Skip to content

Add first-class logical UUID data type#18239

Open
xiangfu0 wants to merge 11 commits intoapache:masterfrom
xiangfu0:codex/uuid-grpc-query-fix
Open

Add first-class logical UUID data type#18239
xiangfu0 wants to merge 11 commits intoapache:masterfrom
xiangfu0:codex/uuid-grpc-query-fix

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 16, 2026

Summary

Adds `UUID` as a first-class Pinot data type backed by 16-byte BYTES storage with canonical lowercase RFC 4122 strings as the external representation.

Core type plumbing

  • `FieldSpec.DataType.UUID` — stored as BYTES (16 bytes), sortable via unsigned byte order. UUID is placed after `UNKNOWN` (ordinal 14) to avoid shifting existing ordinals since `AnyValueAggregationFunction` serializes `DataType.ordinal()` into intermediate results.
  • `PinotDataType.UUID` — conversions to/from STRING/BYTES; numeric ops throw as expected.
  • `DataSchema.ColumnDataType.UUID` — backed by BYTES internally, maps to Calcite `SqlTypeName.UUID`.
  • `Schema.validate()` — UUID and BIG_DECIMAL enforce SV-only constraint.

UuidUtils (`pinot-spi`)

  • Canonical conversions: `toBytes()`, `toUUID()`, `toString()`
  • `UuidKey` inner class — stores UUID as two `long` fields for O(1) hashCode/equals in hot groupby/join paths (avoids ByteArray allocation per row)
  • `compare()` uses `Long.compareUnsigned` for correct unsigned byte-order UUID comparison
  • `UuidUtils.toBytes(Object)` — handles String, java.util.UUID, ByteArray, and byte[] inputs uniformly

Query engine

  • Predicate evaluators — Eq/NotEq/In/NotIn handle UUID via `BytesRaw*Evaluator` + `DataType.UUID`
  • Range predicates — `RangePredicateEvaluatorFactory` has a `case UUID` using `UuidUtils.toBytes()` for bounds; `PredicateUtils.getStoredValue` converts UUID strings to hex-encoded bytes for dictionary lookups
  • Group key generators — `NoDictionarySingleColumnGroupKeyGenerator`, `NoDictionaryMultiColumnGroupKeyGenerator` updated
  • MSQE hot-path: `UuidToIdMap`, `OneUuidKeyGroupIdGenerator`, `UuidLookupTable`
  • `RequestContextUtils.evaluateLiteralValue` — supports CAST/TO_UUID/UUID_TO_BYTES/BYTES_TO_UUID/UUID_TO_STRING/IS_UUID on predicate RHS literals; catches only `BadQueryRequestException` (not all RuntimeException) to avoid masking bugs
  • `CastTransformFunction` — CAST(col AS UUID) per-row support
  • `InTransformFunction` — UUID literal sets use `ObjectOpenHashSet` (O(1) hash/equals) not ByteArray
  • `BaseTransformFunction.transformToBytesValuesSV` — safe fallback for UUID via string conversion
  • Scalar functions: `IS_UUID`, `TO_UUID`, `UUID_TO_BYTES`, `BYTES_TO_UUID`, `UUID_TO_STRING`
  • `ValueBasedSegmentPruner` — UUID bloom filter checked with canonical UUID strings

gRPC / response encoding (bug fixes)

  • `JsonResponseEncoder.extractValue` — added `case UUID` to avoid `IllegalArgumentException` when decoding gRPC responses with UUID columns. Root cause of `UuidUpsertRealtimeTest` failure.
  • `ArrowResponseEncoder` — full UUID support; avoids `ClassCastException` in Arrow encoding path.
  • `ServerPlanRequestUtils.computeInOperands` — emits canonical UUID string literals (not raw bytes) for UUID columns in MSQE join dynamic filters.

Bloom filter (bug fixes)

  • `BloomFilterHandler` — UUID rebuild paths fixed:
    • Dictionary path: calls `UuidUtils.toString(dictionary.getBytesValue(i))` instead of `dictionary.getStringValue(i)` (which returns hex for byte-backed dicts).
    • Non-dictionary path: explicit `case UUID` added before `case BYTES`.
  • `BloomFilterCreator` — `add(Object, int)` and `add(Object[], int[])` UUID paths use `UuidUtils.toBytes(value)` instead of an unsafe `(byte[]) value` cast; handles String, java.util.UUID, ByteArray, and byte[] input from any ingestion source.

Index creation (bug fixes)

  • `RawValueBitmapInvertedIndexCreator.add(Object, int)` and `add(Object[], int[])` — both single-value and multi-value BYTES branches check `_dataType == DataType.UUID` and route through `UuidUtils.toBytes(value)` to handle all input representations.

Segment / realtime

  • `MutableSegmentImpl` — UUID primary key normalization for upsert
  • `MutableNoDictColumnStatistics.isSorted()` — uses `UuidUtils.compare` (unsigned) instead of `ByteArray.compare` (signed) for UUID
  • `BloomFilterCreator` — UUID stored as canonical string

Input format

  • Avro plugin — `logicalType: "uuid"` Avro schema field → UUID DataType
  • `AvroUtils.extractFieldDataType` — rejects `ARRAY` at schema inference time with a clear error (UUID is SV-only); `IllegalStateException` propagates directly without re-wrapping.

Tests

  • `UuidTypeTest`, `UuidTypeRealtimeTest` — full integration tests for batch and realtime
  • `UuidUpsertRealtimeTest` — upsert with UUID primary key
  • Unit tests for UuidUtils, DataSchema, PinotDataType, RequestContextUtils, response encoders, Avro utils, predicate evaluators
  • `ServerPlanRequestUtilsTest` — regression tests for `computeInOperands` UUID fix

Known limitations

  • Nil UUID (`00000000-0000-0000-0000-000000000000`) is the null sentinel — collides with a valid but rare UUID value. Fixing requires out-of-band null representation.
  • Mixed-case UUID primary keys (e.g., `550E...` vs `550e...`) could route to different Kafka partitions before normalization in realtime upsert. Test forces single partition. Fix requires pre-routing canonicalization.

Test plan

  • `UuidTypeTest` (batch): predicate pushdown, group-by, CAST, ORDER BY
  • `UuidTypeRealtimeTest`: same for realtime segment
  • `UuidUpsertRealtimeTest`: upsert/dedup with UUID primary key (verifies gRPC fix)
  • `UuidUtilsTest`, `DataSchemaTest`, `PinotDataTypeTest`: unit coverage
  • `ArrowResponseEncoderTest`, `JsonResponseEncoderTest`: encoding round-trips
  • `RequestContextUtilsTest`: literal evaluation for UUID functions
  • `ServerPlanRequestUtilsTest`: regression for computeInOperands UUID fix

🤖 Generated with Claude Code

xiangfu0 and others added 4 commits April 16, 2026 00:02
Adds UUID as a first-class Pinot data type backed by 16-byte BYTES
storage with canonical lowercase RFC 4122 string as the external
representation.

## Core type plumbing
- `FieldSpec.DataType.UUID(BYTES, 16, false, true)` — stored as BYTES,
  fixed 16 bytes, not numeric, sortable
- `PinotDataType.UUID` — conversions to/from STRING/BYTES, numeric ops
  throw as expected
- `DataSchema.ColumnDataType.UUID` — backed by BYTES internal, maps to
  Calcite `SqlTypeName.UUID`
- `Schema.validate()` — UUID and BIG_DECIMAL enforce SV-only constraint

## UuidUtils (pinot-spi)
- Canonical conversions: `toBytes()`, `toUUID()`, `toString()`
- `UuidKey` inner class — stores UUID as two `long` fields for O(1)
  hashCode/equals in hot groupby/join paths (avoids ByteArray allocation)
- `compare()` uses `Long.compareUnsigned` for correct byte-order UUID
  comparison vs. `ByteArray.compare()` signed bytes

## Query engine
- Predicate evaluators — Eq/NotEq/In/NotIn handle UUID via
  `BytesRaw*Evaluator` + `DataType.UUID`
- Group key generators — `NoDictionarySingleColumnGroupKeyGenerator`,
  `NoDictionaryMultiColumnGroupKeyGenerator` updated
- MSQE hot-path: `UuidToIdMap`, `OneUuidKeyGroupIdGenerator`,
  `UuidLookupTable`
- `RequestContextUtils.evaluateLiteralValue` — supports
  CAST/TO_UUID/UUID_TO_BYTES/BYTES_TO_UUID/UUID_TO_STRING/IS_UUID
  on predicate RHS literals; uses `BadQueryRequestException` for arity
  errors
- `CastTransformFunction` — CAST(col AS UUID) per-row support
- Scalar functions: `IS_UUID`, `TO_UUID`, `UUID_TO_BYTES`,
  `BYTES_TO_UUID`, `UUID_TO_STRING`

## Segment / realtime
- `MutableSegmentImpl` — UUID primary key normalization for upsert via
  `normalizePrimaryKeyValue()`
- `MutableNoDictColumnStatistics.isSorted()` — uses `UuidUtils.compare`
  (unsigned byte-order) instead of `ByteArray.compare` (signed) for UUID
- `BloomFilterCreator` — UUID stored as canonical string in bloom filter
- `RawValueBitmapInvertedIndexCreator` — UUID handled via stored BYTES

## Response encoding
- Arrow encoder — emits UUID string values in UUID columns
- JSON encoder — passes through UUID as string

## Input format
- Avro plugin — `logicalType: "uuid"` Avro schema field → UUID DataType;
  Pinot UUID field → Avro `{"type":"string","logicalType":"uuid"}`
- Rejects Avro ARRAY<uuid> with a clear error (UUID is SV-only)

## Tests
- `UuidTypeTest`, `UuidTypeRealtimeTest` — full integration tests for
  batch and realtime ingestion, predicate pushdown, group-by, CAST
- `UuidUpsertRealtimeTest` — upsert with UUID primary key
- Unit tests for UuidUtils, DataSchema, PinotDataType, RequestContextUtils,
  response encoders, Avro utils, predicate evaluators

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In ServerPlanRequestUtils, the BYTES branch builds IN-predicate literals
by iterating sorted byte arrays and calling getLiteralExpression(bytes[]).
For UUID columns (which are stored as 16-byte BYTES internally) this
emits raw bytes to the broker, but the broker plan sends UUID values as
canonical string literals. The mismatch causes wrong query results when
filtering a UUID column through the gRPC/MSQE path.

Fix: detect ColumnDataType.UUID and call UuidUtils.toString() to emit
the canonical UUID string literal instead of raw bytes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…trings

UUID columns were broken in two ways during bloom-filter rebuild:
- Dictionary path: called dictionary.getStringValue() which returns hex
  encoding for byte-backed dicts, but query-time pruning hashes canonical
  UUID strings, so rebuilt filters would never match.
- Non-dictionary path: no UUID case, causing IllegalStateException when
  rebuilding bloom filters on raw UUID columns.

Fix both paths to use UuidUtils.toString() to produce canonical lowercase
UUID strings, consistent with BloomFilterCreator and ValueBasedSegmentPruner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…latile

- Move DataType.UUID to end of enum (after UNKNOWN) to preserve existing
  ordinals; AnyValueAggregationFunction serializes ordinals into intermediate
  results so inserting UUID between BYTES and STRUCT would break mixed-version
  clusters. UUID is now ordinal 14, leaving all prior ordinals unchanged.
- canEvaluateLiteral: catch BadQueryRequestException instead of RuntimeException
  to avoid masking NPEs and ClassCastExceptions from bugs in evaluateLiteralValue.
- Remove nilUuidBytes() and its test: method was introduced and deprecated in
  the same PR with no external callers; nullUuidBytes() is the right method.
- IdentifierTransformFunction._uuidStringValuesSV: remove volatile; transform
  functions are single-threaded so the volatile is misleading and inconsistent
  with BaseTransformFunction's plain fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 60.91743% with 426 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.46%. Comparing base (3cc1902) to head (f25a211).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...r/impl/inv/RawValueBitmapInvertedIndexCreator.java 54.34% 58 Missing and 5 partials ⚠️
...ain/java/org/apache/pinot/spi/utils/UuidUtils.java 73.62% 32 Missing and 16 partials ⚠️
...ot/common/request/context/RequestContextUtils.java 59.61% 32 Missing and 10 partials ⚠️
...pby/NoDictionarySingleColumnGroupKeyGenerator.java 50.00% 15 Missing and 6 partials ⚠️
...e/operator/groupby/OneUuidKeyGroupIdGenerator.java 0.00% 20 Missing ⚠️
...upby/NoDictionaryMultiColumnGroupKeyGenerator.java 36.36% 11 Missing and 3 partials ⚠️
...erator/transform/function/InTransformFunction.java 40.90% 11 Missing and 2 partials ⚠️
...main/java/org/apache/pinot/spi/data/FieldSpec.java 74.00% 9 Missing and 4 partials ⚠️
...a/org/apache/pinot/common/utils/PinotDataType.java 47.82% 11 Missing and 1 partial ⚠️
...inot/query/planner/logical/RexExpressionUtils.java 0.00% 11 Missing ⚠️
... and 42 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18239      +/-   ##
============================================
+ Coverage     63.32%   63.46%   +0.13%     
  Complexity     1627     1627              
============================================
  Files          3238     3257      +19     
  Lines        197011   198120    +1109     
  Branches      30466    30665     +199     
============================================
+ Hits         124762   125727     +965     
- Misses        62244    62304      +60     
- Partials      10005    10089      +84     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.43% <60.91%> (+0.13%) ⬆️
java-21 63.41% <60.91%> (+8.15%) ⬆️
temurin 63.46% <60.91%> (+0.13%) ⬆️
unittests 63.45% <60.91%> (+0.13%) ⬆️
unittests1 55.51% <52.93%> (+0.22%) ⬆️
unittests2 34.92% <25.04%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- InTransformFunction (C4): UUID IN-predicate literal path now uses
  ObjectOpenHashSet<UuidKey> instead of ObjectOpenHashSet<ByteArray>.
  Per-row lookup uses UuidKey.fromBytes() — O(1) precomputed two-long
  hash/equals vs. per-row ByteArray allocation + Arrays.hashCode.
  Consistent with InPredicateEvaluatorFactory's UuidKey-based evaluators.

- BaseTransformFunction (C8): Replace IllegalStateException for UUID in
  transformToBytesValuesSV with a safe string-based fallback. Subclasses
  returning UUID should override this method; the fallback converts via
  transformToStringValuesSV so callers never crash on a missing override.

- RequestContextUtils (C9): Add comment documenting that evaluateFunctionLiteral
  hard-codes supported RHS function names; must be updated when new UUID
  or constant-folding functions are added.

- CommonConstants (C10): Document that INTERNAL_UUID_BYTES is a shared
  read-only sentinel; callers must not mutate the array returned by getBytes().

- ServerPlanRequestUtilsTest (C6): Add regression test for UUID
  computeInOperands fix — asserts that UUID ByteArray values produce
  canonical string literals (not byte[] literals) in the dynamic IN filter.
  Also verifies raw BYTES columns still produce binary literals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xiangfu0 and others added 6 commits April 16, 2026 14:38
…schema inference

- BloomFilterCreator: replace unsafe (byte[]) cast with UuidUtils.toBytes(value) for
  UUID single- and multi-value add paths — handles String, UUID, ByteArray, byte[] input
- RawValueBitmapInvertedIndexCreator: add UUID guard in both add(Object) and add(Object[])
  BYTES branches; add comment explaining ByteArray invariant in seal()
- RangePredicateEvaluatorFactory: add UUID case in newRawValueBasedEvaluator using
  UuidUtils.toBytes() for bounds; add comment confirming dictionary value-type contract
- PredicateUtils.getStoredValue: add UUID case converting UUID string to hex for BYTES
  dictionary lookups (SortedDictionary and UnsortedDictionary low-cardinality paths)
- AvroUtils.extractFieldDataType: reject ARRAY<uuid> at schema inference time with a
  clear error; let IllegalStateException propagate without re-wrapping

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…idate()

BIG_DECIMAL was never restricted to single-value columns in existing Pinot.
My UUID PR incorrectly added a BIG_DECIMAL check alongside the UUID check,
breaking the Spark 3 connector which creates MV BIG_DECIMAL fields. Remove
the BIG_DECIMAL restriction, keeping only the UUID SV-only enforcement.

Fixes: SparkToPinotTypeTranslatorTest "Translate multi value data types"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This test was added alongside the erroneous BIG_DECIMAL SV-only restriction
in Schema.validate(). That restriction was removed in the previous commit
(BIG_DECIMAL was never SV-only in Pinot). Remove the test that enforced it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IsUuidScalarFunction and ToUuidScalarFunction share identical getFunctionInfo
dispatch logic (STRING_FUNCTION_INFO / BYTES_FUNCTION_INFO switch). Extract
it into AbstractStringOrBytesUuidFunction so subclasses only provide the two
FunctionInfo constants via protected abstract methods.

Addresses review comment: the getFunctionInfo pattern appears multiple times.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cleanup

BIG_DECIMAL MV restriction lives in SchemaUtils.validateMultiValueCompatibility()
(enforced at validate-time, not build-time). Commit 0186fbb incorrectly moved
the test to Assert.assertThrows(() -> ...build()); removing the duplicate restriction
from Schema.validate() broke that assertion. Revert to the original checkValidationFails
style matching how JSON MV is tested.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tions() override

_sharedClusterTestSuite.createKafkaTopic(topic) internally calls
getNumKafkaPartitions() on _sharedClusterTestSuite (the first test in the
suite), ignoring any override in the current test class. UuidUpsertRealtimeTest
overrides getNumKafkaPartitions() to 1 (single-partition for UUID upsert
correctness), but the topic was created with 2 partitions, causing mixed-case
UUID keys to route to different partitions before normalization and breaking
upsert dedup.

Fix: pass getNumKafkaPartitions() (called on this) explicitly to createKafkaTopic.
Also fixes waitForKafkaTopicMetadataReadyForConsumer which already used the right value.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants