Add first-class logical UUID data type by xiangfu0 · Pull Request #18239 · apache/pinot

xiangfu0 · 2026-04-16T20:01:56Z

Summary

Adds `UUID` as a first-class Pinot data type backed by 16-byte BYTES storage with canonical lowercase RFC 4122 strings as the external representation.

Core type plumbing

`FieldSpec.DataType.UUID` — stored as BYTES (16 bytes), sortable via unsigned byte order. UUID is placed after `UNKNOWN` (ordinal 14) to avoid shifting existing ordinals since `AnyValueAggregationFunction` serializes `DataType.ordinal()` into intermediate results.
`PinotDataType.UUID` — conversions to/from STRING/BYTES; numeric ops throw as expected.
`DataSchema.ColumnDataType.UUID` — backed by BYTES internally, maps to Calcite `SqlTypeName.UUID`.
`Schema.validate()` — UUID and BIG_DECIMAL enforce SV-only constraint.

UuidUtils (`pinot-spi`)

Canonical conversions: `toBytes()`, `toUUID()`, `toString()`
`UuidKey` inner class — stores UUID as two `long` fields for O(1) hashCode/equals in hot groupby/join paths (avoids ByteArray allocation per row)
`compare()` uses `Long.compareUnsigned` for correct unsigned byte-order UUID comparison
`UuidUtils.toBytes(Object)` — handles String, java.util.UUID, ByteArray, and byte[] inputs uniformly

Query engine

Predicate evaluators — Eq/NotEq/In/NotIn handle UUID via `BytesRaw*Evaluator` + `DataType.UUID`
Range predicates — `RangePredicateEvaluatorFactory` has a `case UUID` using `UuidUtils.toBytes()` for bounds; `PredicateUtils.getStoredValue` converts UUID strings to hex-encoded bytes for dictionary lookups
Group key generators — `NoDictionarySingleColumnGroupKeyGenerator`, `NoDictionaryMultiColumnGroupKeyGenerator` updated
MSQE hot-path: `UuidToIdMap`, `OneUuidKeyGroupIdGenerator`, `UuidLookupTable`
`RequestContextUtils.evaluateLiteralValue` — supports CAST/TO_UUID/UUID_TO_BYTES/BYTES_TO_UUID/UUID_TO_STRING/IS_UUID on predicate RHS literals; catches only `BadQueryRequestException` (not all RuntimeException) to avoid masking bugs
`CastTransformFunction` — CAST(col AS UUID) per-row support
`InTransformFunction` — UUID literal sets use `ObjectOpenHashSet` (O(1) hash/equals) not ByteArray
`BaseTransformFunction.transformToBytesValuesSV` — safe fallback for UUID via string conversion
Scalar functions: `IS_UUID`, `TO_UUID`, `UUID_TO_BYTES`, `BYTES_TO_UUID`, `UUID_TO_STRING`
`ValueBasedSegmentPruner` — UUID bloom filter checked with canonical UUID strings

gRPC / response encoding (bug fixes)

`JsonResponseEncoder.extractValue` — added `case UUID` to avoid `IllegalArgumentException` when decoding gRPC responses with UUID columns. Root cause of `UuidUpsertRealtimeTest` failure.
`ArrowResponseEncoder` — full UUID support; avoids `ClassCastException` in Arrow encoding path.
`ServerPlanRequestUtils.computeInOperands` — emits canonical UUID string literals (not raw bytes) for UUID columns in MSQE join dynamic filters.

Bloom filter (bug fixes)

`BloomFilterHandler` — UUID rebuild paths fixed:
- Dictionary path: calls `UuidUtils.toString(dictionary.getBytesValue(i))` instead of `dictionary.getStringValue(i)` (which returns hex for byte-backed dicts).
- Non-dictionary path: explicit `case UUID` added before `case BYTES`.
`BloomFilterCreator` — `add(Object, int)` and `add(Object[], int[])` UUID paths use `UuidUtils.toBytes(value)` instead of an unsafe `(byte[]) value` cast; handles String, java.util.UUID, ByteArray, and byte[] input from any ingestion source.

Index creation (bug fixes)

`RawValueBitmapInvertedIndexCreator.add(Object, int)` and `add(Object[], int[])` — both single-value and multi-value BYTES branches check `_dataType == DataType.UUID` and route through `UuidUtils.toBytes(value)` to handle all input representations.

Segment / realtime

`MutableSegmentImpl` — UUID primary key normalization for upsert
`MutableNoDictColumnStatistics.isSorted()` — uses `UuidUtils.compare` (unsigned) instead of `ByteArray.compare` (signed) for UUID
`BloomFilterCreator` — UUID stored as canonical string

Input format

Avro plugin — `logicalType: "uuid"` Avro schema field → UUID DataType
`AvroUtils.extractFieldDataType` — rejects `ARRAY` at schema inference time with a clear error (UUID is SV-only); `IllegalStateException` propagates directly without re-wrapping.

Tests

`UuidTypeTest`, `UuidTypeRealtimeTest` — full integration tests for batch and realtime
`UuidUpsertRealtimeTest` — upsert with UUID primary key
Unit tests for UuidUtils, DataSchema, PinotDataType, RequestContextUtils, response encoders, Avro utils, predicate evaluators
`ServerPlanRequestUtilsTest` — regression tests for `computeInOperands` UUID fix

Known limitations

Nil UUID (`00000000-0000-0000-0000-000000000000`) is the null sentinel — collides with a valid but rare UUID value. Fixing requires out-of-band null representation.
Mixed-case UUID primary keys (e.g., `550E...` vs `550e...`) could route to different Kafka partitions before normalization in realtime upsert. Test forces single partition. Fix requires pre-routing canonicalization.

Test plan

`UuidTypeTest` (batch): predicate pushdown, group-by, CAST, ORDER BY
`UuidTypeRealtimeTest`: same for realtime segment
`UuidUpsertRealtimeTest`: upsert/dedup with UUID primary key (verifies gRPC fix)
`UuidUtilsTest`, `DataSchemaTest`, `PinotDataTypeTest`: unit coverage
`ArrowResponseEncoderTest`, `JsonResponseEncoderTest`: encoding round-trips
`RequestContextUtilsTest`: literal evaluation for UUID functions
`ServerPlanRequestUtilsTest`: regression for computeInOperands UUID fix

🤖 Generated with Claude Code

Adds UUID as a first-class Pinot data type backed by 16-byte BYTES storage with canonical lowercase RFC 4122 string as the external representation. ## Core type plumbing - `FieldSpec.DataType.UUID(BYTES, 16, false, true)` — stored as BYTES, fixed 16 bytes, not numeric, sortable - `PinotDataType.UUID` — conversions to/from STRING/BYTES, numeric ops throw as expected - `DataSchema.ColumnDataType.UUID` — backed by BYTES internal, maps to Calcite `SqlTypeName.UUID` - `Schema.validate()` — UUID and BIG_DECIMAL enforce SV-only constraint ## UuidUtils (pinot-spi) - Canonical conversions: `toBytes()`, `toUUID()`, `toString()` - `UuidKey` inner class — stores UUID as two `long` fields for O(1) hashCode/equals in hot groupby/join paths (avoids ByteArray allocation) - `compare()` uses `Long.compareUnsigned` for correct byte-order UUID comparison vs. `ByteArray.compare()` signed bytes ## Query engine - Predicate evaluators — Eq/NotEq/In/NotIn handle UUID via `BytesRaw*Evaluator` + `DataType.UUID` - Group key generators — `NoDictionarySingleColumnGroupKeyGenerator`, `NoDictionaryMultiColumnGroupKeyGenerator` updated - MSQE hot-path: `UuidToIdMap`, `OneUuidKeyGroupIdGenerator`, `UuidLookupTable` - `RequestContextUtils.evaluateLiteralValue` — supports CAST/TO_UUID/UUID_TO_BYTES/BYTES_TO_UUID/UUID_TO_STRING/IS_UUID on predicate RHS literals; uses `BadQueryRequestException` for arity errors - `CastTransformFunction` — CAST(col AS UUID) per-row support - Scalar functions: `IS_UUID`, `TO_UUID`, `UUID_TO_BYTES`, `BYTES_TO_UUID`, `UUID_TO_STRING` ## Segment / realtime - `MutableSegmentImpl` — UUID primary key normalization for upsert via `normalizePrimaryKeyValue()` - `MutableNoDictColumnStatistics.isSorted()` — uses `UuidUtils.compare` (unsigned byte-order) instead of `ByteArray.compare` (signed) for UUID - `BloomFilterCreator` — UUID stored as canonical string in bloom filter - `RawValueBitmapInvertedIndexCreator` — UUID handled via stored BYTES ## Response encoding - Arrow encoder — emits UUID string values in UUID columns - JSON encoder — passes through UUID as string ## Input format - Avro plugin — `logicalType: "uuid"` Avro schema field → UUID DataType; Pinot UUID field → Avro `{"type":"string","logicalType":"uuid"}` - Rejects Avro ARRAY<uuid> with a clear error (UUID is SV-only) ## Tests - `UuidTypeTest`, `UuidTypeRealtimeTest` — full integration tests for batch and realtime ingestion, predicate pushdown, group-by, CAST - `UuidUpsertRealtimeTest` — upsert with UUID primary key - Unit tests for UuidUtils, DataSchema, PinotDataType, RequestContextUtils, response encoders, Avro utils, predicate evaluators Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

In ServerPlanRequestUtils, the BYTES branch builds IN-predicate literals by iterating sorted byte arrays and calling getLiteralExpression(bytes[]). For UUID columns (which are stored as 16-byte BYTES internally) this emits raw bytes to the broker, but the broker plan sends UUID values as canonical string literals. The mismatch causes wrong query results when filtering a UUID column through the gRPC/MSQE path. Fix: detect ColumnDataType.UUID and call UuidUtils.toString() to emit the canonical UUID string literal instead of raw bytes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…trings UUID columns were broken in two ways during bloom-filter rebuild: - Dictionary path: called dictionary.getStringValue() which returns hex encoding for byte-backed dicts, but query-time pruning hashes canonical UUID strings, so rebuilt filters would never match. - Non-dictionary path: no UUID case, causing IllegalStateException when rebuilding bloom filters on raw UUID columns. Fix both paths to use UuidUtils.toString() to produce canonical lowercase UUID strings, consistent with BloomFilterCreator and ValueBasedSegmentPruner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…latile - Move DataType.UUID to end of enum (after UNKNOWN) to preserve existing ordinals; AnyValueAggregationFunction serializes ordinals into intermediate results so inserting UUID between BYTES and STRUCT would break mixed-version clusters. UUID is now ordinal 14, leaving all prior ordinals unchanged. - canEvaluateLiteral: catch BadQueryRequestException instead of RuntimeException to avoid masking NPEs and ClassCastExceptions from bugs in evaluateLiteralValue. - Remove nilUuidBytes() and its test: method was introduced and deprecated in the same PR with no external callers; nullUuidBytes() is the right method. - IdentifierTransformFunction._uuidStringValuesSV: remove volatile; transform functions are single-threaded so the volatile is misleading and inconsistent with BaseTransformFunction's plain fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-04-16T20:40:55Z

Codecov Report

❌ Patch coverage is 60.91743% with 426 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.46%. Comparing base (3cc1902) to head (f25a211).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
...r/impl/inv/RawValueBitmapInvertedIndexCreator.java	54.34%	58 Missing and 5 partials ⚠️
...ain/java/org/apache/pinot/spi/utils/UuidUtils.java	73.62%	32 Missing and 16 partials ⚠️
...ot/common/request/context/RequestContextUtils.java	59.61%	32 Missing and 10 partials ⚠️
...pby/NoDictionarySingleColumnGroupKeyGenerator.java	50.00%	15 Missing and 6 partials ⚠️
...e/operator/groupby/OneUuidKeyGroupIdGenerator.java	0.00%	20 Missing ⚠️
...upby/NoDictionaryMultiColumnGroupKeyGenerator.java	36.36%	11 Missing and 3 partials ⚠️
...erator/transform/function/InTransformFunction.java	40.90%	11 Missing and 2 partials ⚠️
...main/java/org/apache/pinot/spi/data/FieldSpec.java	74.00%	9 Missing and 4 partials ⚠️
...a/org/apache/pinot/common/utils/PinotDataType.java	47.82%	11 Missing and 1 partial ⚠️
...inot/query/planner/logical/RexExpressionUtils.java	0.00%	11 Missing ⚠️
... and 42 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18239      +/-   ##
============================================
+ Coverage     63.32%   63.46%   +0.13%     
  Complexity     1627     1627              
============================================
  Files          3238     3257      +19     
  Lines        197011   198120    +1109     
  Branches      30466    30665     +199     
============================================
+ Hits         124762   125727     +965     
- Misses        62244    62304      +60     
- Partials      10005    10089      +84

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.43% <60.91%> (+0.13%)`	⬆️
java-21	`63.41% <60.91%> (+8.15%)`	⬆️
temurin	`63.46% <60.91%> (+0.13%)`	⬆️
unittests	`63.45% <60.91%> (+0.13%)`	⬆️
unittests1	`55.51% <52.93%> (+0.22%)`	⬆️
unittests2	`34.92% <25.04%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- InTransformFunction (C4): UUID IN-predicate literal path now uses ObjectOpenHashSet<UuidKey> instead of ObjectOpenHashSet<ByteArray>. Per-row lookup uses UuidKey.fromBytes() — O(1) precomputed two-long hash/equals vs. per-row ByteArray allocation + Arrays.hashCode. Consistent with InPredicateEvaluatorFactory's UuidKey-based evaluators. - BaseTransformFunction (C8): Replace IllegalStateException for UUID in transformToBytesValuesSV with a safe string-based fallback. Subclasses returning UUID should override this method; the fallback converts via transformToStringValuesSV so callers never crash on a missing override. - RequestContextUtils (C9): Add comment documenting that evaluateFunctionLiteral hard-codes supported RHS function names; must be updated when new UUID or constant-folding functions are added. - CommonConstants (C10): Document that INTERNAL_UUID_BYTES is a shared read-only sentinel; callers must not mutate the array returned by getBytes(). - ServerPlanRequestUtilsTest (C6): Add regression test for UUID computeInOperands fix — asserts that UUID ByteArray values produce canonical string literals (not byte[] literals) in the dynamic IN filter. Also verifies raw BYTES columns still produce binary literals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…schema inference - BloomFilterCreator: replace unsafe (byte[]) cast with UuidUtils.toBytes(value) for UUID single- and multi-value add paths — handles String, UUID, ByteArray, byte[] input - RawValueBitmapInvertedIndexCreator: add UUID guard in both add(Object) and add(Object[]) BYTES branches; add comment explaining ByteArray invariant in seal() - RangePredicateEvaluatorFactory: add UUID case in newRawValueBasedEvaluator using UuidUtils.toBytes() for bounds; add comment confirming dictionary value-type contract - PredicateUtils.getStoredValue: add UUID case converting UUID string to hex for BYTES dictionary lookups (SortedDictionary and UnsortedDictionary low-cardinality paths) - AvroUtils.extractFieldDataType: reject ARRAY<uuid> at schema inference time with a clear error; let IllegalStateException propagate without re-wrapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…idate() BIG_DECIMAL was never restricted to single-value columns in existing Pinot. My UUID PR incorrectly added a BIG_DECIMAL check alongside the UUID check, breaking the Spark 3 connector which creates MV BIG_DECIMAL fields. Remove the BIG_DECIMAL restriction, keeping only the UUID SV-only enforcement. Fixes: SparkToPinotTypeTranslatorTest "Translate multi value data types" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This test was added alongside the erroneous BIG_DECIMAL SV-only restriction in Schema.validate(). That restriction was removed in the previous commit (BIG_DECIMAL was never SV-only in Pinot). Remove the test that enforced it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

IsUuidScalarFunction and ToUuidScalarFunction share identical getFunctionInfo dispatch logic (STRING_FUNCTION_INFO / BYTES_FUNCTION_INFO switch). Extract it into AbstractStringOrBytesUuidFunction so subclasses only provide the two FunctionInfo constants via protected abstract methods. Addresses review comment: the getFunctionInfo pattern appears multiple times. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cleanup BIG_DECIMAL MV restriction lives in SchemaUtils.validateMultiValueCompatibility() (enforced at validate-time, not build-time). Commit 0186fbb incorrectly moved the test to Assert.assertThrows(() -> ...build()); removing the duplicate restriction from Schema.validate() broke that assertion. Revert to the original checkValidationFails style matching how JSON MV is tested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tions() override _sharedClusterTestSuite.createKafkaTopic(topic) internally calls getNumKafkaPartitions() on _sharedClusterTestSuite (the first test in the suite), ignoring any override in the current test class. UuidUpsertRealtimeTest overrides getNumKafkaPartitions() to 1 (single-partition for UUID upsert correctness), but the topic was created with 2 partitions, causing mixed-case UUID keys to route to different partitions before normalization and breaking upsert dedup. Fix: pass getNumKafkaPartitions() (called on this) explicitly to createKafkaTopic. Also fixes waitForKafkaTopicMetadataReadyForConsumer which already used the right value. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

xiangfu0 and others added 4 commits April 16, 2026 00:02

lakechd reviewed Apr 16, 2026

View reviewed changes

Comment thread ...-common/src/main/java/org/apache/pinot/common/function/scalar/uuid/ToUuidScalarFunction.java Outdated

xiangfu0 and others added 6 commits April 16, 2026 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first-class logical UUID data type#18239

Add first-class logical UUID data type#18239
xiangfu0 wants to merge 11 commits intoapache:masterfrom
xiangfu0:codex/uuid-grpc-query-fix

xiangfu0 commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiangfu0 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core type plumbing

UuidUtils (`pinot-spi`)

Query engine

gRPC / response encoding (bug fixes)

Bloom filter (bug fixes)

Index creation (bug fixes)

Segment / realtime

Input format

Tests

Known limitations

Test plan

Uh oh!

codecov-commenter commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiangfu0 commented Apr 16, 2026 •

edited

Loading

codecov-commenter commented Apr 16, 2026 •

edited

Loading