Add first-class logical UUID data type#18239
Open
xiangfu0 wants to merge 11 commits intoapache:masterfrom
Open
Add first-class logical UUID data type#18239xiangfu0 wants to merge 11 commits intoapache:masterfrom
xiangfu0 wants to merge 11 commits intoapache:masterfrom
Conversation
Adds UUID as a first-class Pinot data type backed by 16-byte BYTES
storage with canonical lowercase RFC 4122 string as the external
representation.
## Core type plumbing
- `FieldSpec.DataType.UUID(BYTES, 16, false, true)` — stored as BYTES,
fixed 16 bytes, not numeric, sortable
- `PinotDataType.UUID` — conversions to/from STRING/BYTES, numeric ops
throw as expected
- `DataSchema.ColumnDataType.UUID` — backed by BYTES internal, maps to
Calcite `SqlTypeName.UUID`
- `Schema.validate()` — UUID and BIG_DECIMAL enforce SV-only constraint
## UuidUtils (pinot-spi)
- Canonical conversions: `toBytes()`, `toUUID()`, `toString()`
- `UuidKey` inner class — stores UUID as two `long` fields for O(1)
hashCode/equals in hot groupby/join paths (avoids ByteArray allocation)
- `compare()` uses `Long.compareUnsigned` for correct byte-order UUID
comparison vs. `ByteArray.compare()` signed bytes
## Query engine
- Predicate evaluators — Eq/NotEq/In/NotIn handle UUID via
`BytesRaw*Evaluator` + `DataType.UUID`
- Group key generators — `NoDictionarySingleColumnGroupKeyGenerator`,
`NoDictionaryMultiColumnGroupKeyGenerator` updated
- MSQE hot-path: `UuidToIdMap`, `OneUuidKeyGroupIdGenerator`,
`UuidLookupTable`
- `RequestContextUtils.evaluateLiteralValue` — supports
CAST/TO_UUID/UUID_TO_BYTES/BYTES_TO_UUID/UUID_TO_STRING/IS_UUID
on predicate RHS literals; uses `BadQueryRequestException` for arity
errors
- `CastTransformFunction` — CAST(col AS UUID) per-row support
- Scalar functions: `IS_UUID`, `TO_UUID`, `UUID_TO_BYTES`,
`BYTES_TO_UUID`, `UUID_TO_STRING`
## Segment / realtime
- `MutableSegmentImpl` — UUID primary key normalization for upsert via
`normalizePrimaryKeyValue()`
- `MutableNoDictColumnStatistics.isSorted()` — uses `UuidUtils.compare`
(unsigned byte-order) instead of `ByteArray.compare` (signed) for UUID
- `BloomFilterCreator` — UUID stored as canonical string in bloom filter
- `RawValueBitmapInvertedIndexCreator` — UUID handled via stored BYTES
## Response encoding
- Arrow encoder — emits UUID string values in UUID columns
- JSON encoder — passes through UUID as string
## Input format
- Avro plugin — `logicalType: "uuid"` Avro schema field → UUID DataType;
Pinot UUID field → Avro `{"type":"string","logicalType":"uuid"}`
- Rejects Avro ARRAY<uuid> with a clear error (UUID is SV-only)
## Tests
- `UuidTypeTest`, `UuidTypeRealtimeTest` — full integration tests for
batch and realtime ingestion, predicate pushdown, group-by, CAST
- `UuidUpsertRealtimeTest` — upsert with UUID primary key
- Unit tests for UuidUtils, DataSchema, PinotDataType, RequestContextUtils,
response encoders, Avro utils, predicate evaluators
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In ServerPlanRequestUtils, the BYTES branch builds IN-predicate literals by iterating sorted byte arrays and calling getLiteralExpression(bytes[]). For UUID columns (which are stored as 16-byte BYTES internally) this emits raw bytes to the broker, but the broker plan sends UUID values as canonical string literals. The mismatch causes wrong query results when filtering a UUID column through the gRPC/MSQE path. Fix: detect ColumnDataType.UUID and call UuidUtils.toString() to emit the canonical UUID string literal instead of raw bytes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…trings UUID columns were broken in two ways during bloom-filter rebuild: - Dictionary path: called dictionary.getStringValue() which returns hex encoding for byte-backed dicts, but query-time pruning hashes canonical UUID strings, so rebuilt filters would never match. - Non-dictionary path: no UUID case, causing IllegalStateException when rebuilding bloom filters on raw UUID columns. Fix both paths to use UuidUtils.toString() to produce canonical lowercase UUID strings, consistent with BloomFilterCreator and ValueBasedSegmentPruner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…latile - Move DataType.UUID to end of enum (after UNKNOWN) to preserve existing ordinals; AnyValueAggregationFunction serializes ordinals into intermediate results so inserting UUID between BYTES and STRUCT would break mixed-version clusters. UUID is now ordinal 14, leaving all prior ordinals unchanged. - canEvaluateLiteral: catch BadQueryRequestException instead of RuntimeException to avoid masking NPEs and ClassCastExceptions from bugs in evaluateLiteralValue. - Remove nilUuidBytes() and its test: method was introduced and deprecated in the same PR with no external callers; nullUuidBytes() is the right method. - IdentifierTransformFunction._uuidStringValuesSV: remove volatile; transform functions are single-threaded so the volatile is misleading and inconsistent with BaseTransformFunction's plain fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18239 +/- ##
============================================
+ Coverage 63.32% 63.46% +0.13%
Complexity 1627 1627
============================================
Files 3238 3257 +19
Lines 197011 198120 +1109
Branches 30466 30665 +199
============================================
+ Hits 124762 125727 +965
- Misses 62244 62304 +60
- Partials 10005 10089 +84
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- InTransformFunction (C4): UUID IN-predicate literal path now uses ObjectOpenHashSet<UuidKey> instead of ObjectOpenHashSet<ByteArray>. Per-row lookup uses UuidKey.fromBytes() — O(1) precomputed two-long hash/equals vs. per-row ByteArray allocation + Arrays.hashCode. Consistent with InPredicateEvaluatorFactory's UuidKey-based evaluators. - BaseTransformFunction (C8): Replace IllegalStateException for UUID in transformToBytesValuesSV with a safe string-based fallback. Subclasses returning UUID should override this method; the fallback converts via transformToStringValuesSV so callers never crash on a missing override. - RequestContextUtils (C9): Add comment documenting that evaluateFunctionLiteral hard-codes supported RHS function names; must be updated when new UUID or constant-folding functions are added. - CommonConstants (C10): Document that INTERNAL_UUID_BYTES is a shared read-only sentinel; callers must not mutate the array returned by getBytes(). - ServerPlanRequestUtilsTest (C6): Add regression test for UUID computeInOperands fix — asserts that UUID ByteArray values produce canonical string literals (not byte[] literals) in the dynamic IN filter. Also verifies raw BYTES columns still produce binary literals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lakechd
reviewed
Apr 16, 2026
…schema inference - BloomFilterCreator: replace unsafe (byte[]) cast with UuidUtils.toBytes(value) for UUID single- and multi-value add paths — handles String, UUID, ByteArray, byte[] input - RawValueBitmapInvertedIndexCreator: add UUID guard in both add(Object) and add(Object[]) BYTES branches; add comment explaining ByteArray invariant in seal() - RangePredicateEvaluatorFactory: add UUID case in newRawValueBasedEvaluator using UuidUtils.toBytes() for bounds; add comment confirming dictionary value-type contract - PredicateUtils.getStoredValue: add UUID case converting UUID string to hex for BYTES dictionary lookups (SortedDictionary and UnsortedDictionary low-cardinality paths) - AvroUtils.extractFieldDataType: reject ARRAY<uuid> at schema inference time with a clear error; let IllegalStateException propagate without re-wrapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…idate() BIG_DECIMAL was never restricted to single-value columns in existing Pinot. My UUID PR incorrectly added a BIG_DECIMAL check alongside the UUID check, breaking the Spark 3 connector which creates MV BIG_DECIMAL fields. Remove the BIG_DECIMAL restriction, keeping only the UUID SV-only enforcement. Fixes: SparkToPinotTypeTranslatorTest "Translate multi value data types" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This test was added alongside the erroneous BIG_DECIMAL SV-only restriction in Schema.validate(). That restriction was removed in the previous commit (BIG_DECIMAL was never SV-only in Pinot). Remove the test that enforced it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IsUuidScalarFunction and ToUuidScalarFunction share identical getFunctionInfo dispatch logic (STRING_FUNCTION_INFO / BYTES_FUNCTION_INFO switch). Extract it into AbstractStringOrBytesUuidFunction so subclasses only provide the two FunctionInfo constants via protected abstract methods. Addresses review comment: the getFunctionInfo pattern appears multiple times. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cleanup BIG_DECIMAL MV restriction lives in SchemaUtils.validateMultiValueCompatibility() (enforced at validate-time, not build-time). Commit 0186fbb incorrectly moved the test to Assert.assertThrows(() -> ...build()); removing the duplicate restriction from Schema.validate() broke that assertion. Revert to the original checkValidationFails style matching how JSON MV is tested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tions() override _sharedClusterTestSuite.createKafkaTopic(topic) internally calls getNumKafkaPartitions() on _sharedClusterTestSuite (the first test in the suite), ignoring any override in the current test class. UuidUpsertRealtimeTest overrides getNumKafkaPartitions() to 1 (single-partition for UUID upsert correctness), but the topic was created with 2 partitions, causing mixed-case UUID keys to route to different partitions before normalization and breaking upsert dedup. Fix: pass getNumKafkaPartitions() (called on this) explicitly to createKafkaTopic. Also fixes waitForKafkaTopicMetadataReadyForConsumer which already used the right value. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds `UUID` as a first-class Pinot data type backed by 16-byte BYTES storage with canonical lowercase RFC 4122 strings as the external representation.
Core type plumbing
UuidUtils (`pinot-spi`)
Query engine
gRPC / response encoding (bug fixes)
Bloom filter (bug fixes)
Index creation (bug fixes)
Segment / realtime
Input format
Tests
Known limitations
Test plan
🤖 Generated with Claude Code