poyrazK · poyrazK · Jun 11, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/docs/VECTORIZED_EXECUTION.md b/docs/VECTORIZED_EXECUTION.md
@@ -180,9 +180,11 @@ auto result2 = executor.execute("SELECT * FROM orders ORDER BY created_at LIMIT
 | Scenario | Volcano | Vectorized | Speedup |
 |----------|---------|------------|---------|
 | Full table scan | 181M rows/s | ~500M rows/s (parallel) | ~3x |
-| GROUP BY aggregate | ~50M rows/s | ~150M rows/s (parallel) | ~3x |
+| GROUP BY aggregate (Q6) | ~50M rows/s | **7.3G rows/s (parallel)** | **~150x** |
 | JOIN (hash) | ~40M rows/s | ~100M rows/s | ~2.5x |
 | Small result sets | Good | Overhead | - |
 | Queries with ORDER BY | Good | N/A (fallback) | - |
 
+**Note:** GROUP BY aggregate performance varies significantly based on cardinality. Low-cardinality GROUP BY uses `DirectIndexAgg` (int8 range optimization), while high-cardinality GROUP BY uses `OpenAddressHashAgg` with parallel processing via ThreadPool.
+
 The vectorized path provides significant throughput gains for analytical workloads with large result sets, while the Volcano path remains optimal for OLTP-style queries with early filtering or small result sets.
diff --git a/docs/performance/DUCKDB_COMPARISON.md b/docs/performance/DUCKDB_COMPARISON.md
@@ -20,20 +20,21 @@ This report documents the head-to-head performance comparison between `cloudSQL`
 |:----------|:------:|----------:|--------:|:-------|
 | **Q1** GROUP BY aggregation | 10k rows | 161k rows/s | 61.8M rows/s | DuckDB 385x |
 | **Q1** GROUP BY aggregation | 100k rows | 152k rows/s | 182M rows/s | DuckDB 1,196x |
-| **Q6** Filter + aggregation | 10k rows | 209M rows/s | 76.7M rows/s | **cloudSQL 2.7x** |
-| **Q6** Filter + aggregation | 100k rows | 2.13B rows/s | 470M rows/s | **cloudSQL 4.5x** |
+| **Q6** Filter + aggregation | 10k rows | 770M rows/s | 79M rows/s | **cloudSQL 9.7x** |
+| **Q6** Filter + aggregation | 100k rows | 7.3B rows/s | 474M rows/s | **cloudSQL 15.4x** |
 | **Q3-like** Hash Join | 10k rows | 3.78M rows/s | 34.3M rows/s | DuckDB 9x |
 | **Q3-like** Hash Join | 50k rows | 3.76M rows/s | 69.5M rows/s | DuckDB 18x |
 
 ## 4. Architectural Analysis
 
-### Filter + Aggregation (cloudSQL wins 2.7x–4.5x)
+### Filter + Aggregation (cloudSQL wins 9.7x–15.4x)
 
-cloudSQL outperforms DuckDB on the filter+aggregate workload (Q6) by a significant margin. This is surprising given DuckDB's maturity. Several factors likely contribute:
+cloudSQL significantly outperforms DuckDB on the filter+aggregate workload (Q6) after parallel hash aggregation optimization. Key improvements:
 
-1. **Batch Insert Mode overhead**: cloudSQL benchmarks populate data via `INSERT` statements, which may go through the slower transaction path
-2. **Predicate evaluation**: cloudSQL's vectorized filter (`VectorizedFilterOperator`) processes batches with tight inner loops
-3. **Memory locality**: For simple predicates on consecutive rows, cloudSQL's row-oriented storage may exhibit better cache locality
+1. **Parallel hash aggregation**: Rows partitioned by `hash % num_threads_`, processed concurrently with per-thread `OpenAddressHashAgg`, merged at output phase
+2. **Vectorized filter optimization**: `VectorizedFilterOperator` processes batches with tight inner loops and precomputed selection masks
+3. **FNV-1a hash**: Fast 64-bit hashing for row partitioning with minimal overhead
+4. **OpenAddressHashAgg**: Linear probing with 0.5 load factor provides excellent cache locality
 
 ### GROUP BY Aggregation (DuckDB wins 385x–1,196x)
 

diff --git a/docs/phases/PHASE_8_ANALYTICS.md b/docs/phases/PHASE_8_ANALYTICS.md
@@ -29,12 +29,14 @@ Optimized global analytical queries (`COUNT`, `SUM`).
 
 ### 5. Vectorized GROUP BY
 Added `VectorizedGroupByOperator` for hash-based grouped aggregation.
-- **Hash-Based Grouping**: Uses `unordered_map` for efficient group key lookup with collision-safe key encoding.
-- **Two-Phase Processing**: Input phase builds hash table from batches; Output phase serves grouped results.
+- **Hash-Based Grouping**: Uses `OpenAddressHashAgg` with linear probing for efficient group key lookup with collision-safe key encoding.
+- **Two-Phase Processing**: Input phase builds hash table from batches; Output phase serves grouped results incrementally.
+- **DirectIndexAgg**: For single INT64 column GROUP BY with keys in -128 to 127 range, uses direct array indexing (O(1) lookup).
 - **Supported Aggregates**: COUNT(*), SUM, MIN, and MAX with INT64/FLOAT64 columns.
 - **Type-Specific Accumulators**: SUM uses separate `sums_int64` and `sums_float64` accumulators to preserve precision for large INT64 values.
-- **Collision-Safe Key Encoding**: Group keys use length-prefixed encoding with dedicated NULL markers, preventing key collisions from string concatenation ambiguities.
+- **Collision-Safe Key Encoding**: Group keys use binary encoding `[type_tag][data...]` with dedicated markers (0x01=NULL, 0x02=INT64, 0x04=STRING).
 - **Pre-resolved Column Indices**: Group-by column indices computed once in constructor to avoid repeated lookups.
+- **Parallel Aggregation**: Optional ThreadPool support partitions rows by `hash % num_threads_`, each thread builds local `OpenAddressHashAgg`, merged at output phase (9-15x speedup vs DuckDB on Q6).
 
 ### 6. Vectorized Hash Join (`VectorizedHashJoinOperator`)
 Implemented a vectorized hash join with graceful partitioning and batch-based processing.