Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/VECTORIZED_EXECUTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,9 +180,11 @@ auto result2 = executor.execute("SELECT * FROM orders ORDER BY created_at LIMIT
| Scenario | Volcano | Vectorized | Speedup |
|----------|---------|------------|---------|
| Full table scan | 181M rows/s | ~500M rows/s (parallel) | ~3x |
| GROUP BY aggregate | ~50M rows/s | ~150M rows/s (parallel) | ~3x |
| GROUP BY aggregate (Q6) | ~50M rows/s | **7.3G rows/s (parallel)** | **~150x** |
| JOIN (hash) | ~40M rows/s | ~100M rows/s | ~2.5x |
| Small result sets | Good | Overhead | - |
| Queries with ORDER BY | Good | N/A (fallback) | - |

**Note:** GROUP BY aggregate performance varies significantly based on cardinality. Low-cardinality GROUP BY uses `DirectIndexAgg` (int8 range optimization), while high-cardinality GROUP BY uses `OpenAddressHashAgg` with parallel processing via ThreadPool.

The vectorized path provides significant throughput gains for analytical workloads with large result sets, while the Volcano path remains optimal for OLTP-style queries with early filtering or small result sets.
15 changes: 8 additions & 7 deletions docs/performance/DUCKDB_COMPARISON.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,20 +20,21 @@ This report documents the head-to-head performance comparison between `cloudSQL`
|:----------|:------:|----------:|--------:|:-------|
| **Q1** GROUP BY aggregation | 10k rows | 161k rows/s | 61.8M rows/s | DuckDB 385x |
| **Q1** GROUP BY aggregation | 100k rows | 152k rows/s | 182M rows/s | DuckDB 1,196x |
| **Q6** Filter + aggregation | 10k rows | 209M rows/s | 76.7M rows/s | **cloudSQL 2.7x** |
| **Q6** Filter + aggregation | 100k rows | 2.13B rows/s | 470M rows/s | **cloudSQL 4.5x** |
| **Q6** Filter + aggregation | 10k rows | 770M rows/s | 79M rows/s | **cloudSQL 9.7x** |
| **Q6** Filter + aggregation | 100k rows | 7.3B rows/s | 474M rows/s | **cloudSQL 15.4x** |
| **Q3-like** Hash Join | 10k rows | 3.78M rows/s | 34.3M rows/s | DuckDB 9x |
| **Q3-like** Hash Join | 50k rows | 3.76M rows/s | 69.5M rows/s | DuckDB 18x |

## 4. Architectural Analysis

### Filter + Aggregation (cloudSQL wins 2.7x–4.5x)
### Filter + Aggregation (cloudSQL wins 9.7x–15.4x)

cloudSQL outperforms DuckDB on the filter+aggregate workload (Q6) by a significant margin. This is surprising given DuckDB's maturity. Several factors likely contribute:
cloudSQL significantly outperforms DuckDB on the filter+aggregate workload (Q6) after parallel hash aggregation optimization. Key improvements:

1. **Batch Insert Mode overhead**: cloudSQL benchmarks populate data via `INSERT` statements, which may go through the slower transaction path
2. **Predicate evaluation**: cloudSQL's vectorized filter (`VectorizedFilterOperator`) processes batches with tight inner loops
3. **Memory locality**: For simple predicates on consecutive rows, cloudSQL's row-oriented storage may exhibit better cache locality
1. **Parallel hash aggregation**: Rows partitioned by `hash % num_threads_`, processed concurrently with per-thread `OpenAddressHashAgg`, merged at output phase
2. **Vectorized filter optimization**: `VectorizedFilterOperator` processes batches with tight inner loops and precomputed selection masks
3. **FNV-1a hash**: Fast 64-bit hashing for row partitioning with minimal overhead
4. **OpenAddressHashAgg**: Linear probing with 0.5 load factor provides excellent cache locality

### GROUP BY Aggregation (DuckDB wins 385x–1,196x)

Expand Down
8 changes: 5 additions & 3 deletions docs/phases/PHASE_8_ANALYTICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,14 @@ Optimized global analytical queries (`COUNT`, `SUM`).

### 5. Vectorized GROUP BY
Added `VectorizedGroupByOperator` for hash-based grouped aggregation.
- **Hash-Based Grouping**: Uses `unordered_map` for efficient group key lookup with collision-safe key encoding.
- **Two-Phase Processing**: Input phase builds hash table from batches; Output phase serves grouped results.
- **Hash-Based Grouping**: Uses `OpenAddressHashAgg` with linear probing for efficient group key lookup with collision-safe key encoding.
- **Two-Phase Processing**: Input phase builds hash table from batches; Output phase serves grouped results incrementally.
- **DirectIndexAgg**: For single INT64 column GROUP BY with keys in -128 to 127 range, uses direct array indexing (O(1) lookup).
- **Supported Aggregates**: COUNT(*), SUM, MIN, and MAX with INT64/FLOAT64 columns.
- **Type-Specific Accumulators**: SUM uses separate `sums_int64` and `sums_float64` accumulators to preserve precision for large INT64 values.
- **Collision-Safe Key Encoding**: Group keys use length-prefixed encoding with dedicated NULL markers, preventing key collisions from string concatenation ambiguities.
- **Collision-Safe Key Encoding**: Group keys use binary encoding `[type_tag][data...]` with dedicated markers (0x01=NULL, 0x02=INT64, 0x04=STRING).
- **Pre-resolved Column Indices**: Group-by column indices computed once in constructor to avoid repeated lookups.
- **Parallel Aggregation**: Optional ThreadPool support partitions rows by `hash % num_threads_`, each thread builds local `OpenAddressHashAgg`, merged at output phase (9-15x speedup vs DuckDB on Q6).

### 6. Vectorized Hash Join (`VectorizedHashJoinOperator`)
Implemented a vectorized hash join with graceful partitioning and batch-based processing.
Expand Down
Loading
Loading