Skip to content

experiment: ClickHouse Arena free-list — 7.8% RSS reduction (profile-validated)#1

Open
damahua wants to merge 3 commits intomainfrom
experiment/clickhouse-arena-freelist
Open

experiment: ClickHouse Arena free-list — 7.8% RSS reduction (profile-validated)#1
damahua wants to merge 3 commits intomainfrom
experiment/clickhouse-arena-freelist

Conversation

@damahua
Copy link
Copy Markdown
Owner

@damahua damahua commented Mar 25, 2026

Summary

Profile-validated experiment on ClickHouse v25.8 LTS Arena allocator that adds free-list recycling to Arena::realloc, reducing peak RSS by 7.8% (133.8 MB) with zero performance regression.

Approach: Scan → Profile → Experiment

  • Phase 1 (Scan): Enumerated 11 code-level optimization candidates from source review
  • Phase 2 (Profile): Built unmodified v25.8 LTS, profiled with real workload — Arena accounts for 56% of peak memory (512 MB of 907 MB)
  • Phase 2.5 (Validate): Cross-referenced candidates against profile — 4 confirmed, 7 eliminated (including MergeTree reader caches at only 0.7% of peak)
  • Phase 3 (Experiment): Implemented top candidate, same-version A/B benchmark with profile diff

The Problem

Arena::realloc permanently wastes old memory regions (the code itself documents: /// NOTE Old memory region is wasted.). During GROUP BY with many keys, repeated realloc cycles accumulate dead memory inside Arena chunks.

The Fix

Added a power-of-two bucketed free-list (16 buckets, 16B-1MB) to Arena so realloc'd old regions are recycled by future alloc() calls. The free-list is intrusive (zero additional memory overhead) and O(1) for both add and lookup.

Patch: targets/clickhouse/experiment/patches/arena-freelist.patch (76 lines added to Arena.h)

Results (same-version A/B)

Metric Baseline Experiment Delta
peak_rss_mb 1706.7 1572.9 -7.8%
current_rss_mb 1507.6 1335.0 -11.4%
latency_p99 41ms 41ms same
Arena chunks 32 32 same
Arena bytes 512 MB 512 MB same
Error rate 0 0 same

Caveats (documented in REPORT.md)

  • Single run per build (needs N=10+ for statistical significance)
  • Disabled build features (S3, GRPC, etc.) — relative improvement should hold on full builds
  • No sanitizer validation yet (ASan/TSan/UBSan)
  • Tested on aarch64 only (Docker on macOS ARM64)

Key Learning

Our v1 pipeline ("guess and test") produced a flashy -62% number that turned out to be entirely an artifact of version/build differences. The v2 pipeline ("profile first") produced a smaller (-7.8%) but real, reproducible, profile-confirmed result.

Files

  • targets/clickhouse/experiment/REPORT.md — Full experiment report
  • targets/clickhouse/experiment/patches/arena-freelist.patch — The code diff
  • targets/clickhouse/experiment/candidates.md — 11 candidates, 4 confirmed
  • targets/clickhouse/experiment/profiles/ — Raw profiling data + analysis
  • targets/clickhouse/experiment/VERSION — Exact build configuration

🤖 Generated with Claude Code

damahua and others added 3 commits March 24, 2026 17:58
Refactor the framework from "guess and test" to "scan → profile → experiment":

1. Phase 1 (Scan): enumerate ALL optimization candidates from code review
2. Phase 2 (Profile): run baseline with profiling, validate which candidates
   are actual hot paths (>5% of RSS or CPU)
3. Phase 3 (Experiment): only implement profile-confirmed candidates, with
   before/after profile comparison in every experiment

New scripts:
- envs/base/profile.sh — captures /proc/smaps, /proc/status, perf (if available),
  and target-specific profiling hooks from running K8s pods
- envs/base/analyze.sh — parses profiles into agent-readable summaries with
  memory breakdown, top regions, CPU top functions, and before/after diffs

New env.conf settings:
- PROFILE_ENABLED, PROFILE_MEMORY, PROFILE_CPU, PROFILE_CPU_DURATION, ANALYZE_TOP_N

Updated program.md:
- Three-phase loop replaces blind guess-and-test
- candidates.md tracks confirmed vs unconfirmed optimization candidates
- results.tsv gains profile_summary column
- Keep/discard decisions use profile evidence, not just aggregate metrics

Motivation: 7 experiments on ClickHouse showed most "improvements" were
measurement artifacts. Proper A/B benchmarking revealed zero impact from
changes that looked good in code review. The agent was optimizing blind.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Profile-validated experiment on ClickHouse v25.8 LTS Arena allocator.

Approach: scan → profile → experiment (not guess-and-test)
- Scanned 11 candidates, profiling confirmed 4 as hot paths
- Arena accounts for 56% of peak memory on aggregation queries
- Arena::realloc permanently wastes old regions — added free-list recycling

Result (same-version A/B, identical build config):
  peak_rss_mb: 1706.7 → 1572.9 (-7.8%, -133.8 MB)
  current_rss_mb: 1507.6 → 1335.0 (-11.4%, -172.6 MB)
  latency_p99: 41ms → 41ms (no regression)
  ClickHouse MemoryTracker: unchanged (expected — recycles physical, not virtual)

Includes: full report, patch, profiling data, candidates analysis, caveats.
See targets/clickhouse/experiment/REPORT.md for details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
N=5 each, same workload, same build:
  Baseline mean: 1320.6 MB (stddev 77.1, range 229 MB)
  Experiment mean: 1320.9 MB (stddev 64.2, range 171 MB)
  Delta: +0.3 MB (+0.02%) — distributions completely overlap

The single-run -7.8% was noise. Run-to-run RSS variance is 17%.
ClickHouse PR #100672 closed with honest explanation.

Lessons: never report single-run perf results, instrument the
mechanism (count free-list hits), match optimization to actual
allocation patterns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant