Skip to content

Z3 bench reboot#465

Closed
hdson07 wants to merge 45 commits into
algorithmicsuperintelligence:mainfrom
hdson07:z3-bench-reboot
Closed

Z3 bench reboot#465
hdson07 wants to merge 45 commits into
algorithmicsuperintelligence:mainfrom
hdson07:z3-bench-reboot

Conversation

@hdson07

@hdson07 hdson07 commented Jun 9, 2026

Copy link
Copy Markdown

No description provided.

hdson-Axion and others added 30 commits May 18, 2026 09:59
- build_samples.py replaces build_stage1_sample.py: scans raw-data,
  regenerates problems.jsonl, writes stage1_sample.json (5 SAT) and
  stage2_sample.json (50 SAT+UNSAT) using quintile-spread by
  (num_hard_constraints, num_variables). Runtime no longer used for
  selection — raw-data can accumulate beyond 50 without reshuffling.
- evaluator.py: per-problem timeout = max(5s, baseline_ms * 1.3),
  replacing fixed OPENEVOLVE_STAGE{1,2}_TIMEOUT envs. Adaptive cap
  absorbs both small-problem startup overhead and huge-problem
  proportional headroom. _filter_stage2 added; _load_problems no
  longer filters by local_baseline (stage2 returns full set).
- run_phase.sh / rebaseline_local.py / baseline_params.py: refs
  updated to build_samples.py.
- README: reflect new sample layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_samples.py: SAT runtime quintile split → stage1 (Q1+2, 5),
  stage2 (Q3+4, 5), stage3 (Q5, 5); stage4 (broad 20, SAT+UNSAT, dedup vs
  stage1-3). Tukey IQR k=3 outlier filter drops far-tail monsters
  (~180s/~130s with current data) so they don't distort quintile
  boundaries or inflate stage wall-clock.
- evaluator.py: evaluate_stage3 chains stage4 internally because
  openevolve cascade hardcodes 3 stage slots. Gate threshold for the
  internal stage3→stage4 transition read from
  config.evaluator.cascade_thresholds[2] via runtime.cascade_threshold().
- config.yaml: cascade_thresholds=[1.03, 1.03, 1.03] — 3% gain required
  at each transition (stage1→2, stage2→3, stage3→4).
- runtime.py: cascade_threshold(index, default) helper.
- rebaseline_local.py: extend union to all 4 stage samples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three evolution surfaces per phase (GLOBAL_OVERRIDES, SIZE_BUCKETS,
STAGE3_OVERRIDES) with constraint-count buckets and outlier-only stage3
overrides. Stage3 sample now picks from Statistics/outliers_top.csv
(top-5 residual, capped at MAX_BASELINE_MS). Evaluator resolves params
per problem with backward-compat shim for legacy get_params().

extract_best / prepare_phase chain bucket + stage3 dicts across phases.
config.yaml system_message embeds full CpSolverParameters reference
(275 fields) so the LLM has authoritative knob names in every call.
hdson-Axion and others added 15 commits May 26, 2026 18:25
- process_parallel: pass log_file to workers and configure root logger
- gitignore + untrack cpsat-bench shared run artifacts (samples, baselines, phase outputs)
- add phase5_custom_subsolvers initial program
- tune cpsat evolve config, samples, evaluator, worker

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift per-bench evaluator / sampler / rebaseline / scoring / phase
extraction / unified-phase materialization into reusable _lib modules.
Per-bench surface shrinks to four hand-edited files (config.yaml,
params.json, adapter.py, _solve_worker.py) plus phase initial_program.py
modules. All orchestration commands route via `python -m _lib.<name>
<bench>`; no per-bench wrapper scripts.

New _lib modules: bench_paths, params_catalog, averaging, scorer,
sampler, rebaseline, self_test, evaluator_core, evaluator_entry,
final_verify, finalize. Extended extract_best + prepare_phase +
load_bench_config to support CLI invocation by bench name.

Per-bench changes:
- shared/ subdirs removed entirely (worker moved to evolve/ root).
- baseline_params.py / score.py / evaluator.py / runtime.py /
  <solver>_runner.py / build_samples.py / rebaseline_local.py /
  extract_best.py / prepare_phase*.py / final_verify.py wrappers
  deleted (logic now in _lib).
- New adapter.py exposes solver-specific constants
  (DECISIVE_RESULTS, KEY_STATS, STATUS_FIELD, WORKERS_KEY, SCORE_MODE,
  STATS_WEIGHTS, get_problem_size).
- New params.json: rich catalog (defaults + locked + grouped params
  with type/range/desc) replaces BASELINE/LOCKED dicts.
- config.yaml gains bench.clustering, bench.evaluation,
  bench.adapter / params_catalog / worker_path / unified_dict_name.
- z3 extras (validate_keys.py, probe_cli_keys.py, _probe_parallel.py,
  verify_stage1_baseline.py) and cpsat extras
  (benchmark_final.py, materialize_final.py) removed.
- Phase initial_program.py modules:
  - imports BASELINE from params_catalog
  - EVOLVE-BLOCK reset to empty (clean slate)
  - robust bench-root resolution via OPENEVOLVE_BENCH_ROOT env +
    walk-up fallback (works after openevolve copies program to /tmp)

run_phase.sh: invokes _lib modules directly, exports
OPENEVOLVE_BENCH_ROOT, passes _lib/evaluator_entry.py as evaluator,
auto-runs _lib.finalize after the last phase to emit
<bench>/evolve/final_program.py (self-contained — all _load_prev*()
calls inlined as literal dicts so no cache/ dependency at runtime).

Behavior unified across solvers:
- 10-run averaging is the standard (configurable via
  bench.evaluation.repeats).
- Clustering driven by bench.clustering (kmeans / quintile /
  thresholds); z3 clusters by features.num_hard_constraints, cpsat
  by features.num_constraints.
- Score mode pluggable: `cost` (cpsat: dtime + cost_ratio) or
  `speedup` (z3: wall-clock).
- SIZE_BUCKETS / STAGE3_OVERRIDES are opt-in via
  bench.evaluation.enable_size_buckets / enable_outlier_stage.
- LLM param validation runs against the catalog before subprocess
  spawn — unknown / range-violating keys surface as invalid_param
  without a solver run.

Generated stage samples + local_baseline now live in
<bench>/evolve/cache/ (safe to delete; regenerated on first run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- _lib/sampler.py: print per-stage report (picks vs pool quartiles for
  baseline_ms + feature, result distribution, pick list with sha / result /
  ms / feature). Surfaces clustering quality + sample spread at a glance.

- cpsat-bench/build_problems.py + z3-bench/build_problems.py: scan
  raw-data/*.meta.jsonl and emit problems.jsonl (one row per meta entry,
  sorted by sha + applied_params_hash). Replaces the raw-scan portion of
  the deleted build_samples.py wrappers. Flags: --filter-decisive,
  --applied-params-hash, --dry-run, --out. _lib stays solver-agnostic;
  these scripts handle solver-specific meta.jsonl format.

- ADD_NEW_SOLVER.md: quick-start guide for integrating a new solver
  (cvc5, minizinc, picosat, ...). Covers raw-data layout, problems.jsonl
  schema, the 4 hand-edited per-bench files (config.yaml + params.json +
  adapter.py + _solve_worker.py), phase modules, verification steps,
  option decisions (score_mode / worker axis / buckets / clustering),
  common pitfalls, time estimates.

- .gitignore: stop tracking locally-regenerated artifacts. Drop the dead
  `shared/*` patterns (refactor removed shared/), add:
    input/*/evolve/cache/                  # _lib regenerates on demand
    input/*/evolve/final_program.py        # _lib.finalize regenerates
    input/*/evolve/phase*/openevolve_output*/
    input/*/evolve/phase*/__pycache__/
  Untrack 8 previously-committed cache/stage{1..4}_sample.json files
  (still on disk, rebuilt by _lib.sampler).

- cpsat/z3 config.yaml: minor knob tweaks (user-side feature-path tuning
  during smoke runs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-bench pipeline generator skill at .claude/skills/openevolve-pipeline/.
Drives the ADD_NEW_SOLVER.md contract: interview user, scaffold 4 per-bench
files + N phase modules under input/<bench>/evolve/, verify with _lib CLIs
(sampler / self_test / rebaseline / run_phase.sh).

Templates cover both speedup (z3-style flat overrides) and cost+worker-axis
(cpsat-style SIZE_BUCKETS) phase shapes, plus Python-binding and CLI-binary
worker variants. References split into interview / decision-guide /
verify / gotchas to keep SKILL.md short.

Unignore .claude/skills/ so the skill ships with the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- z3-bench: build_problems_reboot.py, sat/optimize config examples,
  phase4_unified update, regenerated shared + cache-sat baselines,
  and a 20260519 backup snapshot of the prior phase layout
- _lib: sampler/scorer/evaluator_core/finalize/bench_paths refinements
- openevolve-pipeline skill + claude_code llm backend updates
- gitignore: ignore input/z3-bench/reboot-raw-data

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@hdson07 hdson07 closed this Jun 9, 2026
@hdson07 hdson07 deleted the z3-bench-reboot branch June 9, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants