Z3 bench reboot#465
Closed
hdson07 wants to merge 45 commits into
Closed
Conversation
- build_samples.py replaces build_stage1_sample.py: scans raw-data,
regenerates problems.jsonl, writes stage1_sample.json (5 SAT) and
stage2_sample.json (50 SAT+UNSAT) using quintile-spread by
(num_hard_constraints, num_variables). Runtime no longer used for
selection — raw-data can accumulate beyond 50 without reshuffling.
- evaluator.py: per-problem timeout = max(5s, baseline_ms * 1.3),
replacing fixed OPENEVOLVE_STAGE{1,2}_TIMEOUT envs. Adaptive cap
absorbs both small-problem startup overhead and huge-problem
proportional headroom. _filter_stage2 added; _load_problems no
longer filters by local_baseline (stage2 returns full set).
- run_phase.sh / rebaseline_local.py / baseline_params.py: refs
updated to build_samples.py.
- README: reflect new sample layout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_samples.py: SAT runtime quintile split → stage1 (Q1+2, 5), stage2 (Q3+4, 5), stage3 (Q5, 5); stage4 (broad 20, SAT+UNSAT, dedup vs stage1-3). Tukey IQR k=3 outlier filter drops far-tail monsters (~180s/~130s with current data) so they don't distort quintile boundaries or inflate stage wall-clock. - evaluator.py: evaluate_stage3 chains stage4 internally because openevolve cascade hardcodes 3 stage slots. Gate threshold for the internal stage3→stage4 transition read from config.evaluator.cascade_thresholds[2] via runtime.cascade_threshold(). - config.yaml: cascade_thresholds=[1.03, 1.03, 1.03] — 3% gain required at each transition (stage1→2, stage2→3, stage3→4). - runtime.py: cascade_threshold(index, default) helper. - rebaseline_local.py: extend union to all 4 stage samples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three evolution surfaces per phase (GLOBAL_OVERRIDES, SIZE_BUCKETS, STAGE3_OVERRIDES) with constraint-count buckets and outlier-only stage3 overrides. Stage3 sample now picks from Statistics/outliers_top.csv (top-5 residual, capped at MAX_BASELINE_MS). Evaluator resolves params per problem with backward-compat shim for legacy get_params(). extract_best / prepare_phase chain bucket + stage3 dicts across phases. config.yaml system_message embeds full CpSolverParameters reference (275 fields) so the LLM has authoritative knob names in every call.
- process_parallel: pass log_file to workers and configure root logger - gitignore + untrack cpsat-bench shared run artifacts (samples, baselines, phase outputs) - add phase5_custom_subsolvers initial program - tune cpsat evolve config, samples, evaluator, worker Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift per-bench evaluator / sampler / rebaseline / scoring / phase
extraction / unified-phase materialization into reusable _lib modules.
Per-bench surface shrinks to four hand-edited files (config.yaml,
params.json, adapter.py, _solve_worker.py) plus phase initial_program.py
modules. All orchestration commands route via `python -m _lib.<name>
<bench>`; no per-bench wrapper scripts.
New _lib modules: bench_paths, params_catalog, averaging, scorer,
sampler, rebaseline, self_test, evaluator_core, evaluator_entry,
final_verify, finalize. Extended extract_best + prepare_phase +
load_bench_config to support CLI invocation by bench name.
Per-bench changes:
- shared/ subdirs removed entirely (worker moved to evolve/ root).
- baseline_params.py / score.py / evaluator.py / runtime.py /
<solver>_runner.py / build_samples.py / rebaseline_local.py /
extract_best.py / prepare_phase*.py / final_verify.py wrappers
deleted (logic now in _lib).
- New adapter.py exposes solver-specific constants
(DECISIVE_RESULTS, KEY_STATS, STATUS_FIELD, WORKERS_KEY, SCORE_MODE,
STATS_WEIGHTS, get_problem_size).
- New params.json: rich catalog (defaults + locked + grouped params
with type/range/desc) replaces BASELINE/LOCKED dicts.
- config.yaml gains bench.clustering, bench.evaluation,
bench.adapter / params_catalog / worker_path / unified_dict_name.
- z3 extras (validate_keys.py, probe_cli_keys.py, _probe_parallel.py,
verify_stage1_baseline.py) and cpsat extras
(benchmark_final.py, materialize_final.py) removed.
- Phase initial_program.py modules:
- imports BASELINE from params_catalog
- EVOLVE-BLOCK reset to empty (clean slate)
- robust bench-root resolution via OPENEVOLVE_BENCH_ROOT env +
walk-up fallback (works after openevolve copies program to /tmp)
run_phase.sh: invokes _lib modules directly, exports
OPENEVOLVE_BENCH_ROOT, passes _lib/evaluator_entry.py as evaluator,
auto-runs _lib.finalize after the last phase to emit
<bench>/evolve/final_program.py (self-contained — all _load_prev*()
calls inlined as literal dicts so no cache/ dependency at runtime).
Behavior unified across solvers:
- 10-run averaging is the standard (configurable via
bench.evaluation.repeats).
- Clustering driven by bench.clustering (kmeans / quintile /
thresholds); z3 clusters by features.num_hard_constraints, cpsat
by features.num_constraints.
- Score mode pluggable: `cost` (cpsat: dtime + cost_ratio) or
`speedup` (z3: wall-clock).
- SIZE_BUCKETS / STAGE3_OVERRIDES are opt-in via
bench.evaluation.enable_size_buckets / enable_outlier_stage.
- LLM param validation runs against the catalog before subprocess
spawn — unknown / range-violating keys surface as invalid_param
without a solver run.
Generated stage samples + local_baseline now live in
<bench>/evolve/cache/ (safe to delete; regenerated on first run).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- _lib/sampler.py: print per-stage report (picks vs pool quartiles for
baseline_ms + feature, result distribution, pick list with sha / result /
ms / feature). Surfaces clustering quality + sample spread at a glance.
- cpsat-bench/build_problems.py + z3-bench/build_problems.py: scan
raw-data/*.meta.jsonl and emit problems.jsonl (one row per meta entry,
sorted by sha + applied_params_hash). Replaces the raw-scan portion of
the deleted build_samples.py wrappers. Flags: --filter-decisive,
--applied-params-hash, --dry-run, --out. _lib stays solver-agnostic;
these scripts handle solver-specific meta.jsonl format.
- ADD_NEW_SOLVER.md: quick-start guide for integrating a new solver
(cvc5, minizinc, picosat, ...). Covers raw-data layout, problems.jsonl
schema, the 4 hand-edited per-bench files (config.yaml + params.json +
adapter.py + _solve_worker.py), phase modules, verification steps,
option decisions (score_mode / worker axis / buckets / clustering),
common pitfalls, time estimates.
- .gitignore: stop tracking locally-regenerated artifacts. Drop the dead
`shared/*` patterns (refactor removed shared/), add:
input/*/evolve/cache/ # _lib regenerates on demand
input/*/evolve/final_program.py # _lib.finalize regenerates
input/*/evolve/phase*/openevolve_output*/
input/*/evolve/phase*/__pycache__/
Untrack 8 previously-committed cache/stage{1..4}_sample.json files
(still on disk, rebuilt by _lib.sampler).
- cpsat/z3 config.yaml: minor knob tweaks (user-side feature-path tuning
during smoke runs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-bench pipeline generator skill at .claude/skills/openevolve-pipeline/. Drives the ADD_NEW_SOLVER.md contract: interview user, scaffold 4 per-bench files + N phase modules under input/<bench>/evolve/, verify with _lib CLIs (sampler / self_test / rebaseline / run_phase.sh). Templates cover both speedup (z3-style flat overrides) and cost+worker-axis (cpsat-style SIZE_BUCKETS) phase shapes, plus Python-binding and CLI-binary worker variants. References split into interview / decision-guide / verify / gotchas to keep SKILL.md short. Unignore .claude/skills/ so the skill ships with the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- z3-bench: build_problems_reboot.py, sat/optimize config examples, phase4_unified update, regenerated shared + cache-sat baselines, and a 20260519 backup snapshot of the prior phase layout - _lib: sampler/scorer/evaluator_core/finalize/bench_paths refinements - openevolve-pipeline skill + claude_code llm backend updates - gitignore: ignore input/z3-bench/reboot-raw-data Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.