Implement benchmark output structure per .claude/BENCHMARK_OUTPUT.md specification, then run a test benchmark to verify.
| Setting | Value |
|---|---|
| LLM Model | mistral-devstral2 (Mistral Devstral-2512) |
| Repositories | deriva, bigdata |
| Runs per repo | 1 |
| Ground Truth | reference/reference_*.archimate (filtered versions) |
workspace/benchmarks/{session_id}/
├── summary.json # Session metadata
├── events.ocel.json # OCEL 2.0 event log at root
├── events.jsonl # JSONL streaming log at root
├── models/
│ ├── deriva_mistral-devstral2_1.xml # Model naming: {repo}_{model}_{n}.xml
│ └── bigdata_mistral-devstral2_1.xml
├── cache/
│ └── mistral-devstral2/ # Per-model subdirectory
│ ├── abc123def456.json # Individual cache files by hash
│ └── ...
└── analysis/
└── summary.json # Consistency analysis (JSON only)
benchmark_session_<timestamp>/
├── session_metadata.json # Renamed from summary.json
├── models/
│ ├── deriva_devstral2_run1.xml # Model naming: {repo}_{model}_run{n}.xml
│ └── bigdata_devstral2_run1.xml
├── cache/
│ └── llm_cache.json # Single consolidated cache file
├── ocel/
│ └── benchmark_events.json # OCEL in subdirectory, renamed
└── results/
├── consistency_metrics.csv # NEW: Per-run stability
├── ground_truth_comparison.csv # NEW: Precision/recall/F1
├── quality_verification.csv # NEW: Quality scores
├── inter_model_agreement.json # Extracted from analysis
└── execution_metrics.csv # NEW: Timing/tokens
| # | Item | Current | Desired | File to Modify | Lines | Effort |
|---|---|---|---|---|---|---|
| 1 | Model filename | {repo}_{model}_{n}.xml |
{repo}_{model}_run{n}.xml |
benchmarking.py |
1055 | Low |
| 2 | OCEL location | Root level | ocel/ subdirectory |
benchmarking.py |
1178-1184 | Low |
| 3 | OCEL filename | events.ocel.json |
benchmark_events.json |
benchmarking.py |
1179 | Low |
| 4 | Metadata file | summary.json |
session_metadata.json |
benchmarking.py |
1195 | Low |
| 5 | Cache structure | Per-model dirs | Single llm_cache.json |
benchmarking.py |
1200-1234 | Medium |
| 6 | Results directory | analysis/ |
results/ |
analysis.py |
541-548 | Low |
| 7 | consistency_metrics.csv | Not exists | Required | analysis.py |
New method | Medium |
| 8 | ground_truth_comparison.csv | Not exists | Required | analysis.py |
New method | Medium |
| 9 | quality_verification.csv | Not exists | Required | analysis.py |
New method | Medium |
| 10 | execution_metrics.csv | Not exists | Required | analysis.py |
New method | Medium |
File: h:\Deriva\deriva\deriva\services\benchmarking.py
Line: 1055
# CURRENT (line 1055):
filename = f"{safe_repo}_{safe_model}_{iteration}.xml"
# CHANGE TO:
filename = f"{safe_repo}_{safe_model}_run{iteration}.xml"Impact: Model files will be named bigdata_mistral-devstral2_run1.xml instead of bigdata_mistral-devstral2_1.xml
File: h:\Deriva\deriva\deriva\services\benchmarking.py
Lines: 1178-1184
# CURRENT (lines 1178-1184):
# Export OCEL JSON
ocel_json_path = output_dir / "events.ocel.json"
self.ocel_log.export_json(ocel_json_path)
# Export JSONL for streaming
ocel_jsonl_path = output_dir / "events.jsonl"
self.ocel_log.export_jsonl(ocel_jsonl_path)
# CHANGE TO:
# Create ocel subdirectory
ocel_dir = output_dir / "ocel"
ocel_dir.mkdir(parents=True, exist_ok=True)
# Export OCEL JSON
ocel_json_path = ocel_dir / "benchmark_events.json"
self.ocel_log.export_json(ocel_json_path)
# Export JSONL for streaming
ocel_jsonl_path = ocel_dir / "benchmark_events.jsonl"
self.ocel_log.export_jsonl(ocel_jsonl_path)Impact: OCEL files moved from root to ocel/ subdirectory
File: h:\Deriva\deriva\deriva\services\benchmarking.py
Line: 1195
# CURRENT (line 1195):
with open(output_dir / "summary.json", "w") as f:
# CHANGE TO:
with open(output_dir / "session_metadata.json", "w") as f:Impact: Session metadata file renamed
File: h:\Deriva\deriva\deriva\services\benchmarking.py
Lines: 1200-1234
Replace the entire method with consolidation logic:
def _copy_used_cache_entries(
self,
used_cache_keys: list[str],
cache_dir: Path,
model_name: str,
) -> int:
"""
Consolidate used LLM cache entries into single llm_cache.json.
Instead of copying individual hash-named files to per-model subdirectories,
consolidates all entries into a single JSON file per BENCHMARK_OUTPUT.md spec.
Args:
used_cache_keys: List of cache keys (SHA256 hashes) used during the run
cache_dir: Source cache directory where cache files are stored
model_name: Name of the model (used for tagging entries)
Returns:
Number of cache entries in consolidated file
"""
session_id = self.session_id or "unknown"
cache_file = Path("workspace/benchmarks") / session_id / "cache" / "llm_cache.json"
cache_file.parent.mkdir(parents=True, exist_ok=True)
# Load existing consolidated cache or create new
consolidated: dict[str, Any] = {}
if cache_file.exists():
try:
with open(cache_file) as f:
consolidated = json.load(f)
except json.JSONDecodeError:
consolidated = {}
# Add new entries from this run
for cache_key in set(used_cache_keys): # Deduplicate
src = cache_dir / f"{cache_key}.json"
if src.exists() and cache_key not in consolidated:
try:
with open(src) as f:
data = json.load(f)
# Extract fields per BENCHMARK_OUTPUT.md spec
consolidated[cache_key] = {
"prompt": data.get("request", {}).get("messages", [{}])[-1].get("content", ""),
"response": data.get("response", {}).get("content", ""),
"model": model_name,
"tokens_in": data.get("usage", {}).get("input_tokens", 0),
"tokens_out": data.get("usage", {}).get("output_tokens", 0),
"timestamp": data.get("created_at", ""),
"bench_hash": f"run{self._current_iteration}_{self._current_repo}_{model_name}",
}
except (OSError, json.JSONDecodeError):
pass # Skip on read errors
# Write consolidated cache
with open(cache_file, "w") as f:
json.dump(consolidated, f, indent=2)
return len(consolidated)Note: Need to track _current_iteration and _current_repo in the orchestrator class.
File: h:\Deriva\deriva\deriva\services\analysis.py
Location: After line 548 (after export_all method)
Add these 4 new methods:
import csv
from pathlib import Path
def export_consistency_metrics_csv(self, path: Path) -> str:
"""
Export consistency_metrics.csv with per-run element and relationship stability.
Columns: repository, model, run, total_elements, stable_elements,
element_stability, total_relationships, stable_relationships,
relationship_stability
"""
report = self.generate_report()
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow([
"repository", "model", "run",
"total_elements", "stable_elements", "element_stability",
"total_relationships", "stable_relationships", "relationship_stability"
])
for repo, phases in report.stability_reports.items():
derivation = phases.get("derivation")
if derivation:
# Aggregate element stats
total_elem = sum(b.total for b in derivation.element_breakdown)
stable_elem = sum(b.stable for b in derivation.element_breakdown)
elem_stability = stable_elem / total_elem if total_elem > 0 else 0.0
# Aggregate relationship stats
total_rel = sum(b.total for b in derivation.relationship_breakdown)
stable_rel = sum(b.stable for b in derivation.relationship_breakdown)
rel_stability = stable_rel / total_rel if total_rel > 0 else 0.0
writer.writerow([
repo, "all", "all",
total_elem, stable_elem, f"{elem_stability:.2%}",
total_rel, stable_rel, f"{rel_stability:.2%}"
])
return str(path)
def export_ground_truth_comparison_csv(self, path: Path) -> str:
"""
Export ground_truth_comparison.csv with precision/recall/F1 against reference models.
Columns: repository, model, gt_elements, extracted_elements, matched,
precision, recall, f1
Uses reference models from reference/reference_{repo}.archimate
"""
report = self.generate_report()
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow([
"repository", "model", "gt_elements", "extracted_elements",
"matched", "precision", "recall", "f1"
])
for repo, sem_report in report.semantic_reports.items():
writer.writerow([
repo, "all",
sem_report.reference_element_count,
sem_report.derived_element_count,
len(sem_report.correctly_derived),
f"{sem_report.precision:.3f}",
f"{sem_report.recall:.3f}",
f"{sem_report.f1_score:.3f}"
])
return str(path)
def export_quality_verification_csv(self, path: Path) -> str:
"""
Export quality_verification.csv with quality rubric scores.
Columns: repository, model, validity_pct, type_correct_pct, name_quality_pct
"""
report = self.generate_report()
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow([
"repository", "model", "validity_pct", "type_correct_pct", "name_quality_pct"
])
for repo, sem_report in report.semantic_reports.items():
# Validity: Elements traceable to source
validity = sem_report.precision if sem_report else 0.0
# Type correct: Elements with matching ArchiMate type
type_correct = len([m for m in sem_report.correctly_derived
if m.match_type in ("exact", "fuzzy_name")]) / max(sem_report.derived_element_count, 1)
# Name quality: Elements with meaningful names (fuzzy match quality)
name_quality = sem_report.avg_similarity if hasattr(sem_report, 'avg_similarity') else 0.0
writer.writerow([
repo, "all",
f"{validity:.1%}",
f"{type_correct:.1%}",
f"{name_quality:.1%}"
])
return str(path)
def export_execution_metrics_csv(self, path: Path) -> str:
"""
Export execution_metrics.csv with runtime performance data.
Columns: repository, model, run, duration_sec, event_count,
tokens_in, tokens_out, api_calls
Extracts timing data from OCEL events.
"""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
# Parse OCEL for timing data
run_metrics: dict[str, dict] = {}
for event in self.ocel_log.events:
if event.activity == "StartRun":
run_id = event.objects.get("BenchmarkRun", [""])[0]
run_metrics[run_id] = {
"start": event.timestamp,
"repo": event.objects.get("Repository", [""])[0],
"model": event.objects.get("Model", [""])[0],
"events": 0,
"tokens_in": 0,
"tokens_out": 0,
"api_calls": 0,
}
elif event.activity == "CompleteRun":
run_id = event.objects.get("BenchmarkRun", [""])[0]
if run_id in run_metrics:
run_metrics[run_id]["end"] = event.timestamp
elif event.activity == "LLMRequest":
run_id = event.objects.get("BenchmarkRun", [""])[0]
if run_id in run_metrics:
run_metrics[run_id]["api_calls"] += 1
run_metrics[run_id]["tokens_in"] += event.attributes.get("tokens_in", 0)
run_metrics[run_id]["tokens_out"] += event.attributes.get("tokens_out", 0)
# Count all events for this run
for run_id in event.objects.get("BenchmarkRun", []):
if run_id in run_metrics:
run_metrics[run_id]["events"] += 1
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow([
"repository", "model", "run", "duration_sec",
"event_count", "tokens_in", "tokens_out", "api_calls"
])
for run_id, metrics in run_metrics.items():
# Parse run number from run_id (format: session:repo:model:n)
parts = run_id.split(":")
run_num = parts[-1] if len(parts) >= 4 else "1"
# Calculate duration
start = metrics.get("start")
end = metrics.get("end", start)
duration = (end - start).total_seconds() if start and end else 0.0
writer.writerow([
metrics["repo"], metrics["model"], run_num,
f"{duration:.1f}",
metrics["events"],
metrics["tokens_in"],
metrics["tokens_out"],
metrics["api_calls"]
])
return str(path)File: h:\Deriva\deriva\deriva\services\analysis.py
Lines: 531-548
# CURRENT:
def export_all(self, output_dir: str | Path) -> dict[str, str]:
"""Export all formats to a directory."""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
paths = {}
paths["json"] = self.export_json(output_dir / "benchmark_analysis.json")
paths["markdown"] = self.export_markdown(output_dir / "benchmark_analysis.md")
return paths
# CHANGE TO:
def export_all(self, output_dir: str | Path) -> dict[str, str]:
"""Export all formats to results/ directory per BENCHMARK_OUTPUT.md spec."""
output_dir = Path(output_dir)
results_dir = output_dir / "results"
results_dir.mkdir(parents=True, exist_ok=True)
paths = {}
# JSON exports
paths["inter_model_json"] = self.export_json(results_dir / "inter_model_agreement.json")
paths["markdown"] = self.export_markdown(results_dir / "benchmark_analysis.md")
# CSV exports (new)
paths["consistency_csv"] = self.export_consistency_metrics_csv(
results_dir / "consistency_metrics.csv"
)
paths["ground_truth_csv"] = self.export_ground_truth_comparison_csv(
results_dir / "ground_truth_comparison.csv"
)
paths["quality_csv"] = self.export_quality_verification_csv(
results_dir / "quality_verification.csv"
)
paths["execution_csv"] = self.export_execution_metrics_csv(
results_dir / "execution_metrics.csv"
)
return pathsThe ground truth comparison will use:
| Repository | Reference Model Path |
|---|---|
| deriva | reference/reference_deriva.archimate (if exists) |
| bigdata | reference/reference_bigdata.archimate |
Current reference files available:
reference/reference_bigdata.archimate(filtered)reference/reference_lightblue.archimate(filtered)reference/reference_cloudbased.archimate(filtered)reference/full_reference_*.archimate(full - NOT to be used)
Note: No reference_deriva.archimate exists. Ground truth comparison will show N/A or 0 for deriva repository.
uv run deriva-cli benchmark run \
--repos deriva,bigdata \
--models mistral-devstral2 \
-n 1 \
--per-repo \
-v \
-d "Baseline test before BENCHMARK_OUTPUT.md changes"Expected output location: workspace/benchmarks/bench_YYYYMMDD_HHMMSS/
uv run deriva-cli benchmark analyze <session_id>uv run deriva-cli benchmark comprehensive-analysis <session_id> \
-o workspace/benchmarks/<session_id># Windows PowerShell
Get-ChildItem -Recurse workspace/benchmarks/<session_id> | Select-Object FullName
# Or simple dir
dir workspace\benchmarks\<session_id> /sAfter running baseline benchmark, check:
-
workspace/benchmarks/{session_id}/directory created -
summary.jsonexists (will be renamed tosession_metadata.json) -
events.ocel.jsonat root (will be moved toocel/) -
events.jsonlat root (will be moved toocel/) -
models/directory contains XML files - Model filenames show current naming pattern
After implementing changes and re-running:
-
session_metadata.json(renamed) -
ocel/benchmark_events.json(moved + renamed) -
ocel/benchmark_events.jsonl(moved + renamed) -
cache/llm_cache.json(consolidated) -
results/consistency_metrics.csv(new) -
results/ground_truth_comparison.csv(new) -
results/quality_verification.csv(new) -
results/inter_model_agreement.json(extracted) -
results/execution_metrics.csv(new) - Model files named
{repo}_{model}_run{n}.xml
- Run baseline benchmark with current code
- Inspect output to confirm current structure
- Implement Phase 1 (naming/directory changes)
- Implement Phase 2 (cache consolidation)
- Implement Phase 3 (CSV exports)
- Re-run benchmark and verify output matches spec
| Risk | Mitigation |
|---|---|
| Breaking existing benchmark sessions | Keep backward compatibility by checking for old filenames |
| Missing reference models | Generate placeholder CSV rows with N/A values |
| OCEL parsing errors | Add try/except blocks with graceful fallbacks |
| Large cache files | Limit consolidated entries to current session only |
| File | Changes |
|---|---|
deriva/services/benchmarking.py |
Lines 1055, 1178-1184, 1195, 1200-1234 |
deriva/services/analysis.py |
Lines 531-548 + 4 new methods |
Total estimated lines of code: ~150 new/modified lines