Benchmark Implementation Plan

Objective

Implement benchmark output structure per .claude/BENCHMARK_OUTPUT.md specification, then run a test benchmark to verify.

Configuration Summary

Setting	Value
LLM Model	`mistral-devstral2` (Mistral Devstral-2512)
Repositories	`deriva`, `bigdata`
Runs per repo	1
Ground Truth	`reference/reference_*.archimate` (filtered versions)

Current vs Desired Output Structure

Current Structure (what Deriva produces now)

workspace/benchmarks/{session_id}/
├── summary.json                           # Session metadata
├── events.ocel.json                       # OCEL 2.0 event log at root
├── events.jsonl                           # JSONL streaming log at root
├── models/
│   ├── deriva_mistral-devstral2_1.xml    # Model naming: {repo}_{model}_{n}.xml
│   └── bigdata_mistral-devstral2_1.xml
├── cache/
│   └── mistral-devstral2/                 # Per-model subdirectory
│       ├── abc123def456.json              # Individual cache files by hash
│       └── ...
└── analysis/
    └── summary.json                       # Consistency analysis (JSON only)

Desired Structure (per BENCHMARK_OUTPUT.md)

benchmark_session_<timestamp>/
├── session_metadata.json                  # Renamed from summary.json
├── models/
│   ├── deriva_devstral2_run1.xml         # Model naming: {repo}_{model}_run{n}.xml
│   └── bigdata_devstral2_run1.xml
├── cache/
│   └── llm_cache.json                     # Single consolidated cache file
├── ocel/
│   └── benchmark_events.json              # OCEL in subdirectory, renamed
└── results/
    ├── consistency_metrics.csv            # NEW: Per-run stability
    ├── ground_truth_comparison.csv        # NEW: Precision/recall/F1
    ├── quality_verification.csv           # NEW: Quality scores
    ├── inter_model_agreement.json         # Extracted from analysis
    └── execution_metrics.csv              # NEW: Timing/tokens

Gap Analysis

#	Item	Current	Desired	File to Modify	Lines	Effort
1	Model filename	`{repo}_{model}_{n}.xml`	`{repo}_{model}_run{n}.xml`	`benchmarking.py`	1055	Low
2	OCEL location	Root level	`ocel/` subdirectory	`benchmarking.py`	1178-1184	Low
3	OCEL filename	`events.ocel.json`	`benchmark_events.json`	`benchmarking.py`	1179	Low
4	Metadata file	`summary.json`	`session_metadata.json`	`benchmarking.py`	1195	Low
5	Cache structure	Per-model dirs	Single `llm_cache.json`	`benchmarking.py`	1200-1234	Medium
6	Results directory	`analysis/`	`results/`	`analysis.py`	541-548	Low
7	consistency_metrics.csv	Not exists	Required	`analysis.py`	New method	Medium
8	ground_truth_comparison.csv	Not exists	Required	`analysis.py`	New method	Medium
9	quality_verification.csv	Not exists	Required	`analysis.py`	New method	Medium
10	execution_metrics.csv	Not exists	Required	`analysis.py`	New method	Medium

Implementation Phases

Phase 1: Directory Structure & Naming (P0 - Low Effort)

Task 1.1: Update Model File Naming

File: h:\Deriva\deriva\deriva\services\benchmarking.py Line: 1055

# CURRENT (line 1055):
filename = f"{safe_repo}_{safe_model}_{iteration}.xml"

# CHANGE TO:
filename = f"{safe_repo}_{safe_model}_run{iteration}.xml"

Impact: Model files will be named bigdata_mistral-devstral2_run1.xml instead of bigdata_mistral-devstral2_1.xml

Task 1.2: Move OCEL Files to `ocel/` Subdirectory

File: h:\Deriva\deriva\deriva\services\benchmarking.py Lines: 1178-1184

# CURRENT (lines 1178-1184):
# Export OCEL JSON
ocel_json_path = output_dir / "events.ocel.json"
self.ocel_log.export_json(ocel_json_path)

# Export JSONL for streaming
ocel_jsonl_path = output_dir / "events.jsonl"
self.ocel_log.export_jsonl(ocel_jsonl_path)

# CHANGE TO:
# Create ocel subdirectory
ocel_dir = output_dir / "ocel"
ocel_dir.mkdir(parents=True, exist_ok=True)

# Export OCEL JSON
ocel_json_path = ocel_dir / "benchmark_events.json"
self.ocel_log.export_json(ocel_json_path)

# Export JSONL for streaming
ocel_jsonl_path = ocel_dir / "benchmark_events.jsonl"
self.ocel_log.export_jsonl(ocel_jsonl_path)

Impact: OCEL files moved from root to ocel/ subdirectory

Task 1.3: Rename summary.json to session_metadata.json

File: h:\Deriva\deriva\deriva\services\benchmarking.py Line: 1195

# CURRENT (line 1195):
with open(output_dir / "summary.json", "w") as f:

# CHANGE TO:
with open(output_dir / "session_metadata.json", "w") as f:

Impact: Session metadata file renamed

Phase 2: Cache Consolidation (P1 - Medium Effort)

Task 2.1: Replace _copy_used_cache_entries() Method

File: h:\Deriva\deriva\deriva\services\benchmarking.py Lines: 1200-1234

Replace the entire method with consolidation logic:

def _copy_used_cache_entries(
    self,
    used_cache_keys: list[str],
    cache_dir: Path,
    model_name: str,
) -> int:
    """
    Consolidate used LLM cache entries into single llm_cache.json.

    Instead of copying individual hash-named files to per-model subdirectories,
    consolidates all entries into a single JSON file per BENCHMARK_OUTPUT.md spec.

    Args:
        used_cache_keys: List of cache keys (SHA256 hashes) used during the run
        cache_dir: Source cache directory where cache files are stored
        model_name: Name of the model (used for tagging entries)

    Returns:
        Number of cache entries in consolidated file
    """
    session_id = self.session_id or "unknown"
    cache_file = Path("workspace/benchmarks") / session_id / "cache" / "llm_cache.json"
    cache_file.parent.mkdir(parents=True, exist_ok=True)

    # Load existing consolidated cache or create new
    consolidated: dict[str, Any] = {}
    if cache_file.exists():
        try:
            with open(cache_file) as f:
                consolidated = json.load(f)
        except json.JSONDecodeError:
            consolidated = {}

    # Add new entries from this run
    for cache_key in set(used_cache_keys):  # Deduplicate
        src = cache_dir / f"{cache_key}.json"
        if src.exists() and cache_key not in consolidated:
            try:
                with open(src) as f:
                    data = json.load(f)
                    # Extract fields per BENCHMARK_OUTPUT.md spec
                    consolidated[cache_key] = {
                        "prompt": data.get("request", {}).get("messages", [{}])[-1].get("content", ""),
                        "response": data.get("response", {}).get("content", ""),
                        "model": model_name,
                        "tokens_in": data.get("usage", {}).get("input_tokens", 0),
                        "tokens_out": data.get("usage", {}).get("output_tokens", 0),
                        "timestamp": data.get("created_at", ""),
                        "bench_hash": f"run{self._current_iteration}_{self._current_repo}_{model_name}",
                    }
            except (OSError, json.JSONDecodeError):
                pass  # Skip on read errors

    # Write consolidated cache
    with open(cache_file, "w") as f:
        json.dump(consolidated, f, indent=2)

    return len(consolidated)

Note: Need to track _current_iteration and _current_repo in the orchestrator class.

Phase 3: CSV Export Implementation (P1 - Medium Effort)

Task 3.1: Add CSV Export Methods to BenchmarkAnalyzer

File: h:\Deriva\deriva\deriva\services\analysis.py Location: After line 548 (after export_all method)

Add these 4 new methods:

import csv
from pathlib import Path

def export_consistency_metrics_csv(self, path: Path) -> str:
    """
    Export consistency_metrics.csv with per-run element and relationship stability.

    Columns: repository, model, run, total_elements, stable_elements,
             element_stability, total_relationships, stable_relationships,
             relationship_stability
    """
    report = self.generate_report()
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow([
            "repository", "model", "run",
            "total_elements", "stable_elements", "element_stability",
            "total_relationships", "stable_relationships", "relationship_stability"
        ])

        for repo, phases in report.stability_reports.items():
            derivation = phases.get("derivation")
            if derivation:
                # Aggregate element stats
                total_elem = sum(b.total for b in derivation.element_breakdown)
                stable_elem = sum(b.stable for b in derivation.element_breakdown)
                elem_stability = stable_elem / total_elem if total_elem > 0 else 0.0

                # Aggregate relationship stats
                total_rel = sum(b.total for b in derivation.relationship_breakdown)
                stable_rel = sum(b.stable for b in derivation.relationship_breakdown)
                rel_stability = stable_rel / total_rel if total_rel > 0 else 0.0

                writer.writerow([
                    repo, "all", "all",
                    total_elem, stable_elem, f"{elem_stability:.2%}",
                    total_rel, stable_rel, f"{rel_stability:.2%}"
                ])

    return str(path)


def export_ground_truth_comparison_csv(self, path: Path) -> str:
    """
    Export ground_truth_comparison.csv with precision/recall/F1 against reference models.

    Columns: repository, model, gt_elements, extracted_elements, matched,
             precision, recall, f1

    Uses reference models from reference/reference_{repo}.archimate
    """
    report = self.generate_report()
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow([
            "repository", "model", "gt_elements", "extracted_elements",
            "matched", "precision", "recall", "f1"
        ])

        for repo, sem_report in report.semantic_reports.items():
            writer.writerow([
                repo, "all",
                sem_report.reference_element_count,
                sem_report.derived_element_count,
                len(sem_report.correctly_derived),
                f"{sem_report.precision:.3f}",
                f"{sem_report.recall:.3f}",
                f"{sem_report.f1_score:.3f}"
            ])

    return str(path)


def export_quality_verification_csv(self, path: Path) -> str:
    """
    Export quality_verification.csv with quality rubric scores.

    Columns: repository, model, validity_pct, type_correct_pct, name_quality_pct
    """
    report = self.generate_report()
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow([
            "repository", "model", "validity_pct", "type_correct_pct", "name_quality_pct"
        ])

        for repo, sem_report in report.semantic_reports.items():
            # Validity: Elements traceable to source
            validity = sem_report.precision if sem_report else 0.0

            # Type correct: Elements with matching ArchiMate type
            type_correct = len([m for m in sem_report.correctly_derived
                               if m.match_type in ("exact", "fuzzy_name")]) / max(sem_report.derived_element_count, 1)

            # Name quality: Elements with meaningful names (fuzzy match quality)
            name_quality = sem_report.avg_similarity if hasattr(sem_report, 'avg_similarity') else 0.0

            writer.writerow([
                repo, "all",
                f"{validity:.1%}",
                f"{type_correct:.1%}",
                f"{name_quality:.1%}"
            ])

    return str(path)


def export_execution_metrics_csv(self, path: Path) -> str:
    """
    Export execution_metrics.csv with runtime performance data.

    Columns: repository, model, run, duration_sec, event_count,
             tokens_in, tokens_out, api_calls

    Extracts timing data from OCEL events.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)

    # Parse OCEL for timing data
    run_metrics: dict[str, dict] = {}

    for event in self.ocel_log.events:
        if event.activity == "StartRun":
            run_id = event.objects.get("BenchmarkRun", [""])[0]
            run_metrics[run_id] = {
                "start": event.timestamp,
                "repo": event.objects.get("Repository", [""])[0],
                "model": event.objects.get("Model", [""])[0],
                "events": 0,
                "tokens_in": 0,
                "tokens_out": 0,
                "api_calls": 0,
            }
        elif event.activity == "CompleteRun":
            run_id = event.objects.get("BenchmarkRun", [""])[0]
            if run_id in run_metrics:
                run_metrics[run_id]["end"] = event.timestamp
        elif event.activity == "LLMRequest":
            run_id = event.objects.get("BenchmarkRun", [""])[0]
            if run_id in run_metrics:
                run_metrics[run_id]["api_calls"] += 1
                run_metrics[run_id]["tokens_in"] += event.attributes.get("tokens_in", 0)
                run_metrics[run_id]["tokens_out"] += event.attributes.get("tokens_out", 0)

        # Count all events for this run
        for run_id in event.objects.get("BenchmarkRun", []):
            if run_id in run_metrics:
                run_metrics[run_id]["events"] += 1

    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow([
            "repository", "model", "run", "duration_sec",
            "event_count", "tokens_in", "tokens_out", "api_calls"
        ])

        for run_id, metrics in run_metrics.items():
            # Parse run number from run_id (format: session:repo:model:n)
            parts = run_id.split(":")
            run_num = parts[-1] if len(parts) >= 4 else "1"

            # Calculate duration
            start = metrics.get("start")
            end = metrics.get("end", start)
            duration = (end - start).total_seconds() if start and end else 0.0

            writer.writerow([
                metrics["repo"], metrics["model"], run_num,
                f"{duration:.1f}",
                metrics["events"],
                metrics["tokens_in"],
                metrics["tokens_out"],
                metrics["api_calls"]
            ])

    return str(path)

Task 3.2: Update export_all() to Use results/ Directory

File: h:\Deriva\deriva\deriva\services\analysis.py Lines: 531-548

# CURRENT:
def export_all(self, output_dir: str | Path) -> dict[str, str]:
    """Export all formats to a directory."""
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    paths = {}
    paths["json"] = self.export_json(output_dir / "benchmark_analysis.json")
    paths["markdown"] = self.export_markdown(output_dir / "benchmark_analysis.md")

    return paths

# CHANGE TO:
def export_all(self, output_dir: str | Path) -> dict[str, str]:
    """Export all formats to results/ directory per BENCHMARK_OUTPUT.md spec."""
    output_dir = Path(output_dir)
    results_dir = output_dir / "results"
    results_dir.mkdir(parents=True, exist_ok=True)

    paths = {}

    # JSON exports
    paths["inter_model_json"] = self.export_json(results_dir / "inter_model_agreement.json")
    paths["markdown"] = self.export_markdown(results_dir / "benchmark_analysis.md")

    # CSV exports (new)
    paths["consistency_csv"] = self.export_consistency_metrics_csv(
        results_dir / "consistency_metrics.csv"
    )
    paths["ground_truth_csv"] = self.export_ground_truth_comparison_csv(
        results_dir / "ground_truth_comparison.csv"
    )
    paths["quality_csv"] = self.export_quality_verification_csv(
        results_dir / "quality_verification.csv"
    )
    paths["execution_csv"] = self.export_execution_metrics_csv(
        results_dir / "execution_metrics.csv"
    )

    return paths

Reference Model Configuration

The ground truth comparison will use:

Repository	Reference Model Path
deriva	`reference/reference_deriva.archimate` (if exists)
bigdata	`reference/reference_bigdata.archimate`

Current reference files available:

reference/reference_bigdata.archimate (filtered)
reference/reference_lightblue.archimate (filtered)
reference/reference_cloudbased.archimate (filtered)
reference/full_reference_*.archimate (full - NOT to be used)

Note: No reference_deriva.archimate exists. Ground truth comparison will show N/A or 0 for deriva repository.

CLI Commands

Step 1: Run Benchmark (Current Code - Baseline)

uv run deriva-cli benchmark run \
  --repos deriva,bigdata \
  --models mistral-devstral2 \
  -n 1 \
  --per-repo \
  -v \
  -d "Baseline test before BENCHMARK_OUTPUT.md changes"

Expected output location: workspace/benchmarks/bench_YYYYMMDD_HHMMSS/

Step 2: Run Analysis

uv run deriva-cli benchmark analyze <session_id>

Step 3: Run Comprehensive Analysis

uv run deriva-cli benchmark comprehensive-analysis <session_id> \
  -o workspace/benchmarks/<session_id>

Step 4: Verify Output Structure

# Windows PowerShell
Get-ChildItem -Recurse workspace/benchmarks/<session_id> | Select-Object FullName

# Or simple dir
dir workspace\benchmarks\<session_id> /s

Verification Checklist

After running baseline benchmark, check:

workspace/benchmarks/{session_id}/ directory created
summary.json exists (will be renamed to session_metadata.json)
events.ocel.json at root (will be moved to ocel/)
events.jsonl at root (will be moved to ocel/)
models/ directory contains XML files
Model filenames show current naming pattern

After implementing changes and re-running:

Implementation Order

Run baseline benchmark with current code
Inspect output to confirm current structure
Implement Phase 1 (naming/directory changes)
Implement Phase 2 (cache consolidation)
Implement Phase 3 (CSV exports)
Re-run benchmark and verify output matches spec

Risk Mitigation

Risk	Mitigation
Breaking existing benchmark sessions	Keep backward compatibility by checking for old filenames
Missing reference models	Generate placeholder CSV rows with N/A values
OCEL parsing errors	Add try/except blocks with graceful fallbacks
Large cache files	Limit consolidated entries to current session only

Files Summary

File	Changes
`deriva/services/benchmarking.py`	Lines 1055, 1178-1184, 1195, 1200-1234
`deriva/services/analysis.py`	Lines 531-548 + 4 new methods

Total estimated lines of code: ~150 new/modified lines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark Implementation Plan

Objective

Configuration Summary

Current vs Desired Output Structure

Current Structure (what Deriva produces now)

Desired Structure (per BENCHMARK_OUTPUT.md)

Gap Analysis

Implementation Phases

Phase 1: Directory Structure & Naming (P0 - Low Effort)

Task 1.1: Update Model File Naming

Task 1.2: Move OCEL Files to `ocel/` Subdirectory

Task 1.3: Rename summary.json to session_metadata.json

Phase 2: Cache Consolidation (P1 - Medium Effort)

Task 2.1: Replace _copy_used_cache_entries() Method

Phase 3: CSV Export Implementation (P1 - Medium Effort)

Task 3.1: Add CSV Export Methods to BenchmarkAnalyzer

Task 3.2: Update export_all() to Use results/ Directory

Reference Model Configuration

CLI Commands

Step 1: Run Benchmark (Current Code - Baseline)

Step 2: Run Analysis

Step 3: Run Comprehensive Analysis

Step 4: Verify Output Structure

Verification Checklist

Implementation Order

Risk Mitigation

Files Summary

Uh oh!

FilesExpand file tree

BENCHMARK_PLAN.md

Latest commit

History

BENCHMARK_PLAN.md

File metadata and controls

Benchmark Implementation Plan

Objective

Configuration Summary

Current vs Desired Output Structure

Current Structure (what Deriva produces now)

Desired Structure (per BENCHMARK_OUTPUT.md)

Gap Analysis

Implementation Phases

Phase 1: Directory Structure & Naming (P0 - Low Effort)

Task 1.1: Update Model File Naming

Task 1.2: Move OCEL Files to ocel/ Subdirectory

Task 1.3: Rename summary.json to session_metadata.json

Phase 2: Cache Consolidation (P1 - Medium Effort)

Task 2.1: Replace _copy_used_cache_entries() Method

Phase 3: CSV Export Implementation (P1 - Medium Effort)

Task 3.1: Add CSV Export Methods to BenchmarkAnalyzer

Task 3.2: Update export_all() to Use results/ Directory

Reference Model Configuration

CLI Commands

Step 1: Run Benchmark (Current Code - Baseline)

Step 2: Run Analysis

Step 3: Run Comprehensive Analysis

Step 4: Verify Output Structure

Verification Checklist

Implementation Order

Risk Mitigation

Files Summary

Task 1.2: Move OCEL Files to `ocel/` Subdirectory