Legacy-Bench

A benchmark for evaluating AI coding agents on legacy software engineering tasks.

The software that processes trillions in daily financial settlements, routes telephone calls across continents, and adjudicates insurance claims was written in COBOL, Fortran, and Java 7. The engineers who understand it are retiring faster than they can be replaced. Every major coding agent benchmark evaluates agents on modern Python and JavaScript. None of them reflect the reality of working with the world's most critical infrastructure.

Legacy-Bench measures how well frontier AI agents can maintain, debug, and modernize legacy code.

Overview

Legacy-Bench consists of hundreds of tasks spanning six legacy language families and real enterprise domains. This repository contains ten representative public sample tasks. The full benchmark is available for evaluation -- contact Factory for access.

Language	% of Benchmark	Domains
COBOL	46%	Financial settlement, payroll processing, insurance claims, telecom billing, VSAM file handling
Java 7	32%	Enterprise middleware, CDR processing, warehouse logistics, binary parsing, EJB patterns
BASIC	6%	Business applications, accounting, data processing
C89	5%	Systems programming, low-level debugging, protocol implementation
Fortran	5%	Scientific computing, numerical methods, physics simulation
Assembly	5%	x86 firmware parsing, protocol decoding, hardware simulation

Public Sample Tasks

Task	Language	Type	Description
`1907c2`	C	fix/debug	Legacy buddy allocator fix
`16b04d`	COBOL	migration	Railroad retirement migration
`2831b5`	Java 7	fix/debug	Rating engine repair
`3af1fe`	COBOL	fix/debug	Bond settlement reconciliation
`505812`	Java 7	fix/debug	Inventory cost fix
`6fe1ab`	Java 7	fix/debug	MTOM attachment corruption fix
`8e8098`	COBOL	fix/debug	Railcar settlement fix
`d1ddc1`	Fortran	migration	Lattice QCD migration to C++
`ecf5e7`	x86-64 ASM	fix/debug	MZ/NE header parser fix
`fac397`	COBOL	migration	Batch interest migration

Task Structure

Each task directory follows the Harbor task format:

tasks/<task-id>/
  instruction.md    # What the agent must do
  task.toml         # Configuration (timeout, resources, etc.)
  environment/      # The legacy codebase and Dockerfile
  solution/         # Reference solution (oracle)
  tests/            # Verifier scripts run after the agent finishes

The agent receives instruction.md and the environment/ directory. After the agent submits its changes, the verifier in tests/ is executed inside the container to check correctness.

Getting Started

Prerequisites

Docker
Harbor (for automated evaluation)

Install Harbor

pip install harbor

Run the Oracle Solutions

Verify that the tasks and verifiers work by running the oracle:

harbor run --dataset legacy-bench \
  --agent oracle \
  --n-concurrent 4

Run an Agent

export ANTHROPIC_API_KEY=<YOUR-KEY>
harbor run --dataset legacy-bench \
  --agent claude-code \
  --model anthropic/claude-opus-4-6 \
  --n-concurrent 4

Or any other Harbor-compatible agent. See the Harbor documentation for details on integrating custom agents.

Run a Single Task Manually

Each task can also be run manually with Docker:

cd tasks/1907c2-c-debug-legacy-buddy-fix

# Build the container
docker build -t legacy-bench-1907c2 -f environment/Dockerfile environment/

# Run the container interactively
docker run -it legacy-bench-1907c2 /bin/bash

# After making changes inside the container, run the verifier
pytest tests/test_outputs.py

Refer to task.toml for task-specific settings (timeout, internet access, etc.).

Results

Overall pass rates on the full benchmark range from 16.9% to 42.5% across 12 model-agent combinations evaluated. For context, these same frontier models score >70% on Terminal-Bench 2 and SWE-bench Verified.

Key findings:

Agent iteration works only where errors are visible. Java 7 bug fixing scores highest because stack traces tell the agent what went wrong. COBOL bugs are silent -- wrong output looks correct.
Bug fixing outperforms implementation and migration. Bug fixing scores roughly 2x higher than implementation, which scores roughly 2x higher than migration. Every model shows this pattern.
No single model wins. Each model has categorical failures on entire language families. Rankings are inconsistent across task types.
Agents don't know when they're wrong. In 97% of failures, the agent believes it has solved the task.

Read the full analysis: factory.ai/news/legacy-bench

License

This project is licensed under the Apache License 2.0 -- see the LICENSE file for details.

Citation

@misc{legacybench2026,
  title={Legacy-Bench: A Benchmark for AI Agents on Legacy Software Engineering Tasks},
  author={Factory AI},
  year={2026},
  url={https://github.com/factory-ai/legacy-bench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tasks		tasks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legacy-Bench

Overview

Public Sample Tasks

Task Structure

Getting Started

Prerequisites

Install Harbor

Run the Oracle Solutions

Run an Agent

Run a Single Task Manually

Results

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Legacy-Bench

Overview

Public Sample Tasks

Task Structure

Getting Started

Prerequisites

Install Harbor

Run the Oracle Solutions

Run an Agent

Run a Single Task Manually

Results

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages