Official dataset and benchmark for the paper:
Multi-agent systems (MAS) powered by large language models (LLMs) are increasingly used to solve complex tasks through collaboration. However, these systems frequently fail due to complex inter-agent dependencies and multi-step reasoning processes.
Understanding where and why failures occur is critical for improving reliability.
❗ Key insight: Failure attribution in MAS is inherently non-deterministic — multiple valid explanations for a failure can coexist.
To address this, we introduce:
- MP-Bench: The first benchmark for multi-perspective failure attribution
- A new evaluation paradigm that moves beyond single “ground-truth” failure labels
Existing benchmarks assume:
- A single deterministic failure step
- A unique correct answer
However, in real-world MAS:
- Multiple reasoning paths may be valid
- Different analysts may attribute failure to different steps
We argue that:
Failure attribution should be modeled as a set of plausible explanations, not a single label.
- (a) Execution trace of a MAS
- (b) Deterministic attribution (existing assumption)
- (c) Multi-perspective attribution (our formulation)
Different experts may:
- Blame different steps
- Provide different reasoning
- Still be equally valid
- 289 execution logs (Collected from existing benchmarks, MAST and Who&When)
- 121 diverse MAS configurations (including Hand-Crafted and Automated MAS configurations)
- 3 expert annotators per instance(experts hired by rigorous interview process)
- Step-level annotations + reasoning
All released JSON instances live under MP-Bench/. The directory mirrors which of the three expert annotators produced the labels, and how the underlying MAS was configured:
| Path | Meaning |
|---|---|
MP-Bench/1/, 2/, 3/ |
Annotations from expert 1, 2, and 3 (three independent perspectives per the benchmark design). |
…/automatic/ |
Traces from automated / algorithm-generated MAS setups (e.g., configurations aligned with algorithmically generated pipelines). |
…/manual/ |
Traces from hand-crafted MAS setups (e.g., expert-designed agent workflows). |
Each *.json file is one execution log instance; the numeric filename is an instance identifier within that split.
MP-Bench/
├── 1/
│ ├── automatic/ # Expert 1 · automated MAS configurations
│ └── manual/ # Expert 1 · hand-crafted MAS configurations
├── 2/
│ ├── automatic/
│ └── manual/
└── 3/
├── automatic/
└── manual/
Each JSON instance includes:
{
"log_source": "source link",
"annotation": [
{
"step": "5",
"fail_annotation": "1",
"fail_category": "Resource Access Error",
"fail_reason": "Direct access to ScienceDirect was blocked with error message 'There was a problem providing the content you requested' and reference number, indicating institutional access restrictions",
"ideal_action": "The WebSurfer agent should have immediately recognized the access restriction and pivoted to alternative approaches like searching for published research that might contain this statistical data instead of continuing attempts to access restricted content"
},
{
"step": "16",
"fail_annotation": "1",
"fail_category": "System Error",
"fail_reason": "The WebSurfer agent innvocation have raised an error",
"ideal_action": "The WebSurfer agent could have had better exception handling."
}
]
}@article{in2026rethinking,
title={Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation},
author={In, Yeonjun and Tanjim, Mehrab and Subramanian, Jayakumar and Kim, Sungchul and Bhattacharya, Uttaran and Kim, Wonjoong and Park, Sangwu and Sarkhel, Somdeb and Park, Chanyoung},
journal={arXiv preprint arXiv:2603.25001},
year={2026}
}This dataset is released under the Adobe Research License.
- The dataset is strictly limited to non-commercial research purposes only
- Allowed uses include:
- Academic research
- Teaching
- Commercial use is strictly prohibited, including:
- Product development
- Commercial distribution
- Any activity leading to financial gain
If you redistribute this dataset (or modified versions):
- You must include a copy of the original Adobe Research License
- Any derivative work must also be restricted to non-commercial research use
- You must retain all original copyright notices and disclaimers
The dataset is provided "as is" without warranty of any kind.
Adobe is not liable for any damages resulting from its use.
For full details, please refer to Adobe Research License v1.2.txt.
