Curator's Take
This benchmark represents a crucial step toward automating one of quantum computing's most pressing challenges: designing the complex error correction circuits needed for fault-tolerant quantum computers. As quantum systems scale beyond hundreds of qubits, manually crafting these specialized circuits becomes impossible, making AI-assisted synthesis essential for practical quantum computing. StabilizerBench provides the first standardized testing ground for measuring how well AI agents can handle this intricate task, covering everything from basic circuit generation to full fault-tolerant implementations across 192 different quantum error correction codes. The benchmark's clever use of stabilizer circuits allows efficient verification at scale while testing core quantum programming skills, potentially accelerating the development of AI tools that could unlock the next generation of quantum hardware.
— Mark Eatherly
Summary
As quantum hardware scales toward fault tolerant operation, the demand for correct quantum error correction (QEC) circuits far outpaces manual design capacity. AI agents offer a promising path to automating this synthesis, yet no benchmark exists to measure their progress on the specialized task of generating QEC circuits. We introduce StabilizerBench, a benchmark suite of 192 stabilizer codes spanning 12 families, 4-196 qubits, and distances 2-21, organized into three tasks of increasing difficulty: state preparation circuit generation, circuit optimization under semantic constraints, and fault tolerant circuit synthesis. Although motivated by QEC, stabilizer circuits exercise core competencies required for general quantum programming, including gate decomposition, qubit routing, and semantic preserving transformations, while admitting efficient verification via the Gottesman Knill theorem, enabling the benchmark to scale to large codes without the exponential cost of full unitary comparison. We define a unified generator weighted scoring system with two tiers: a capability score measuring breadth of success and a quality score capturing circuit merit. We also introduce continuous fault tolerance and optimization metrics that grade error resilience and circuit improvements beyond binary pass or fail. Following the design of classical benchmarks such as SWE-bench, StabilizerBench specifies inputs, verification oracles, and scoring but leaves prompts and agent strategies open. We evaluate three frontier AI agents and find the benchmark discriminates across models and tasks with substantial headroom for improvement.