Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation

Curator's Take

This research tackles a critical accessibility barrier in quantum computing by making quantum circuit simulation practical on everyday consumer hardware rather than expensive data-center GPUs. The team's clever cache optimization technique addresses the fundamental memory bottleneck that plagues quantum state-vector simulation, achieving nearly 2x performance improvements on common laptop processors from Intel, AMD, and Apple. This democratization of quantum simulation tools could significantly accelerate quantum algorithm development by enabling researchers and students to prototype and test quantum circuits without requiring specialized hardware. The vendor-agnostic approach is particularly valuable as it ensures the optimization benefits work across different integrated GPU architectures, making quantum computing research more accessible to a broader community.

— Mark Eatherly

Summary

The classical simulation of quantum algorithms is a crucial tool for circuit development, testing, and validation. Although acceleration using GPUs significantly reduces simulation time, most high-performance simulators rely on vendor-specific frameworks that target data-center hardware. To broaden access to quantum simulation, this work proposes a vendor-agnostic approach targeting the integrated GPUs commonly found in consumer-grade laptops. A primary challenge in state-vector simulation is its inherently poor spatial locality, which creates a memory bandwidth bottleneck. Consequently, baseline implementations experience a severe degradation in relative GPU speedup as the number of simulated qubits increases. To address this limitation, we introduce a state partitioning optimization that reorganizes the quantum state vector to maximize the last-level cache locality and minimize costly main memory fetches. We evaluate this strategy using a Quantum Phase Estimation algorithm across diverse architectures from Intel, AMD, and Apple. The experimental results demonstrate that the proposed optimization successfully mitigates performance degradation at larger qubit scales. In particular, for a 28-qubit simulation, the optimization reversed a performance deficit on an Intel Core i5, improving the GPU speedup over the CPU from 0.95x to 1.89x, and increased the Apple M1 Pro speedup from 3.71x to 5.88x. Overall, this approach yields consistent execution time improvements, demonstrating the viability of integrated GPUs for efficient quantum circuit simulation.

Read Full Article at arXiv Quantum Physics →