Skip to content

Quantum simulation

Qulacs on Deucalion: Technical Overview & User Guide

Introduction

Qulacs is a high-performance, versatile quantum circuit simulator developed for quantum computing research. Written primarily in C++ with optimized backend implementations (including CPU parallelization via OpenMP, SIMD optimizations) and user-friendly Python bindings, Qulacs aims to provide researchers with one of the fastest simulation environments available.

On the Deucalion supercomputer, Qulacs has been installed and tested, leveraging the system's large-scale distributed computing capabilities to simulate quantum systems beyond the reach of typical workstations and up to 40 qubits at the ARM partition.

Key Features of Qulacs (General)

  • High Speed: Optimized C++ core using SVE instructions and multi-threading (OpenMP) for fast single-node performance.
  • MPI Parallelization: Enables distributed memory parallelization using MPI, allowing simulations to scale across multiple compute nodes for larger qubit systems. (This is the key feature utilized on Deucalion).
  • Versatile Functionality:
    • Simulation of quantum state vectors and density matrices.
    • Calculation of expectation values of observables.
    • Sampling measurement outcomes from quantum states.
    • Noise simulation (including Pauli noise, damping, Kraus operators).
    • Wide range of predefined quantum gates (Pauli, Clifford, Rotation, SWAP, controlled gates, etc.).
    • Support for parametric gates and variational circuits (useful for VQE, QAOA).
    • Gate fusion optimizations.
    • Fast quantum circuit simulations.
  • Python Interface: Easy-to-use Python bindings allow quick prototyping, integration with other Python libraries (like NumPy, SciPy), and script-based simulation control.
  • C++ Interface: Allows direct integration into C++ projects for maximum performance and low-level control.
  • Open Source: Available under the MIT license.

Qulacs Deployment on Deucalion

Qulacs is readily available for use on Deucalion, specifically tested and optimized for the ARM partition.

  • Accessing Qulacs: Qulacs is installed as an environment module. To load it into your environment, simply use the command inside a node of any ARM partition:

    ml qulacs
    
    This command will set up the necessary paths and environment variables, making the Qulacs Python package available for import in your Python scripts (import qulacs).

  • Demonstrated Scalability: Qulacs has been successfully tested utilizing the MPI parallelization capabilities on Deucalion.

    • Maximum Nodes Tested: 1024 ARM nodes.
    • Maximum Qubit System Simulated: 40 qubits (achieved using 1024 nodes).
  • Qubit Scaling on Deucalion: The maximum number of qubits (n) you can simulate depends on the number of nodes (N) allocated for your job: n = 30 + log₂(N)

    • Single Node (N=1): You can simulate systems up to 30 qubits. (log₂(1) = 0)
    • Maximum Tested (N=1024): You can simulate systems up to 40 qubits. (log₂(1024) = 10)
    • Note: This formula applies for N up to 1024, based on current testing. Ensure your job requests sufficient memory per node for the simulation.

Performance Benchmark on Deucalion

A performance test was conducted to gauge Qulacs's efficiency on a large-scale simulation using Deucalion's infrastructure.

  • System Size: 40 qubits
  • Nodes Used: 1024 (ARM partition)
  • Circuit: Contained 10 complex gates, where each gate acted on 20 qubits simultaneously. Such large multi-qubit gates are computationally demanding.
  • Execution Time: Approximately 300 seconds (5 minutes).

Performance Context: Directly comparing this benchmark externally is challenging without standardized benchmark circuits and identical hardware/network configurations. However, simulating a 40-qubit system is computationally intensive (requiring 2⁴⁰ complex numbers for the state vector, distributed across nodes). The fact that a circuit with demanding 20-qubit gates completed in 5 minutes on 1024 nodes demonstrates: 1. The effectiveness of Qulacs's MPI implementation. 2. The capability of Deucalion's ARM partition and high-speed interconnect for distributed quantum simulations. This result serves as a valuable data point for users estimating runtime for their own large-scale simulations on Deucalion.

Why Use Qulacs on Deucalion?

  • Large-Scale Simulation: Simulate systems up to 40 qubits, enabling research on algorithms and phenomena not feasible on smaller systems.
  • Ease of Access: Simple module loading (ml qulacs) integrates Qulacs seamlessly into the Deucalion environment.
  • High Performance: Leverage both Qulacs's optimizations and Deucalion's parallel computing power.
  • Familiar Python Interface: Utilize the popular Python interface for rapid development and integration.

Getting Started - Example Workflow

  1. Request Resources: Submit a SLURM job script requesting the desired number of nodes (N) on the ARM partition. Remember to allocate sufficient memory per node.
  2. Load Module: In your job script, include ml qulacs.
  3. Run Python Script: Execute your Python script which imports and uses Qulacs. For MPI jobs, you'll typically use sbatch jobscript to launch your Python script, ensuring Qulacs's MPI backend is correctly initialized. (Consult Qulacs MPI documentation for specifics on script structure).
#example of possible jobscript for 39 qubits in large-arm partition
#to get access to 1024 nodes please send an e-mail to support justifying your request
#!/bin/bash
#SBATCH --job-name  qulacs-python
#SBATCH --account   <your_account>
#SBATCH --partition large-arm
#SBATCH --time 00:59:00

#SBATCH --nodes           512
#SBATCH --ntasks          512
#SBATCH --cpus-per-task   48
#SBATCH --mem=0
#SBATCH --exclusive

#SBATCH -o job.out
#SBATCH -e job.err

source /share/env/module_select.sh #configure the correct path for lmod
ml qulacs
export OMP_NUM_THREADS=48
export GOMP_CPU_AFFINITY="0-47"
srun python bench_circuit.py -n 39 -o 1 -d 10

The folder with bench_circuit.py is here. Please download the whole folder to test this benchmark.

Further Information