iTranslated by AI
Playing with Qiskit (8) — Qiskit Aer GPU
Purpose
Following Playing with Qiskit (7) — Qiskit Aer GPU, I want to evaluate a GPU-enabled build of Qiskit Aer in an Ubuntu environment, specifically one built with cuQuantum support.
For the content, I will use a Grover's algorithm circuit, which I personally often use in these cases, and I want to base it on Playing with cuQuantum (2) — Grover's Search Algorithm.
Experiment Details
- The experimental environment is a machine with one NVIDIA T4.
- I want to solve Grover's algorithm for three patterns: 18, 23, and 25 qubits, and compare the wall times.
- Regarding the CPU, since it takes a noticeable amount of time from a certain number of qubits onwards, I will only try it with 18 qubits.
18-qubit Circuit
import numpy as np
from qiskit import QuantumCircuit, QuantumRegister
from qiskit.circuit.library import MCXGate
%matplotlib inline
def revserse_phase(qc: QuantumCircuit, state: str):
qubits = []
for i, digit in enumerate(state[::-1]):
if digit == '0':
qubits.append(i)
if qubits:
qc.x(qubits)
# MCZ start
qc.h(n_qubits - 1)
qc.append(MCXGate(n_qubits - 1), list(range(n_qubits)))
qc.h(n_qubits - 1)
# MCZ end
if qubits:
qc.x(qubits)
def define_oracle(solutions):
# Create the oracle with two solutions: |101> and |111>
qreg = QuantumRegister(n_qubits, 'qr')
oracle = QuantumCircuit(qreg)
for sol in solutions:
revserse_phase(oracle, sol)
return oracle
def define_diffuser(n_qubits):
qreg = QuantumRegister(n_qubits, 'qr')
diffuser = QuantumCircuit(qreg)
diffuser.h(qreg[:])
diffuser.x(qreg[:])
# MCZ start (HXH = Z)
diffuser.h(qreg[n_qubits - 1])
diffuser.append(MCXGate(n_qubits - 1), list(range(n_qubits)))
diffuser.h(qreg[n_qubits - 1])
# MCZ end
diffuser.x(qreg[:])
diffuser.h(qreg[:])
return diffuser
solutions = ['101100111000111011', '110001110011000111']
assert len(solutions[0]) == len(solutions[1])
n_qubits = len(solutions[0])
print(f'{n_qubits=}')
oracle = define_oracle(solutions)
#oracle.draw('mpl')
n_qubits=18
diffuser = define_diffuser(n_qubits)
#diffuser.draw('mpl')
N = 2**n_qubits
angle = np.arcsin(np.sqrt(len(solutions) / N))
counts = int((np.pi/2 - angle) / (2*angle) + 0.5)
#print(f'{angle=}, {np.pi/2=}, {counts=}')
qreg = QuantumRegister(n_qubits, 'qr')
grover = QuantumCircuit(qreg)
# initialize |s>
grover.h(qreg[:])
for _ in range(counts):
grover.compose(oracle, inplace=True)
grover.compose(diffuser, inplace=True)
#grover.draw('mpl')
print(len(grover))
31542
from qiskit import transpile
from qiskit.tools.visualization import plot_histogram
from qiskit_aer import AerSimulator
qc = grover.copy()
qc.measure_all()
sim_cpu = AerSimulator(method='statevector', device='CPU')
%%time
result_cpu = sim_cpu.run(qc).result()
CPU times: user 59.7 s, sys: 318 ms, total: 1min
Wall time: 30.8 s
counts = result_cpu.get_counts()
print(counts)
{'101100111000111011': 506, '110001110011000111': 518}
sim_gpu = AerSimulator(method='statevector', device='GPU', cuStateVec_enable=False)
%%time
result_gpu = sim_gpu.run(qc).result()
CPU times: user 1.15 s, sys: 390 ms, total: 1.54 s
Wall time: 1.28 s
counts = result_gpu.get_counts()
print(counts)
{'110001110011000111': 513, '101100111000111011': 511}
sim_cuq = AerSimulator(method='statevector', device='GPU', cuStateVec_enable=True)
%%time
result_cuq = sim_cuq.run(qc).result()
CPU times: user 1.58 s, sys: 52 ms, total: 1.64 s
Wall time: 1.63 s
counts = result_cuq.get_counts()
print(counts)
{'110001110011000111': 510, '101100111000111011': 514}
So, summarizing the results for wall time, they are as follows: CPU >> cuQuantum > GPU. In the circuit from Playing with Qiskit (7) — Qiskit Aer GPU, cuQuantum was the fastest, but perhaps it depends on compatibility with the circuit?
| CPU | GPU | cuQuantum |
|---|---|---|
| 30.8 s | 1.28 s | 1.63 s |
From now on, I won't test with the CPU as it is simply a waste of time.
23-qubit Circuit
solutions = ['10011011000000000011111', '11101110000000000011111']
n_qubits = len(solutions[0])
print(f'{n_qubits=}')
n_qubits=23
print(len(grover))
242831
%%time
result_gpu = sim_gpu.run(qc).result()
CPU times: user 49.2 s, sys: 57.7 s, total: 1min 46s
Wall time: 1min 44s
counts = result_gpu.get_counts()
print(counts)
{'10011011000000000011111': 501, '11101110000000000011111': 523}
%%time
result_cuq = sim_cuq.run(qc).result()
CPU times: user 2min 52s, sys: 5.07 s, total: 2min 57s
Wall time: 2min 57s
counts = result_cuq.get_counts()
print(counts)
{'11101110000000000011111': 523, '10011011000000000011111': 501}
And this time as well, the result was cuQuantum > GPU. The user CPU time appears to be longer in the cuQuantum version, so could that be the issue?
| GPU | cuQuantum |
|---|---|
| 1min 44s | 2min 57s |
25-qubit Circuit
solutions = ['1001101100000000000011111', '1110111000000000000011111']
n_qubits = len(solutions[0])
print(f'{n_qubits=}')
n_qubits=25
print(len(grover))
537097 [1]
%%time
result_gpu = sim_gpu.run(qc).result()
CPU times: user 6min 40s, sys: 7min 51s, total: 14min 32s
Wall time: 14min 31s
counts = result_gpu.get_counts()
print(counts)
{'1110111000000000000011111': 489, '1001101100000000000011111': 535}
%%time
result_cuq = sim_cuq.run(qc).result()
CPU times: user 24min 27s, sys: 26.1 s, total: 24min 53s
Wall time: 24min 49s
counts = result_cuq.get_counts()
print(counts)
{'1001101100000000000011111': 507, '1110111000000000000011111': 517}
And this time again, the result was cuQuantum > GPU.
| GPU | cuQuantum |
|---|---|
| 14min 31s | 24min 49s |
Summary
Summarizing all the tables, we get the following:
| n_qubits | CPU | GPU | cuQuantum |
|---|---|---|---|
| 18 | 30.8 s | 1.28 s | 1.63 s |
| 23 | N/A | 1min 44s | 2min 57s |
| 25 | N/A | 14min 31s | 24min 49s |
I'm not sure about the exact reasons, but while it's certain that GPU and cuQuantum are overwhelmingly faster than CPU, the result in Playing with Qiskit (7) — Qiskit Aer GPU, where it was GPU > cuQuantum (GPU slower than cuQuantum), has been reversed to cuQuantum > GPU (cuQuantum slower than GPU) this time.
This might be due to compatibility with the circuit, or perhaps the interface between Qiskit Aer and cuQuantum doesn't align well, or maybe the tuning is insufficient—I don't know the details[2]. It's also possible that there were issues with the options provided during the build. As more users join over time and more use cases emerge, this situation might change.
For the time being, I think it's best to experiment with smaller circuits to determine the combination that offers the best performance before moving on to larger-scale circuits.
One thing that puzzles me is that in Playing with Google Cirq + cuQuantum (2) — Grover's Search Algorithm, the depth for an equivalent circuit in the 25-qubit case was 35,377, so I'm not sure why it's 530,000 this time. Since I used different SDKs, I might have made a mistake in the circuit implementation. There seem to be significant differences around multi-controlled gates, so something might be there. In any case, since the situation is clearly different, a simple comparison with the results from Cirq + qsim + cuQuantum is likely not possible.
It might be worthwhile to build the target circuits using multiple SDKs and compare their characteristics.
While people often say "GPU is fast" or "cuQuantum is fast," you have to investigate for yourself what kind of execution environment and from how many qubits the GPU advantage becomes apparent. Furthermore, even when it's said to be fast, you have to find out for yourself specifically which one is faster and by how much in what kind of execution environment.
The results gave me a lot to think about, and I felt that setting a problem I want to solve and executing it myself allows me to grasp and see things more clearly.
Discussion