iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🪐

Playing with Qiskit (8) — Qiskit Aer GPU

に公開

Purpose

Following Playing with Qiskit (7) — Qiskit Aer GPU, I want to evaluate a GPU-enabled build of Qiskit Aer in an Ubuntu environment, specifically one built with cuQuantum support.

For the content, I will use a Grover's algorithm circuit, which I personally often use in these cases, and I want to base it on Playing with cuQuantum (2) — Grover's Search Algorithm.

Experiment Details

  • The experimental environment is a machine with one NVIDIA T4.
  • I want to solve Grover's algorithm for three patterns: 18, 23, and 25 qubits, and compare the wall times.
  • Regarding the CPU, since it takes a noticeable amount of time from a certain number of qubits onwards, I will only try it with 18 qubits.

18-qubit Circuit

import numpy as np
from qiskit import QuantumCircuit, QuantumRegister
from qiskit.circuit.library import MCXGate
%matplotlib inline

def revserse_phase(qc: QuantumCircuit, state: str):
    qubits = []
    for i, digit in enumerate(state[::-1]):
        if digit == '0':
            qubits.append(i)
    if qubits:
        qc.x(qubits)
    # MCZ start
    qc.h(n_qubits - 1)
    qc.append(MCXGate(n_qubits - 1), list(range(n_qubits)))
    qc.h(n_qubits - 1)
    # MCZ end
    if qubits:
        qc.x(qubits)

def define_oracle(solutions):
    # Create the oracle with two solutions: |101> and |111>
    qreg = QuantumRegister(n_qubits, 'qr')
    oracle = QuantumCircuit(qreg)

    for sol in solutions:
        revserse_phase(oracle, sol)

    return oracle

def define_diffuser(n_qubits):
    qreg = QuantumRegister(n_qubits, 'qr')
    diffuser = QuantumCircuit(qreg)
    diffuser.h(qreg[:])
    diffuser.x(qreg[:])
    # MCZ start (HXH = Z)
    diffuser.h(qreg[n_qubits - 1])
    diffuser.append(MCXGate(n_qubits - 1), list(range(n_qubits)))
    diffuser.h(qreg[n_qubits - 1])
    # MCZ end
    diffuser.x(qreg[:])
    diffuser.h(qreg[:])

    return diffuser

solutions = ['101100111000111011', '110001110011000111']
assert len(solutions[0]) == len(solutions[1])
n_qubits = len(solutions[0])
print(f'{n_qubits=}')

oracle = define_oracle(solutions)
#oracle.draw('mpl')

n_qubits=18

diffuser = define_diffuser(n_qubits)
#diffuser.draw('mpl')

N = 2**n_qubits
angle = np.arcsin(np.sqrt(len(solutions) / N))
counts = int((np.pi/2 - angle) / (2*angle) + 0.5)
#print(f'{angle=}, {np.pi/2=}, {counts=}')

qreg = QuantumRegister(n_qubits, 'qr')
grover = QuantumCircuit(qreg)
# initialize |s>
grover.h(qreg[:])
for _ in range(counts):
    grover.compose(oracle, inplace=True)
    grover.compose(diffuser, inplace=True)
#grover.draw('mpl')

print(len(grover))

31542

from qiskit import transpile
from qiskit.tools.visualization import plot_histogram
from qiskit_aer import AerSimulator

qc = grover.copy()
qc.measure_all()
sim_cpu = AerSimulator(method='statevector', device='CPU')
%%time
result_cpu = sim_cpu.run(qc).result()

CPU times: user 59.7 s, sys: 318 ms, total: 1min
Wall time: 30.8 s

counts = result_cpu.get_counts()
print(counts)

{'101100111000111011': 506, '110001110011000111': 518}

sim_gpu = AerSimulator(method='statevector', device='GPU', cuStateVec_enable=False)
%%time
result_gpu = sim_gpu.run(qc).result()

CPU times: user 1.15 s, sys: 390 ms, total: 1.54 s
Wall time: 1.28 s

counts = result_gpu.get_counts()
print(counts)

{'110001110011000111': 513, '101100111000111011': 511}

sim_cuq = AerSimulator(method='statevector', device='GPU', cuStateVec_enable=True)
%%time
result_cuq = sim_cuq.run(qc).result()

CPU times: user 1.58 s, sys: 52 ms, total: 1.64 s
Wall time: 1.63 s

counts = result_cuq.get_counts()
print(counts)

{'110001110011000111': 510, '101100111000111011': 514}

So, summarizing the results for wall time, they are as follows: CPU >> cuQuantum > GPU. In the circuit from Playing with Qiskit (7) — Qiskit Aer GPU, cuQuantum was the fastest, but perhaps it depends on compatibility with the circuit?

CPU GPU cuQuantum
30.8 s 1.28 s 1.63 s

From now on, I won't test with the CPU as it is simply a waste of time.

23-qubit Circuit

solutions = ['10011011000000000011111', '11101110000000000011111']
n_qubits = len(solutions[0])
print(f'{n_qubits=}')

n_qubits=23

print(len(grover))

242831

%%time
result_gpu = sim_gpu.run(qc).result()

CPU times: user 49.2 s, sys: 57.7 s, total: 1min 46s
Wall time: 1min 44s

counts = result_gpu.get_counts()
print(counts)

{'10011011000000000011111': 501, '11101110000000000011111': 523}

%%time
result_cuq = sim_cuq.run(qc).result()

CPU times: user 2min 52s, sys: 5.07 s, total: 2min 57s
Wall time: 2min 57s

counts = result_cuq.get_counts()
print(counts)

{'11101110000000000011111': 523, '10011011000000000011111': 501}

And this time as well, the result was cuQuantum > GPU. The user CPU time appears to be longer in the cuQuantum version, so could that be the issue?

GPU cuQuantum
1min 44s 2min 57s

25-qubit Circuit

solutions = ['1001101100000000000011111', '1110111000000000000011111']
n_qubits = len(solutions[0])
print(f'{n_qubits=}')

n_qubits=25

print(len(grover))

537097 [1]

%%time
result_gpu = sim_gpu.run(qc).result()

CPU times: user 6min 40s, sys: 7min 51s, total: 14min 32s
Wall time: 14min 31s

counts = result_gpu.get_counts()
print(counts)

{'1110111000000000000011111': 489, '1001101100000000000011111': 535}

%%time
result_cuq = sim_cuq.run(qc).result()

CPU times: user 24min 27s, sys: 26.1 s, total: 24min 53s
Wall time: 24min 49s

counts = result_cuq.get_counts()
print(counts)

{'1001101100000000000011111': 507, '1110111000000000000011111': 517}

And this time again, the result was cuQuantum > GPU.

GPU cuQuantum
14min 31s 24min 49s

Summary

Summarizing all the tables, we get the following:

n_qubits CPU GPU cuQuantum
18 30.8 s 1.28 s 1.63 s
23 N/A 1min 44s 2min 57s
25 N/A 14min 31s 24min 49s

I'm not sure about the exact reasons, but while it's certain that GPU and cuQuantum are overwhelmingly faster than CPU, the result in Playing with Qiskit (7) — Qiskit Aer GPU, where it was GPU > cuQuantum (GPU slower than cuQuantum), has been reversed to cuQuantum > GPU (cuQuantum slower than GPU) this time.

This might be due to compatibility with the circuit, or perhaps the interface between Qiskit Aer and cuQuantum doesn't align well, or maybe the tuning is insufficient—I don't know the details[2]. It's also possible that there were issues with the options provided during the build. As more users join over time and more use cases emerge, this situation might change.

For the time being, I think it's best to experiment with smaller circuits to determine the combination that offers the best performance before moving on to larger-scale circuits.

One thing that puzzles me is that in Playing with Google Cirq + cuQuantum (2) — Grover's Search Algorithm, the depth for an equivalent circuit in the 25-qubit case was 35,377, so I'm not sure why it's 530,000 this time. Since I used different SDKs, I might have made a mistake in the circuit implementation. There seem to be significant differences around multi-controlled gates, so something might be there. In any case, since the situation is clearly different, a simple comparison with the results from Cirq + qsim + cuQuantum is likely not possible.

It might be worthwhile to build the target circuits using multiple SDKs and compare their characteristics.

While people often say "GPU is fast" or "cuQuantum is fast," you have to investigate for yourself what kind of execution environment and from how many qubits the GPU advantage becomes apparent. Furthermore, even when it's said to be fast, you have to find out for yourself specifically which one is faster and by how much in what kind of execution environment.

The results gave me a lot to think about, and I felt that setting a problem I want to solve and executing it myself allows me to grasp and see things more clearly.

脚注
  1. "My power level is 530,000" ↩︎

  2. Or rather, I could look at the implementation, but I probably wouldn't understand it, and I haven't investigated it. ↩︎

GitHubで編集を提案

Discussion