iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🪐

Playing with CUDA-Q (5) — Building CUDA-Q Kernels Dynamically

に公開

Purpose

The goal is to construct a CUDA-Q kernel using cudaq.make_kernel—which was previously implemented statically in Playing with CUDA-Q (2) — Running Grover's algorithm on GPU—and try writing it in a more pythonic way.

Implementation

First, install the necessary packages

We will use the following.

!pip install -qU cudaq "cupy-cuda12x==13.6.0"

Ensuring ASCII art circuit diagrams are displayed correctly

Define the utility.

from IPython.display import HTML, display

def show_fixed(text, font="Consolas, Roboto Mono, monospace", size=13):
    text = text.expandtabs(4)   # Convert tabs to spaces
    esc = (text.replace("&", "&")
               .replace("<", "&lt;")
               .replace(">", "&gt;"))
    html = f'<pre style="font-family:{font}; font-size:{size}px; white-space:pre; font-variant-ligatures:none;">{esc}</pre>'
    display(HTML(html))

Importing the necessary packages

import numpy as np
import cudaq

Defining CUDA-Q kernels dynamically

Referring to 9. Just-in-Time Kernel Creation, I think implementing it as follows would be good. It's important to note that what was z.ctrl in the static kernel implementation needs to be kernel.cz, and the arguments are also slightly different. Aside from that, the basic structure is very similar.

def embed_solution(kernel: cudaq.PyKernel, qubits: cudaq.qview, solution: list[int]):
    n_qubits = qubits.size()
    for i, v in enumerate(solution):
        if v == 0:
            kernel.x(qubits[i])
    # MCZ start
    ctrls = [qubits[i] for i in range(n_qubits - 1)]
    last = qubits[n_qubits - 1]
    kernel.cz(ctrls, last)
    # MCZ end
    for i, v in enumerate(solution):
        if v == 0:
            kernel.x(qubits[i])


def oracle(kernel: cudaq.PyKernel, qubits: cudaq.qview, solutions: list[int]):
    n_qubits = qubits.size()
    for i in range(len(solutions) // n_qubits):
        embed_solution(kernel, qubits, solutions[n_qubits*i:n_qubits*(i+1)])


def diffuser(kernel: cudaq.PyKernel, qubits: cudaq.qview):
    n_qubits = qubits.size()
    kernel.h(qubits)
    kernel.x(qubits)
    # MCZ start
    ctrls = [qubits[i] for i in range(n_qubits - 1)]
    last = qubits[n_qubits - 1]
    kernel.cz(ctrls, last)
    # MCZ end
    kernel.x(qubits)
    kernel.h(qubits)


def make_grover(num_qubits: int, counts: int, solutions: list[int]):
    kernel = cudaq.make_kernel()
    qubits = kernel.qalloc(num_qubits)

    kernel.h(qubits)
    for _ in range(counts):
        oracle(kernel, qubits, solutions)
        diffuser(kernel, qubits)

    kernel.mz(qubits)

    return kernel


try:
    solutions = [0, 1, 1, 0, 1]
    grover = make_grover(5, 1, solutions)
except Exception as e:
    print(e)

Experimenting with 19 qubits

solutions = [
    [1,0,0,1,1,0,1,1,0,0,1,0,0,1,1,0,1,0,0],
    [1,1,1,0,1,1,1,0,0,1,0,1,0,0,1,1,0,1,1]
]
num_qubits = len(solutions[0])

N = 2**num_qubits
angle = np.arcsin(np.sqrt(len(solutions) / N))
counts = int((np.pi/2 - angle) / (2*angle) + 0.5)

print(f'{num_qubits=} {counts=}')

num_qubits=19 counts=402

Running sampling and checking results

%%time

try:
    solutions_ = np.array(solutions).flatten().tolist()
    grover = make_grover(num_qubits, counts, solutions_)
    if False:
        show_fixed(str(cudaq.draw(grover)))
    result = cudaq.sample(grover)
    result: dict[str, int] = {k: v for k, v in result.items()}
    print(list(sorted(result.items(), key=lambda k_v: -k_v[1]))[:10])
except Exception as e:
    print(e)

[('1110111001010011011', 505), ('1001101100100110100', 495)]
CPU times: user 1min 40s, sys: 213 ms, total: 1min 40s
Wall time: 1min 43s

It seems that it is slower compared to implementing the CUDA-Q kernel statically, likely due to some overhead somewhere.

Relationship Between Number of Qubits and Elapsed Time

I decided to observe the elapsed time by varying the number of qubits. I've included CPU results for reference, but the focus is on comparing static kernel construction versus dynamic kernel construction using the GPU (cuStateVec).

It's hard to see, so let's look at it as a logarithmic graph.

  • Recalling Playing with Qiskit (8) — Qiskit Aer GPU, if you're running on a CPU, it might be better to use Qiskit instead of CUDA-Q.
  • Dynamic kernel construction might also be faster using cuStateVec via Qiskit Aer, but it's hard to say for sure as I might not be fully utilizing cudaq.make_kernel yet.
  • Static kernel construction is clearly distinct from the others, showing a trend of blazing speed. It might be that mastering the syntax of static kernel construction is the most efficient approach.

Summary

I explored the dynamic construction of CUDA-Q kernels. I initially thought this might be the most powerful approach since it can be written pythonically, but it resulted in being unable to fully leverage the GPU, perhaps due to the implementation details.

I will continue to investigate, and if the gap with static kernel construction can be bridged, I'd like to take another shot at it.

References

GitHubで編集を提案

Discussion