iTranslated by AI
Playing with CUDA-Q (5) — Building CUDA-Q Kernels Dynamically
Purpose
The goal is to construct a CUDA-Q kernel using cudaq.make_kernel—which was previously implemented statically in Playing with CUDA-Q (2) — Running Grover's algorithm on GPU—and try writing it in a more pythonic way.
Implementation
First, install the necessary packages
We will use the following.
!pip install -qU cudaq "cupy-cuda12x==13.6.0"
Ensuring ASCII art circuit diagrams are displayed correctly
Define the utility.
from IPython.display import HTML, display
def show_fixed(text, font="Consolas, Roboto Mono, monospace", size=13):
text = text.expandtabs(4) # Convert tabs to spaces
esc = (text.replace("&", "&")
.replace("<", "<")
.replace(">", ">"))
html = f'<pre style="font-family:{font}; font-size:{size}px; white-space:pre; font-variant-ligatures:none;">{esc}</pre>'
display(HTML(html))
Importing the necessary packages
import numpy as np
import cudaq
Defining CUDA-Q kernels dynamically
Referring to 9. Just-in-Time Kernel Creation, I think implementing it as follows would be good. It's important to note that what was z.ctrl in the static kernel implementation needs to be kernel.cz, and the arguments are also slightly different. Aside from that, the basic structure is very similar.
def embed_solution(kernel: cudaq.PyKernel, qubits: cudaq.qview, solution: list[int]):
n_qubits = qubits.size()
for i, v in enumerate(solution):
if v == 0:
kernel.x(qubits[i])
# MCZ start
ctrls = [qubits[i] for i in range(n_qubits - 1)]
last = qubits[n_qubits - 1]
kernel.cz(ctrls, last)
# MCZ end
for i, v in enumerate(solution):
if v == 0:
kernel.x(qubits[i])
def oracle(kernel: cudaq.PyKernel, qubits: cudaq.qview, solutions: list[int]):
n_qubits = qubits.size()
for i in range(len(solutions) // n_qubits):
embed_solution(kernel, qubits, solutions[n_qubits*i:n_qubits*(i+1)])
def diffuser(kernel: cudaq.PyKernel, qubits: cudaq.qview):
n_qubits = qubits.size()
kernel.h(qubits)
kernel.x(qubits)
# MCZ start
ctrls = [qubits[i] for i in range(n_qubits - 1)]
last = qubits[n_qubits - 1]
kernel.cz(ctrls, last)
# MCZ end
kernel.x(qubits)
kernel.h(qubits)
def make_grover(num_qubits: int, counts: int, solutions: list[int]):
kernel = cudaq.make_kernel()
qubits = kernel.qalloc(num_qubits)
kernel.h(qubits)
for _ in range(counts):
oracle(kernel, qubits, solutions)
diffuser(kernel, qubits)
kernel.mz(qubits)
return kernel
try:
solutions = [0, 1, 1, 0, 1]
grover = make_grover(5, 1, solutions)
except Exception as e:
print(e)
Experimenting with 19 qubits
solutions = [
[1,0,0,1,1,0,1,1,0,0,1,0,0,1,1,0,1,0,0],
[1,1,1,0,1,1,1,0,0,1,0,1,0,0,1,1,0,1,1]
]
num_qubits = len(solutions[0])
N = 2**num_qubits
angle = np.arcsin(np.sqrt(len(solutions) / N))
counts = int((np.pi/2 - angle) / (2*angle) + 0.5)
print(f'{num_qubits=} {counts=}')
num_qubits=19 counts=402
Running sampling and checking results
%%time
try:
solutions_ = np.array(solutions).flatten().tolist()
grover = make_grover(num_qubits, counts, solutions_)
if False:
show_fixed(str(cudaq.draw(grover)))
result = cudaq.sample(grover)
result: dict[str, int] = {k: v for k, v in result.items()}
print(list(sorted(result.items(), key=lambda k_v: -k_v[1]))[:10])
except Exception as e:
print(e)
[('1110111001010011011', 505), ('1001101100100110100', 495)]
CPU times: user 1min 40s, sys: 213 ms, total: 1min 40s
Wall time: 1min 43s
It seems that it is slower compared to implementing the CUDA-Q kernel statically, likely due to some overhead somewhere.
Relationship Between Number of Qubits and Elapsed Time
I decided to observe the elapsed time by varying the number of qubits. I've included CPU results for reference, but the focus is on comparing static kernel construction versus dynamic kernel construction using the GPU (cuStateVec).

It's hard to see, so let's look at it as a logarithmic graph.

- Recalling Playing with Qiskit (8) — Qiskit Aer GPU, if you're running on a CPU, it might be better to use Qiskit instead of CUDA-Q.
- Dynamic kernel construction might also be faster using
cuStateVecvia Qiskit Aer, but it's hard to say for sure as I might not be fully utilizingcudaq.make_kernelyet. - Static kernel construction is clearly distinct from the others, showing a trend of blazing speed. It might be that mastering the syntax of static kernel construction is the most efficient approach.
Summary
I explored the dynamic construction of CUDA-Q kernels. I initially thought this might be the most powerful approach since it can be written pythonically, but it resulted in being unable to fully leverage the GPU, perhaps due to the implementation details.
I will continue to investigate, and if the gap with static kernel construction can be bridged, I'd like to take another shot at it.
Discussion