iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💻

CUDA Programming with CUDA Python #2: Mapping Multi-dimensional Data

に公開

1. Introduction

I am documenting my progress as I learn GPU programming.
In the previous article, I explained the overview of GPU programming and introduced the minimum working CUDA code using CUDA Python.
https://zenn.dev/bobtk/articles/26925007b68726

In the previous article, I explained that the smallest unit of parallelized computation is called a "Thread" and that processing is assigned to data using a hierarchical structure called Grid-Block-Thread. To delve deeper into this, this article will use the conversion of color images to grayscale as an example to explain how to map CUDA code processes (threads) to data.

Flow of This Article

In this article, I will explain the following topics:

    1. Data Handling
    • Grayscale conversion process
    • Handling of the data to be processed
    1. Mapping Threads (Processes)
    • Organizing how to assign (divide) threads to the data from section 2
    • Organizing how threads reference the data they are intended to calculate
    1. Implementation
    • Turning the content explained in sections 2 and 3 into working code using CUDA Python

Target Audience

  • This article also assumes experience with Python coding.
  • Knowledge of C/C++ or CUDA coding is not required.
  • However, it assumes knowledge of "4. Getting Started with CUDA Programming" from the previous article (implementing vector addition with CUDA Python).

2. Data Handling

Below, I will explain what the grayscale conversion process entails and how image data is handled.

2.1 Grayscale Process

I will explain the process of grayscaling a color image used in this article. To simplify, a color image can be considered data consisting of red, green, and blue (RGB) intensities (values from 0 to 255) for multiple points (pixels) in two dimensions. Grayscaling a color image converts these RGB values for each pixel into a single value (luminance).

If the RGB intensity values of a certain pixel are (r, g, b), the luminance L after gray conversion is given by the following formula using a method called the Luminosity Method:

L = 0.21*r + 0.72*g + 0.07*b

The visualization of the color-to-gray conversion based on the above formula is as follows. The color boxes on the left show the (r, g, b) values, and the grayscale boxes on the right show the corresponding (L) calculated using the formula above.

Note that the weights (0.21, 0.72, 0.07) for adding (r, g, b) sum up to 1. This ensures that the luminance L also stays within the range of 0 to 255. There isn't just one set of weights (0.21, 0.72, 0.07), and there are several conventions other than the Luminosity Method. I won't go deeper here, but feel free to look it up if you're interested.

As seen from the formula for L above, only the (r, g, b) values of that specific pixel are used for the calculation. Since it is independent of other pixels' calculations, it's clear that it can be easily parallelized.

2.2 Flattening Multi-dimensional Data

Data to be processed is placed in memory. Similar to vector addition, the image data to be processed is first read into the host memory by the CPU, placed in the device (GPU) memory during processing, and then the processed grayscale image is transferred back to the host memory after processing. When writing a CUDA Kernel, it's necessary to understand how images are stored in memory, which I'll explain in this and the next section.

First, I'll explain flattening. Since color images have RGB values for each two-dimensional pixel, they can be viewed as three-dimensional data. On the other hand, memory is a one-dimensional array to allow for high-speed access using only a single address value. Therefore, it's necessary to arrange three-dimensional data in one dimension. Here is an image of the 1D conversion of an image. At the top, an image of 4 horizontal × 3 vertical pixels is shown, illustrating how the r, g, b values of each pixel are arranged in one dimension.

First, the r, g, b values are arranged sequentially in memory starting from the top-left pixel [0, 0] in the horizontal direction (light blue line). Once it reaches the right edge, r, g, b are arranged from the left of the next row (orange line). This sequence continues until the r, g, b of the bottom-right pixel are arranged, completing the memory placement (green line).

Converting multi-dimensional data into a one-dimensional array like this is called flattening. Remember that all multi-dimensional data, not just color images, is flattened and placed in memory as a 1D array.

In Python, flattening multi-dimensional data can be easily done using NumPy's flatten() method.

2.3 How Color Images are Placed in Memory

Next, let's look at how the RGB values (r, g, b) are represented in CUDA.

First, as a premise, each value of (r, g, b), represented as integers from 0 to 255, is expressed in C(C++) as a type called unsigned char. This might be confusing at first, but it's best to simply memorize it as:

unsigned char: A type that represents integers from 0 to 255

Since the notation is complicated due to historical reasons, don't try too hard to interpret it literally as "unsigned character type."

However, it's good to understand that unsigned char is 8 digits in binary (8 bits), which is 1 byte of data, and since it can represent 2^8 = 256 different values, it can express integers from 0 to 255, including 0.

Furthermore, to be precise, the logic is actually the reverse: "For computational convenience, each color in (r, g, b) is represented as 1 byte, for a total of 3 bytes."

3. Mapping Threads to Data

So far, we have explained the process of grayscaling color images and how to arrange images in memory. Next, we will look at how to assign (map) threads to data in order to parallelize the process and assign it to each core.

As mentioned in the previous article, when writing a Kernel, it is essential to put yourself in the mindset of the thread performing the calculation. When a thread performs a calculation, it naturally needs to know which pixel it is responsible for.

Below, we consider an example of mapping threads to an image of 76 horizontal pixels × 62 vertical pixels. Since the calculation of luminance L for each pixel can be parallelized, we map one thread to each pixel. Each thread then identifies the appropriate target for processing based on the provided index values.

The following is an image showing the mapping of threads to data. Each small square corresponds to one pixel, and the calculation of luminance L for that square is handled by one thread. The squares enclosed by thick black lines represent blocks (groups of threads). The gray-colored cells represent the target 76x62 pixel image to be processed. The details are explained below in order.

3.1 Blocks and Grids

First, we decide the block size. A block is a group of threads. In the diagram above, we assume a two-dimensional block of 16x16. In this case, 16x16=256 threads will be assigned to one block. I will supplement how to determine the block size in section 3.6.

Next is the definition of the grid. A grid is a group of blocks. Since we need to process the entire image indicated by the gray squares, the blocks are arranged to cover this entire image. This set of arranged blocks is called a grid.

To cover the entire image with blocks while avoiding unnecessary ones, the number of blocks required in the x-axis (horizontal) and y-axis (vertical) directions can be calculated as follows:

  • x-axis: ceil(76/16) = ceil(4.75) = 5
  • y-axis: ceil(62/16) = ceil(3.875) = 4
    Note that ceil(x) is the smallest integer greater than or equal to x (essentially a function that rounds up decimal points). If x is an integer, the value of ceil(x) is simply x. This set of 5x4 = 20 blocks forms the grid for the grayscale conversion of the 76x62 pixel image.

As we will see in section 3.6, the block size is often determined from perspectives such as efficiency, and the size of the input data is typically not considered. For this reason, as shown in the image, some threads will fall outside the range of the gray squares of the image (the white squares). How to handle these overflowing parts will be explained in section 3.4.

3.2 Block and Thread Indices

Indices are assigned to blocks and threads through structures called blockIdx and threadIdx, respectively. Note that indices are zero-based.

First, the grid described above consists of 5x4 blocks. Therefore, the block indices take the following integer values:

  • Index in the x-axis direction (blockIdx.x): 0–4
  • Index in the y-axis direction (blockIdx.y): 0–3
    In the diagram above, the block where (blockIdx.x, blockIdx.y) = (2, 1) is highlighted in blue.

Next, one block in the grid consists of 16x16 threads. Therefore, the thread indices take the following values:

  • Index in the x-axis direction (threadIdx.x): 0–15
  • Index in the y-axis direction (threadIdx.y): 0–15
    In the diagram above, the thread where (threadIdx.x, threadIdx.y) = (2, 3) is highlighted in orange.

To supplement, all threads belonging to a specific block have the same blockIdx value. In the image example, the blue threads belong to the block (blockIdx.x, blockIdx.y) = (2, 1), so regardless of which blue thread retrieves the value of blockIdx.x, it will always return 2, and blockIdx.y will return 1.

3.3 Identifying the Pixel Processed by Each Thread

First, let's consider the thread colored in orange. It calculates the column (col, horizontal coordinate) and row (row, vertical coordinate) of the pixel it should process. For the column, there are 2 blocks to the left of this blue block (16 pixels each), and within the blue block, there are threadIdx.x=2 threads to the left of the orange thread. Therefore, the horizontal coordinate col to be calculated is:

col = 2*16 + 2 = 34

Similarly, the vertical index row is:

row = 1*16 + 3

Here, while keeping in mind that coordinates are zero-based, look at the diagram until you are convinced that "the number of blocks to the left of the block you belong to is equal to your blockIdx.x value." Also, confirm that "the number of threads to your left within the block you belong to is equal to your threadIdx.x value." Based on this logic, generalizing it, the coordinates of the pixel that the thread should calculate are:

col = blockIdx.x*blockDim.x + threadIdx.x
row = blockIdx.y*blockDim.y + threadIdx.y

3.4 Boundary Checks

As mentioned in Section 3.1, as shown in the image below, when blocks are tiled over an image, cells (threads) that extend beyond the image appear in the edge blocks. Specifically, while the image we are dealing with this time is 76x62 pixels, the cells colored in orange in the following image extend beyond the x-axis direction where col ≧ 76, the cells colored in purple extend beyond the y-axis direction where row ≧ 62, and the red cells fall into both. (As a reminder, indices are zero-based.)

It is extremely dangerous to read from or write to memory for indices that should not be accessed. Therefore, it is necessary to take measures (boundary checks) to ensure that threads assigned to such regions do not execute the process.

A boundary check is written in the Kernel using an if statement, where width is the number of horizontal pixels and height is the number of vertical pixels of the input image, as follows:

// Boundary check
if (col < width && row < height) {
	〜〜 Kernel processing content 〜〜
}

3.5 Memory Access per Thread

Now that we have identified the col and row that each thread should calculate in Section 3.3 and included a boundary check in Section 3.4, we can safely access memory. Next, let's consider how a thread retrieves the RGB values of its target pixel from memory.

As seen in Sections 2.2 and 2.3, image data is flattened as a 1-dimensional array and placed in memory. In an image with a width of width pixels and a height of height pixels, the pixel at coordinates (col, row) becomes the row * width + col-th pixel when we number them starting from the top-left pixel as 0 (let's call this pixel_index). Since each pixel has three values, the first value (r) of the RGB for the pixel at (col, row) is stored in the 3*(row * width + col)-th element of the memory array (let's call this base_index). Then, g is the next element (base_index + 1), and b is the one after that (base_index + 2).

This can be expressed in code-like form as follows:

pixel_index = row * width + col
base_index = 3 * pixel_index
r = input[base_index]
g = input[base_index + 1]
b = input[base_index + 2]

Through this calculation, the appropriate position in the 1D array can be identified from the 2D coordinates.

A concrete example for the 4x3 pixel case from Section 2.2 is shown below. Please examine it until you can clearly visualize the index calculation above.

3.6 Considerations for Selecting Block Size

In the example above, we arbitrarily set the block size to 16x16=256, but this should actually be determined by considering various factors.

Here, I will list the points to consider when deciding on the block size that we chose as a given in Section 3.1.

Hardware Limitations

The number of threads per block is limited to 1,024 in modern architectures. The block size of 16x16=256 we set here satisfies this limit.

Warp Considerations

As will be explained in another article, threads are processed in units called "warps" (meaning bundles in Japanese). In current GPUs, 1 warp = 32 threads. Because threads are divided into these warp units, the block size (number of threads per block) should be set to a multiple of the warp size, i.e., a multiple of 32. Note that if it is not set to a multiple of 32, the last warp in the block will have idle threads, leading to cores that cannot perform calculations and resulting in decreased performance.

The block size of 16x16=256 we set this time is exactly 8 times 32, so it also satisfies this requirement.

Note that 1 warp = 32 threads is based on architectures to date and is not guaranteed to remain unchanged forever. If an architecture with, for example, 1 warp = 64 threads appears in the future, the block size will need to be reviewed accordingly.

Performance Considerations

This is the most important factor, as the choice of block size directly impacts performance. Since several prerequisites are needed for the explanation, I will cover this in multiple future articles. A block size that is too large or too small will negatively affect performance.

4. Implementation

In Chapter 2, we organized the handling of image data, and in Chapter 3, we looked at how to map threads to data. Below, we will integrate everything covered so far and proceed with the implementation.

4.1 Implementation of the CUDA Kernel

Summarizing what has been explained so far, the implementation of the CUDA Kernel is as follows:

__global__ void rgb_to_grayscale(
    unsigned char *input,   // Input image (RGB)
    unsigned char *output,  // Output image (grayscale)
    int width,              // Image width
    int height              // Image height
) {
    // Calculate the coordinates of the pixel processed by the thread
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Boundary check
    if (col < width && row < height) {
        // Calculate the flattened pixel position
        int pixel_index = row * width + col;
        
        // Retrieve RGB values (reading from consecutive memory addresses for 3 channels)
        int base_index = pixel_index * 3;
        unsigned char r = input[base_index];
        unsigned char g = input[base_index + 1];
        unsigned char b = input[base_index + 2];
                
        // Convert to grayscale and store in the output image
        output[pixel_index] = 0.21*r + 0.72*g + 0.07*b;
    }
}

Note that in the kernel above, while the input image input is an array with a length of pixels × 3 (each RGB value), the output image output is an array with a length equal to the number of pixels. Therefore, for the pixel_index-th pixel, while input uses 3 * pixel_index as the base address, output is stored at the pixel_index address.

4.2 Execution Code using CUDA Python

Next, I will show a complete implementation example for calling the above Kernel from CUDA Python. Steps ① to ⑦ correspond to Section 4.2 of the previous article, so please refer to that as needed.

# ① Preparation
## Import libraries
from cuda.core.experimental import Device, Program, ProgramOptions, LaunchConfig, launch
import cupy as cp
import numpy as np
from PIL import Image
import math
import time

## Initialize CUDA device
device = Device()
device.set_current()
stream = device.create_stream()

## CUDA Kernel source code
kernel_source = """
__global__ void rgb_to_grayscale(
    unsigned char *input,
    unsigned char *output,
    int width,
    int height
) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (col < width && row < height) {
        int pixel_index = row * width + col;
        int base_index = pixel_index * 3;
        
        unsigned char r = input[base_index];
        unsigned char g = input[base_index + 1];
        unsigned char b = input[base_index + 2];
        
        output[pixel_index] = 0.21*r + 0.72*g + 0.07*b;
    }
}
"""

## Compile kernel
capa = device.compute_capability
program_options = ProgramOptions(
    std="c++17", 
    arch=f"sm_{capa.major}{capa.minor}"
)
program = Program(kernel_source, "C++", options=program_options)
module = program.compile("cubin", name_expressions=("rgb_to_grayscale",))
kernel = module.get_kernel("rgb_to_grayscale")

# ② Data preparation
## Generate or load sample image
# Example: Generate a random color image with 100 million pixels (12,032 × 9,024)
width = 12_032
height = 9_024 
rgb_image_host = np.random.randint(
    0, 256, 
    (height, width, 3), 
    dtype=np.uint8
)

# In case of loading an actual image file
# image = Image.open("GPU.jpg")
# rgb_image_host = np.array(image)
# height, width = rgb_image_host.shape[:2]

# Flatten RGB image into a 1D array
rgb_flat_host = rgb_image_host.flatten()

print(f"Image size: {width}×{height}")

# Start measuring GPU computation time
start_gpu = time.time()

# ③ Data transfer to GPU memory
rgb_flat_device = cp.asarray(rgb_flat_host)
gray_device = cp.empty(width * height, dtype=cp.uint8)

# ④ Kernel launch configuration
## 2D Block configuration (16×16 = 256 threads/block)
block_dim = (16, 16)
grid_dim = (
    math.ceil(width / block_dim[0]),
    math.ceil(height / block_dim[1])
)

print(f"Block configuration: {block_dim}")
print(f"Grid configuration: {grid_dim}")

config = LaunchConfig(grid=grid_dim, block=block_dim)

# ⑤ Execution
launch(
    stream, config, kernel,
    rgb_flat_device.data.ptr,
    gray_device.data.ptr,
    cp.uint32(width),
    cp.uint32(height)
)
device.sync()

# ⑥ Retrieve results
gray_host = cp.asnumpy(gray_device)

# Calculate GPU computation time
end_gpu = time.time()
time_gpu = end_gpu - start_gpu

# ⑦ Verification (Run calculation on CPU as well to compare results)
## Grayscale conversion on CPU
start_cpu = time.time()
gray_cpu = (
    0.21 * rgb_image_host[:, :, 0] + 
    0.72 * rgb_image_host[:, :, 1] + 
    0.07 * rgb_image_host[:, :, 2]
).astype(np.uint8).flatten()
end_cpu = time.time()
time_cpu = end_cpu - start_cpu

print(f"\nProcessing results:")
print(f"GPU processing time: {time_gpu:.4f} seconds")
print(f"CPU processing time: {time_cpu:.4f} seconds")
print(f"Speedup ratio: {time_cpu/time_gpu:.2f}x")

# Save result image (optional)
gray_image = gray_host.reshape((height, width))
Image.fromarray(gray_image, mode='L').save("output_gray.png")
print("Saved the grayscale image as 'output_gray.png'")

4.3 Results and Discussion

This is a comparison of execution times when running on a random color image of 100 million pixels (12,032 × 9,024 pixels). (Average time of 5 runs)

GPU[s] CPU[s] Speedup Ratio
0.067 0.547 8.2

The effort paid off this time, as we achieved more than an 8x speedup!
As expected of a processor specialized for graphics processing, it has excellent compatibility with these data types and arithmetic operations, resulting in a significant speedup even with a naive implementation.

By the way, I wanted to visually verify if it was correctly converted to grayscale, so I used the code above to convert the following GPU image from Irasutoya.

The converted image is shown below, and the result looks natural.

Summary

In this article, using the grayscale conversion of a color image as an example, we explained thread mapping to 2D data and CUDA implementation. We learned how to process 2D data using the hierarchical structure of Grid-Block-Thread and how to retrieve data from a flattened 1D array. Furthermore, despite the naive implementation, we saw a practical example where the GPU significantly speeds up the process.

In the next article, I will explain GPU processing for matrix multiplication, which is a more complex operation. Matrix multiplication will serve as a base case for future optimization studies and is an entry point for applications in deep learning, so please look forward to the next post.

Discussion