iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🛜

ZeroNIC Anatomy Notes: Unbelievably Fast...

に公開

Introduction

Notes on researching "ZeroNIC," a high-performance host networking technology developed through collaborative research by Stanford University, Cornell University, and Enfabrica.

It is based on the paper "High-throughput and Flexible Host Networking for Accelerated Computing" presented at the USENIX Symposium on Operating Systems Design and Implementation 2024. ZeroNIC is a networking technology for bandwidth-intensive applications such as AI and data analysis, achieving both high performance and flexibility through the co-design of hardware and software.

Watch the video here.
https://youtu.be/zdslmVqIsjc?list=PLbRoZ5Rrl5lcNznjOKrTgFmIsFRPA5da-

Challenges of Traditional Network Stacks

Modern data center applications need to handle massive amounts of data over the network. For example, the NVIDIA DGX-B200 has a bandwidth of 400 Gbps per GPU, providing a total system bandwidth of 3.2 Tbps. Handling such high data transfer volumes reveals several issues with traditional network stacks.

Stack Features Issues
RDMA ・High throughput
・CPU-offload
・Protocol processing in NIC hardware
・Lack of flexibility
・Requires lossless fabric
・Hardware dependency makes protocol improvements difficult
・Hard to solve issues like Head-of-Line (HOL) blocking
Linux TCP ・Flexible
・Various verified protocols
・Easy to introduce new protocols
・Cannot achieve high throughput
・CPU becomes the bottleneck
・Too many data copies between kernel and user space

According to the research team's experiments, the limit for Linux TCP is 50 Gbps for a single flow. This is because CPU usage reaches nearly 100%. On the other hand, implementing zero-copy on the receiving side significantly reduces CPU usage and improves performance up to 100 Gbps. In other words, reducing data copies is the key to lowering CPU usage.

Separation of Data Path and Control Path

The core idea of ZeroNIC is to physically separate the data path (payload transfer between the network and application buffers) and the control path (header processing and protocol control).


https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf

The features are summarized as follows:

Component Role Features
NIC ・Separates packet header and payload
・Directly DMA transfers payload to application buffers
・Flow state tracking and ordering
Implemented in hardware for high speed and efficiency
Control Stack ・Coordination between NIC, protocol, and application
・Memory management and signaling
・Queue management
The core of the design, logically coupling the physically separated data path and control path
Transport Protocol ・Processes only headers
・Provides congestion control and reliability
・Can run in any environment
・Kernel space
・User space
・Can run on accelerators, etc.

What's interesting is that this separation enables both high performance and flexibility. ZeroNIC can adapt whether application buffers are in the CPU or GPU, and whether the transport protocol runs in the kernel or user space.

NIC Hardware Design

The ZeroNIC hardware adopts a unique architecture to efficiently realize the separation of the data path and control path.


https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf

Hardware Block Configuration

This diagram is broadly composed of the following components:

Component Role
Transport Protocol Execution Environment The environment where the transport protocol (TCP, etc.) is executed
Application Buffers Where application data is stored (GPU HBM in this example)
NIC Body The core part of packet processing
Network Port Connection point to the external network

Detailed NIC Internal Components

The NIC body consists of various components, each playing a critical role.

NIC Component Function
DMA ・Direct memory access processing for headers and payloads
・High-speed data transfer without CPU intervention
TX Merge Unit ・Combines header and payload during transmission
・Assembles complete packets
RX Split Unit ・Splits packets into header and payload upon reception
・Determines appropriate transfer destinations for separated data
MS List ・Memory Segment List
・Manages application send/receive buffer information
・Maintained in a linked list format per flow
MR Table ・Memory Region Table
・Tracks memory regions registered by the application
・Holds IOMMU address translation information
Flow Table ・Management of flow state
・Retention of flow cursor information
・Used for packet processing decisions

The coordination between each table and list is particularly important for the operation of the NIC hardware.

  1. Upon Packet Arrival
    • A packet arrives from the network port.
    • The RX Split Unit divides the packet into a header and a payload.
    • The Flow Table is searched using packet flow information (source/destination IP addresses/ports, protocol).
    • Flow cursor (last processed position) and MS List information are retrieved from the Flow Table.
    • Appropriate memory segment information is identified from the MS List.
    • Physical addresses in application memory are resolved by referencing the MR Table.
    • The payload is DMAed directly to the application buffer (GPU HBM in this example).
    • The header is transferred to the Transport Protocol.
  2. Upon Packet Transmission
    • The Transport Protocol sends a header to the NIC.
    • The location of the transmission data is identified using the MS List and MR Table.
    • The payload is read from the application buffer (GPU HBM) via DMA.
    • The TX Merge Unit combines the header and payload.
    • The completed packet is sent out from the network port.

The MS List and MR Table are the core parts of zero-copy transfer, providing the NIC with information on the location of application buffers. The Flow Table is a data structure for efficient packet order management and retransmission processing. By combining these simple data structures, complex network processing can be executed at high speeds.

Flow of Sending and Receiving

Let's take a detailed look at the send/receive flow of ZeroNIC.


https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf

Basic Configuration

This diagram is broadly composed of the following components:

Component Role
Application User application (PyTorch, Redis, etc.)
Provider Library Acts as a bridge between the application and the control stack
Control Stack Coordinates between the NIC, transport protocol, and application
Transport Protocol Responsible for protocol processing like TCP (control path)
NIC Network interface (implements separation of data and control paths)
Application Buffers Data buffers such as CPU memory (DRAM) or GPU memory (HBM)

Numbers in the diagram represent the order of processing, with blue lines indicating the transmit path and red lines indicating the receive path.

Flow of the Transmit Path (Blue Arrows)


https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf

The transmit path is processed in the following order:

  1. Initialization
    The application first initializes using the Provider Library. At this stage, connection establishment and memory registration take place.
  2. Send Request
    The application calls send(buff, len, ...).
  3. Queue Management
    The Provider Library forwards the request to the Control Stack, which delegates processing to the Transport Protocol and then adds an entry to the NIC queue.
  4. Protocol Processing
    The Transport Protocol (e.g., TCP) executes protocol-specific processing.
  5. Zero-copy DMA
    The NIC performs a direct DMA transfer of data from the application buffer (this is the core of zero-copy).
  6. Packet Formation
    The NIC combines the header and payload to form a complete packet.
  7. Completion Processing
    Processing completion notification is sent from the NIC via the Control Stack.
  8. Notification Transfer
    The completion notification is sent from the Control Stack to the Provider Library.
  9. App Notification
    Finally, the application is notified of the completion.

Flow of the Receive Path (Red Arrows)


https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf

The receive path is processed in the following order:

  1. Initialization
    Similar to the transmit path, the application first performs initialization.
  2. Receive Request
    The application calls recv(buff, len, ...).
  3. Queue Management
    The Provider Library informs the NIC of the receive buffer information via the Control Stack.
  4. Packet Processing
    The NIC receives the packet and splits it into header and payload.
  5. Zero-copy DMA
    The payload is DMA-transferred directly to the application buffer (CPU DRAM or GPU HBM).
  6. Header Transfer
    Header information is transferred to the Control Stack via the NIC queue.
  7. Protocol Processing
    The Transport Protocol performs protocol processing (such as ACK generation) using only the header.
  8. Completion Processing
    The Control Stack sends a completion notification to the Provider Library.
  9. App Notification
    Finally, the application is notified of the completion.

Innovative Points of This Design

The following points are particularly innovative in this diagram:

Innovation Description
Separation of Data Path and Control Path ・Physical separation of the header (control path) and payload (data path)
・Protocol processing is performed using only the header
・The payload is transferred directly to the application buffer
Zero-copy Transfer ・DMA directly to application buffers without going through kernel buffers
・Significant reduction in CPU copy overhead
・Reduction in CPU usage (from 100% to 17%)
Support for Diverse Endpoints ・Supports CPU memory (DRAM), GPU memory (HBM), and other accelerator memories
Flexible Protocol Implementation ・The Transport Protocol can run in various environments such as kernel space, user space, or dedicated hardware
・Existing TCP stacks can be used as is
・New protocols can be introduced easily

Thanks to this design, ZeroNIC achieves both high performance like RDMA and flexibility like TCP/IP. By making the data path more efficient, it achieves throughput close to 100 Gbps for a single TCP flow while keeping CPU usage around 17%. I think this is quite a groundbreaking design.

Performance Evaluation

The prototype implementation of ZeroNIC integrates a NIC using a Xilinx Virtex UltraScale+ FPGA with the Linux kernel's TCP protocol in the control stack. Comparing this with the Mellanox ConnectX-6 NIC yields quite interesting results.

Metric ZeroNIC MLX TCP
(with TX zero-copy)
MLX RoCE
Single TCP flow throughput 96.37 Gbps 50.63 Gbps 98.03 Gbps
CPU usage (protocol processing) 17.20% 100% N/A
Estimated maximum throughput 560.29 Gbps 50.63 Gbps N/A

Performance improvements in real-world applications are also impressive.

Application Improvement Rate Highlights
NCCL (GPU communication) 2.66× ・Supports direct GPU-to-GPU communication
・Maintains TCP robustness
・No application modifications required
Redis (Key-Value Store) 3.71× ・Significant reduction in CPU overhead
・Performance comparable to RoCE
・Existing applications can be used as is

It is noteworthy that while ZeroNIC achieves high throughput, it maintains the robustness of TCP. For instance, even with a 1% packet loss, performance hardly degrades thanks to TCP's retransmission mechanism. This is something difficult to achieve with RoCE.

Innovative Features of ZeroNIC

Summarizing ZeroNIC's approach reveals several innovative features.

Innovation Description
Separation of Data Path and Control Path ・Physical separation of the header (control path) and payload (data path)
・Protocol processing is performed using only the header
・The payload is transferred directly to the application buffer
Zero-copy Transfer ・DMA directly to application buffers without going through kernel buffers
・Significant reduction in CPU copy overhead
・Reduction in CPU usage (from 100% to 17%)
Support for Diverse Endpoints ・CPU memory (DRAM)
・GPU memory (HBM)
・Supports other accelerator memories as well
Flexible Protocol Implementation ・The Transport Protocol can run in various environments such as kernel space, user space, or dedicated hardware
・Existing TCP stacks can be used as is
・New protocols can be introduced easily
Efficient Handling of Retransmission and Reordering ・Efficient packet management leveraging the MS List and Flow Table
・Out-of-order packets are DMA-transferred to the correct positions
・Consistency is ensured through cursor management

Summary

ZeroNIC is a next-generation network architecture that combines high performance like RDMA with flexibility like TCP/IP. Through its novel approach of physically separating the data path and the control path, it can handle bandwidth-demanding applications such as AI training and distributed computing. Most importantly, its ability to improve performance without requiring modifications to existing applications is compelling. This is a technology to watch as a strong candidate for next-generation data center networking.

Discussion