iTranslated by AI
ZeroNIC Anatomy Notes: Unbelievably Fast...
Introduction
Notes on researching "ZeroNIC," a high-performance host networking technology developed through collaborative research by Stanford University, Cornell University, and Enfabrica.
It is based on the paper "High-throughput and Flexible Host Networking for Accelerated Computing" presented at the USENIX Symposium on Operating Systems Design and Implementation 2024. ZeroNIC is a networking technology for bandwidth-intensive applications such as AI and data analysis, achieving both high performance and flexibility through the co-design of hardware and software.
Watch the video here.
Challenges of Traditional Network Stacks
Modern data center applications need to handle massive amounts of data over the network. For example, the NVIDIA DGX-B200 has a bandwidth of 400 Gbps per GPU, providing a total system bandwidth of 3.2 Tbps. Handling such high data transfer volumes reveals several issues with traditional network stacks.
| Stack | Features | Issues |
|---|---|---|
| RDMA | ・High throughput ・CPU-offload ・Protocol processing in NIC hardware |
・Lack of flexibility ・Requires lossless fabric ・Hardware dependency makes protocol improvements difficult ・Hard to solve issues like Head-of-Line (HOL) blocking |
| Linux TCP | ・Flexible ・Various verified protocols ・Easy to introduce new protocols |
・Cannot achieve high throughput ・CPU becomes the bottleneck ・Too many data copies between kernel and user space |
According to the research team's experiments, the limit for Linux TCP is 50 Gbps for a single flow. This is because CPU usage reaches nearly 100%. On the other hand, implementing zero-copy on the receiving side significantly reduces CPU usage and improves performance up to 100 Gbps. In other words, reducing data copies is the key to lowering CPU usage.
Separation of Data Path and Control Path
The core idea of ZeroNIC is to physically separate the data path (payload transfer between the network and application buffers) and the control path (header processing and protocol control).

https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf
The features are summarized as follows:
| Component | Role | Features |
|---|---|---|
| NIC | ・Separates packet header and payload ・Directly DMA transfers payload to application buffers ・Flow state tracking and ordering |
Implemented in hardware for high speed and efficiency |
| Control Stack | ・Coordination between NIC, protocol, and application ・Memory management and signaling ・Queue management |
The core of the design, logically coupling the physically separated data path and control path |
| Transport Protocol | ・Processes only headers ・Provides congestion control and reliability ・Can run in any environment |
・Kernel space ・User space ・Can run on accelerators, etc. |
What's interesting is that this separation enables both high performance and flexibility. ZeroNIC can adapt whether application buffers are in the CPU or GPU, and whether the transport protocol runs in the kernel or user space.
NIC Hardware Design
The ZeroNIC hardware adopts a unique architecture to efficiently realize the separation of the data path and control path.

https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf
Hardware Block Configuration
This diagram is broadly composed of the following components:
| Component | Role |
|---|---|
| Transport Protocol Execution Environment | The environment where the transport protocol (TCP, etc.) is executed |
| Application Buffers | Where application data is stored (GPU HBM in this example) |
| NIC Body | The core part of packet processing |
| Network Port | Connection point to the external network |
Detailed NIC Internal Components
The NIC body consists of various components, each playing a critical role.
| NIC Component | Function |
|---|---|
| DMA | ・Direct memory access processing for headers and payloads ・High-speed data transfer without CPU intervention |
| TX Merge Unit | ・Combines header and payload during transmission ・Assembles complete packets |
| RX Split Unit | ・Splits packets into header and payload upon reception ・Determines appropriate transfer destinations for separated data |
| MS List | ・Memory Segment List ・Manages application send/receive buffer information ・Maintained in a linked list format per flow |
| MR Table | ・Memory Region Table ・Tracks memory regions registered by the application ・Holds IOMMU address translation information |
| Flow Table | ・Management of flow state ・Retention of flow cursor information ・Used for packet processing decisions |
The coordination between each table and list is particularly important for the operation of the NIC hardware.
-
Upon Packet Arrival
- A packet arrives from the network port.
- The RX Split Unit divides the packet into a header and a payload.
- The Flow Table is searched using packet flow information (source/destination IP addresses/ports, protocol).
- Flow cursor (last processed position) and MS List information are retrieved from the Flow Table.
- Appropriate memory segment information is identified from the MS List.
- Physical addresses in application memory are resolved by referencing the MR Table.
- The payload is DMAed directly to the application buffer (GPU HBM in this example).
- The header is transferred to the Transport Protocol.
-
Upon Packet Transmission
- The Transport Protocol sends a header to the NIC.
- The location of the transmission data is identified using the MS List and MR Table.
- The payload is read from the application buffer (GPU HBM) via DMA.
- The TX Merge Unit combines the header and payload.
- The completed packet is sent out from the network port.
The MS List and MR Table are the core parts of zero-copy transfer, providing the NIC with information on the location of application buffers. The Flow Table is a data structure for efficient packet order management and retransmission processing. By combining these simple data structures, complex network processing can be executed at high speeds.
Flow of Sending and Receiving
Let's take a detailed look at the send/receive flow of ZeroNIC.

https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf
Basic Configuration
This diagram is broadly composed of the following components:
| Component | Role |
|---|---|
| Application | User application (PyTorch, Redis, etc.) |
| Provider Library | Acts as a bridge between the application and the control stack |
| Control Stack | Coordinates between the NIC, transport protocol, and application |
| Transport Protocol | Responsible for protocol processing like TCP (control path) |
| NIC | Network interface (implements separation of data and control paths) |
| Application Buffers | Data buffers such as CPU memory (DRAM) or GPU memory (HBM) |
Numbers in the diagram represent the order of processing, with blue lines indicating the transmit path and red lines indicating the receive path.
Flow of the Transmit Path (Blue Arrows)

https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf
The transmit path is processed in the following order:
-
Initialization
The application first initializes using the Provider Library. At this stage, connection establishment and memory registration take place. -
Send Request
The application calls send(buff, len, ...). -
Queue Management
The Provider Library forwards the request to the Control Stack, which delegates processing to the Transport Protocol and then adds an entry to the NIC queue. -
Protocol Processing
The Transport Protocol (e.g., TCP) executes protocol-specific processing. -
Zero-copy DMA
The NIC performs a direct DMA transfer of data from the application buffer (this is the core of zero-copy). -
Packet Formation
The NIC combines the header and payload to form a complete packet. -
Completion Processing
Processing completion notification is sent from the NIC via the Control Stack. -
Notification Transfer
The completion notification is sent from the Control Stack to the Provider Library. -
App Notification
Finally, the application is notified of the completion.
Flow of the Receive Path (Red Arrows)

https://www.usenix.org/system/files/osdi24-skiadopoulos.pdf
The receive path is processed in the following order:
-
Initialization
Similar to the transmit path, the application first performs initialization. -
Receive Request
The application calls recv(buff, len, ...). -
Queue Management
The Provider Library informs the NIC of the receive buffer information via the Control Stack. -
Packet Processing
The NIC receives the packet and splits it into header and payload. -
Zero-copy DMA
The payload is DMA-transferred directly to the application buffer (CPU DRAM or GPU HBM). -
Header Transfer
Header information is transferred to the Control Stack via the NIC queue. -
Protocol Processing
The Transport Protocol performs protocol processing (such as ACK generation) using only the header. -
Completion Processing
The Control Stack sends a completion notification to the Provider Library. -
App Notification
Finally, the application is notified of the completion.
Innovative Points of This Design
The following points are particularly innovative in this diagram:
| Innovation | Description |
|---|---|
| Separation of Data Path and Control Path | ・Physical separation of the header (control path) and payload (data path) ・Protocol processing is performed using only the header ・The payload is transferred directly to the application buffer |
| Zero-copy Transfer | ・DMA directly to application buffers without going through kernel buffers ・Significant reduction in CPU copy overhead ・Reduction in CPU usage (from 100% to 17%) |
| Support for Diverse Endpoints | ・Supports CPU memory (DRAM), GPU memory (HBM), and other accelerator memories |
| Flexible Protocol Implementation | ・The Transport Protocol can run in various environments such as kernel space, user space, or dedicated hardware ・Existing TCP stacks can be used as is ・New protocols can be introduced easily |
Thanks to this design, ZeroNIC achieves both high performance like RDMA and flexibility like TCP/IP. By making the data path more efficient, it achieves throughput close to 100 Gbps for a single TCP flow while keeping CPU usage around 17%. I think this is quite a groundbreaking design.
Performance Evaluation
The prototype implementation of ZeroNIC integrates a NIC using a Xilinx Virtex UltraScale+ FPGA with the Linux kernel's TCP protocol in the control stack. Comparing this with the Mellanox ConnectX-6 NIC yields quite interesting results.
| Metric | ZeroNIC | MLX TCP (with TX zero-copy) |
MLX RoCE |
|---|---|---|---|
| Single TCP flow throughput | 96.37 Gbps | 50.63 Gbps | 98.03 Gbps |
| CPU usage (protocol processing) | 17.20% | 100% | N/A |
| Estimated maximum throughput | 560.29 Gbps | 50.63 Gbps | N/A |
Performance improvements in real-world applications are also impressive.
| Application | Improvement Rate | Highlights |
|---|---|---|
| NCCL (GPU communication) | 2.66× | ・Supports direct GPU-to-GPU communication ・Maintains TCP robustness ・No application modifications required |
| Redis (Key-Value Store) | 3.71× | ・Significant reduction in CPU overhead ・Performance comparable to RoCE ・Existing applications can be used as is |
It is noteworthy that while ZeroNIC achieves high throughput, it maintains the robustness of TCP. For instance, even with a 1% packet loss, performance hardly degrades thanks to TCP's retransmission mechanism. This is something difficult to achieve with RoCE.
Innovative Features of ZeroNIC
Summarizing ZeroNIC's approach reveals several innovative features.
| Innovation | Description |
|---|---|
| Separation of Data Path and Control Path | ・Physical separation of the header (control path) and payload (data path) ・Protocol processing is performed using only the header ・The payload is transferred directly to the application buffer |
| Zero-copy Transfer | ・DMA directly to application buffers without going through kernel buffers ・Significant reduction in CPU copy overhead ・Reduction in CPU usage (from 100% to 17%) |
| Support for Diverse Endpoints | ・CPU memory (DRAM) ・GPU memory (HBM) ・Supports other accelerator memories as well |
| Flexible Protocol Implementation | ・The Transport Protocol can run in various environments such as kernel space, user space, or dedicated hardware ・Existing TCP stacks can be used as is ・New protocols can be introduced easily |
| Efficient Handling of Retransmission and Reordering | ・Efficient packet management leveraging the MS List and Flow Table ・Out-of-order packets are DMA-transferred to the correct positions ・Consistency is ensured through cursor management |
Summary
ZeroNIC is a next-generation network architecture that combines high performance like RDMA with flexibility like TCP/IP. Through its novel approach of physically separating the data path and the control path, it can handle bandwidth-demanding applications such as AI training and distributed computing. Most importantly, its ability to improve performance without requiring modifications to existing applications is compelling. This is a technology to watch as a strong candidate for next-generation data center networking.
Discussion