Zebra BPF DPlane Demo

So, I'll post a demo of the Zebra BPF DPlane and an idea behind it to here. Since my idea is still very rough and I still have a minimal amount of the dirty PoC, I think the Scrap is a sufficient format.

YutaroHayakawa

What I want to do is very simple. Support DPlane protocol features which are supported in FRRouting, but not supported in underlying Linux kernel by writing eBPF program. The concrete example in my mind is SRv6 End.DT4 function which is primarily used for SRv6 L3 VPN use case (see RFC8986 and RFC9252).

SRv6 L3 VPN DPlane consists of two SRv6 functions; H.Encaps (encap side) and End.DT4 (decap side). In Linux kernel, H.Encaps has been supported since 4.10. However, End.DT4 has been merged in 5.11. At the point I write this post, some major enterprise Linux distributions (e.g. RHEL) uses kernel older than 5.11. Thus, SRv6 VPN is still not available for users using such distributions.

On the other hand, BGP-based SRv6 L3 VPN CPlane has been actively developed for these few years, but despite the progress of the FRR community, some users are still suffered by the feature gap between FRR and underlying kernel implementation.

YutaroHayakawa

I think this is where eBPF comes in. Like Cilium does for IPv6 BIG TCP support, substituting future kernel features is one of the nice use case of eBPF (it's not always easy, but...).

The problem here is even if we can implement the DPlane using eBPF, FRR doesn't know about how to interact with it. While kernel routing usually uses netlink as an interface to interact with userspace, eBPF DPlane usually uses Map. So, we somehow need to tell FRR about how to interact with eBPF.

YutaroHayakawa

FRR (technically Zebra) has a feature to realize that which is something called DPlane Provider Plugin (not explicitly documented, but see this sample code). There are another implementations like DPDK, FPM, or even the default Linux kernel DPlane is using this abstraction to write a DPlane logic. We can simultaniously load multiple DPlane providers and they are executed in some order everytime Zebra makes changes to its RIB.

The cool feature of DPlane provider abstraction is the DPlane executed before the kernel DPlane can tell kernel "not" to install the specific route update to FIB with dplane_ctx_set_skip_kernel function. This is a perfect fit into my "alternative DPlane" idea.

YutaroHayakawa

Based on above idea, I made a PoC of BPF DPlane Provider which currently only substitutes DPlane of SRv6 End.DT4. Here is the source in my dev branch.

As you may notice, this branch only contains the logic to writing an SRv6 routing information into Map (zebra_seg6local_map). This is intentional. By keeping an actual DPlane logic (eBPF program) out of the scope of FRR, developers can easily integrate this DPlane provider with existing eBPF-based product such as Cilium, Calico, Polycube, etc. What they need to do is open the pinned map from bpffs and implement End.DT4 using zebra_seg6local_map in their eBPF program. This also leaves the possibility of choosing the program type to realize the DPlane (TC, XDP, LWT, etc...).

This time, I implemented an example DPlane using TC-BPF. You can find it in YutaroHayakawa/zebra-bpf-dplane-example.

YutaroHayakawa

So, let's jump into demo. In YutaroHayakawa/zebra-bpf-dplane-example, you can find a file topo.yaml which contains ContainerLab topology. You should be able to reproduce this demo on your Linux machine (please let me know if it doesn't work). The container image for modified FRR and example DPlane are already pushed to DockerHub and the topology file uses them.

The demo topology looks like this. The CE routers with the same color belongs to the same VPN (VRF) and PE routers implements encap/decap of SRv6 header. All PE and P routers are running ISIS and exchanging the underlay routes like SRv6 locator or loopback addresses. PE0 and PE1 establishes the iBGP peer and exchanging the VPN routes.

What I'll show you for this demo is the ping between ce0 and ce2 are working without using kernel End.DT4 implementation.

YutaroHayakawa

Let's see the condition of the PE0 after booting up the lab. First, we can see the BPF DPlane is loaded.

pe0# show zebra dplane providers
Zebra dataplane providers:
BPF (2): in: 38, q: 0, q_max: 11, out: 38, q: 38, q_max: 38
Kernel (1): in: 38, q: 0, q_max: 11, out: 38, q: 38, q_max: 38

We also can see VPN route received from PE1. This comes from the IPv4 Unicast route that CE2 advertises to PE1. Thus, to reach to CE2, we can ping to 10.0.2.0/24.

pe0# show bgp ipv4 vpn wide
<skip...>
 *>i10.0.2.0/24                                  b::2                                           0    100      0 65002 ?
    UN=b::2 EC{65002:1} label=32 sid=a:2:: sid_structure=[40,24,16,0] type=bgp, subtype=0

This makes H.Encaps route on VRF0. We can see that it is also installed into kernel's FIB. a:2:0:0:1:: is the SID to reach to 10.0.2.0/24 over VPN.

pe0# show ip route vrf vrf0
B>  10.0.2.0/24 [20/0] via b::2 (vrf default) (recursive), label 16, seg6local unspec unknown(seg6local_context2str), seg6 a:2:0:0:1::, weight 1, 02:43:48
  *                      via fe80::a8c1:abff:fe38:2280, net0 (vrf default), label 16, seg6local unspec unknown(seg6local_context2str), seg6 a:2:0:0:1::, weight 1, 02:43:48

# ip r show vrf vrf0
10.0.2.0/24 nhid 33  encap seg6 mode encap segs 1 [ a:2:0:0:1:: ] via inet6 fe80::a8c1:abff:fe38:2280 dev net0 proto bgp metric 20

YutaroHayakawa

How about the receive side? Let's see PE1. On the Zebra side, we can see the End.DT4 route for the SID a:2:0:0:1:: and Zebra says it is installed into FIB (an * symbol). However, we cannot see it on the kernel's FIB. Why? It is because the BPF DPlane Provider takes care of the End.DT4 route and tell kernel DPlane Provider not to install that to kernel's FIB.

pe1# show ipv6 route
B>* a:2:0:0:1::/128 [20/0] is directly connected, vrf0, seg6local End.DT4 table 100, seg6 ::, weight 1, 02:49:08

# ip -6 r
a:1::/64 nhid 30 via fe80::a8c1:abff:fe02:e53f dev net0 proto isis metric 20 pref medium
a:2::/64 dev lo proto kernel metric 256 pref medium
b:: nhid 30 via fe80::a8c1:abff:fe02:e53f dev net0 proto isis metric 20 pref medium
b::1 nhid 30 via fe80::a8c1:abff:fe02:e53f dev net0 proto isis metric 20 pref medium
b::2 dev lo proto kernel metric 256 pref medium
2001:172:20:20::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev net0 proto kernel metric 256 pref medium
default via 2001:172:20:20::1 dev eth0 metric 1024 pref medium

YutaroHayakawa

So, now let's see the ping works.

$ docker exec -it clab-srv6-vpnv4-ce0 ping 10.0.2.1
PING 10.0.2.1 (10.0.2.1): 56 data bytes
64 bytes from 10.0.2.1: seq=0 ttl=64 time=0.130 ms
64 bytes from 10.0.2.1: seq=1 ttl=64 time=0.112 ms
64 bytes from 10.0.2.1: seq=2 ttl=64 time=0.122 ms
^C
--- 10.0.2.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.112/0.121/0.130 ms

tcpdump shows that the ping packets are encapsulated with H.Encaps on pe0

pe0# tcpdump -ni net0 dst a:2:0:0:1::
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on net0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:07:48.604130 IP6 a:1:: > a:2:0:0:1::: RT6 (len=2, type=4, segleft=0, last-entry=0, tag=0, [0]a:2:0:0:1::) IP 10.0.1.0 > 10.0.2.1: ICMP echo request, id 91, seq 23, length 64
15:07:49.604306 IP6 a:1:: > a:2:0:0:1::: RT6 (len=2, type=4, segleft=0, last-entry=0, tag=0, [0]a:2:0:0:1::) IP 10.0.1.0 > 10.0.2.1: ICMP echo request, id 91, seq 24, length 64

Reach to the pe1

pe1# tcpdump -ni net0 dst a:1:0:0:1::
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on net0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:09:05.437495 IP6 a:2:: > a:1:0:0:1::: RT6 (len=2, type=4, segleft=0, last-entry=0, tag=0, [0]a:1:0:0:1::) IP 10.0.2.1 > 10.0.1.0: ICMP echo reply, id 97, seq 35, length 64
15:09:06.437638 IP6 a:2:: > a:1:0:0:1::: RT6 (len=2, type=4, segleft=0, last-entry=0, tag=0, [0]a:1:0:0:1::) IP 10.0.2.1 > 10.0.1.0: ICMP echo reply, id 97, seq 36, length 64

And reach to the ce2.

ce2## tcpdump -ni net0 dst 10.0.2.1
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on net0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:11:13.460135 IP 10.0.1.0 > 10.0.2.1: ICMP echo request, id 97, seq 163, length 64
15:11:14.460329 IP 10.0.1.0 > 10.0.2.1: ICMP echo request, id 97, seq 164, length 64

That means BPF based End.DT4 is indeed working.