XDP packet capture in Rust with aya

Published in 2024-07-06

#rust #aya #xdp

eXpress Data Path (XDP) can achieve exceptional performance in network-related applications like load balancers, firewalls, and anti-DDoS systems.

When developing firewalls or anti-DDoS applications, it is crucial to monitor which packets our XDP program has dropped. This is useful for debugging and understanding the nature of attacks against our infrastructure.

In most Linux distributions, common tools for packet capture are tcpdump (CLI) and Wireshark (GUI). These tools primarily rely on libpcap, which uses AF_PACKET (or PF_PACKET) to clone and capture raw packet data. By bypassing most of the networking stack, AF_PACKET offers better performance as it avoids processing packet data (e.g., parsing IP, TCP/UDP headers).

However, XDP drops packets early in the networking stack, meaning tools like tcpdump cannot capture these dropped packets. Below is a simplified diagram of the network path of a packet in the kernel.

To address this, we need to implement a packet capture program using XDP before the packet is dropped. By sending packets to userspace, we can log them into a pcap file. An event-driven map, such as a ring buffer (BPF_MAP_TYPE_RINGBUF), can facilitate this process.

Kernel-space code

First, we declare our ring buffer map in the eBPF code. In this example, the ring buffer map size is set to 16MB. A bigger size of the buffer better guarantees free space for new events.

#[map]
static RING_BUF: RingBuf = RingBuf::with_byte_size(16_777_216u32, 0);

Here is a working sample code to reserve a buffer in our ring buffer map and copy the raw packet into it. Note that this code might be inefficient because it reserves buffers of 1502 bytes even if the packet is significantly smaller.

    const U16_SIZE: usize = mem::size_of::<u16>();
    const SIZE: usize = U16_SIZE + 1500;

    // reserve some space
    match RING_BUF.reserve::<[u8; SIZE]>(0) {
        Some(mut event) => {
            let len = ctx.data_end() - ctx.data();

            // We check if packet len is greater than our reserved buffer size
            if aya_ebpf::check_bounds_signed(len as i64, 1, 1500) == false {
                event.discard(0);
                return Ok(xdp_action::XDP_DROP); // action depends on what you plan to do
            }

            unsafe {
                // we first save into the buffer the packet length.
                // Useful on userspace to retrieve the correct amount of bytes and not some bytes not part of the packet.
                ptr::write_unaligned(event.as_mut_ptr() as *mut _, len as u16);

                // We copy the entire content of the packet to the buffer (L2 to L7)
                match aya_ebpf::helpers::gen::bpf_xdp_load_bytes(
                    ctx.ctx,
                    0,
                    event.as_mut_ptr().byte_add(U16_SIZE) as *mut _,
                    len as u32,
                ) {
                    0 => event.submit(0),
                    _ => event.discard(0),
                }
            }
        }
        None => {
            info!(&ctx, "Cannot reserve space in ring buffer.");
        }
    };

The following schema represents each event reserved to the ring buffer.

User-space code

In the userspace program, we can retrieve events from our ring buffer map. Each event contains the packet size and the raw packet data. To save packets in a pcap file, we'll use the pcap-file-tokio library, which leverages Tokio for asynchronous operations.

We'll also need a buffered writer to avoid making a write syscall for each packet, as syscalls can be slow. Without this, the ring buffer might fill up faster than our userspace program can process events, causing it to run out of space.

The following code is executed after loading the XDP program from userspace. We first need to get the map so we can access events.

    let ring_dump = aya::maps::RingBuf::try_from(bpf.take_map("RING_BUF").unwrap()).unwrap();

We create a pcap file and pass it to the buffered writer, using the file name provided as an argument.

    let file_out = File::create(opt.pcap_out.as_str())
        .await
        .expect("Error creating file out");

    // BufWriter to avoid a syscall per write. BufWriter will manage that for us and reduce the amound of syscalls.
    let stream = BufWriter::with_capacity(8192, file_out);
    let mut pcap_writer = PcapWriter::new(stream).await.expect("Error writing file");

We use Tokio to create a watch channel that signals the task for a graceful shutdown, allowing us to flush the file buffer before program termination. Without flushing the buffer, any packets still in the buffer won't be written to the pcap file, resulting in data loss.

    let (tx, rx) = watch::channel(false);

We then spawn a Tokio task to fetch events, write them into the buffer, and flush it upon program termination.

    let pcapdump_task = tokio::spawn(async move {
        let mut rx = rx.clone();
        let mut async_fd = AsyncFd::new(ring_dump).unwrap();

        loop {
            tokio::select! {
                _ = async_fd.readable_mut() => {
                    // Ready to read
                    let mut guard = async_fd.readable_mut().await.unwrap();
                    let rb = guard.get_inner_mut();

                    while let Some(read) = rb.next() {
                        let ptr = read.as_ptr();

                        // Retrieve packet len first then packet data
                        let size = unsafe { ptr::read_unaligned::<u16>(ptr as *const u16) };
                        let data = unsafe { slice::from_raw_parts(ptr.byte_add(2), size.into()) };

                        let ts = SystemTime::now().duration_since(UNIX_EPOCH).unwrap();

                        // Write to pcap file
                        let packet = PcapPacket::new(ts, size as u32, data);
                        pcap_writer.write_packet(&packet).await.unwrap();
                    }

                    guard.clear_ready();
                },
                _ = rx.changed() => {
                    // Termination signal received
                    if *rx.borrow() {
                        break;
                    }
                }
            }
        }

        // End of program, flush the buffer
        let mut buf_writer = pcap_writer.into_writer();
        buf_writer.flush().await.unwrap();
    });

After awaiting the CTRL+C signal, we signal the task to stop and wait for it to complete.

    info!("Waiting for Ctrl-C...");
    signal::ctrl_c().await?;

    // Signal the task to stop
    tx.send(true).unwrap();

    // wait for the task to finish
    pcapdump_task.await.unwrap();

    info!("Exiting...");

To test it, we'll assume our XDP program filters UDP packets originating from port 53 (DNS responses). Let's run our program by passing the name of the file we want to save packets in.

vagrant@ebpf-dev:~/xdpdump-rs$ RUST_LOG=debug cargo xtask run -- filtered.pcap
...
[2024-07-06T04:16:47Z INFO  xdpdump_rs] Waiting for Ctrl-C...

We can use the command line dig to perform a DNS request to our favorite DNS servers. As shown below, we are getting timeouts from the DNS server.

vagrant@ebpf-dev:~$ dig @8.8.8.8 reitw.fr
;; communications error to 8.8.8.8#53: timed out
;; communications error to 8.8.8.8#53: timed out
;; communications error to 8.8.8.8#53: timed out

; <<>> DiG 9.18.24-1-Debian <<>> @8.8.8.8 reitw.fr
; (1 server found)
;; global options: +cmd
;; no servers could be reached

As shown below, we successfully dropped three responses originating from Google's DNS servers.

[2024-07-06T04:17:05Z DEBUG xdpdump_rs] Dropping packet from source 8.8.8.8
[2024-07-06T04:17:10Z DEBUG xdpdump_rs] Dropping packet from source 8.8.8.8
[2024-07-06T04:17:15Z DEBUG xdpdump_rs] Dropping packet from source 8.8.8.8

Now, let's examine the PCAP file with tcpdump. As shown below, we can see three packets containing A responses for the domain reitw.fr.

vagrant@ebpf-dev:~/xdpdump-rs$ tcpdump -vvn -r filtered.pcap
reading from file filtered.pcap, link-type EN10MB (Ethernet), snapshot length 65535
04:17:05.215430 IP (tos 0x0, ttl 64, id 62244, offset 0, flags [none], proto UDP (17), length 129)
    8.8.8.8.53 > 10.0.2.15.40736: [udp sum ok] 47092 q: A? reitw.fr. 4/0/1 reitw.fr. A 185.199.111.153, reitw.fr. A 185.199.108.153, reitw.fr. A 185.199.110.153, reitw.fr. A 185.199.109.153 ar: . OPT UDPsize=512 (101)
04:17:10.247388 IP (tos 0x0, ttl 64, id 62251, offset 0, flags [none], proto UDP (17), length 129)
    8.8.8.8.53 > 10.0.2.15.53953: [udp sum ok] 47092 q: A? reitw.fr. 4/0/1 reitw.fr. A 185.199.110.153, reitw.fr. A 185.199.108.153, reitw.fr. A 185.199.109.153, reitw.fr. A 185.199.111.153 ar: . OPT UDPsize=512 (101)
04:17:15.268431 IP (tos 0x0, ttl 64, id 62258, offset 0, flags [none], proto UDP (17), length 129)
    8.8.8.8.53 > 10.0.2.15.51077: [udp sum ok] 47092 q: A? reitw.fr. 4/0/1 reitw.fr. A 185.199.111.153, reitw.fr. A 185.199.108.153, reitw.fr. A 185.199.109.153, reitw.fr. A 185.199.110.153 ar: . OPT UDPsize=512 (101)

That's it. You can find the full code at this link. The code is experimental and can be enhanced for improved performance and readability.