Trace a package’s journey using Linux tracepoints, perf, and eBPF

Original link: https://colobu.com/2023/09/11/tracing-a-packet-journey-using-linux-tracepoints-perf-ebpf/

Original text: Tracing a packet journey using Linux tracepoints, perf and eBPF

I’ve been looking for a low-level Linux network debugging tool for a long time.
I’ve been looking for a low-level Linux network debugging tool for a while. Linux allows the use of a combination of virtual interfaces and network namespaces to build complex networks that run directly on the host. When something goes wrong, troubleshooting is pretty tedious. If this is an L3 routing issue, mtr will most likely help. However, if this is a lower layer issue, I’d usually manually check each interface/bridge/network namespace/iptables and launch a few tcpdumps to try and understand what’s going on. If you don’t know the network setup beforehand, this may feel like walking a maze.

What I need is a tool that can tell me “Hey, buddy, I’ve seen your packet: it’s gone like this, on this interface, in this network namespace”.

Basically, what I need is mtr on L2.

No such tool? Let’s build one from scratch!

At the end of this article, we will have a low-level packet tracer that is simple and easy to use. If you ping your local Docker container, it will show something like this:


1
2
3
4
5
6
7

# ping -4 172.17.0.2
[ 4026531957 ] docker0 request #17146.001 172.17.0.1 -> 172.17.0.2
[ 4026531957 ] vetha373ab6 request #17146.001 172.17.0.1 -> 172.17.0.2
[ 4026532258 ] eth0 request #17146.001 172.17.0.1 -> 172.17.0.2
[ 4026532258 ] eth0 reply #17146.001 172.17.0.2 -> 172.17.0.1
[ 4026531957 ] vetha373ab6 reply #17146.001 172.17.0.2 -> 172.17.0.1
[ 4026531957 ] docker0 reply #17146.001 172.17.0.2 -> 172.17.0.1

Tracing to the rescue

One way to get out of the maze is to explore. This is what you do when you get out of the maze. Another way to get out of the labyrinth is to change your perspective, from God’s perspective, and observe the path taken by those who know the path.

In Linux terms, this means moving to a kernel perspective, where network namespaces are just labels, not “containers”1. In the kernel, packets, interfaces, etc. are ordinary observable objects.

In this article, I will focus on 2 tracing tools: perf and eBPF.

Introducing perf and eBPF

perf is the benchmark tool for every performance related analysis on Linux. It is developed in the same source tree as the Linux kernel and must be compiled specifically for the kernel you will use for tracing. It can trace kernel as well as user programs. It also works by sampling or using tracking points. Think of it as a huge superset of strace with much lower overhead. In this article we only use it in a very simple way. If you want to learn more about perf, I highly recommend visiting Brendan Gregg’s blog .

eBPF is a relatively recent addition to the Linux kernel. As the name suggests, this is an extended version of BPF bytecode, the “Berkeley Packet Filter” used to filter packets on BSD family systems. On Linux, it can also be used to safely run platform-independent code in the runtime kernel if some security standards are met. For example, memory accesses are verified before a program is run, and it must be proven that the program will end within a limited time. Even if the program itself is safe and will always terminate, if the kernel cannot prove this, the program will be rejected.

Such programs can be used as network classifiers for QOS, very low-level networking and filtering can use the eXpress Data Plane (XDP), these programs act as tracing proxies, and many other scenarios. Trace probes can be attached to any function or any trace point in /proc/kallsyms . In this article, I will focus on tracing agents attached to tracepoints.

For an example of tracing probes attached to kernel functions or as a more detailed introduction, please read my previous article on eBPF.

Laboratory setup

For this article, we need perf and some tools for eBPF. Since I’m not a big fan of hand-written assembly code, I’ll use bcc here. This is a powerful and flexible tool that allows you to write kernel probes in restricted C and perform instrumentation in Python in user space. Might be too heavy for a production environment, but perfect for development!

Here I recap the installation instructions on Ubuntu 17.04 (Zesty), which is the operating system used on my laptop. Instructions for getting to “perf” from other distributions shouldn’t vary too much, and specific bcc installation instructions can be found on github .

Note: Attaching eBPF to tracepoints requires at least a Linux kernel version higher than 4.7.

Install perf:


1
2
3
4
5

// Install
sudo apt install linux-tools-generic
# test
perf

If you see an error message, it’s likely that your kernel was recently updated but the operating system has not yet restarted.

Install bcc:


1
2
3
4
5
6
7
8
9
10
11
12

# Install the original
sudo apt install bison build-essential cmake flex git libedit-dev python zlib1g-dev libelf-dev libllvm4.0 llvm-dev libclang-dev luajit luajit- 5.1 -dev
# Get bcc source code
git clone https://ift.tt/fqKZmbu
# Compile and install
mkdir bcc/build
cd bcc/build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr
make
sudo make install

Find good tracking points, that is, “manually track the journey of the packet with perf”

There are several ways to find good tracking points. In an earlier version of this article, I started with veth-driven code and traced function calls from there to find the function to trace. While it does result in acceptable results, I am unable to capture all packets. Indeed, the path common to all packets is in unexported (inline or static) methods. That’s when I realized Linux has tracepoints and decided to rewrite this article and related code to use tracepoints instead of functions. It’s frustrating, but also more interesting to me.

I’ve talked too much, let’s get down to business.

Our goal is to trace the path that a packet takes. Depending on the interface they pass through, the trace points they pass through may be different (spoiler: they are).

In order to find the appropriate tracking point, I pinged 2 internal destination IPs and 2 external destination IPs when using perf trace:

  • localhost, IP is 127.0.0.1
  • An “innocent” Docker container with IP 172.17.0.2
  • My phone shares the network via USB, the IP is 192.168.42.129
  • My phone via WiFi, the IP is 192.168.43.1

perf trace is a subcommand of the perf command that by default produces output similar to strace (with less overhead). We can easily adjust this to hide the system call itself and only print events of the “net” category. For example, tracing a ping to a Docker container with IP 172.17.0.2 would look like this:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

sudo perf trace –no-syscalls –event ‘net:*’ ping 172.17 . 0.2 -c1 > /dev/null
0.000 net:net_dev_queue:dev=docker0 skbaddr= 0 xffff96d481988700 len= 98 )
0.008 net:net_dev_start_xmit:dev=docker0 queue_mapping= 0 skbaddr= 0 xffff96d481988700 vlan_tagged= 0 vlan_proto= 0 x0000 vlan_tci= 0 x0000 protocol= 0 x0800 ip_summed= 0 len= 98 data_len= 0 network_offset= 14 transport_offset_valid= 1 transport_offset= 34 tx_flags= 0 gso_size= 0 gso_segs= 0 gso_ type = 0 )
0.014 net:net_dev_queue:dev=veth79215ff skbaddr= 0 xffff96d481988700 len= 98 )
0.016 net:net_dev_start_xmit:dev=veth79215ff queue_mapping= 0 skbaddr= 0 xffff96d481988700 vlan_tagged= 0 vlan_proto= 0 x0000 vlan_tci= 0 x0000 protocol= 0 x0800 ip_summed= 0 len= 98 data_len= 0 network_off set= 14 transport_offset_valid= 1 transport_offset= 34 tx_flags= 0 gso_size= 0 gso_segs= 0 gso_ type = 0 )
0.020 net:netif_rx:dev=eth0 skbaddr= 0 xffff96d481988700 len= 84 )
0.022 net:net_dev_xmit:dev=veth79215ff skbaddr= 0 xffff96d481988700 len= 98 rc= 0 )
0.024 net:net_dev_xmit:dev=docker0 skbaddr= 0 xffff96d481988700 len= 98 rc= 0 )
0.027 net:netif_receive_skb:dev=eth0 skbaddr= 0 xffff96d481988700 len= 84 )
0.044 net:net_dev_queue:dev=eth0 skbaddr= 0 xffff96d481988b00 len= 98 )
0.046 net:net_dev_start_xmit:dev=eth0 queue_mapping= 0 skbaddr= 0 xffff96d481988b00 vlan_tagged= 0 vlan_proto= 0 x0000 vlan_tci= 0 x0000 protocol= 0 x0800 ip_summed= 0 len= 98 data_len= 0 network_offset= 14 transport _offset_valid= 1 transport_offset= 34 tx_flags= 0 gso_size= 0 gso_segs= 0 gso_ type = 0 )
0.048 net:netif_rx:dev=veth79215ff skbaddr= 0 xffff96d481988b00 len= 84 )
0.050 net:net_dev_xmit:dev=eth0 skbaddr= 0 xffff96d481988b00 len= 98 rc= 0 )
0.053 net:netif_receive_skb:dev=veth79215ff skbaddr= 0 xffff96d481988b00 len= 84 )
0.060 net:netif_receive_skb_entry:dev=docker0 napi_id= 0 x3 queue_mapping= 0 skbaddr= 0 xffff96d481988b00 vlan_tagged= 0 vlan_proto= 0 x0000 vlan_tci= 0 x0000 protocol= 0 x0800 ip_summed= 2 hash = 0 x0 0000000 l4_ hash = 0 len= 84 data_len= 0 truesize= 768 mac_header_valid= 1 mac_header=- 14 nr_frags= 0 gso_size= 0 gso_ type = 0 )
0.061 net:netif_receive_skb:dev=docker0 skbaddr= 0 xffff96d481988b00 len= 84 )

Just keep the event name and skbaddr, this looks more readable.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

net_dev_queue dev=docker0 skbaddr =0 xffff96d481988700
net_dev_start_xmit dev=docker0 skbaddr =0 xffff96d481988700
net_dev_queue dev=veth79215ff skbaddr =0 xffff96d481988700
net_dev_start_xmit dev=veth79215ff skbaddr =0 xffff96d481988700
netif_rx dev=eth0 skbaddr =0 xffff96d481988700
net_dev_xmit dev=veth79215ff skbaddr =0 xffff96d481988700
net_dev_xmit dev=docker0 skbaddr =0 xffff96d481988700
netif_receive_skb dev=eth0 skbaddr =0 xffff96d481988700
net_dev_queue dev=eth0 skbaddr =0 xffff96d481988b00
net_dev_start_xmit dev=eth0 skbaddr =0 xffff96d481988b00
netif_rx dev=veth79215ff skbaddr =0 xffff96d481988b00
net_dev_xmit dev=eth0 skbaddr =0 xffff96d481988b00
netif_receive_skb dev=veth79215ff skbaddr =0 xffff96d481988b00
netif_receive_skb_entry dev=docker0 skbaddr =0 xffff96d481988b00
netif_receive_skb dev=docker0 skbaddr =0 xffff96d481988b00

There are a few points to note here. The most obvious is that skbaddr changes in the middle, but stays the same the rest of the time, which is when an echo reply packet is generated as a reply to that echo request (ping). At other times, the same network packet moves between interfaces, hopefully without duplication. Replication is expensive…

Another interesting point is that we clearly see the packet going through the docker0 bridge, then the host side of veth, in my case veth79215ff, and finally the container side of veth, pretending to be eth0. We haven’t seen the network namespace yet, but it already gives a good overview.

Finally, after seeing the packets on eth0, we see the trace points in reverse order. This is not a response, but the final destination path of the transmission.

By repeating a similar process across the 4 target scenarios, we can select the most appropriate tracking points to follow the packet’s journey. I chose 4 of them:

  • net_dev_queue
  • netif_receive_skb_entry
  • netif_rx
  • napi_gro_receive_entry

Using these 4 tracking points will provide me with tracking events in order, without duplication, saving some deduplication work. Still a very good choice.

We can easily double-check this selection like this:


1
2
3
4
5
6
7
8
9
10
11
12

sudo perf trace –no-syscalls \
–event ‘net:net_dev_queue’ \
–event ‘net:netif_receive_skb_entry’ \
–event ‘net:netif_rx’ \
–event ‘net:napi_gro_receive_entry’ \
ping 172.17 . 0.2 -c1 > /dev/null
0.000 net:net_dev_queue:dev=docker0 skbaddr= 0 xffff8e847720a900 len= 98 )
0.010 net:net_dev_queue:dev=veth7781d5c skbaddr= 0 xffff8e847720a900 len= 98 )
0.014 net:netif_rx:dev=eth0 skbaddr= 0 xffff8e847720a900 len= 84 )
0.034 net:net_dev_queue:dev=eth0 skbaddr= 0 xffff8e849cb8 cd 00 len= 98 )
0.036 net:netif_rx:dev=veth7781d5c skbaddr= 0 xffff8e849cb8 cd 00 len= 84 )
0.045 net:netif_receive_skb_entry:dev=docker0 napi_id= 0 x1 queue_mapping= 0 skbaddr= 0 xffff8e849cb8 cd 00 vlan_tagged= 0 vlan_proto= 0 x0000 vlan_tci= 0 x0000 protocol= 0 x0800 ip_summed= 2 hash = 0 x 00000000 l4_ hash = 0 len = 84 data_len= 0 truesize= 768 mac_header_valid= 1 mac_header=- 14 nr_frags= 0 gso_size= 0 gso_ type = 0 )

mission completed!

If you want to go one step further and explore the list of available network trace points, you can use perf list:


1

sudo perf list ‘net:*’

This should return a list of tracepoint names like net:netif_rx . The part before the colon (‘net’) is the event category; the part after the colon is the event name in that category.

Writing custom trackers using eBPF/bcc

For most situations, this is far more than is needed. If you are reading this article to learn how to trace the journey of a packet on a Linux system, you have already obtained all the information you need. However, if you want to dig deeper, run custom filters, and track more data, such as the network namespaces packets pass through or the source and destination IPs, follow me here.

Starting with Linux kernel 4.7, it is possible to attach eBPF programs to kernel tracepoints. Previously, the only alternative to building this tracer was to attach probes to exported kernel symbols. Although this approach works, it has some disadvantages:

  • The internal API of the kernel is unstable. Trackpoints are stable (although the data structure doesn’t necessarily have to be…).
  • For performance reasons, most network intrinsics are inline or static. None of them can be detected.
  • Finding the potential call points for all these functions is cumbersome, and sometimes all the data needed is not available at this stage.

An earlier version of this article attempted to use kprobes, which were easier to use, but the results were incomplete.

Now, let’s be honest, accessing data via tracepoints is a lot more cumbersome than using their kprobe counterparts. While I’ve tried to keep this article as accessible as possible, you may want to start with one of my previous articles .

That statement aside, let’s start with a simple Hello World program. In this Hello World example, we will create an event every time one of the 4 tracking points we selected is triggered (net_dev_queue, netif_receive_skb_entry, netif_rx, and napi_gro_receive_entry). To keep it simple at this stage, we will send the program’s comm, which is a 16-character string that is basically the name of the program.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

# include <bcc/proto.h>
# include <linux/sched.h>
//Event structure
struct route_evt_t {
char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(route_evt);
static inline int do_trace( void * ctx, struct sk_buff* skb)
{
// Built event for userland
struct route_evt_t evt = {};
bpf_get_current_comm(evt.comm, TASK_COMM_LEN);
// Send event to userland
route_evt.perf_submit(ctx, &evt, sizeof (evt));
return 0 ;
}
/**
* Attach to Kernel Tracepoints
*/
TRACEPOINT_PROBE(net, netif_rx) {
return do_trace(args, ( struct sk_buff*)args->skbaddr);
}
TRACEPOINT_PROBE(net, net_dev_queue) {
return do_trace(args, ( struct sk_buff*)args->skbaddr);
}
TRACEPOINT_PROBE(net, napi_gro_receive_entry) {
return do_trace(args, ( struct sk_buff*)args->skbaddr);
}
TRACEPOINT_PROBE(net, netif_receive_skb_entry) {
return do_trace(args, ( struct sk_buff*)args->skbaddr);
}

This code snippet will connect to the 4 tracepoints of the “net” category, load skbaddr field, and pass it to the general section, which currently only loads the program name. If you’re wondering where this args->skbaddr comes from (I’m glad you think so), args structure is generated for you by bcc whenever you define a tracepoint using TRACEPOINT_PROBE. Since it is generated on the fly, there is no easy way to see its definition, but there is a better way. We can view the data source directly from the kernel. Fortunately, there is an entry for every trace point /sys/kernel/debug/tracing/events . For example, for net:netif_rx, you can just run the command cat /sys/kernel/debug/tracing/events/net/netif_rx/format , which should output something similar to the following:


1
2
3
4
5
6
7
8
9
10
11
12
13

name: netif_rx
ID: 1183
format:
field:unsigned short common_ type ; offset: 0 ; size: 2 ; signed: 0 ;
field:unsigned char common_flags; offset: 2 ; size: 1 ; signed: 0 ;
field:unsigned char common_preempt_count; offset: 3 ; size: 1 ; signed: 0 ;
field:int common_pid; offset: 4 ; size: 4 ; signed: 1 ;
field:void * skbaddr; offset: 8 ; size: 8 ; signed: 0 ;
field:unsigned int len; offset: 16 ; size: 4 ; signed: 0 ;
field:__data_loc char[] name; offset: 20 ; size: 4 ; signed: 1 ;
print fmt: “dev=%s skbaddr=%p len=%u” , __get_str(name), REC->skbaddr, REC->len

You may notice the print fmt line at the end of the record. This is exactly what perf trace uses to generate its output.

With the underlying base code in place, and you understand it, we can wrap it in a Python script to display one line for each event sent by the eBPF probe:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29

#!/usr/bin/env python
# coding: utf-8
from socket import inet_ntop
from bcc import BPF
import ctypes asct
bpf_text = ”'<SEE CODE SNIPPET ABOVE>”’
TASK_COMM_LEN = 16 # linux/sched.h
class RouteEvt (ct.Structure) :
_fields_ = [
( “comm” , ct.c_char * TASK_COMM_LEN),
]
def event_printer (cpu, data, size) :
#Decode event
event = ct.cast(data, ct.POINTER(RouteEvt)).contents
# Print event
print “Just got a packet from %s” % (event.comm)
if __name__ == “__main__” :
b = BPF(text=bpf_text)
b[ “route_evt” ].open_perf_buffer(event_printer)
while True :
b.kprobe_poll()

You can test it now. You need to run with root privileges.

Please note: We are not filtering packages at this stage. Even low network usage may cause your terminal to refresh.


1
2
3
4
5
6
7

$> sudo python ./tracepkt.py
Just got a packet from ping6
Just got a packet from ping6
Just got a packet from ping
Just got a packet from irq/ 46 -iwlwifi

In this case you can see that I’m using ping and ping6 and the WiFi driver just received some packets. In this case, it’s the echo reply.

Now let’s start adding some useful data/filters.

In this article, I won’t focus on performance. This will better demonstrate the capabilities and limitations of eBPF. To make it (significantly) faster we can use packet size as filter, assuming no “weird” IP options are set. Using this sample program will slow down your network traffic.

Please note: To limit the length of this post, I will focus on the C/eBPF part here. I’ll put a link to the full source code at the end of the post.

Add network interface information

First, you can safely remove the “comm” resource, load, and sched.h headers. It’s of no real use here, sorry.

You can then include net/inet_sock.h so that we have the necessary declarations and add char ifname[IFNAMSIZ]; to the event structure.

Now we will load the device name from the device structure. This is interesting because it’s actually useful information and demonstrates the technique of loading any data on a manageable scale:


1
2
3
4
5
6

// Get the device pointer
struct net_device *dev;
bpf_probe_read(&dev, sizeof (skb->dev), (( char *)skb) + offsetof(typeof(*skb), dev));
//Load network interface name
bpf_probe_read(&evt.ifname, IFNAMSIZ, dev->name);

You can test it and it works fine. But don’t forget to add the relevant code in the Python section ?

Okay, how does it work? In order to load the interface name we need the interface device structure. I’ll start my explanation with the last statement since it’s the easiest to understand, the previous one is really just a more complex version. It uses bpf_probe_read to read data of length IFNAMSIZ from dev->name and copies it to evt.ifname. The first line follows exactly the same logic. It loads the value of skb->dev pointer into dev . Unfortunately, I didn’t find another way to load the field address without offsetof / typeof trick.

As a reminder, the goal of eBPF is to allow safe scripting of the kernel. This means that random memory access is prohibited. All memory accesses must be authenticated. Unless the memory you are accessing is on the stack, you need to use the bpf_probe_read read accessor. This makes the code tedious to read/write, but also makes it more secure. bpf_probe_read is defined in bpf_trace.c in the kernel. There are some interesting parts:

  1. It is similar to memcpy. Be aware of the performance cost of replication.
  2. If an error occurs, it returns a buffer initialized to 0 and returns an error. It won’t crash or stop the process

1
2
3
4
5
6
7
8

# define member_read(destination, source_struct, source_member) \
do { \
bpf_probe_read( \
destination, \
sizeof (source_struct->source_member), \
(( char *)source_struct) + offsetof(typeof(*source_struct), source_member) \
); \
} while ( 0 )

This allows us to write:


1

member_read(&dev, skb, dev);

great!

Add network namespace ID

This is probably the most valuable information. In and of itself, it is a justification for all these efforts. Unfortunately, this is also the hardest to load.

Namespace identifiers can be loaded from two places:

  • socket’sk’ structure
  • Device ‘dev’ structure

Initially, I used the socket structure because that’s what I used when writing solisten.py . However, somehow, once the packet crosses a namespace boundary, the namespace identifier is no longer readable. This field is all 0’s, which is a clear indicator of an invalid memory access (remember how bpf_probe_read works on errors), and defeats the entire purpose.

Fortunately, the device approach works. Think of it as a process of asking which interface a packet is on and which namespace the interface belongs to.


1
2
3
4
5
6
7

struct net* net;
// Get netns id. Equivalent to: evt.netns = dev->nd_net.net->ns.inum
possible_net_t *skc_net = &dev->nd_net;
member_read(&net, skc_net, net);
struct ns_common* ns = member_address(net, ns);
member_read(&evt.netns, ns, inum);

Use the following additional macros to improve readability:


1
2
3
4
5
6

# define member_address(source_struct, source_member) \
({ \
void * __ret; \
__ret = ( void *) ((( char *)source_struct) + offsetof(typeof(*source_struct), source_member)); \
__ret; \
})

Put the pieces together and… DONE!


1
2
3
4
5
6
7

$> sudo python ./tracepkt.py
[ 4026531957 ] docker0
[ 4026531957 ] vetha373ab6
[ 4026532258 ]eth0
[ 4026532258 ]eth0
[ 4026531957 ] vetha373ab6
[ 4026531957 ] docker0

If you send a ping to the Docker container, you should see this. The packet is passed through the local docker0 bridge and then moved to the veth pair, crossing the network namespace boundary, and the reply returns along the exact opposite path.

Reply This is indeed a tough question!

Go one step further: only track request reply and echo reply packets

As a bonus, we’ll also load the IP address of the packet. Regardless, we have to read the IP header. I’m going to stick with IPv4 here, but the same logic applies to IPv6.

The bad news is that nothing is ever truly simple. Remember, we are dealing with the kernel, in the network path. Some packets have not been opened. This means that some header offsets are still not initialized. We will have to calculate it all, from the MAC header to the IP header and finally to the ICMP header.

Let’s first easily load the MAC header address and derive the IP header address. We do not load the MAC header itself, but assume it is 14 bytes long.


1
2
3
4
5
6
7
8
9
10

//Compute MAC header address
char *head;
u16 mac_header;
member_read(&head, skb, head);
member_read(&mac_header, skb, mac_header);
// Compute IP Header address
#define MAC_HEADER_SIZE 14;
char * ip_header_address = head + mac_header + MAC_HEADER_SIZE;

This basically means that the IP header starts from skb->head + skb->mac_header + MAC_HEADER_SIZE;

Now we can decode the IP version in the IP header, the first 4 bits of the first byte, to make sure it is IPv4:


1
2
3
4
5
6
7
8
9

//Load the version of ip
u8 ip_version;
bpf_probe_read(&ip_version, sizeof (u8), ip_header_address);
ip_version = ip_version >> 4 & 0xf ;
// Filter IPv4 packets
if (ip_version != 4 ) {
return 0 ;
}

Now we load the full IP header, extract the IP address to make the Python information more useful, ensure the next header is ICMP, and derive the ICMP header offset. Here are all these operations:


1
2
3
4
5
6
7
8
9
10
11
12
13

//Load IP Header
struct iphdr iphdr;
bpf_probe_read(&iphdr, sizeof (iphdr), ip_header_address);
// Load protocol and address
u8 icmp_offset_from_ip_header = iphdr.ihl * 4 ;
evt.saddr[ 0 ] = iphdr.saddr;
evt.daddr[ 0 ] = iphdr.daddr;
// Filter ICMP packets
if (iphdr.protocol != IPPROTO_ICMP) {
return 0 ;
}

Finally, we can load the ICMP header itself, make sure this is an echo request of reply, and load the id and seq from it:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

// Compute ICMP header address and load ICMP header
char * icmp_header_address = ip_header_address + icmp_offset_from_ip_header;
struct icmphdr icmphdr;
bpf_probe_read(&icmphdr, sizeof (icmphdr), icmp_header_address);
// Filter ICMP echo request and echo reply
if (icmphdr.type != ICMP_ECHO && icmphdr.type != ICMP_ECHOREPLY) {
return 0 ;
}
// Get ICMP info
evt.icmptype = icmphdr.type;
evt.icmpid = icmphdr.un.echo.id;
evt.icmpseq = icmphdr.un.echo.sequence;
// Fix endian
evt.icmpid = be16_to_cpu(evt.icmpid);
evt.icmpseq = be16_to_cpu(evt.icmpseq);

That’s all!

If you want to filter ICMP from a specific ping instance, you can assume that evt.icmpid is the PID of the Linux ping.

It’s time to show off!

Start the program, then run some “ping” commands in another terminal and observe the results:


1
2
3
4
5

# ping -4 localhost
[ 4026531957 ] lo request #20212.001 127.0.0.1 -> 127.0.0.1
[ 4026531957 ] lo request #20212.001 127.0.0.1 -> 127.0.0.1
[ 4026531957 ] lo reply #20212.001 127.0.0.1 -> 127.0.0.1
[ 4026531957 ] lo reply #20212.001 127.0.0.1 -> 127.0.0.1

An ICMP echo request sent by process 20212 (ICMP id in Linux’s ping) is sent over the loopback interface and passed to the exact same loopback interface, where an echo reply is generated and sent back. The loopback interface is both an emitting interface and a receiving interface.

What about WiFi gateways?


1
2
3

# ping -4 192.168.43.1
[ 4026531957 ] wlp2s0 request #20710.001 192.168.43.191 -> 192.168.43.1
[ 4026531957 ] wlp2s0 reply #20710.001 192.168.43.1 -> 192.168.43.191

In this case, echo request and echo reply are performed through the WiFi interface. Very easy.

On a slightly unrelated note, remember when we only printed “comm” for the process that owned the packet? In this case, the echo request will belong to the ping process, and the reply will belong to the WiFi driver, because from the Linux perspective, the WiFi driver is the process that generates the reply.

The last one, and my personal favorite, is pinging a Docker container. This is not because of Docker, but because it best demonstrates the power of eBPF. It allows building an “x-ray”-like tool for analyzing pings.


1
2
3
4
5
6
7

# ping -4 172.17.0.2
[ 4026531957 ] docker0 request #17146.001 172.17.0.1 -> 172.17.0.2
[ 4026531957 ] vetha373ab6 request #17146.001 172.17.0.1 -> 172.17.0.2
[ 4026532258 ] eth0 request #17146.001 172.17.0.1 -> 172.17.0.2
[ 4026532258 ] eth0 reply #17146.001 172.17.0.2 -> 172.17.0.1
[ 4026531957 ] vetha373ab6 reply #17146.001 172.17.0.2 -> 172.17.0.1
[ 4026531957 ] docker0 reply #17146.001 172.17.0.2 -> 172.17.0.1

After some processing, it now looks like this:


1
2
3
4

Host netns | Container netns
+——————————+——————–+
| docker0 —> veth0e65931 —> eth0 |
+——————————+——————–+

last words

eBPF/bcc enables us to write a new set of tools for deep troubleshooting, tracking, and tracing issues to places that were previously unreachable through kernel patches. Tracepoints are also very convenient as they give us good hints of interesting locations, eliminating the need for tedious reading of the kernel code, and can be placed in parts of the code that are not accessible from the kprobe, such as inline or static functions.

To go further, we can add IPv6 support. This is easy to do and I’ll leave it as an exercise for the reader. Ideally I’d like to be able to measure the performance impact. But this post is already very long. It might be interesting to improve this tool by tracing routing and iptables decisions as well as ARP packets. All of this will make this tool the perfect “X-ray” packet tracer for people like me who sometimes need to deal with complex Linux network setups.

As promised, you can view the full code (with IPv6 support) on Github: https://github.com/yadutaf/tracepkt

This article is reproduced from: https://colobu.com/2023/09/11/tracing-a-packet-journey-using-linux-tracepoints-perf-ebpf/
This site is for inclusion only, and the copyright belongs to the original author.