#+OPTIONS: ^:nil * Protocol for atomic loading of multi-prog dispatchers With the support for the =freplace= program type, it is possible to load multiple XDP programs on a single interface by building a /dispatcher/ program which will run on the interface, and which will call the component XDP programs as functions using the =freplace= type. For this to work in an interoperable way, applications need to agree on how to attach their XDP programs using this mechanism. This document outlines the protocol implemented by =libxdp=, serving as both documentation and a blueprint for anyone else who wants to implement the same protocol and interoperate. ** Generating a dispatcher The dispatcher is simply an XDP program that will call each of a number of stub functions in turn, and depending on their return code either continue on to the next function or return immediately. These stub functions are then replaced at load time with the user XDP programs, using the =freplace= functionality. *** Dispatcher format The dispatcher XDP program contains the main function containing the dispatcher logic, 10 stub functions that can be replaced by component BPF programs, and a configuration structure that is used by the dispatcher logic. In =libxdp=, this dispatcher is generated by [[https://github.com/xdp-project/xdp-tools/blob/master/lib/libxdp/xdp-dispatcher.c.in][an M4 macro file]] which expands to the following: #+begin_src C #define XDP_METADATA_SECTION "xdp_metadata" #define XDP_DISPATCHER_VERSION 1 #define XDP_DISPATCHER_RETVAL 31 #define MAX_DISPATCHER_ACTIONS 10 struct xdp_dispatcher_config { __u8 num_progs_enabled; __u32 chain_call_actions[MAX_DISPATCHER_ACTIONS]; __u32 run_prios[MAX_DISPATCHER_ACTIONS]; }; /* While 'const volatile' sounds a little like an oxymoron, there's reason * behind the madness: * * - const places the data in rodata, where libbpf will mark it as read-only and * frozen on program load, letting the kernel do dead code elimination based * on the values. * * - volatile prevents the compiler from optimising away the checks based on the * compile-time value of the variables, which is important since we will be * changing the values before loading the program into the kernel. */ static volatile const struct xdp_dispatcher_config conf = {}; /* The volatile return value prevents the compiler from assuming it knows the * return value and optimising based on that. */ __attribute__ ((noinline)) int prog0(struct xdp_md *ctx) { volatile int ret = XDP_DISPATCHER_RETVAL; if (!ctx) return XDP_ABORTED; return ret; } /* the above is repeated as prog1...prog9 */ SEC("xdp/dispatcher") int xdp_dispatcher(struct xdp_md *ctx) { __u8 num_progs_enabled = conf.num_progs_enabled; int ret; if (num_progs_enabled < 1) goto out; ret = prog0(ctx); if (!((1U << ret) & conf.chain_call_actions[0])) return ret; /* the above is repeated for prog1...prog9 */ out: return XDP_PASS; } char _license[] SEC("license") = "GPL"; __uint(dispatcher_version, XDP_DISPATCHER_VERSION) SEC(XDP_METADATA_SECTION); #+end_src The dispatcher program is pre-compiled and distributed with =libxdp=. Because the configuration struct is marked as =const= in the source file, it will be put into the =rodata=, which libbpf will turn into a read-only (frozen) map on load. This allows the kernel verifier to perform dead code elimination based on the values in the map. This is also the reason for the =num_progs_enabled= member of the config struct: together with the checks in the main dispatcher function the verifier will effectively remove all the stub function calls not being used, without having to rely on dynamic compilation. When generating a dispatcher, this BPF object file is opened and the configuration struct is populated before the object is loaded. As a forward compatibility measure, =libxdp= will also check for the presence of the =dispatcher_version= field in the =xdp_metadata= section (encoded like the program metadata described in "Processing program metadata" below), and if it doesn't match the expected version (currently only version 1 exists), will abort any action. *** Populating the dispatcher configuration map On loading, the dispatcher configuration map is populated as follows: - The =num_progs_enabled= member is simply set to the number of active programs that will be attached to this dispatcher. The two other fields contain per-component program metadata, which is read from the component programs as explained in the "Processing program metadata" section below. - The =chain_call_actions= array is populated with a bitmap signifying which XDP actions (return codes) of each component program should be interpreted as a signal to continue execution of the next XDP program. For instance, a packet filtering program might designate that an =XDP_PASS= action should make execution continue, while other return codes should immediately end the call chain and return. The special =XDP_DISPATCHER_RETVAL= (which is set to 31 corresponding to the topmost bit in the bitmap) is always included in each programs' =chain_call_actions=; this value is returned by the stub functions, which ensures that should a component program become detached, processing will always continue past the stub function. - The =run_prios= array contains the effective run priority of each component program when it was installed. This is also read as program metadata, but because it can be overridden at load time, the effective value is stored in the configuration array so it can be carried forward when the dispatcher is replaced. Component programs are expected to be sorted in order of their run priority (as explained below in "Loading and attaching component programs"). **** Processing program metadata As explained above, each component program must specify one or more chain call actions and a run priority on attach. When loading a user program, =libxdp= will attempt to read this metadata from the object file as explained in the following; if no values are found in the object file, a default run priority of 50 will be applied, and =XDP_PASS= will be the only chain call action. The metadata is read from the object file by looking for BTF-encoded metadata in the =.xdp_run_config= object section, encoded similar to the BTF-defined maps used by libbpf (in the =.maps= section). Here, =libxdp= will look for a struct definition with the XDP program function name prefixed by an underscore (e.g., if the main XDP function is called =xdp_main=, libxdp will look for a struct definition called =_xdp_main=). In this struct, a member =priority= encodes the run priority, each XDP action can be set as a chain call action by setting a struct member with the action name. The =xdp_helpers.h= header file included with XDP exposes helper macros that can be used with the existing helpers in =bpf_helpers.h= (from libbpf), so a full run configuration metadata section can be defined as follows: #+begin_src C #include #include struct { __uint(priority, 10); __uint(XDP_PASS, 1); __uint(XDP_DROP, 1); } XDP_RUN_CONFIG(my_xdp_func); #+end_src This example sets priority 10 with chain call actions =XDP_PASS= and =XDP_DROP= for the XDP program starting at =my_xdp_func()=. This turns into the following BTF information (as shown by =bpftool btf dump=): #+begin_src [12] STRUCT '(anon)' size=24 vlen=3 'priority' type_id=13 bits_offset=0 'XDP_PASS' type_id=15 bits_offset=64 'XDP_DROP' type_id=15 bits_offset=128 [13] PTR '(anon)' type_id=14 [14] ARRAY '(anon)' type_id=6 index_type_id=10 nr_elems=10 [15] PTR '(anon)' type_id=16 [16] ARRAY '(anon)' type_id=6 index_type_id=10 nr_elems=1 [17] VAR '_my_xdp_func' type_id=12, linkage=global-alloc [18] DATASEC '.xdp_run_config' size=0 vlen=1 type_id=17 offset=0 size=24 #+end_src The parser will look for the =.xdp_run_config= DATASEC, then follow the types recursively, extracting the field values from the =nr_elems= in the anonymous arrays in type IDs 14 and 16. While =libxdp= will automatically load any metadata specified as above in the program BTF, the application using =libxdp= can override these values at runtime. These overridden values will be the ones used when determining program order, and will be preserved in the dispatcher configuration map for subsequent operations. *** Loading and attaching component programs When loading one or more XDP programs onto an interface (assuming no existing program is found on the interface; for adding programs, see below), =libxdp= first prepares a dispatcher program with the right number of slots, by populating the configuration struct as described above. Then, this dispatcher program is loaded into the kernel. Having loaded the dispatcher program, =libxdp= then loads each of the component programs. To do this, first the list of component programs is sorted by their run priority, forming the final run sequence. Should several programs have the same run priority, ties are broken in the following arbitrary, but deterministic, order (see =cmp_xdp_programs()= [[https://github.com/xdp-project/xdp-tools/blob/master/lib/libxdp/libxdp.c][in libxdp.c]]): - By XDP function name (=bpf_program__name()= from libbpf) - By sorting already-loaded programs before not-yet-loaded ones - By unloaded programs by program size - By loaded program bpf tag value (using =memcmp()=) - By load time Before loading, each component program type is reset to =BPF_PROG_TYPE_EXT= with an expected attach type of 0. Then, the attachment target is set to the dispatcher file descriptor and the BTF ID of the stub function to replace (i.e., the first component program has =prog0()= as its target, and so on). Then the program is loaded, at which point the kernel will verify the component program's compatibility with the attach point. Having loaded the component program, it is attached to the dispatcher by way of =bpf_link_create()=, specifying the same target file description and BTF ID used when loading the program. This will return a link fd, which will be pinned to prevent the attachment to unravel when the fd is closed (see "Locking and pinning" below). *** Locking and pinning To prevent the kernel from detaching any =freplace= program when its last file description is closed, the programs must be pinned in =bpffs=. This is done in the =xdp= subdirectory of =bpffs=, which by default means =/sys/fs/bpf/xdp=. If the =LIBXDP_BPFFS= environment variable is set, this will override the location of the top-level =bpffs=, and the =xdp= subdirectory will be created beneath this path. The pathnames generated for pinning are the following: - /sys/fs/bpf/xdp/dispatch-IFINDEX-DID - dispatcher program for IFINDEX with BPF program ID DID - /sys/fs/bpf/xdp/dispatch-IFINDEX-DID/prog0-prog - component program 0, program reference - /sys/fs/bpf/xdp/dispatch-IFINDEX-DID/prog0-link - component program 0, bpf_link reference - /sys/fs/bpf/xdp/dispatch-IFINDEX-DID/prog1-prog - component program 1, program reference - /sys/fs/bpf/xdp/dispatch-IFINDEX-DID/prog1-link - component program 1, bpf_link reference - etc, up to ten component programs This means that several pin operations have to be performed for each dispatcher program. Semantically, these are all atomic, so to make sure every consumer of the hierarchy of pinned files gets a consistent view, locking is needed. This is implemented by opening the parent directory =/sys/fs/bpf/xdp= with the =O_DIRECTORY= flag, and obtaining a lock on the resulting file descriptor using =flock(lock_fd, LOCK_EX)=. When creating a new dispatcher program, it will first be fully populated, with all component programs attached. Then, the programs will be linked in =bpffs= as specified above, and once this succeeds, the program will be attached to the interface. If attaching the program fails, the programs will be unpinned again, and the error returned to the caller. This order ensures atomic attachment to the interface, without any risk that component programs will be automatically detached due to a badly timed application crash. When loading the initial dispatcher program, the =XDP_FLAGS_UPDATE_IF_NOEXIST= flag is set to prevent accidentally overriding any concurrent modifications. If this fails, the whole operation starts over, turning the load into a modification as described below. ** Adding or removing programs from an existing dispatcher The sections above explain how to generate a dispatcher and attach it to an interface, assuming no existing program is attached. When one or more programs is already attached, a couple of extra steps are required to ensure that the switch is made atomically. Briefly, changing the programs attached to an interface entails the following steps: - Reading the existing dispatcher program and obtaining references to the component programs. - Generating a new dispatcher containing the new set of programs (adding or removing the programs needed). - Atomically swapping out the XDP program attachment on the interface so the new dispatcher takes over from the old one. - Unpinning and dismantling the old dispatcher. These operations are each described in turn in the following sections. *** Reading list of existing programs from the kernel The first step is to obtain the ID of the currently loaded XDP program using =bpf_get_link_xdp_info()=. A file descriptor to the dispatcher is obtained using =bpf_prog_get_fd_by_id()=, and the BTF information attached to the program is obtained from the kernel. This is checked for the presence of the dispatcher version field (as explained above), and the operation is aborted if this is not present, or doesn't match what the library expects. Having thus established that the program loaded on the interface is indeed a compatible dispatcher, the map ID of the map containing the configuration struct is obtained from the kernel, and the configuration data is loaded from the map (after checking that the map value size matches the expected configuration struct). Then, the file lock on the directory in =bpffs= is obtained as explained in the "Locking and pinning" section above, and, while holding this lock, file descriptors to each of the component programs and =bpf_link= objects are obtained. The end result is a reference to the full dispatcher structure (and its component programs), corresponding to that generated on load. When populating the component program structure in memory, the chain call actions and run priority from the dispatcher configuration map is used instead of parsing the BTF metadata of each program: This ensures that any modified values specified at load time will be retained in stead of being reverted to the values compiled into the BTF metadata. *** Generating a new dispatcher Having obtained a reference to the existing dispatcher, =libxdp= takes that and the list of programs to add to or remove from the interface, and simply generates a new dispatcher with the new set of programs. When adding programs, the whole list of programs is sorted according to their run priorities (as explained above), resulting in new programs being inserted in the right place in the existing sequence according to their priority. Generating this secondary dispatcher relies on the support for multiple attachments for =freplace= programs, which was added in kernel 5.10. This allows the =bpf_link_create()= operation to specify an attachment target in the new dispatcher. In other words, the component programs will briefly be attached to both the old and new dispatcher, but only one of those will be attached to the interface. After completion of the new dispatcher, its component programs are pinned in =bpffs= as described above. *** Atomic replace and retry At this point, =libxdp= has references to both the old dispatcher, already attached to the interface, and the new one with the modified set of component programs. The new dispatcher is then atomically swapped out with the old one, using the =XDP_FLAGS_REPLACE= flag to the netlink operation (and the accompanying =IFLA_XDP_EXPECTED_FD= attribute). Once the atomic replace operation succeeds, the old dispatcher is unpinned from =bppfs= and the in-memory references to both the old and new dispatchers are released (since the new dispatcher was already pinned, preventing it from being detached from the interface). Should this atomic replace instead *fail* because the program attached to the interface changed while the new dispatcher was being built, the whole operation is simply started over from the beginning. That is, the new dispatcher is unpinned from =bpffs=, and the in-memory references to both dispatchers are released (but no unpinning of the old dispatcher is performed!). Then, the program ID attached to the interface is again read from the kernel, and the operation proceeds from "Reading list of existing programs from the kernel". ** Compatibility with older kernels The full functionality described above can only be attained with kernels version 5.10 or newer, because this is the version that introduced support for re-attaching an freplace program in a secondary attachment point. However, the freplace functionality itself was introduced in kernel 5.7, so for kernel versions 5.7 to 5.9, multiple programs can be attached as long as they are all attached to the dispatcher immediately as they are loaded. This is achieved by using =bpf_raw_tracepoint_open()= in place of =bpf_link_create()= when attaching the component programs to the dispatcher. The =bpf_raw_tracepoint_open()= function doesn't take an attach target as a parameter; instead, it simply attached the freplace program to the target that was specified at load time (which is why it only works when all component programs are loaded together with the dispatcher).