Userspace Networking with DPDK
DPDK is a fully open-source project that operates in userspace. It's a multi-vendor and multi-architecture project, and it aims at achieving high I/O performance and reaching high packet processing rates, which are some of the most important features in the networking arena. It was created by Intel in 2010 and moved to the Linux Foundation in April 2017. This move positioned it as one of the most dominant and most important open-source Linux projects. DPDK was created for the telecom/datacom infrastructure, but today, it's used almost everywhere, including the cloud, data centers, appliances, containers and more. In this article, I present a high-level overview of the project and discuss features that were released in DPDK 17.08 (August 2017).
Undoubtedly, a lot of effort in many networking projects is geared toward achieving high speed and high performance. Several factors contribute to achieving this goal with DPDK. One is that DPDK is a userspace application that bypasses the heavy layers of the Linux kernel networking stack and talks directly to the network hardware. Another factor is usage of memory hugepages. By using hugepages (of 2MB or 1GB in size), a smaller number of memory pages is needed than when using standard memory pages (which in many platforms are 4k in size). As a result, the number of Translation Lookaside Buffers (TLBs) misses is reduced significantly, and performance is increased. Yet another factor is that low-level optimizations are done in the code, some of them related to memory cache line alignment, aiming at achieving optimal cache use, prefetching and so on. (Delving into the technical details of those optimizations is outside the scope of this article.)
DPDK has gained popularity in recent years, and it's used in many open-source projects. Many Linux distributions (Fedora, Ubuntu and others) have included DPDK support in their packaging systems as well.
The core DPDK ingredients are libraries and drivers, also known as Poll Mode Drivers (PMDs). There are more than 35 libraries at the time of this writing. These libraries abstract away the low-level implementation details, which provides flexibility as each vendor implements its own low-level layers.
The DPDK Development ModelDPDK is written mostly in C, but the project also has a few tools that are written in Python. All code contributions to DPDK are done by patches sent and discussed over the dpdk-dev mailing list. Patches aiming at getting feedback first are usually titled RFCs (Request For Comments). In order to keep the code as stable as possible, the preference is to preserve the ABI (Application Binary Interface) whenever possible. When it seems that there's no other choice, developers should follow a strict ABI deprecation process, including announcement of the requested ABI changes over the dpdk-dev mailing list ahead of time. The ABI changes that are approved and merged are documented in the Release Notes. When acceptance of new features is in doubt, but the respective patches are merged into the master tree anyway, they are tagged as "EXPERIMENTAL". This means that those patches may be changed or even could be removed without prior notice. Thus, for example, new rte_bus experimental APIs were added in DPDK 17.08. I also should note that usually whenever patches for a new, generic API (which should support multiple hardware devices from different vendors) are sent over the mailing list, it's expected that at least one hardware device that supports the new feature is available on the market (if the device is merely announced and not available, developers can't test it).
There's a technical board of nine members from various companies (Intel, NXP, 6WIND, Cavium and others). Meetings typically are held every two weeks over IRC, and the minutes of those meetings are posted on the dpdk-dev mailing list.
As with other large open-source projects, there are community-driven DPDK events across the globe on a regular basis every year. First, there are various DPDK Summits. Among them, DPDK Summit Userspace is focused on being more interactive and on getting feedback from the community. There also are several DPDK meetups in different locations around the world. Moreover, from time to time there is an online community survey, announced over the dpdk-dev mailing list, in order to get feedback from the community, and everyone can participate in it.
The DPDK website hosts the master DPDK repo, but several other repos are dedicated for new features. Several tools and utilities exist in the DPDK tree, among them are the dpdk-devbind.py script, which is for associating a network device or a crypto device with DPDK, and testpmd, which is a CLI tool for various tasks, such as forwarding, monitoring statistics and more. There are almost 50 sample applications under the "examples" folder, bundled with full detailed documentation.
Apart from DPDK itself, the DPDK site hosts several other open-source projects. One is the DPDK Test Suite (DTS), which a Python-based framework for DPDK. It has more than 100 test modules for various features, including the most advanced and most recent features. It runs with IXIA and Scapy traffic generators. It includes both functional and benchmarking tests, and it's very easy to configure and run, as you need to set up only three or four configuration files. You also can set the DPDK version with which you want to run it. DTS development is handled over a dedicated mailing list, and it currently has support for Intel NICs and Mallanox NICs.
DPDK is released every three months. This release cadence is designed to allow DPDK to keep evolving in a rapid pace while giving enough opportunity to review, discuss and improve the contributions. There are usually 3–5 release candidates (RCs) before the final release. For the 17.08 release, there were 1,023 patches from 125 authors, including patches from Intel, Cavium, 6WIND, NXP and others. The release numbers follow the Ubuntu versions convention. A Long Term Stable (LTS) release is maintained for two years. Plans for future LTS releases currently are being discussed in the DPDK community. The plan is to make every .11 release in an even-numbered year (16.11, 18.11 and so forth ) an LTS release and to maintain it for two years.
Recent Features and New Ideas
Several interesting features were added last year. One of the most
fascinating capabilities (added in DPDK 17.05, with new features enabled in 17.08
and 17.11) is "Dynamic Device Personalization" (DDP) for the Intel I40E driver
(10Gb/25Gb/40Gb). This feature allows applying a per-device profile to the I40E
firmware dynamically. You can load a profile by running a testpmd CLI
command (ddp add
), and you can remove it with ddp del
.
You also can apply or remove profiles when traffic is flowing, with a small
number of packets dropped during handling a profile. These profiles are created by
Intel and not by customers, as I40E firmware programming requires deep
knowledge of the I40E device internals.
Other features to mention include Bruce Richardson's build system patch, which provides a more efficient build system for DPDK with meson and ninja, a new kernel module called Kernel Control Path (KCP), port representors and more.
DPDK and Networking ProjectsDPDK is used in various important networking projects. The list is quite long, but I want to mention a few of them briefly:
- Open vSwitch (OvS): the OvS project implements a virtual network switch. It was transitioned to the Linux Foundation in August 2016 and gained a lot of popularity in the industry. DPDK was first integrated into OvS 2.2 in 2015. Later, in OvS 2.4, support for vHost user, which is a virtual device, was added. Support for advanced features like multi-queues and numa awareness was added in subsequent releases.
- Contrail vRouter: Contrail Systems was a startup that developed SDN controllers. Juniper Networks acquired it in 2012, and Juniper Networks released the Contrail vRouter later as an open-source project. It uses DPDK to achieve better network performance.
- pktgen-dpdk: an open-source traffic generator based on DPDK (hosted on the DPDK site).
- TREX: a stateful and stateless open-source traffic generator based on DPDK.
- Vector Packet Processing (VPP): an FD.io project.
For those who are newcomers to DPDK, both users and developers, there is excellent documentation hosted on DPDK site. It's recommended that you actually try run several of the sample applications (following the "Sample Applications User Guides"), starting with the "Hello World" application. It's also a good idea to follow the dpdk-users mailing list on a regular basis. For those who are interested in development, the Programmer's Guide is a good source of information about the architecture and development environment, and developers should follow the dpdk-dev mailing list as well.
DPDK and SR-IOV ExampleI want to conclude this article with a very basic example (based on SR-IOV) of how to create a DPDK VF and how to attach it to a VM with qemu. I also show how to create a non-DPDK VF ("kernel VF"), attach it to a VM, run a DPDK app on that VF and communicate with it from the host.
As a preparation step, you need to enable IOMMU and virtualization on the host.
To support this, add intel_iommu=on iommu=pt
as kernel parameters
to the kernel command line (in grub.cfg), and also to enable virtualization and VT-d
in the BIOS (VT-d stands for "Intel Virtualization Technology for Directed I/O").
You'll use the Intel I40E network interface card for this example. The I40E
device driver supports up to 128 VFs per device, divided equally across ports,
so if you have a quad-port I40E NIC, you can create up to 32 VFs on each port.
For this example, I also show a simple usage of the testpmd CLI, as mentioned earlier. This example is based on DPDK-17.08, the most recent release of DPDK at the time of this writing. In this example, you'll use Single Root I/O Virtualization (SR-IOV), which is an extension of the PCI Express (PCIe) specification and allows sharing a single physical PCI Express resource across several virtual environments. This technology is very popular in data-center/cloud environments, and many network adapters support this feature, and likewise, their drivers support this feature. I should note that SRIOV is not limited to network devices, but is available for other PCI devices as well, such as graphic cards.
DPDK VF
You create DPDK VFs by writing the number of requested VFs into
a DPDK sysfs entry called max_vfs
. Say that
eth8 is the PF on top of which you want to create a VF and its
PCI address is 0000:07:00.0. (You can fetch the PCI address with
ethtool -i <ethDeviceName> | grep bus-info
.)
The following is the sequence you run on the host in order to create a VF and launch a VM.
First, bind the PF to DPDK with usertools/dpdk-devbind.py, for example:
modprobe uio
insmod /build/kmod/igb_uio.k
./usertools/dpdk-devbind.py -b igb_uio 0000:07:00.0
Then, create two DPDK VFs with:
echo 2 > /sys/bus/pci/devices/0000:07:00.0/max_vfs
You can verify that the two VFs were created by this operation by checking whether
two new entries were added when running: lspci | grep "Virtual Function"
, or
by verifying that you have now two new symlinks under /sys/bus/pci/devices/0000:07:00.0/ for the two
newly created VFs: virtfn0 and virtfn1.
Next, launch the VMs via qemu using PCI Passthrough, for example:
qemu-system-x86_64 -enable-kvm -cpu host \
-drive file=Ubuntu_1604.qcow2,index=0,media=disk,format=qcow2 \
-smp 5 -m 2048 -vga qxl \
-vnc :1 \
-device pci-assign,host=0000:07:02.0 \
-net nic,macaddr=00:00:00:99:99:01 \
-net tap,script=/etc/qemu-ifup.
Note: qemu-ifup
is a shell script that's invoked when the VM is launched,
usually for setting up networking.
Next, you can start a VNC client (such as RealVNC client) to access the VM, and from there, you can
verify that the VF was indeed assigned to it, with lspci -n
. You should see a single
device, which has "8086 154c" as the vendor ID/device ID combination;
"8086 154c" is the virtual function PCI ID of the I40E NIC.
You can launch a DPDK application in the guest on top of that VF.
To conclude this example, let's create a kernel VF on the host and run a DPDK on top of it in the VM, and then let's look at a simple interaction with the host PF.
First, create two kernel VFs with:
echo 2 > /sys/bus/pci/devices/0000:07:00.0/sriov_numvfs
Here again you can verify that these two VFs were created by running
lspci | grep "Virtual Function"
.
Next, run this sequence:
echo "8086 154c" > /sys/bus/pci/drivers/pci-stub/new_id
echo 07:02.0 > /sys/bus/pci/devices/$VF_PCI_0/driver/unbind
echo 07:02.0 > /sys/bus/pci/drivers/pci-stub/bind
Then launch the VM the same way as before, with the
same qemu-system-x86_64
command mentioned earlier.
Again, in the guest, you should be able to see the I40E VF with lspci -n
.
On the host, doing ip link show
will show the two VFs of eth8: vf 0 and vf 1.
You can set the MAC addresses of a VF from the host
with ip link set
—for example:
ip link set eth8 vf 0 mac 00:11:22:33:44:55
Then, when you run a DPDK application like testpmd in the guest, and run, for example,
show port info 0
from the testpmd CLI, you'll see that indeed the MAC address
that you set in the host is reflected for that VF in DPDK.
This article provides a high-level overview of the DPDK project, which is growing dynamically and gaining popularity in the industry. The near future likely will bring support for more network interfaces from different vendors, as well as new features.