Data in a Flash, Part III: NVMe over Fabrics Using TCP
A remote NVMe block device exported via an NVMe over Fabrics network using TCP.
Version 5.0 of the Linux kernel brought with it many wonderful features, one of which was the introduction of NVMe over Fabrics (NVMeoF) across native TCP. If you recall, in the previous part to this series ("Data in a Flash, Part II: Using NVMe Drives and Creating an NVMe over Fabrics Network", I explained how to enable your NVMe network across RDMA (an Infiniband protocol) through a little method referred to as RDMA over Converged Ethernet (RoCE). As the name implies, it allows for the transfer of RDMA across a traditional Ethernet network. And although this works well, it introduces a bit of overhead (along with latencies). So when the 5.0 kernel introduced native TCP support for NVMe targets, it simplifies the method or procedure one needs to take to configure the same network, as shown in my last article, and it also makes accessing the remote NVMe drive faster.
Software RequirementsTo continue with this tutorial, you'll need to have a 5.0 Linux kernel or later installed, with the following modules built and inserted into the operating systems of both your initiator (the server importing the remote NVMe volume) and the target (the server exporting its local NVMe volume):
# NVME Support
CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y
# CONFIG_NVME_MULTIPATH is not set
CONFIG_NVME_FABRICS=m
CONFIG_NVME_RDMA=m
# CONFIG_NVME_FC is not set
CONFIG_NVME_TCP=m
CONFIG_NVME_TARGET=m
CONFIG_NVME_TARGET_LOOP=m
CONFIG_NVME_TARGET_RDMA=m
# CONFIG_NVME_TARGET_FC is not set
CONFIG_NVME_TARGET_TCP=m
More specifically, you need the module to import the remote NVMe volume:
CONFIG_NVME_TCP=m
And the module to export a local NVMe volume:
CONFIG_NVME_TARGET_TCP=m
Before continuing, make sure your physical (or virtual) machine is up to date. And once you verify that to be the case, make sure you are able to see all locally connected NVMe devices (which you'll export across your network):
$ cat /proc/partitions |grep -e nvme -e major
major minor #blocks name
259 0 3907018584 nvme2n1
259 1 3907018584 nvme3n1
259 2 3907018584 nvme0n1
259 3 3907018584 nvme1n1
If you don't see any connected NVMe devices, make sure the kernel module is loaded:
petros@ubu-nvme1:~$ lsmod|grep nvme
nvme 32768 0
nvme_core 61440 1 nvme
The following modules need to be loaded on the initiator:
$ sudo modprobe nvme
$ sudo modprobe nvme-tcp
And, the following modules need to be loaded on the target:
$ sudo modprobe nvmet
$ sudo modprobe nvmet-tcp
Next, you'll install the drive management utility called
nvme-cli
. This utility is defined and maintained by the very same
NVM Express committee that has defined the NVMe specification. You
can find the GitHub repository hosting the source code here.
A recent build is needed. Clone
the source code from the GitHub repository. Build and install it:
$ make
$ make install
Accessing the Drive across a Network over TCP
The purpose of this section is to leverage the high-speed SSD technology and expand it beyond the local server. An NVMe does not have to be limited to the server it is physically plugged in to. In this example, and for the sake of convenience, I'm using two virtual machines to create this network. There is absolutely no advantage in doing this, and I wouldn't recommend you do the same unless you just want to follow the exercise. Realistically, you should enable the following only on physical machines with high-speed network cards connected. Anyway, in the target virtual machine, I attached a couple of low-capacity virtual NVMe drives (2GB each):
$ sudo nvme list
Node SN Model Namespace
-------------- -------------- ---------------------- ---------
/dev/nvme0n1 VB1234-56789 ORCL-VBOX-NVME-VER12 1
/dev/nvme0n2 VB1234-56789 ORCL-VBOX-NVME-VER12 2
Usage Format FW Rev
-------------------------- ---------------- --------
2.15 GB / 2.15 GB 512 B + 0 B 1.0
2.15 GB / 2.15 GB 512 B + 0 B 1.0
[Note: the tabular output above has been modified for readability.]
The following instructions rely heavily on the sysfs
virtual filesystem. In theory, you could export NVMe targets with the
open-source utility, nvmet-cli
, which does all of that complex heavy
lifting. But, where is the fun in that?
Exporting a Target
Mount the kernel user configuration filesystem. This is a requirement. All of the NVMe Target instructions require the NVMe Target tree made available in this filesystem:
$ sudo /bin/mount -t configfs none /sys/kernel/config/
Create an NVMe Target subsystem to host your devices (to export) and change into its directory:
$ sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test
$ cd /sys/kernel/config/nvmet/subsystems/nvmet-test
This example will simplify host connections by leaving the newly created subsystem accessible to any and every host attempting to connect to it. In a production environment, you definitely should lock this down to specific host machines by their NQN:
$ echo 1 |sudo tee -a attr_allow_any_host > /dev/null
When a target is exported, it is done so with a "unique" NVMe Qualified Name (NQN). The concept is very similar to the iSCSI Qualified Name (IQN). This NQN is what enables other operating systems to import and use the remote NVMe device across a network potentially hosting multiple NVMe devices.
Define a subsystem namespace and change into its directory:
$ sudo mkdir namespaces/1
$ cd namespaces/1/
Set a local NVMe device to the newly created namespace:
$ echo -n /dev/nvme0n1 |sudo tee -a device_path > /dev/null
And enable the namespace:
$ echo 1|sudo tee -a enable > /dev/null
Now, you'll create an NVMe Target port to export the newly created subsystem and change into its directory path:
$ sudo mkdir /sys/kernel/config/nvmet/ports/1
$ cd /sys/kernel/config/nvmet/ports/1
Well, you'll use the IP address of your preferred Ethernet interface port when exporting your subsystem (for example, eth0):
$ echo 192.168.1.92 |sudo tee -a addr_traddr > /dev/null
Then, you'll set a few other parameters:
$ echo tcp|sudo tee -a addr_trtype > /dev/null
$ echo 4420|sudo tee -a addr_trsvcid > /dev/null
$ echo ipv4|sudo tee -a addr_adrfam > /dev/null
And create a softlink to point to the subsystem from your newly created port:
$ sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/
↪/sys/kernel/config/nvmet/ports/1/subsystems/nvmet-test
You now should see the following message captured in dmesg
:
$ dmesg |grep "nvmet_tcp"
[24457.458325] nvmet_tcp: enabling port 1 (192.168.1.92:4420)
Importing a Target
The host machine is currently without an NVMe device:
$ nvme list
Node SN Model Namespace
--------- ------------ ------------------------ ---------
Usage Format FW Rev
-------------- ---------------- --------
[Note: the tabular output above has been modified for readability.]
Scan your target machine for any exported NVMe volumes:
$ sudo nvme discover -t tcp -a 192.168.1.92 -s 4420
Discovery Log Number of Records 1, Generation counter 1
=====Discovery Log Entry 0======
trtype: tcp
adrfam: ipv4
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 1
trsvcid: 4420
subnqn: nvmet-test
traddr: 192.168.1.92
sectype: none
It must be your lucky day. It looks as if the target machine
is exporting one or more volumes. You'll need to remember
its subnqn field: nvmet-test
. Now connect to the subnqn:
$ sudo nvme connect -t tcp -n nvmet-test -a 192.168.1.92 -s 4420
If you go back to list all NVMe devices, you now should see all those exported by that one subnqn:
$ sudo nvme list
Node SN Model
---------------- -------------------- ------------------------
/dev/nvme1n1 8e0999a558e17818 Linux
Namespace Usage Format FW Rev
--------- ----------------------- ---------------- --------
1 2.15 GB / 2.15 GB 512 B + 0 B 5.0.0-3
[Note: the tabular output above has been modified for readability.]
Verify that it also shows up like your other block device:
$ cat /proc/partitions |grep nvme
259 1 2097152 nvme1n1
You can disconnect from the target device by typing:
$ sudo nvme disconnect -d /dev/nvme1n1
Summary
There you have it—a remote NVMe block device exported via an NVMe over Fabrics network using TCP. Now you can write to and read from it like any other locally attached high-performance block device. The fact that you now can map the block device over TCP without the additional overhead should and will accelerate adoption of the technology.
Resources