Data in a Flash, Part II: Using NVMe Drives and Creating an NVMe over Fabrics Network
By design, NVMe drives are intended to provide local access to the machines they are plugged in to; however, the NVMe over Fabric specification seeks to address this very limitation by enabling remote network access to that same device.
This article puts into practice what you learned in Part I and shows how to use NVMe drives in a Linux environment. But, before continuing, you first need to make sure that your physical (or virtual) machine is up to date. Once you verify that to be the case, make sure you're able to see all connected NVMe devices:
$ cat /proc/partitions |grep -e nvme -e major
major minor #blocks name
259 0 3907018584 nvme2n1
259 1 3907018584 nvme3n1
259 2 3907018584 nvme0n1
259 3 3907018584 nvme1n1
Those devices also will appear in sysfs
:
$ ls /sys/block/|grep nvme
nvme0n1
nvme1n1
nvme2n1
nvme3n1
If you don't see any connected NVMe devices, make sure the kernel module is loaded:
petros@ubu-nvme1:~$ lsmod|grep nvme
nvme 32768 0
nvme_core 61440 1 nvme
Next, install the drive management utility called
nvme-cli
. This utility is defined and maintained by the very
same
NVM Express committee that defined the NVMe specification. The nvme-cli
source code is hosted on
GitHub. Fortunately,
some operating
systems offer this package in their internal repositories.
Installing it on the latest Ubuntu looks something like this:
petros@ubu-nvme1:~$ sudo add-apt-repository universe
petros@ubu-nvme1:~$ sudo apt update && sudo apt install
↪nvme-cli
Using this utility, you're able to list more details of all connected NVMe drives (note: the tabular output below has been reformatted and truncated to better fit here):
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------------------------------------------------
/dev/nvme0n1 PHLF814001... Dell Express Flash NVMe P4500 4.0TB SFF 1
↪4.00 TB / 4.00 TB 512 B + 0 B QDV1DP12
/dev/nvme1n1 PHLF814300... Dell Express Flash NVMe P4500 4.0TB SFF 1
↪4.00 TB / 4.00 TB 512 B + 0 B QDV1DP12
/dev/nvme2n1 PHLF814504... Dell Express Flash NVMe P4500 4.0TB SFF 1
↪4.00 TB / 4.00 TB 512 B + 0 B QDV1DP12
/dev/nvme3n1 PHLF814502... Dell Express Flash NVMe P4500 4.0TB SFF 1
↪4.00 TB / 4.00 TB 512 B + 0 B QDV1DP12
Note: if you don't have a physical NVMe drive connected to your machine but still want to follow along (in limited form), you can install and simulate an NVMe controller plus drive(s) in the latest VirtualBox virtualization application.
Drive Management
Issuing the nvme
command on the command line prints an
online help menu with a complete list of features and functions, some of
which locate and identify various NVMe controllers, drives and
their namespaces:
list List all NVMe devices and namespaces on machine
list-subsys List nvme subsystems
id-ctrl Send NVMe Identify Controller
id-ns Send NVMe Identify Namespace, display structure
list-ns Send NVMe Identify List, display structure
Other features of the nvme-cli utility introduce namespace management:
ns-descs Send NVMe Namespace Descriptor List, display
↪structure
create-ns Creates a namespace with the provided parameters
delete-ns Deletes a namespace from the controller
attach-ns Attaches a namespace to requested controller(s)
detach-ns Detaches a namespace from requested controller(s)
Namespaces are a unique function of the NVMe drive. Think of them as sort of a virtual partition of the physical device. A namespace is a defined quantity of non-volatile memory that can be formatted into logical blocks. When provisioned, one or more namespaces are connected to the controller (or to a host, sometimes remotely). Each can support various block sizes (such as 512 bytes, 4 KB and so on). When defined, they will appear as separate block devices to the host.
If the drive contains a single namespace, listing it will showcase the following:
$ nvme list-ns /dev/nvme0
[ 0]:0x1
If you start creating more namespaces, it will be reflected in the listing:
$ sudo nvme list-ns /dev/nvme0
[ 0]:0x1
[ 1]:0x2
and again in the number of block devices registered by your operating system:
$ cat /proc/partitions |grep nvme0
259 0 1953509292 nvme0n1
259 1 1953509292 nvme0n2
With the same utility, you are able to access drive-level logging:
get-log Generic NVMe get log, returns log in raw format
fw-log Retrieve FW Log, show it
smart-log Retrieve SMART Log, show it
error-log Retrieve Error Log, show it
effects-log Retrieve Command Effects Log, show it
And you can also set drive-level features:
get-feature Get feature and show the resulting value
set-feature Set a feature and show the resulting value
set-property Set a property and show the resulting value
For example, let's say you want to enable (1) or disable (0) the drive's volatile write cache (VWC). You can list its current setting like so:
$ sudo nvme id-ctrl /dev/nvme0|grep vwc
vwc : 0
And, set it like so:
$ sudo nvme set-feature /dev/nvme0 -f 0x6 -v 1
You can manage and update drive firmware:
fw-commit Verify and commit firmware to a specific slot
↪(fw-activate in old version < 1.2)
fw-download Download new firmware
Reset the controller (but not the connected drives):
reset Resets the controller
subsystem-reset Resets the controller
Discover and connect to other NVMe devices over a network (see below):
discover Discover NVMeoF subsystems
connect-all Discover and Connect to NVMeoF subsystems
connect Connect to NVMeoF subsystem
disconnect Disconnect from NVMeoF subsystem
gen-hostnqn Generate NVMeoF host NQN
And more.
The utility even has plugin extensions to support vendor-specific functions. The latest revision includes:
intel Intel vendor specific extensions
lnvm LightNVM specific extensions
memblaze Memblaze vendor specific extensions
wdc Western Digital vendor specific extensions
huawei Huawei vendor specific extensions
netapp NetApp vendor specific extensions
toshiba Toshiba NVME plugin
micron Micron vendor specific extensions
seagate Seagate vendor specific extensions
Accessing the Drive across a Network
Let's look at how to leverage the high-speed SSD technology and expand it beyond the local server. An NVMe doesn't have to be limited to the server that it's physically plugged in to. In this example, let's configure a Soft RDMA over Converged Ethernet (RoCE) network on top of traditional TCP/IP and export/import an NVMe block device via this method. This will be your NVMeoF network.
Before continuing though, you'll need to understand a couple concepts:
- Host: as it relates to the current environment, a host will be the server connecting to a remote block device—specifically, an NVMe target.
- Target: the target will be the server exporting the NVMe device across the network and to the host server.
In this example, and for the sake of convenience, I'm describing using two virtual machines to create the network. There's absolutely no advantage in doing this, and I don't recommend that anyone do the same other than to follow along with the exercise. Realistically, you should enable the following only on physical machines with high-speed network cards connected. Having said that, in the target virtual machine, let's attach a couple low-capacity virtual NVMe drives (2GB each):
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------------------------------------------------
/dev/nvme0n1 VB1234-56789 ORCL-VBOX-NVME-VER12 1 2.15 GB / 2.15 GB
↪512 B + 0 B 1.0
/dev/nvme0n2 VB1234-56789 ORCL-VBOX-NVME-VER12 2 2.15 GB / 2.15 GB
↪512 B + 0 B 1.0
(Note: the above tabular output has been edited to fit the column width.)
Again, I've been using a recent release of Ubuntu. To prepare both the host and target operating environments, install the following packages:
$ sudo apt install libibverbs-dev libibverbs1 rdma-core
↪ibverbs-utils
On some distributions, you may need to specify the librxe
package (on Ubuntu, its functions are packaged in rdma-core
).
Again, on both the host and target, you'll now load the required kernel modules (there are a few):
$ sudo modprobe nvme-rdma
$ sudo modprobe ib_uverbs
$ sudo modprobe rdma_ucm
$ sudo modprobe rdma_rxe
$ sudo modprobe nvmet
$ sudo modprobe nvmet-rdma
The following instructions rely heavily on the sysfs
virtual filesystem. In theory, you could export NVMe targets with the
nvmet-cli
open-source utility, which does all of that complex
heavy-lifting. But, where is the fun in that?
Setting Up a Soft-RoCE Network
An RDMA network needs be established between both the host and target servers. On each server, identify the network interface to enable for this method of transport:
$ ip addr show enp0s3
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
↪fq_codel state UP group default qlen 1000
link/ether 08:00:27:15:4b:da brd ff:ff:ff:ff:ff:ff
inet 192.168.1.85/24 brd 192.168.1.255 scope global
↪dynamic enp0s3
valid_lft 85865sec preferred_lft 85865sec
inet6 fe80::a00:27ff:fe15:4bda/64 scope link
valid_lft forever preferred_lft forever
Let's configure the RDMA interface on top of the preferred Ethernet interface, but before doing so, first verify that one doesn't already exist:
$ sudo rxe_cfg status
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
enp0s3 yes e1000
Enable the RDMA environment and add the Ethernet interface:
$ sudo rxe_cfg start
$ sudo rxe_cfg add enp0s3
Verify that you now have your RDMA interface (rxe
):
$ sudo rxe_cfg status
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
enp0s3 yes e1000 rxe0 (?)
You'll also find this interface listed in sysfs
:
$ ls /sys/class/infiniband
rxe0
After applying the same instructions to both host and target machines, you'll need to test the RDMA network.
On the host, set up the server:
$ sudo ibv_rc_pingpong -d rxe0 -g 0
local address: LID 0x0000, QPN 0x000011, PSN 0x5db323,
↪GID fe80::a00:27ff:fe48:d511
remote address: LID 0x0000, QPN 0x000011, PSN 0x3403d4,
↪GID fe80::a00:27ff:fe15:4bda
8192000 bytes in 0.40 seconds = 164.26 Mbit/sec
1000 iters in 0.40 seconds = 398.97 usec/iter
On the target, set up the client (replace the IP with the IP address of your host machine):
$ sudo ibv_rc_pingpong -d rxe0 -g 0 192.168.1.85
local address: LID 0x0000, QPN 0x000011, PSN 0x3403d4,
↪GID fe80::a00:27ff:fe15:4bda
remote address: LID 0x0000, QPN 0x000011, PSN 0x5db323,
↪GID fe80::a00:27ff:fe48:d511
8192000 bytes in 0.40 seconds = 164.46 Mbit/sec
1000 iters in 0.40 seconds = 398.50 usec/iter
If you get responses like those shown above, you've succeeded in configuring your RDMA network on top of TCP.
Exporting a Target
Mount the kernel user configuration filesystem. This is a requirement. All of your NVMe Target instructions require the NVMe Target tree to be made available in this filesystem:
$ sudo /bin/mount -t configfs none /sys/kernel/config/
Create an NVMe Target subsystem to host your devices (to export), and change into its directory:
$ sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test
$ cd /sys/kernel/config/nvmet/subsystems/nvmet-test
This example simplifies host connections by leaving the newly created subsystem accessible to any and every host attempting to connect to it (in a production environment, you definitely should lock this down to specific host machines by their NQN):
$ echo 1 |sudo tee -a attr_allow_any_host > /dev/null
When a target is exported, it's done with a "unique" NVMe Qualified Name (NQN). The concept is very similar to the iSCSI Qualified Name (IQN). This NQN is what enables other operating systems to import and use the remote NVMe device across a network potentially hosting multiple NVMe devices.
Define a subsystem namespace and change into its directory:
$ sudo mkdir namespaces/1
$ cd namespaces/1/
Set a local NVMe device to the newly created namespace:
$ echo -n /dev/nvme0n1 |sudo tee -a device_path > /dev/null
And enable the namespace:
$ echo 1|sudo tee -a enable > /dev/null
Now you'll create an NVMe Target port to export the newly created subsystem and change into its directory path:
$ sudo mkdir /sys/kernel/config/nvmet/ports/1
$ cd /sys/kernel/config/nvmet/ports/1
Remember that Ethernet interface you enabled for RDMA communication? Well, you'll use its IP address when exporting your subsystem:
$ echo 192.168.1.92 |sudo tee -a addr_traddr > /dev/null
Next, you'll set a few other parameters:
$ echo rdma|sudo tee -a addr_trtype > /dev/null
$ echo 4420|sudo tee -a addr_trsvcid > /dev/null
$ echo ipv4|sudo tee -a addr_adrfam > /dev/null
Then create a softlink to point to the subsystem from your newly created port:
$ sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/
↪/sys/kernel/config/nvmet/ports/1/subsystems/nvmet-test
You now should see the following message captured in dmesg
:
$ dmesg |grep "nvmet_rdma"
[24457.458325] nvmet_rdma: enabling port 1 (192.168.1.92:4420)
Importing a Target
The host machine is currently without an NVMe device:
$ nvme list
Node SN Model Namespace Usage Format FW Rev
------- ------ -------- --------- -------- ---------- ------
Let's scan the target machine for any exported NVMe volumes:
$ sudo nvme discover -t rdma -a 192.168.1.92 -s 4420
Discovery Log Number of Records 1, Generation counter 1
=====Discovery Log Entry 0======
trtype: rdma
adrfam: ipv4
subtype: nvme subsystem
treq: not specified
portid: 1
trsvcid: 4420
subnqn: nvmet-test
traddr: 192.168.1.92
rdma_prtype: not specified
rdma_qptype: connected
rdma_cms: rdma-cm
rdma_pkey: 0x0000
It must be your lucky day. It looks as if the target machine
is exporting one or more volumes. You'll need to remember
its subnqn
field: nvmet-test
. You'll now connect
to
the subnqn
:
$ sudo nvme connect -t rdma -n nvmet-test -a 192.168.1.92
↪-s 4420
If you go back to list all NVMe devices, you now should see all those
exported by that one subnqn
(note: the tabular output below has been reformatted to fit):
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
------- ---- ------ --------- -------- --------- ------
/dev/nvme1n1 8e0999a558e17818 Linux 1 2.15 GB / 2.15 GB
↪512 B + 0 B 4.15.0-3
Verify that it also shows up like your other block device:
$ cat /proc/partitions |grep nvme
259 1 2097152 nvme1n1
You can disconnect from the target device by typing:
$ sudo nvme disconnect -d /dev/nvme1n1
There you have it: a remote NVMe block device exported via an NVMe over Fabrics network. You now can write to and read from it like any other locally attached high-performance block device.
Note: if you're seeing I/O errors, there is a known issue with
the Linux rxe
code, and you may need to run a newer kernel. It
is believed that kernel commit 2da36d44a9d54a2c6e1f8da1f7ccc26b0bc6cfec
addresses this issue, and it was merged into a later 4.16
release.
The NVMe drive has changed the landscape of high-speed computing. Both the specification and technology have redefined access to NAND-based SSD media and have been updated to cater better to modern workloads. And although NVMe typically runs within a local machine, it isn't limited to it. Using the NVMe over Fabrics technology, the NVMe can expand beyond that local server and across an entire high-speed network.
Resources