The Lustre Distributed Filesystem
There comes a time in a network or storage administrator's career
when a large collection of storage volumes needs to be pooled together
and distributed within a clustered or multiple client network, while
maintaining high performance with little to no bottlenecks when accessing
the same files. That is where Lustre comes into the picture. The Lustre
filesystem is a high-performance distributed filesystem intended for
larger network and high-availability environments.
The Storage Area Network and Linux
Traditionally, Lustre is configured to manage remote data storage disk
devices within a Storage Area Network (SAN), which is two or more
remotely attached disk devices communicating via a Small Computer System
Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel
over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI. To better
explain what a SAN is, it may be more beneficial to begin with what it
isn't. For instance, a SAN shouldn't be confused with a Local Area Network
(LAN), even if that LAN carries storage traffic (that is, via networked filesystem shares and so on). Only if the LAN carries storage traffic using
the iSCSI or FCoE protocols can it then be considered a SAN. Another
thing that a SAN isn't is Network Attached Storage (NAS). Again, the
SAN relies heavily on a SCSI protocol, while the NAS uses the NFS and
SMB/CIFS file-sharing protocols.
An external storage target device will represent storage volumes as
Logical Units within the SAN. Typically, a set of Logical Units will be
mapped across a SAN to an initiator node—in our case, it would be the
server(s) managing the Lustre filesystem. In turn, the server(s) will
identify one or more SCSI disk devices within its SCSI subsystem and treat
them as if they were local drives. The amount of SCSI disks identified
is determined by the amount of Logical Units mapped to the initiator. If
you want to follow along with the examples here, it is relatively simple
to configure a couple virtual machines: one as the server node with
one or more additional disk devices to export and the second to act as
a client node and mount the Lustre enabled volume. Although it is bad
practice, for testing purposes, it also is possible to have a single
virtual machine configured as both server and client.
SCSI
SCSI is an ANSI-standardized hardware and software computing
interface adopted by all early storage manufacturers. Revised editions
of the standard continue to be used today.
The Distributed Filesystem
A distributed filesystem allows access to files from multiple hosts
sharing a computer network. This makes it possible for multiple
users on multiple client nodes to share files and storage resources. The
client nodes do not have direct access to the underlying block storage
but interact over the network using a protocol and, thus, make it possible
to restrict access to the filesystem depending on access lists or
capabilities on both the servers and the clients, unlike a clustered filesystem, where all nodes have equal access to the block storage where the
filesystem is located. On these systems, the access control must reside on
the client. Other advantages to utilizing distributed filesystems include
the fact that they may involve facilities for transparent replication
and fault tolerance. So, when a limited number of nodes in a filesystem
goes off-line, the system continues to work without any data loss.
Lustre (or Linux Cluster) is one such distributed filesystem,
usually deployed for large-scale cluster computing. Licensed under
the GNU General Public License (or GPL), Lustre provides a solution in
which high performance and scalability to tens of thousands of nodes
and petabytes of storage becomes a reality and is relatively simple to
deploy and configure. Despite the fact that Lustre 2.0 has been released,
for this article, I work with the generally available 1.8.5.
Lustre contains a somewhat unique architecture, with three major
functional units. One is a single metadata server or MDS that contains
a single metadata target or MDT for each Lustre filesystem. This
stores namespace metadata, which includes filenames, directories, access
permissions and file layout. The MDT data is stored in a single disk filesystem mapped locally to the serving node and is a dedicated filesystem
that controls file access and informs the client node(s) which object(s)
make up a file. Second are one or more object storage servers (OSSes) that
store file data on one or more object storage targets or OST. An OST is a
dedicated object-base filesystem exported for read/write operations. The
capacity of a Lustre filesystem is determined by the sum of the total
capacities of the OSTs. Finally, there's the client(s) that accesses and
uses the file data.
Lustre presents all clients with a unified namespace
for all of the files and data in the filesystem that allow concurrent
and coherent read and write access to the files in the filesystem. When
a client accesses a file, it completes a filename lookup on the MDS,
and either a new file is created or the layout of an existing file is
returned to the client. Locking the file on the OST, the client will
then run one or more read or write operations to the file but will not
directly modify the objects on the OST. Instead, it will delegate tasks to
the OSS. This approach will ensure scalability and improved security and
reliability, as it does not allow direct access to the underlying storage,
thus, increasing the risk of filesystem corruption from misbehaving/defective clients. Although all three components (MDT, OST and client)
can run on the same node, they typically are configured on separate
nodes communicating over a network (see the details on LNET later in this
article). In
this example, I'm running the MDT and OST on a single server node
while the client will be accessing the OST from a separate node.
Installing Lustre
To obtain Lustre 1.8.5, download the prebuilt binaries packaged in RPMs,
or download the source and build the modules and utilities for your
respective Linux distribution. Oracle provides server RPM packages
for both Oracle Enterprise Linux (OEL) 5 and Red Hat Enterprise Linux
(RHEL) 5, while also providing client RPM packages for OEL 5, RHEL 5
and SUSE Linux Enterprise Server (SLES) 10,11. If you will be building
Lustre from source, ensure that you are using a Linux
kernel 2.6.16 or greater. Note that in all deployments of Lustre, the
server that runs on an MDS, MGS (discussed below) or OSS must utilize a
patched kernel. Running a patched kernel on a Lustre client is optional
and required only if the client will be used for multiple purposes,
such as running as both a client and an OST.
If you already have a supported operating system,
make sure that the patched kernel, lustre-modules, lustre-ldiskfs (a
Lustre-patched backing filesystem kernel module package for the ext3 filesystem), lustre (which includes userspace utilities to configure and run
Lustre) and e2fsprogs packages are installed on the host system while
also resolving its dependencies from a local or remote repository. Use
the rpm command to install all necessary packages:
$ sudo rpm -ivh kernel-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-modules-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-ldiskfs-3.1.3-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh e2fsprogs-1.41.10.sun2-0redhat.oel5.i386.rpm
After these packages have been installed, list the boot directory to
reveal the newly installed patched Linux kernel:
[petros@lustre-host ~]$ ls /boot/
config-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
grub
initrd-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.img
lost+found
symvers-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.gz
System.map-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
vmlinuz-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
View the /boot/grub/grub.conf file to validate that the newly installed
kernel has been set as the default kernel. Now that all packages have
been installed, a reboot is required to load the new kernel image. Once
the system has been rebooted, an invocation of the uname command will
reveal the currently booted kernel image:
[petros@lustre-host ~]$ uname -a
Linux lustre-host 2.6.18-194.3.1.0.1.el5_lustre.1.8.4 #1
↪SMP Mon Jul 26 22:12:56 MDT 2010 i686 i686 i386 GNU/Linux
Meanwhile, on the client side, the packages for the lustre client
(utilities for patchless clients) and lustre client modules (modules for
patchless clients) need to be installed on all desired client machines:
$ sudo rpm -ivh lustre-client-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-client-modules-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
Note that these client machines need to be within the same network
as the host machine serving the Lustre filesystem. After the packages
are installed, reboot all affected client machines.
Configuring Lustre Server
In order to configure the Lustre filesystem, you need to configure
Lustre Networking, or LNET, which provides the communication
infrastructure required by the Lustre filesystem. LNET supports
many commonly used network types, which include InfiniBand and IP
networks. It allows simultaneous availability across multiple network
types with routing between them. In this example, let's use
tcp, so use your favorite editor to append the following line to the
/etc/modprobe.conf file:
options lnet networks=tcp
This step restricts LNET to be using only the specified network interfaces
and prevents LNET from using all network interfaces.
Before moving on, it is important to discuss the role of the Management
Server or MGS. The MGS stores configuration information for all Lustre
filesystems in a clustered setup. An OST contacts the MGS to
provide information, while the client(s) contact the MGS to retrieve
information. The MGS requires its own disk for storage, although there is
a provision that allows the MGS to share a disk with a single MDT. Type
the following command to create a combined MGS/MDT node:
[petros@lustre-host ~]$ sudo /usr/sbin/mkfs.lustre
↪--fsname=lustre --mgs --mdt /dev/sda1
Permanent disk data:
Target: lustre-MDTffff
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mdt.group_upcall=/usr/sbin/l_getgroups
device size = 1019MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sda1
target name lustre-MDTffff
4k blocks 261048
options -i 4096 -I 512 -q -O dir_index,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff -i 4096 -I 512 -q -O
↪dir_index,uninit_groups -F /dev/sda1 261048
Writing CONFIGS/mountdata
If nothing has been provided, by default, the fsname is lustre. If one or
more of these filesystems are created, it is necessary to use unique
names for each labeled volume. These names become very important for
when you access the target on the client system.
Create the OST by executing the following command:
[petros@lustre-host ~]$ sudo /usr/sbin/mkfs.lustre --ost
↪--fsname=lustre --mgsnode=10.0.2.15@tcp0 /dev/sda2
Permanent disk data:
Target: lustre-OSTffff
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x72
(OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.0.2.15@tcp
checking for existing Lustre data: not found
device size = 1027MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sda2
target name lustre-OSTffff
4k blocks 263064
options -J size=40 -i 16384 -I 256 -q -O
↪dir_index,extents,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-OSTffff
↪-J size=40 -i 16384 -I 256 -q -O
↪dir_index,extents,uninit_groups -F /dev/sda2 263064
Writing CONFIGS/mountdata
When the target needs to provide information to the MGS or when the client
accesses the target for information lookup, all also will need to be
aware of where the MGS is, which you have defined for this target
as the server's IP address followed by the network interface for
communication. This is just a reminder that the interface was defined
earlier in the /etc/modprobe.conf file.
Now you easily can mount these newly formatted devices local to the host
machine. Mount the MDT:
[petros@lustre-host ~]$ sudo mkdir -p /mnt /mdt
[petros@lustre-host ~]$ sudo mount -t lustre /dev/sda1 /mnt/mdt/
Verify that it is mounted with the df command:
[petros@lustre-host ~]$ df -t lustre
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 913560 17752 843600 3% /mnt/mdt
The /var/log/messages file will log the following messages:
Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000:
↪new disk, initializing
Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000:
↪Now serving lustre-MDT0000 on /dev/sda1 with recovery enabled
Jun 25 10:26:54 lustre-host kernel: Lustre:
↪3285:0:(lproc_mds.c:271:lprocfs_wr_group_upcall())
↪lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000.mdt:
↪set parameter group_upcall=/usr/sbin/l_getgroups
Mount the OST:
[petros@lustre-host ~]$ sudo mkdir -p /mnt /ost
[petros@lustre-host ~]$ sudo mount -t lustre /dev/sda2 /mnt/ost/
Verify that it is mounted with the df command:
[petros@lustre-host ~]$ df -t lustre
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 913560 17768 843584 3% /mnt/mdt
/dev/sda2 1035692 42460 940620 5% /mnt/ost
The /var/log/messages file will log the following messages:
Jun 25 10:41:33 lustre-host kernel: Lustre:
↪lustre-OST0000: new disk, initializing
Jun 25 10:41:33 lustre-host kernel: Lustre:
↪lustre-OST0000: Now serving lustre-OST0000 on
↪/dev/sda2 with recovery enabled
Jun 25 10:41:39 lustre-host kernel: Lustre:
↪3417:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000:
↪add target lustre-OST0000_UUID
Jun 25 10:41:39 lustre-host kernel: Lustre:
↪3188:0:(quota_master.c:1716:mds_quota_recovery())
↪Only 0/1 OSTs are active, abort quota recovery
Jun 25 10:41:39 lustre-host kernel: Lustre: lustre-OST0000:
↪received MDS connection from 0@lo
Jun 25 10:41:39 lustre-host kernel: Lustre: MDS lustre-MDT0000:
↪lustre-OST0000_UUID now active, resetting orphans
Even though they are mounted like typical filesystems, you will notice
that the MDT and OST labeled volumes do not contain the typical filesystem
characteristics. That is where the client will come into play:
[petros@lustre-host ~]$ ls /mnt/ost/
ls: /mnt/ost/: Not a directory
[petros@lustre-host ~]$ sudo ls /mnt/ost/
ls: /mnt/ost/: Not a directory
[petros@lustre-host ~]$ sudo ls -l /mnt/
total 4
d--------- 1 root root 0 Jun 25 10:22 ost
d--------- 1 root root 0 Jun 25 10:22 mdt
Configuring Lustre Client(s)
Remember that you named the Lustre-enabled volume lustre. When you mount
the volume over the network on the client node, you must specify this
name. This can be observed below following the network method (tcp0)
by which you are mounting the remote volume. Again, TCP was defined in
the /etc/modprobe.conf file for the supported LNET networks interfaces:
[petros@client1 ~]$ sudo mkdir -p /lustre
[petros@client1 ~]$ sudo mount -t lustre
↪10.0.2.15@tcp0:/lustre /lustre/
After successfully mounting the remote volume, you will see the
/var/log/messages file append:
Jun 25 10:44:17 client1 kernel: Lustre:
↪Client lustre-client has started
Use df to list the mounted Lustre-enabled volumes:
[petros@client1 ~]$ df -t lustre
Filesystem 1K-blocks Used Available Use% Mounted on
10.0.2.15@tcp0:/lustre 1035692 42460 940556 5% /lustre
Once mounted, the filesystem now can be accessed by the client node. For
instance, you cannot read or write from or to files located on the mounted
OST. In the following example, let's write approximately 40MB of
data to a new file, on /lustre:
[petros@client1 lustre]$ sudo dd if=/dev/zero
↪of=/lustre/test.dat bs=4M count=10
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 1.30103 seconds, 32.2 MB/s
A df listing will reflect the changes of available and used capacities
to the /lustre mountpoint:
[petros@client1 lustre]$ df -t lustre
Filesystem 1K-blocks Used Available Use% Mounted on
10.0.2.15@tcp0:/lustre 1035692 83420 899596 9% /lustre
[petros@client1 lustre]$ ls -l
total 40960
-rw-r--r-- 1 root root 41943040 Jun 25 10:47 test.dat
Summary
Although I covered a simple introduction and tutorial of the Lustre
distributed filesystem, there still is so much uncharted territory
and methods by which the filesystem can be configured for a truly
rich computing environment. For instance, Lustre can be configured
for high availability to ensure that in the situation of failure,
the system's services continue without interruption.
That is, the accessibility between the client(s), server(s) and external
target storage is always made available through a process called failover.
Additional HA is provided through an implementation of disk drive
redundancy in some sort of RAID (hardware or software via mdadm)
configuration for the event of drive failures. These high-availability techniques normally would apply to server nodes hosting
the MDS and OSS. It is suggested to place the OST storage on a RAID 5 or,
preferably, RAID 6, while the MDT storage should be RAID 1 or RAID 0+1.
It isn't a difficult technology to work with; however, there still
exists an entire community with excellent administrator and developer
resources ranging from articles, mailing lists and more to aid the
novice in installing and maintaining a Lustre-enabled system. Commercial
support for Lustre is made available by a non-exhaustive list of vendors
selling bundled computing and Lustre storage systems. Many of these same
vendors also are contributing to the Open Source community surrounding
the Lustre Project.
Failover
Failover is a process with the capability to switch over automatically
to a redundant or standby computer server, system or network upon the
failure or abnormal termination of the previously active application,
server, system or network. Many failover solutions exist
on the Linux platform to cover both SAN and LAN environments.
Resources
Lustre Project Page: http://wiki.lustre.org/index.php
Wikipedia: Lustre: http://en.wikipedia.org/wiki/Lustre_(file_system)
Wikipedia: Distributed File Systems: http://en.wikipedia.org/wiki/Distributed_file_system