A Secure Bioinformatics Linux Lab in an Educational Research Environment
In delivering a new bioinformatics curriculum in the Graduate School at the University of Medicine and Dentistry of New Jersey, we undertook the challenge of incorporating new computational resources over an existing research support infrastructure, adding new services and platforms and reacting to an increasingly burdensome responsibility to protect ourselves from network threats. Our new environment spans two cities and links Linux workstations, Linux servers, Silicon Graphics workstations, a Sun 6800 Enterprise Server and the Internet. Open-source solutions combined with selective use of commercial resources integrate in a cost-effective, service-friendly, bioinformatics research environment. In this report, we describe solutions to a set of challenges in our core, Linux-driven server/client environment.
As with many universities, our public computer labs are Microsoft boxes with the Office suite, and we have a set of clients--Web, secure telnet, secure FTP, IMAP2 mail and X. The bioinformatics software the university hosted lay behind these workstations, on Sun/Solaris and SGI/Irix servers. We needed an environment in which we could do several things: (1) manage workstations efficiently, (2) quickly add or delete applications, (3) rebuild workstations, (4) ensure availability and storage and (5) address network and data security issues.
We recognized that software and configuration information should be stored in a centralized server and available to authenticated clients. Our generic Web and e-mail servers already were overburdened with these services. In addition, each of those servers faced its own distinctive security threats and solutions. A better approach would be to establish a separate server dedicated to serving the scientific community, a scientific server. We needed to bring this project in on a modest budget.
Almost any server/workstation environment dedicated to scientific research might have offered multiple benefits,including parallel processing, centralized administration and secure storage systems. However, many fail in an important aspect in our two-city arena. We have a high demand for visualization, and users need X server clients such as ReflectionX, Exceed and Cygwin. X server clients display graphical interfaces to users accessing programs on a server or an X client. Most molecular modeling software requires visualization using OpenGL. In a local area network, this kind of architecture should suffice. However, our intercampus network was not always up to the task.
Our solution to this set of challenges was to build a bioinformatics computer lab environment dedicated to teaching and research. This lab is designed to be secure, resilient to attacks and failure and adaptable to an array of software and modes of access by authenticated users.
We began with the operating system choice. We elected Linux, for many of the usual reasons: open-source, secure, easily manageable and free availability made it attractive in an educational environment with limited funds. But in that economic mood, we still chose one step up, selecting Red Hat Enterprise Linux due to the support that commercial systems provide, including workstation monitoring, patches and upgrades using the Red Hat Network. We went with Intel x86 computers because we had a number on hand and they made good economic sense. Plus, if we were to fail, we still would have boxes that otherwise could be deployed.
As now deployed, our Piscataway lab has 14 Red Hat Enterprise Linux workstations and two Enterprise Linux servers. In the Newark lab, (where the facility is smaller), we have four Red Hat Enterprise Linux workstations and one Enterprise Linux server. All Piscataway workstations are identical in terms of hardware, as are all Newark workstations; there are minor differences between the two sets, however.
We outlined a set of initial tasks: build a server to run DHCP, host Red Hat CDs for Kickstart installations, authenticate users, host users' home directories and provide a Web and database server. To that end, we did an installation of Red Hat Enterprise Linux AS on two separate PCs in Piscataway and one in Newark to act as our servers--one primary, one backup and one Web/database server.
We also needed DHCP services to permit authorized users access to the network with personal laptops and not to run our workstations. Accommodating the laptop users, the MAC address of a user's laptop was determined, and for each laptop we have an entry similar to the following, where each host is identified by the user's username.
subnet 192.168.1.0 netmask 255.255.255.0 { deny unknown-clients; # DHCP range range 192.168.1.240 192.168.1.245; # known clients host golharam { hardware ethernet 00:12:34:56:78:90; } ... }
No academic lab is supportable if each workstation must be built and maintained separately. Toward this end, we used the Kickstart feature of Red Hat, initially installing Linux on a single machine to get an idea of what our Kickstart configuration file would look like. We used that experience as a starting point.
Using hardware that was slightly different between the campuses, we recognized we would end up with multiple configuration files. Managing several configuration files is not a big problem, but it does introduce a point where the configurations can get out of sync when multiple hardware flavors are not updated all at once. In order to correct for this complication, we created a Perl script to read a Kickstart template file and generate the necessary Kickstart configuration files. The template we used is a Kickstart configuration file with minor additions.
In our scripts, in any place where we had different options depending on the host (such as using DHCP or assigning a static IP address), we preceded the section with a colon (:) followed by the hostnames to which the section applied. When a section was complete, it ended with a colon-period (:.) combination. See the Resources section at the end for a link to our Kickstart template.
You might notice we have three different options for the video card and monitor resolution. One host, alanine, has a 15" flat-panel and uses the Intel 845 board. Four other hosts--isoleucine, tryptophan, tyrosine and proline--have 17" flat-panels using the Intel 845 board. Another group of Piscataway machines have 17" flat-panels and carry the Intel 865 board.
We have several sections in our template file specifying, for example, network settings, partition information, printer information and Red Hat Network registration. Notice, however, that the packages section is identical for all the machines. With our Perl script, mkkickstart, we now can generate the configuration files that keep everything in sync. It takes one parameter, the name of the machine to build for or all to build for all the configurations. See the Resources section at the end for a link to our Kickstart Perl script.
In order to make the installation-unattended installs, we put the contents of the Enterprise Workstation CDs on an NFS share on the server. We used the NFS share /products and copied the CDs into /products/RedHat and the Kickstart configuration file into /products.
The benefit of using Kickstart is you can use one configuration file to build a group of machines. This works well when the machines obtain network information through DHCP. The downside of Kickstart occurs if you need to specify a specific IP address; you need to create a Kickstart configuration file for each machine. This leads to using one Kickstart file per machine.
Our solution is to use a single Kickstart configuration file, in which the machines obtain their networking information by way of DHCP. On the DHCP server, we identified each machine by its MAC address and assigned a static IP and hostname to that MAC address. This allows us to specify static IP addresses using DHCP. A portion of our DHCP configuration is shown below for one machine.
host hydrogen { option host-name "hydrogen"; hardware ethernet 00:0D:56:0A:60:0B; fixed-address 192.168.1.161; }
In Newark where the machines are on a different subnet that has a DHCP server we cannot configure, a Kickstart configuration file is provided that contains the specific networking information for each machine.
In building a machine using Kickstart, we need to point Anaconda (the Red Hat installer) to the location of the configuration file. Because they reside on the server in an NFS share, Anaconda needs to be able to access the server. In order to access the server, Anaconda needs the necessary network drivers. When initially booting the machines from CD, the network drivers are located on the CD and can be loaded. However, part of the reason for using Kickstart and putting the CDs on the server was so we would not need a CD.
In order to circumvent the need for the CD, we created a bootable floppy disk. An image of a boot floppy exists on the first RH CD, so we used that to build a bootable floppy. Unfortunately, the network drivers do not fit on the floppy. Our solution was to use USB memory sticks.
Memory sticks are recognized as USB hard drives and typically can be booted from. The memory sticks have a total capacity of 64MB each. Recently, memory stick capacity has grown to 256MB and up. By booting off a memory stick, we are able to fit the boot image from /RedHat/isolinux from the first CD and add initrd.img from /RedHat/images/pxeboot from the first CD. The pxeboot initrd.img contains the necessary network drivers. The exact steps we used to build a bootable memory stick are as follows. Many thanks to the folks on the Red Hat mailing list for this
1. Format the USB stick as one big FAT partition: mkdosfs /dev/sdb1
2. Mount the memory stick
3. Copy everything from /RedHat/isolinux to the memory stick. You can omit isolinux.bin, boot.cat and TRANS.TBL
4. Rename isolinux.cfg as syslinux.cfg
5. Copy the initrd.img from /RedHat/images/pxeboot to the memory stick. (At present I believe the two initrd.img files are the same)
6. Unmount the memory stick
7. Make the memory stick bootable with syslinux /dev/sdb1
We now were able to boot from the memory stick.
Once the machines were built, a script in the NFS share /products was run as root which performed an automated installation of applications. This installed Perl modules, compiling and installation applications and custom RPM files. It also included making modifications to some of the configuration files to include other NFS shares, firewall settings and setting environment variables.
We needed to establish user accounts and log in protocols. We first used NIS but quickly grew unhappy with security issues, principally because the port NIS uses could not be made static, preventing us from establishing our firewall. One of the security issues with NIS is NIS passwords are sent unencrypted over the network, allowing anyone to capture passwords. We'd taken the NIS route fearing that establishing an LDAP server would be a more time consuming task. In the setup of NIS, the NIS HOW-TO proved to be useful in getting things running.
The migration and setup of an LDAP server did not turn out to be as difficult as first expected, as an established migration path with Perl scripts already was available. For the most part, the migration to LDAP went smoothly thanks to the LDAP HOW-TO.
The tools for adding, modifying and deleting users from LDAP exist, but they are not as integrated as well as NIS is. For example, to add a user, /usr/sbin/useradd needs to be called first to generate the necessary user information and create the home directory. The information from /etc/password and /etc/shadow needs to be extracted. Once done, the user then could be added to LDAP. Thus, we created a Perl script to perform this task for us. See the Resources section at the end for a link to our LDAP useradd Perl script.
The content of that script was derived by massaging the migration scripts that come with LDAP. In an NIS environment users would change passwords by running passwd <username>.
LDAP has no equivalent command, so we created another script to perform this function. See the Resources section at the end for a link to our LDAP passwordchange Perl script.
We have yet to devise a mechanism for deleting users. The only way to do this is to use one of the GUI tools available for LDAP, such as DirectoryAdministrator or GQ. We look forward to having LDAP user authentication integrated into Linux more thoroughly, as has been the case for NIS and flat files.
As mentioned previously, users' home directories are mounted on the clients from the server. Initially, this directory strategy worked well. We continue to voice some concerns, however, about network reliability and security about mounting directories through our two campuses network. We currently researching other distributed filesystems to use for this purpose, including Lustre and InterMezzo.
A great advantage of a distributed filesystem would be to allow us to reclaim unused disk space on the workstations. For example, in our Piscataway lab, the workstations each have 120GB of hard drive space and less than 20GB actually is being used for OS and applications. This means that each machine is wasting 100GB of space, and with 14 machines, that is over 1TB of unused space. Ideally we would like to make all that space look like one large drive available to all the machines. Users' home directories then could be moved to this distributed space, thus increasing the amount of space each user can use and relaxing some of the responsibilities of the primary server.
One of the issues that arises with user accounts is backups and general system reliability. Our failover scheme is pretty simple. Our second server mirrors the first one. The servers sync home directories, databases and certain configuration files nightly. If the main server should go down, a script is available (through sudo access) to which a number of senior people have access to execute. This script resets the IP address of the machine to be the IP address of the main server and starts various services such as DHCP, NFS and LDAP.
This design isn't meant for high-availability, but it does help to prevent user data loss. In our environment, we encourage users to backup critical data and archive it regularly. Increasingly, in our university, user data backup becomes the responsibility of each user, with central computing services concentrating only on disaster recovery services.
Within university environments, access can be very open--as much as a university can be. The majority of Windows OS machines on the network are highly susceptible to all sorts of attacks in spite of increasingly intense efforts to protect them. Infected machines create network traffic, and we see performance dips because of that traffic. In order to protect our lab from threats, a number of security measures are in place, in addition to those described above.
To enhance security, access is controlled by associating specific IP addresses to specific resources on the workstations and servers. We control that access primarily through the use of iptables.
The server firewall allows incoming SSH traffic from anywhere. It then performs IP address filtering to allow only certain IP addresses access to more open resources, such as NFS, LDAP, CUPS and the FlexLM license server. The Web server uses a slightly different setup to allow only incoming SSH and HTTP traffic.
Each of the workstations has even more restrictive settings to allow only incoming SSH and VNC traffic. Outgoing traffic on all the machines is considered to be secure.
Each of the services allowed uses standard ports listed in /etc/services with the exception of NFS and FlexLM. FlexLM port usage can be controlled by modifying the license file by adding a specific port to use. NFS is a little more troublesome. In order to get NFS set up to use static ports, some background information is needed on how it works. For a detailed description, you can refer to the LinWiz documentation (see Resources). The LinWiz site also includes a Web app that allows you to create an iptables configuration you can use as a starting point for your firewall.
There is more to do. Our workstations are on two campuses and our file and authentication server is on one. Our intercampus network is less reliable than our LANs, and NFS traffic between campuses is not encrypted. We intend to convert from the NFS filesystem to a more network-reliable, distributed filesystem that takes advantage of the extensive storage housed on our workstations and in more efficiently communicating LANs.
Bruce Byrne is a PhD geneticist who learned that he needed computers when he no longer could calculate the outcome of DNA cloning ventures using scissors and highlighters. Bruce is the Associate Director for Education at the Informatics Institute of UMDNJ, where he heads the graduate school's Concentration in Bioinformatics.
John Kerrigan is a PhD chemist who learned that he could leave test tubes in the laboratory and make his new molecules in silico. John is the computational biologist at the University's Academic Computing Service, teaches in the Concentration and collaborates with research scientists interested particularly in rational drug design and biophysics.
Ryan Golhar is a computer scientist who found an interest in applying computer science to biology. Ryan currently is pursuing his PhD at UMDNJ.