Gnu Queue: Linux Clustering Made Easy
So, your organization has finally decided to double the number of Linux workstations in your cluster. Now you've got twice as much computer power as before, right?
Wrong. It's not that simple. Old habits die hard, and your organization will probably continue trying to submit most of its jobs to the old computers. Or, used to the old computers being overloaded, your users will submit most of their jobs to the new computers, leaving the old ones idle. Let's face it, it's just too much of a pain to log into every computer on your network to see which one's the least utilized. It's simpler just to send the job somewhere and get on with the rest of the day's work, especially if it's a quick and dirty job and there are lots of computers. The result, however, is slower overall performance and wasted resources.
What you need is a simple utility for sending your job to the least utilized machine automatically. You could install a batch processing system like NQS—maybe you've already installed one—but it's annoying to check your e-mail or run special commands to see if your quick and dirty job has finished running in some batch queue. If something goes wrong, you might need to use nonstandard commands or track down which remote machine is executing your job, do a ps to learn its process id, and then do a kill. Users moving to new departments or new jobs often find that they need to relearn a complex set of nonstandard commands, because their new organization uses a different batch processing system than what they're used to.
You'd like something really simple, something that works through the shell, so that you could check your job's status with a command like jobs, and allow the shell to notify you when the job has terminated, just as if you were running it in the background on your local machine. You'd like to be able to send the job into the background and foreground with bg and fg and kill the job with kill, just as if the job were running on the local node. This way, you can control remote jobs using the same standard shell commands you and your users already know how to use.
Enter GNU Queue. GNU Queue makes it easy to cluster Linux workstations. If you already know how to control jobs running on your local machine, you already know how to control remote jobs using GNU Queue. You don't even need special privileges to install and run GNU Queue on your cluster—anyone can do it. Once you've discovered how incredibly easy it is to cluster Linux environments with GNU Queue, you'll wonder why organizations continue to spend so much money on comparatively hard-to-cluster Windows NT environments.
With GNU Queue, all you have to do is write a simple wrapper shell script to cause software applications to farm out every time to the network:
#!/bin/sh<\n> exec queue -i -w -p -- realbogobasicinterpreter $*
and name it “bogobasicinterpreter”, with the real bogobasicinterpreter renamed “realbogobasicinterpreter”. This assumes, of course, that you have administrative privileges for your cluster (not necessary to install and run GNU Queue). When someone runs bogobasicinterpreter, GNU Queue is told to farm the job out of the network.
Another popular way to use GNU Queue is to set up an alias. You can do this even if you don't have administrative privileges on your cluster. If you are using csh, change to your home directory and add the following line to your .cshrc:
alias q queue -i -w -p --
and run the command source .cshrc. Then, you can simply farm out jobs by typing “q” before the name of the job you want to farm out.
Either way, GNU Queue does all the hard work, instantly finding a lightly loaded machine to run the job on. It then fires up a proxy job on your local machine that “pretends” to be the remotely executing job, so that you can background, foreground and kill the remotely running job through normal shell commands. There's no need to teach other users new commands to interact with some complicated batch processing system—if they understand how to use the UNIX shell to control local jobs, they understand how to use GNU Queue to control remotely executing jobs.
Of course, GNU Queue supports many additional features. It supports a traditional batch processing mode, where output can optionally be returned by e-mail. Versions 1.20.1 and higher now have alpha support for various modes of job migration, which lets the administrator to allow running jobs to actually move from one machine to another in order to maintain a constant load throughout the cluster. More importantly, GNU Queue allows administrators to place limits on the number of a type of job that can run (say, allow no more than five bogobasicinterpreter jobs to run on any node) or to prevent certain jobs from running when a machines's load is too great. For example, the bogobasicinterpreter can't be started if the load average exceeds five; running interpreters are suspended if the load average on the node exceeds seven. It's also possible to place restrictions on the time of day certain jobs may be run (no bogobasicinterpreters on Saturdays) or to have it periodically check the return value of a custom script to determine whether or not a program can be run. But, you'll probably never need these advanced features.
All of this sounds great, you say. How do I obtain and install GNU Queue? You can download the latest release of GNU Queue from its web site at http://www.gnuqueue.org/. It's a participating project on SourceForge, and you can find all sorts of discussion forums, support forums and bug-tracking databases. Download the program from the web site, unpack it and then run:
./configure<\n> make install
from the top-level directory. Run make install and fire up the dæmon with queued -D & on each machine in your cluster.
For a quick reference on using the queue command to farm jobs out to the network, visit the GNU Queue home page. That's all there is to it!
Before installing GNU Queue on your cluster, you have to make a decision that is basically guided by whether you have root (administrative) privileges on your cluster. If you do, you'll probably want to install GNU Queue in a manner that makes it available to all the users on your site. This is the --enable-root option. On the other hand, if you're just an ordinary Jane or Joe on your cluster or want to see what the fuss is all about without giving away privileges, you can install GNU Queue as an ordinary user, the default mode of installation.
Yes, ordinary users can install GNU Queue as a batch processing system on your cluster! But, if another user wants to run GNU Queue, he'll have to change the port numbers in the source code to insure no one else is running GNU Queue. That's why it's better to let the system administrator install GNU Queue (with --enable-root option to the configure script) if you expect a lot of users will want to run GNU Queue on your cluster.
Once you've downloaded GNU Queue off the Net, the first thing to do is to unpack it using the tar command. Under Linux, this is just tar xzf filename, where file name is the name of the file (compressed with gzip and having either the .tar.gz or .tgz file extensions. On other systems it's a little bit more involved, since the tar installed by default is not GNU tar and doesn't support the zdecompression option. You'll need to explicitly run the gunzip decompression program: gunzip filename.tar.gz; tar xf filename.tar, where filename.tar.gz is the file, with .tar.gz extension, that you obtained from the network. (Savvy users might want to use the zcat filename.tar.gz|tar tf - trick, but this assumes the zcat program installed on your system can handle GNU zipped file. gunzip is part of the GNU gzip package; you can obtain it from ftp://ftp.gnu.org/.
So you've unpacked the distribution and you're sitting in the distribution's top-level directory. Now what? Well, if you're an ordinary Jane or Joe you install the program into the distribution directory by running ./configure followed by make install on each machine in your cluster. Then, fire up the dæmon with queued -D & on each machine in your cluster. If you want more details (or you're a system administrator), continue reading.
Run ./configure. If you're installing it on a system where you're not a superuser but an ordinary peon, configure sets the makefile to install GNU Queue into the current directory. queue will go into ./bin; queued dæmon will go into ./sbin; ./com/queue will be the shared spool directory; the host access control list file will go into ./share; and the queued pid files will go into ./var . If you want things to go somewhere else, run ./configure --prefix=dir, where dir is the top-level directory where you want things to be installed.
The default ./configure option is to install GNU Queue in the local directory for use by a single user only. System administrators should run the command ./configure --enable-root instead. When installing with the --enable-root option, configure sets the makefile to install GNU Queue under the /usr/local prefix. queue will go in /usr/local/bin; queued dæmon will go into /usr/local/sbin; /usr/local/com/queue will be the shared spool directory; the host access control list file will go into /usr/local/share; and the queued pid files will go into /usr/local/var. If you want things to go somewhere else, run the following:
./configure --enable-root<\n> --prefix=dir
where dir is the top-level directory where you want things to be installed.
./configure takes a number of additional options that you may wish to be aware of, including options for changing the paths of the various directories. ./configure --help gives a full listing of them. Here are a few examples, --bindir specifies where queue goes; --sbindir specifies where queued goes; --localstatedir states where the spool directory and queued pid files go; and --datadir lists where the host access control file goes. If ./configure fails inelegantly, make sure lex is installed. GNU flex is an implementation of lex available from the FSF, http://www.gnu.org/.
Now, run make to compile the programs. If your make complains about a syntax error in the Makefile, you'll need to run GNU Make which is hopefully already installed on your machine (perhaps as gmake or gnumake), but, if not, you can obtain it from the FSF at http://www.gnu.org/.
If all goes well, make install will install the programs into the directory you specified with ./configure. Missing directories will be created. The host name of the node make install is being run on will be added to the host access control list if it is not already there.
Now, try running Queue. Start up ./queued -D & on the local machine. (If you did a make install on the node, the host name should already be in the host access control list file.)
Here are some simple examples:
> queue -i -w -n -- hostname<\n> > queue -i -r -n -- hostname
For a more sophisticated example, try suspending and resuming it with Control-Z and fg:
> queue -i -w -p -- emacs -nwIf this example works on the localhost, you will want to add additional hosts to the host access control list in share (or --datadir) and start up queued on these.
This line:
> queue -i -w -p -h hostname -- emacs -nw
will run Emacs on host name. Without the -h argument, it will run the job on the best or least-loaded host in the Access Control List file. There is also a -H hostname option, which causes hostname to be preferred, but the job will run on other hosts if hostname is unavailable.
At this point, you might be wondering what some of the other options for queue do. ./queue --help gives a list of options to Queue. The “--” separates GNU Queue options from the options to be given to the command to be run. -i stands for immediate; it places the job to be run in the “now” batch queue. -w invokes the proxy job system, as opposed to -r, which causes output to be returned to the user via e-mail (traditional batch processing mode). -n turns off virtual terminal support. Most users will probably only use -i -w -p (full virtual terminal support, for interactive jobs like Emacs) and -i w -n (no virtual terminal support, for noninteractive jobs).
More details on the protocol GNU Queue uses for host selection can be found in the on-line manual and the on-line Internet draft protocol at http://www.gnuqueue.org/.
You can also create additional queues for use with the -q and -d spooldir options. They might be used to specify different queuing behavior for different classes of jobs. Each spooldir must have a profile associated with it. The profile determines queuing behavior for jobs running in that spooldir. See the on-line manual for more details.
That's all there is to it! Of course, for GNU Queue to work well there needs to be some sort of file sharing between nodes in the cluster (for example, NFS, the Network File System). If you have the same home directory, regardless of which machine you log into, your system administrator has somehow configured your home directory to be shared across all cluster nodes. You want to make sure that enough of the file system is shared (i.e., is the same) between cluster nodes so that your programs don't get confused when they run. Typically, you'll want system temporary directories (/tmp and /usr/tmp) to be non-shared, but everything else (except maybe the root file system containing kernel images and basic commands) to be shared. Because this configuration is so common to UNIX and Linux clusters, we've assumed here that this is the case, but it isn't necessarily so; so check with your system administrator if you have questions about how files are shared across your network cluster.
Documentation about GNU Queue is also available off the web site, including an Internet draft on the protocol GNU Queue uses to farm out jobs. While you're there, you'll probably want to sign up for one of the three mailing lists (queue-announce, queue-developers and queue-support) so that you can learn of new features as they're announced and interact with other GNU Queue users. At the time of writing, queue-developers is by far the most active list, with lively discussion of improvements to GNU Queue's many features and suggested ports to new platforms. You can obtain advice for any problems you encounter from the queue-support mailing list.
Another SourceForge feature mentioned on the home page is the CVS repository for GNU Queue. Interested readers can obtain the latest prerelease development code, containing the latest features (and bugs) as they are added by developers, by unpacking the GNU Queue distribution and running the command cvs update inside the top-level directory. If you're actively making changes to GNU Queue, you can apply for write access to the CVS directory and instantly publish your changes via the cvs ci command. If you can get other developers interested in your work (via the queue-developers mailing list, of course), you can bounce code changes back and forth amongst yourselves via repeated cycles of cvs, ci and cvs update. All of this assumes you have cvs installed, which is the default with many Linux distributions.
Code isn't the only way interested readers can contribute to GNU Queue. There are many ways to contribute to the GNU Queue effort on SourceForge. With a login on SourceForge, one of the project administrators can give you editor privileges for the documentation tree, moderator privileges in the discussion forums, or administrative privileges in the bug tracking and patch database sections of the site.
If you encounter problems with installation not explained here, you may wish to check out the support forum and support mailing list, available off GNU Queue's home page, https://www.gnu.org/software/gnu-queue/. Bugs should be reported to bug-queue@gnu.org.
W. G. Krebs is a PhD candidate in molecular biophysics and biochemistry at Yale University, where he researches web-based biological databases. He has been a systems programmer for longer, and in more languages, than he cares to relate. His wide-ranging interests include political economics, classical and folk music, and the Chinese game of Go; he welcomes your comments at wkrebs@gnu.org or by snail mail c/o Linux Journal.