Big Brother Network Monitoring System
Figure 1. Big Brother (Sean MacGuire) is Watching
I wasn't bored: I don't have time to be bored. Texas Agricultural Extension Service operates a fairly large enterprise-wide network that stretches across hell's half acre, otherwise known as Texas. We have around 3,000 users in 249 counties and 12 district offices who expect to get their e-mail and files across our Wide Area Network. Some users actually expect the network to work most of the time. We use Ethernet networking with Novell servers at some 35 locations, about 15 have routers that are connected via a mixture of 56Kb circuits, fractional T1, frame-relay and radio links. We are not currently using barbed wire fences for our network, no matter what you may have heard.
I am privileged to be part of the team that set up and maintains the network. We do not live in a perfect network world—things happen. Scarcely a day goes by that we do not have one or more WAN link outages, usually of short duration. We sometimes have our hands full just keeping all the pieces connected. Did I mention that the users expect the mail and other software to actually work?
Cruising the USENET newsgroups, I read a posting about “Big Brother, a solution to the problem of Unix Systems Monitoring” written by Sean MacGuire of Montréal, Canada. I was intrigued to notice that Big Brother was a collection of shell scripts and simple C programs designed to monitor a bunch of Unix machines on a network. So what if most of our mission critical servers were Novell-based? Who cares if some of our web servers run on Macintosh, OS/2, Windows 95 or NT? We use both Linux and various flavors of Unix in a surprisingly large number of places.
System administrators often reported difficult installations and software incompatibilities with the monitoring software; thus, frustrated users often gave us our first hint that all was not well. We had cooked up a number of homemade monitoring systems; pinging and tracerouting to all the servers can be very informative. We even looked at a bunch of proprietary (and expensive) network monitoring systems. It is amazing how much money these systems can cost.
According to the blurb by Sean MacGuire on Big Brother:
Big Brother is a loosely-coupled distributed set of tools for monitoring and displaying the current status of an entire Unix network and notifying the system administrator should need be. It came about as the result of automating the day to day tasks encountered while actively administering Unix systems.
The USENET news article provided a URL to the home site of Big Brother, http://www.iti.qc.ca/iti/users/sean/bb-dnld/. I pointed my browser to it and was rewarded with a blue image of a sinister face peering out under the caption “big brother is watching ”against a purple background. After my initial shock, I learned that Big Brother featured:
Web-based status display
Configurable warning and panic levels
Notification via pager or e-mail
Free and included source code
I was fascinated, especially by the last item: “Free and includes source code.” (I often tell people that Linux isn't free, but priceless.) So what could a priceless package do for me? What does Big Brother check?
Connectivity via ping
HTTP servers up and running
Disk space usage
Uptime and CPU usage
Essential processes still running
System-generated messages and warnings
Overall, very sensible. Looking for some “gotchas”, I found I would need a Unix-based machine, a functioning web server and browser (for the display), a compiler, Kermit and a modem line (for the pager). A web server was no problem, as we run many. A C compiler came with Linux, and we use Kermit on many machines with modems. So far, so good.
The Big Brother web site provided links to a few demonstration sites, and a link to download the program as well. I connected to a demonstration site and was greeted with an amazing display:
Figure 2
Legend [BIG BROTHER IMAGE] [help] [grn] System OK [BIG BROTHER IMAGE] [info] [yel] Attention [BIG BROTHER IMAGE] [page] [red] Trouble [BIG BROTHER IMAGE] [view] [blu] No report [BIG BROTHER IMAGE] Updated @ 22:52 [BIG BROTHER IMAGE] conn cpu disk http msgs procs iti-s01 [grn] [grn] [grn] [grn] [yel] [grn] route-r-000 [grn] - - - - - inet-gw-0 [grn] - - - - -
As you can see, Big Brother is watching. While enduring the scrutiny of the Orwellian face peering out at me, I examined the rest of the display. It is colored like a traffic signal (green/yellow/red), and the update time is clearly displayed beneath it. To the right of “Big Brother” are four buttons, marked clearly Help, Info, Page and View. Beneath the header area is a table with six column headings and three rows, each neatly labelled with a computer host name. The boxes formed by the intersection of the rows and columns contain attractive green and yellow balls. The overall effect is like a decorated tree. The left side of the screen has a yellow tint, gradually becoming black at the center.
Selecting the Help button gives a brief explanation of Big Brother. Choosing the Info Button provides a much longer and more detailed explanation of the system, including a graphic that really is worth a thousand words. The Page button sends a signal to a radio-linked pager—not at all what I had expected. Finally, the View selection provides a brief but perhaps more useful view of the information, isolating only the systems with problems.
In my case, only the “iti-s01” system was displayed. My browser cursor indicated a link as it passed over each colored dot, so I clicked on the blinking yellow dot and received this message:
yellow Tue Feb 18 22:50:53 EST 1997 Feb 16 12:22:33 iti-s01 kernel: WARNING: / was not properly dismounted
This puzzled me at first. How on earth could it know that? It turns out that Big Brother (BB) checks the system /var/log/messages file periodically and alerts on any line that begins with either WARNING or NOTICE. As I am certain Sean MacGuire is very conscientious, I suspect he adds that line to his message file, so the viewer can see how Big Brother reports its findings.
Suddenly, my screen spontaneously updated. The update time had changed by five minutes, and a blinking yellow dot appeared under the column labelled procs. I clicked on the blinking yellow dot and was informed that the sendmail process was not running. This got me really interested—Big Brother can monitor whether selected processes are running.
Being a little puzzled about the screen's ability to update itself, I viewed the document source and discovered some HTML commands that were new to me:
<META HTTP-EQUIV="REFRESH" CONTENT="120"> <META HTTP-EQUIV="EXPIRES" CONTENT="Tue Feb 18 23:22:07 CST 1997">
The first META line instructs browsers to get an update every 120 seconds. The second tells the browser to get a new copy after the expiration time and date—very clever.
I returned to the graphics window and discovered that the yellow area on the left had changed to red. A new host name row appeared with a blinking red dot under the column labelled conn. I clicked on the blinking red dot and read this message:
red Tue Feb 18 22:59:11 CST 1997 bb-network.sh: Can't connect to router''000... (paging)
The connection to the machine called router-000 had been interrupted, and the administrator had been paged. Amazingly, while in Texas, I had become aware of a network outage in Montréal, Canada. This really had possibilities—perhaps someday I may get to take a vacation.
I was so impressed with Big Brother that I decided to use it. Sean has thoughtfully made its acquisition easy, but requests that you fill out an on-line registration form with your name and e-mail address. He also likes to know where you heard about Big Brother. I filled out his forms in early November 1996, and received an e-mail survey form in late December. To download Big Brother and to get technical information about how the system works and how to install and configure the package, go to http://www.iti.qc.ca/iti/users/sean/bb-dnld/bb-dnld.html.
When I clicked on the link to download Big Brother, I ended up with a file called bb-src.tgz. I impetuously gunzipped this to get bb-src.tar. I then thought better of the impending error of my ways and decided to download and print the installation instructions before going further. Installation procedures for Big Brother can be found at http://www.iti.qc.ca/iti/users/sean/bb-dnld/bb-install.html, as well as other information about how to set up the system. Just in case, I also grabbed and printed the debugging information (as it turned out, I did not need it) provided at http://www.iti.qc.ca/iti/users/sean/bb-dnld/bb-debug.html.
I had no problems following the installation instructions. I decided to make the $BBHOME directory /usr/src/bb. The automatic configuration routines are said to work for AIX, FreeBSD, HPUX 10, Irix, Linux, NetBSD, OSF, Red Hat Linux, SCO, SCO 3/5, Solaris, SunOS4.1 and UnixWare. I can vouch for Linux, Red Hat Linux, Solaris and SunOS 4.1. The C programs compiled without incident, and the installation went smoothly. As always, your mileage may vary. In less than an hour, I was looking at Big Brother's display of colored lights.
At this point, it's a good idea to re-examine the documentation and information files. Personalize your installation as desired, and above all, have fun.
I admit it. I am a closet hacker. I saw many things about the stock BB distribution that I wanted to improve. Big Brother's modular and elegantly simple construction makes it a joy to modify as desired. The shell scripts are portable, simple, well documented and easy to understand. The use of the modified hosts file to determine which hosts to monitor was gratifyingly familiar. The bbclient script made it extremely easy to move the required components to another similar Unix host. Sean has done a remarkable job in making this package easy to install.
I became obsessive-compulsive about hacking BB and modified it slightly, working from Sean MacGuire's v1.03 distribution as a base. I forwarded my changes to him for possible inclusion in a later distribution.
Features I added to BB proper include:
Links to the info files in the brief view (bb2.html), where I needed them most.
Links to html info files for each column heading and the column info files themselves. I placed these files in the html directory along with bb.html and bb2.html, and gave them boring names like conn.html, cpu.html, ... smtp.html.
Checks to determine if ftp servers, pop3 post offices and SMTP Mail Transfer Agents (MTAs) are accessible ($BBHOME/bin/bb-network.sh). These checks all use bbnet to telnet to the respective ports. I followed Sean's style of adding comments to the bb-hosts file as follows:
128.194.44.99 behemoth.tamu.edu # BBPAGER smtp ftp pop3 165.91.132.4 bryan-ctr.tamu.edu # pop3 smtp 128.194.147.128 csdl.tamu.edu # http://csdl.tamu.edu/ ftp smtp
Some environment variables to $BBHOME/etc/bbdef.sh for the added monitoring as follows:
# # WARNING AND PANIC LEVELS FOR DIFFERENT # THINGS. SEASON TO TASTE # DFPAGE=Y # PAGE ON DISK FULL (Y/N) CPUPAGE=Y # PAGE FOR CPU Y/N TELNETPAGE=Y # PAGE ON TELNET FAILURE? HTTPPAGE=Y # PAGE ON HTTP FAILURE? FTPPAGE=Y # PAGE ON FTPD FAILURE? POP3PAGE=Y # PAGE ON POP3 PO FAILURE? SMTPPAGE=Y # PAGE ON SMTP MTA FAILURE? export DFPAGE CPUPAGE TELNETPAGE HTTPPAGE\ FTPPAGE POP3PAGE SMTPPAGE
Updated the bb-info.html and bb-help.html pages to reflect a version of 1.03a and a date of 10 February 1997. I also modified them to add brief mention of the new ftp, pop3 and smtp monitoring checks. Specifically, I changed the bb-help.html file to add new pager codes as follows:
100—Disk Error. Disk is over 95% full...
200—CPU Error. CPU load average is unacceptably high.
300—Process Error. An important process has died.
400—Message file contains a serious error.
500—Network error, can't connect to that IP address.
600—Web server HTTP error—server is down.
610—Ftp server error—server is down.
620—POP3 server error—PopMail Post Office is down.
630—SMTP MTA error—SMTP Mail Host is down.
911—User Page. Message is phone number to call back.
Added sections to the bb-info.html file to explain the ftp, pop3 and smtp monitoring.
Used a standard tag-line file on each html page that identifies the author and location of the page. Thus, mkbb.sh and mkbb2.sh now look for an optional tag-line file to incorporate into the html documents that they generate. The optional files are named mkbb.tag (for mkbb.sh) and mkbb2.tag (for mkbb2.sh). The shell scripts look for the optional tag-line files in the $BBHOME/web directory, which is also where the mkbb.sh and mkbb2.sh files reside.
Went through ALL of the html-generating scripts and html files to ensure that they actually had sections and properly placed double quotes around the various arguments.
Edited the files so that, for the most part, everything fits on an 80-column screen.
Modified $BBHOME/etc/bbsys.sh to make it easier to ignore certain disk volumes as follows:
# DISK INFORMATION # DFSORT="4" # % COLUMN - 1 DFUSE="^/dev" # PATTERN FOR LINES TO INCLUDE DFEXCLUDE="-->E dos|cdrom" # PATTERN FOR LINES TO EXCLUDE
I modified $BBHOME/etc/bbsys.linux, so that the ping program is properly found, as follows:
# bbsys.linux # # BIG BROTHER # OPERATING SYSTEM DEPENDENT THINGS # THAT ARE NEEDED # PING="/bin/ping" # LINUX CONNECTIVITY TEST PS="/bin/ps -ax" # LINUX DF="/bin/df -k" MSGFILE="/var/adm/messages" TOUCH="/bin/touch" # SPECIAL TO LINUX
Added the ability to dynamically traceroute and ping each system being monitored. I spoke with Sean about it, and, in keeping with the KISS (Keep It Simple, Stupid) principle, we thought these features were best added to the info files. The user portion is pretty obvious in the source of the info file. The cgi scripts are very simple shell scripts as shown in Listing 1.
Sean MacGuire is the primary author of Big Brother. In the finest tradition of decentralized shared software development, Sean solicits improvements, suggestions and enhancements from all. He then skillfully incorporates them as appropriate into the Big Brother distribution. Thus, like Linux, Big Brother is in a dynamic state of positive evolution with contributions from a cast of thousands (at least dozens). This constrained anarchy produces interesting results with an international flavor.
Jacob Lundqvist of Sweden is actively improving the paging interface. He has done a superb job of enhancing the paging portion, adding support for alphanumeric and SMS pagers. Darren Henderson (Maine, US) added AIX support. David Brandon (Texas, US) added proper IRIX support and Jeff Matson (Minnesota, US) made some IRIX fixes. Richard Dansereau (Canada) ported Big Brother to SCO3 and provided support for other df's. Doug White (Oregon, US) made some paging script bug fixes. Ron Nelson (Minnesota, US) adapted BB to Red Hat Linux. Jac Kersing (Netherlands) made some security enhancements to bbd.c. Alan Cox (Wales) suggested some shell script security modifications. Douwe Dijkstra (Netherlands) provided SCO 5 support. Erik Johannessen (Minnesota, US) survived SunOS 4.1.4 installation. Curtis Olson (Minnesota, US) survived IRIX, Linux and SunOS installations. Gunnar Helliesen (Norway) ported Big Brother to Ultrix, OSF and NetBSD. Josh Wilmes (Missouri, US) added Solaris changes for new ping stuff.
Many other unsung heroes around the world are undoubtedly working to enhance BB at this very moment.
I am (ab)using Big Brother in ways not originally envisioned by its creator, Sean MacGuire. Texas Agricultural Extension's networks are wildly heterogeneous mixtures of different operating systems and protocols, rather than a homogeneous Unix-based network. I would like to see Big Brother learn about IPX/SPX protocols for Novell connectivity monitoring. I would also like to see Big Brother data collection modules for Macintosh, Novell, OS/2, Windows 3.1x, Windows'95 and Windows NT. Rewriting Big Brother in Perl might better serve these disparate platforms, if I could only find the time.
We now monitor around 122 hosts. Only 20 are actually Unix-based hosts that run Big Brother's bb program internally. Some 28 are Novell servers, 39 are routers, and the rest are a mixture of Macintosh, OS/2, Windows 3.1x, Windows'95 and Windows NT machines running one or more types of servers (34 FTP or 26 HTTP). We also find it useful to monitor our 31 PopMail post offices and 43 mail hosts and gateways. We are checking connectivity on three DNS servers as well, since they are mission critical.
Big Brother (or, as I now affectionately refer to it, “Big Bother”) is now alerting us to outages five or more times daily. Typically, the system administrator receives a page. BB's display is checked and the info file is used to traceroute and ping the offending machine to validate the outage. Many connection outages involve routers, DSU/CSUs and multiplexors as well as the actual host. BB's display allows us to quickly see a pattern that aids in diagnosis. The ability to dynamically traceroute and ping the host from the html info page also helps to rapidly determine the actual point of failure. If the administrator paged cannot correct the problem, he relays it to the responsible person or agency.
Before we installed Big Brother, we were frequently notified of these failures by frustrated users telephoning us. Now, we are often aware of what has failed before they call. The users are also becoming aware that they can monitor the network through the WWW interface. In many instances, we are able to actually correct the problem before it disturbs our users. It is difficult to accurately measure the time saved, but we estimate that Big Brother has had a net positive effect overall.
We have a machine in a publicly visible area displaying the brief view of Big Brother. The green, yellow, red and blue screen splashes are clearly visible far down the hall, helping our network team to be more aware of problems as they occur. The accessibility of the WWW page has made Big Brother useful even to people at the far ends of our network. Thus, Big Brother has become a helpful member of our network team. Maybe now I'll have time to be bored.
Paul Sittler (p-sittler@tamu.edu) is a human being in the service of Texas Agricultural Extension, a part of the Texas A&M University System. As a human being he is, of course, a skilled tool-maker. He enjoys playing with technology and tries to make it useful to others of his species. He is a shy man of simple tastes, who still has a discriminating palate with respect to German wine. He is multilingual, being at least marginally conversant in several human languages and competent in several computer dialects as well. He was born with a peculiar genetic defect that requires him to disassemble and reassemble things rather than merely use them.