Cluster Hardware Torture Tests
Without stable hardware, any program will fail. The frustration and expense of supporting bad hardware can drain an organization, delay progress and frustrate everyone involved. At Stanford Linear Accelerator Center (SLAC), we have created a testing method that helps our group, SLAC Computer Services (SCS), weed out potentially bad hardware and purchase the best hardware at the best possible cost. Commodity hardware changes often, so new evaluations happen each time we purchase systems. We do minor re-evaluations for revised systems for our clusters about twice a year. This general framework helps SCS perform accurate, efficient evaluations.
This article outlines our computer testing methods and system acceptance criteria. We expanded our basic ideas to other evaluations, such as storage. The methods outlined here help us choose hardware that is much more stable and supportable than our previous purchases. We have found that commodity hardware ranges in quality, so systematic methods and tools for hardware evaluation are necessary. This article is based on one instance of a hardware purchase, but the guidelines apply to the general problem of purchasing commodity computer systems for production computational work.
Maintaining system homogeneity in a growing cluster environment is difficult, as the hardware available to build systems changes often. This has the negative effect of adding complexity in management, software support for new hardware and system stability. Furthermore, introducing new hardware can introduce new hardware bugs. To constrain change and efficiently manage our systems, SCS developed a number of tools and requirements to enable an easy fit for new hardware into our management and computing framework. We reduced the features to the minimum that would fit our management infrastructure and still produce valid results with our code. This is our list of requirements:
One rack unit (1U) case with mounting rails for a 19" rack.
At least two Intel Pentium III CPUs at 1GHz or greater.
At least 1GB of ECC memory for every two CPUs.
100MB Ethernet interface with PXE support on the network card and in the BIOS.
Serial console support with BIOS-level access support.
One 9GB or larger system disk, 7,200 RPM or greater.
All systems must be FCC- and UL-compliant.
Developing a requirements list was one of the first steps of our hardware evaluation project. Listing only must-haves as opposed to nice-to-haves grounded the group. It also slowed feature creep, useless additions to hardware and vendor-specific methods for doing a task. This simple requirement culled the field of possible vendors and reduced the tendency to add complexity where none was needed. Through this simple list, we chose 11 vendors to participate in our test/bid process. A few vendors proposed more than one model, so a total of 13 models were evaluated.
The 11 vendors we chose ranged from large system builders to small screwdriver shops. The two criteria for participating in the evaluation were to meet the list of basic requirements and send three systems for testing. We wanted the test systems for 90 days. In many cases, we did not need the systems that long, but it's good to have the time to investigate the hardware thoroughly.
For each system evaluation, two of the three systems were racked, and the third was placed on a table for visual inspection and testing. The systems on the tables had their lids removed and were photographed digitally. Later, the tabled systems were used for the power and cooling tests and the visual inspection. The other two systems were integrated into a rack in the same manner as all our clustered systems, but they did not join the pool of production systems. Some systems had unique physical sizing and racking restrictions that prevented us from using them.
Each model of system had a score sheet. The score sheets were posted on our working group's Web page. Each problem was noted on the Web site, and we tried to contact the vendor to resolve any issues. In this way we tested both the system and the vendor's willingness to work with us and fix problems.
We had a variety of experiences with all the systems evaluated. Some vendors simply shipped us another model, and some worked through the problem with us. Others responded that it was not a problem, and one or two ignored us. This quickly narrowed the systems that we considered manageable.
Throughout the period of testing, if a system was not completing a specific task, it was running hardware testing scripts or run-in scripts. Each system did run-in for at least 30 days. No vendor does run-in for more than 72 hours, and this allowed us to see failures over the long term. Other labs reported they also saw problems over long testing cycles.
In general, we wanted to evaluate a number of aspects of all the systems: the quality of physical engineering, operation, stability and system performance. Finally, we evaluated each vendor's contract, support and responsiveness.
The systems placed on the table were evaluated based on several criteria: quality of construction, physical design, accessibility, quality of power supply and cooling design. To start, the systems varied greatly in quality of construction. We found bent-over, jammed ribbon cables, blocked airflow, flexible cases and cheap, multiscrew access points that were unbelievably bad for a professional product. We found poor design decisions, including a power switch offset in the back of a system that was nearly inaccessible once the system was racked. On the positive side, we came across a few well-engineered systems.
Our evaluation included quality of airflow and cooling, rackability, size/weight and system layout. Features such as drive bays at the front also would be noted. Airflow is a big problem with hot x86 CPUs, especially in restricted spaces such as a 1U-rack system. Some systems had blocked airflow or little to no circulation. Heat can cause instability in systems and reduce operational lifetimes, so good airflow is critical.
Rigidity of the case, no sharp edges, how the system fits together and cabling also belong in this category. These might seem small, uninteresting factors until you get cut by a system case or have a large percentage of “dead on arrivals”, because the systems were mishandled by the shipper and the cases were too weak to take the abuse. We have to use these systems for a number of years; a simple yet glaring problem is a pain and potentially expensive to maintain.
Tool-less access should be a standard on all clustered systems. When you have thousands of systems, you are always servicing some of them. To keep the cost of that service low, parts should be quickly and easily replaceable. Unscrewing and screwing six to eight tiny machine screws slows down access to the hardware. Parts that fit so one part does not have to come out to get to another part and that provide easy access to drives are pluses. Some features we did not ask for, like keyboard and monitor connections on the front of the case, are fine but not really necessary.
We tested the quality of the power supply using a Dranetz-BMI Power Quality Analyzer (see Sidebar). Power correction often is noted in the literature for a system, but we have seen radically different measurements relative to the published number. For example, one power supply, with a published power factor correction of .96, actually had a .49 correction. This can have terrible consequences when multiplied by 512 systems. We tested the systems at idle and under heavy load. The range of quality was dramatic and an important factor in choosing a manageable system.
The physical inspection, features, cooling and power-supply quality tests weeded out a number of systems early in the process. Eliminating these right away reduced the number of systems that needed extensive testing, thereby reducing the amount of time spent on testing overall. System engineering, design and quality of parts ranged broadly.
Measuring Power Supply Quality
Power supplies often come with inaccurate quality claims. We have experienced a number of problems due to poor-quality power supplies, so we test every system's power supply. For an accurate measurement, a Power Quality Analyzer is used to measure systems at idle and under heavy load.
Prior to employing our test methods, SCS built a cluster with a poor power supply and experienced a range of problems. One of the most expensive problems was current being mismanaged by the power supply. Three phase distribution power systems often are designed with the assumption of nicely balanced loads across the three phases, which results in the neutral current approaching zero. The resulting designs usually used the same gauge wiring on the neutral as on the supply.
Unfortunately, low-quality power supplies generate large third harmonic currents, which are additive in the the neutral line. The potential result of this is neutral current loads in excess of the rated capacity of the wiring, to say nothing of the transformers that were not rated for such loading. And, the neutral cannot be fused by code, so it was possible to exceed the neutral wiring capacity without tripping a breaker on the supply lines. This required a derating on all parts of the infrastructure to remain within spec. Derating is expensive, time consuming, and the cluster cannot be used during that time.
Thanks to Gary Buhrmaster for help on this Sidebar.
Run-in (often called burn-in) is the process manufacturers use to stress test systems to find faulty hardware before they put them in the field. A number of open-source run-in programs are available. One common program is the Cerberus Test Control System sourceforge.net/projects/va-ctcs. It is a series of tests and configurable wrapper scripts originally designed for VA Linux Systems' manufacturing. Cerberus is ideal for run-in tests, but we also developed specific tests based on our knowledge of system faults. We were successful in crashing systems with our scripts more often than when using a more general tool. Testing by using programs developed from system work experience can be more effective than using Cerberus alone, so consider creating a repository of testing tools.
Read the instructions carefully, and understand that run-in programs can damage a system; you assume the risk by running Cerberus. Also, there are a number of software knobs to turn, so consider what you are doing before you launch the program. But if you are going to build a cluster, you need to test system stability, and run-in scripts are designed to test exactly that quality.
At the time that we were testing systems, two members of our group wrote their own run-in scripts, based on some of the problems we have seen in our production systems. Whereas benchmarks try to measure system performance and often have sophisticated methods, run-in scripts are simple processes. A system is put under load and either passes or fails. A failure crashes the system or reports an error; a pass often does not report information. We also ran production code, which uncovered more problems. Production code always should be run whenever possible. For instance, one of the systems that passed the initial design inspection tests with flying colors failed under heavy load.
A plethora of benchmark programs is available. The best benchmark is to run the code that will be used in production, just as it is good to run production code during run-in. This is not always possible, so a standard set of benchmarks is a decent alternative. Also, standard benchmarks establish a relative performance value between systems, which is good information. We do not expect a dramatic performance difference in commodity chipsets and CPUs. Performance differences exist, however, when different chipsets and motherboard combinations are involved, which was the case in this testing trail.
We also wrote a wrapper to a number of standard benchmarking tools and packaged it into a tool called HEPIX-Comp (High Energy Physics—Compute). It is a convenience tool, not a benchmark program itself. It allows a simple make server or make network to measure different aspects of a system. For example, HEPIX-Comp is a wrapper for the following tools (among others): Bonnie++, IOZone, Netpipe, Linpack, NFS Connectathon package and streams.
Understanding the character of the code that runs on the system is paramount to evaluating with standard benchmarking. For example, if you are network-constrained, a fast front-side bus is less important than network bandwidth or latency. These are good benchmarks that measure different aspects of a system. Streams, for example, measure the I/O memory subsystem throughput, which is an important measure for systems with hierarchical memory architectures. Bonnie++ measures different types of read/write combinations for I/O performance.
Many vendors report performance that gives the best possible picture. For example, sequential writes as an I/O performance measure is pretty rosy compared to random, small writes, which are closer to reality for us. Having a standardized test suite run under the Linux installation that is used in production establishes a baseline measurement. If the system is tuned for one benchmark, it might perform the benchmark well at the expense of another system performance factor. For example, systems tuned for large block sequential writes hurt small random writes. A baseline benchmark suite at least shows an apples-to-apples comparison, although not the potentially best performance. So, this is by no means a perfect system, but rather one more data point in an evaluation that characterizes system performance.
All the data was collected and placed on internal Web pages created for the evaluation and shared among the group. We met once a week and reported on the progress of the testing. After our engineering tests were complete, we chose a system.
Non-engineering factors (contractual agreements, warranties and terms) are critical to the success of bringing in new systems for production work. The warranty terms and length affects the long-term cost of system support. Another consideration is the financial health of the vendor company. A warranty does little good if the vendor is not around to honor it.
Also crucial is the acceptance criteria, although seldom talked about until too late. These criteria determine the point in the deployment when the vendor is finished and the organization is willing to accept the systems. This point should be made in writing in your purchase order. If the vendor drops the system off at the curb and later, during the rollout period, some hardware-related problem surfaces, you need to be within your rights to ask the vendor to fix the system problem or remove the system. On the vendor side, a clear separation between what constitutes a hardware or software problem needs to be made. Often a vendor has to work with the client to determine the nature of the problem, so that costs need to be built in to the price of the system.
The success of the method outlined in this article is apparent in how much easier, and therefore cheaper, it is to run the systems we chose after doing this extensive evaluation. We have other systems that we purchased without doing the qualification outlined here. We have had fewer problems after the better evaluation, and we are able to get more work done in other areas, such as tool writing and infrastructure development. And, we are less frustrated, as are our researchers, with good hardware in production.
John Goebel works at the Stanford Linear Accelerator Center (SLAC) in Menlo Park, California. He is part of the SLAC Computing Services (High-Performance Group), supporting a high-energy physics project for a worldwide research community.