A Process for Managing and Customizing HPC Operating Systems
High-performance computing (HPC) for the past ten years has been dominated by thousands of Linux servers connected by a uniform networking infrastructure. The defining theme for an HPC cluster lies in the uniformity of the cluster. This uniformity is most important at the application level: communication between all systems in the cluster must be the same, the hardware must be the same, and the operating system must be the same. Any differences in any of these features must be presented as a choice to the user. The uniformity and consistency of running software on an HPC cluster is of utmost importance and separates HPC clusters from other Linux clusters.
The uniformity also persists over time. Upgrades and security fixes should never affect application correctness or performance. However, security concerns in HPC environments require updates to be applied in a timely fashion. These two requirements are conflicting and need to be managed by well documented processes that involve testing and regular outages.
A process for managing these requirements was developed at the Environmental Molecular Sciences Laboratory (EMSL) during the past ten years. EMSL supports HPC for the United States Department of Energy (DOE) and the open science community. This process gives EMSL an edge in maintaining a secure platform for large computational chemistry simulations that complement instrument research done at EMSL.
RequirementsThe process developed at EMSL to maintain HPC clusters has roots in standard software testing models. The process involves three phases: build testing, integration testing and production. These phases have their own requirements both in hardware, software and organization. Other important systems include configuration management, continuous monitoring and repository management. All of these systems have well defined roles to play in the overall process and need dedicated hardware, not part of the production cluster, to support them.
The build integration phase requires two primary components: package repository management and continuous integration software. These two components interact and give software developers and system administrators knowledge of bugs in individual pieces of software before those updates affect integration testing. This form of testing is important to automate for critical applications because it helps facilitate communication between operations and development teams.
The integration testing phase requires a test cluster that is close to matching the production cluster. The primary difference between the production and test clusters, for HPC, is scale. The test cluster should have a lower number, but at least one, of every Linux host in the production cluster, including configuration management and continuous monitoring. Furthermore, the Linux hosts should be as close to matching production configuration as possible. Any deviations between the production and test clusters' configuration, both in hardware and software, should be well documented. This document will help define the accepted technical risks that might be encountered during production outages.
The production cluster is the culmination of all the preparations done in the build and testing phases. Leading up to the outages, documented tasks during the outage should be identified along with planned operating system upgrades. Storing these documents should be easily accessible for both developers and management to see as well as easy for operational staff to modify and track issues. Along with the plan, documented processes for moving configuration management and continuous monitoring from testing to production also should be followed.
We have identified some required infrastructure needed to support and automate the process for managing your own Linux HPC operating system. During the build integration phase, a dedicated build system is needed along with package management and continuous integration software. The integration testing phase requires test cluster hardware and continuous monitoring and configuration management software. Finally, the production cluster also should integrate with configuration management and continuous monitoring software.
Several systems are not covered here but are critical to integrate into the process. Site-specific backup solutions should be considered for every phase of the process. Furthermore, automated provisioning systems also should be considered for use with this process. At EMSL we have used both, but it's certainly not required by the process; it just makes sleep better at night.
Build PhaseThe build phase is the start of the process. There are three inputs into the system: binary packages, source code packages and tickets. These three inputs produce three outputs: a set of base repositories, a set of patches for upstream contribution and an overlay repository of modified packages. These inputs and outputs provide the operating system fixes needed for your site while contributing them back to the communities that support them. To understand this process completely, let me to break down the components and talk about their requirements.
The package repository management system is utilized throughout the process but first appears in the build phase. The package repository management system should be able to download binary package repositories from an upstream distribution. It also should be able to keep those downloaded repositories in sync with the upstream distribution. The first set of repositories should be a local copy of the upstream distribution, including updates, synchronized daily. As an added feature, the package repository management system also should be able to remove certain packages selectively from being downloaded. This feature complements the contents of the overlay repository. The overlay repository is the place where custom builds of the packages get put to enhance the base distribution.
The content of the overlay repository is specific to the critical applications in the distribution that need to be managed separately. For example, HPC sites might be more concerned about the kernel build, openfabrics enterprise distribution (OFED) and software that implements the message passing interface (MPI). This software is then removed from the base distribution and added back in an overlay repository. Furthermore, there can be multiple overlay repositories. For example, security concerns may dictate that the kernel needs to be managed separately from the rest of the distribution. Having the kernel in a separate overlay repository means that the testing phase can be skipped with minimal impact and still maintain a secure cluster.
The packages that are in the overlay repository are patched to match the needs of the organization. The continuous integration system should be used to patch the specific packages and maintain the build with future updates. These patches should be issued back to the upstream distribution along with good reasons why this patch was needed. Some of these patches may get accepted by upstream developers and make it into the distribution while others may take years to make it due to policy decisions on the part of the distribution maintainers.
Another job of the continuous integration system is to support the continuous build and testing of additions to the distribution that are not supported. These additions may be site-specific applications or open-source software not supported by the distribution. Many open-source software projects support compatibility with enterprise distributions but do not seek distribution inclusion because of financial project support reasons.
The final piece to the build phase is the ticket-tracking system. This system provides package developers information into what needs to be fixed and how. These tickets may come directly from users or from cluster administrators. Furthermore, the users and cluster administrators may use completely different ticketing systems. This piece of the process helps facilitate communication between groups. Having a list of tickets allows objective discussion about priority and makes sure tickets are not forgotten. Tickets may stay open for years or days, depending on priority and rate of ticket creation. The tickets do not stop with the package developers; the cluster administrators use this system in further phases.
The package management and continuous integration systems are automated processes, while the ticket-tracking system requires human interaction. These systems can be deployed on a single host. However, there is a requirement that three copies of the package repositories be present for the later phases of the process. Furthermore, there are features of the continuous integration system that integrate with the ticket-tracking systems. Enabling this feature does require a certain level of stability in the continuous integration build process. Many of the specifics in these systems are not covered here and will be covered later.
Test PhaseThe integration testing phase requires the package repository management, continuous monitoring and configuration management systems. These three systems help maintain the test cluster in a state that integration testing be done by some automated processes. Furthermore, the test cluster hardware configuration should represent all critical aspects of the production cluster such that it mitigates risk to production clusters.
The package repository management system does play a role in all three phases of the process. This is the first phase of the process where the packages with additions are tested in production-like configurations. The daily repositories, including the overlay repository, are synchronized to a set of testing repositories to be included in the test cluster. This synchronization ensures a consistent environment to perform tests.
Every time updates are synchronized to the testing repositories, a set of integration tests should be performed on the test cluster. These tests should be designed to simulate the usage of the production cluster. It's important to focus the tests on critical user-level applications and parts of the operating system you have replaced and put into the overlay repository. The continuous integration system can run these tests and alert on failures.
Failures in the integration tests should be reported in the ticket-tracking system. This is one of the paths to complete the circle of development. Other tickets include deployment and re-install issues. Complex internal infrastructure in the production cluster also may present upgrade issues, and those issues also should be tracked. The test cluster also should be managed by the same procedures as the production cluster. The procedures should be practiced on the test cluster to minimize tickets before updates get deployed to the production cluster. All of these tasks should be performed in repetition until the addition of new tickets is reduced to minimize the risk to the production cluster.
The configuration management and continuous monitoring systems are set up in a similar way between the test and production environments. These two systems help maintain the production state from inadvertent hardware or software changes and, thus, need to be tested when deliberate changes are made. These changes need to be integrated into the production environment easily. So, maintaining the configuration for these two systems in a source code management repository that supports branching and merging also is prudent. This allows for standard processes for making changes and pushing those changes between testing and production environments.
When the number of tickets have been reduced and it is time to push the changes to the production cluster, five things come out of this phase: an updated set of package repositories, a set of tasks that need to be done during an outage, updated procedures to be used on the production cluster, changes that need to be merged to production for the configuration management and continuous monitoring systems. Both the package developers and the cluster administrators need to collaborate on the procedural changes and the outage tasks. This collaboration works well in a wiki environment that is internal to both groups. These outputs conclude the integration testing phase of the process.
Production PhaseThe production phase of the process takes the results from the integration testing phase and applies them to the production cluster. This phase utilizes all of the same processes as the testing phase, with a few modifications. Furthermore, this is the phase where users get to affect change in the process. There is also an increase in more formal communication methods through software between groups. The final outputs of this phase feed back to help complete the development and testing cycle as well. After this phase is completed, the process is finished, and the updated production environment will be maintainable.
The first part of this phase is the replication of what was done with the package repositories. However, this phase requires that production copies of the repositories be synchronized from the integration testing repositories. This is the final set of package repositories required by the process. Furthermore, production configuration of the continuous monitoring and configuration management system also should be created from the integration testing configuration of the respective systems.
Users of the production cluster get input into the process during this phase. Depending on the users' requirements, this may be a different instance of the ticket-tracking system or the same one as used by the package developers and cluster administrators. Either way, it's important to track this input so it makes it through the process without getting forgotten.
Communication is key to this part of the process. From the testing phase, we know what tasks need to be completed on the production cluster during an outage and how long they should be expected to take. This helps management determine cost and benefit of the outage to determine a path forward. There is also continued communication required during outages when differences between the production and test clusters bring unexpected issues. These issues should be mitigated quickly, but tickets should be issued to ensure proper resolution of the issue so it never happens again.
There is always an importance of being prepared for production cluster outages. However, it is impossible to be completely prepared for every possible contingency. The differences between the test cluster and production cluster configuration will help to define the highest risks to any particular outage. It is critical to communicate these risks and any changes that might be impacted by those risks to management prior to outages.
Component | Open-Source Software |
Continuous Monitoring | Nagios, Simple Event Correlator, Auditd |
Package Repository Management | Cobbler |
Continuous Integration | Jenkins, Hudson |
Ticket Tracking | Trac, Bugzilla |
Wiki Documentation | Trac, Drupal WordPress |
Provisioning | Cobbler |
Configuration Management | Cfengine, Chef, Salt, Puppet |
Backup Software | Bacula |
The process described here does seem like a lot of overhead, and it may seem not applicable to your situation. However, the process does have specific circumstances where the testing phase can be skipped. Furthermore, this process is generic enough to be scaled to your needs. There are many pieces of open-source or proprietary software that can meet the requirements of this process.
Skipping the testing phase process easily can be done by pulling the critical applications into separate overlay repositories so they can be managed separately. Then make sure the process for getting updates is put into the continuous integration system. This may just be a Web site download that pulls the appropriate software into its respective overlay repository. Then simply synchronize the overlay repository to test and then production. This is done immediately to push security updates to production systems.
Similarly with production configuration changes, many times unexpected issues during an outage demand that configuration changes be made directly to production systems. These changes should be able to be made directly in the production configuration management system then merged back to test when the outage is over. If the changes to production need development to be made more generic, this should happen in the build and integration testing phase. The final solution then should be pushed to production during an outage.
In conclusion, the process described here is simply suggestive in nature. If the process needs to be modified to get things working again, do so. However, after fixing issues, keep in mind the part of the process the issue relates to and inject what has been done into the process. This process is generic and flexible to manage these sorts of changes as well as keep systems updated while managing communication through well defined systems.