Fedora Loses Infrastructure Services in Surprise Outage
An unplanned outage struck the Fedora Project early this morning, taking down elements of the project's infrastructure including the package buildsystem, a number of infrastructure-related databases, and the websites for several of the services maintained by the Fedora Infrastructure team, among others.
According to the support ticket entered by Ricky Zhou, the outage began around 8:10 UTC (3:10 EST/2:10 CST) this morning with the crash of the team's database server, db3. Zhou was able to confirm file corruption on the machine, immediately contacted Infrastructure Lead Mike McGrath. At 8:41 UTC, Zhou notified the Fedora community of the outage by email, noting that the Koji package buildsystem, all databases on the affected server, several websites — including the project's wiki, its Smolt live instance, and Transifex live instance — were offline, and that translation services were unavailable. He went on to note that core services including CVS, DNS, mail, and Fedora Hosted, Fedora People, and Fedora Talk were unaffected.
At 10:05 UTC, Infrastructure Lead Mike McGrath noted in the trouble ticket that while there was little concern over corruption to the server's /, where no data is stored, "issues" were also discovered on /backup, from which db1 was running at the time. After several hours of diagnosis, the ticket was updated to note that IBM would be replacing the raid controller and motherboard for the machine, noting that the motherboard would take an indeterminate amount of time to install. McGrath also noted that the machine's last backup appeared to be complete, reducing the potential data loss window to around nine hours. A later note indicates that after copying off db3's files, only /var/lib/pgsql/data/base/19461/pg_internal.init was found to be corrupt.
Just before 15:00 UTC (10:00 EST/9:00 CST), McGrath again updated the status, noting that an interim db3 was established in a guest machine on the project's xen3 server, and that the team would wait for IBM to install the replacements, then transfer the data back to db3 during another — planned — outage. As of press time, the last note to the ticket indicates that IBM did not replace the motherboard, but did replace the backplane, and that after performing fsck as needed, the box would be placed under general load for twenty-four hours.
For now, a cursory inspection shows the affected websites back online, and with them presumably their respective databases. Koji is reportedly also once again online and available. Fedora users should be prepared for at least one further outage as services are transferred back to db3, but it appears, for now, that the sky is no longer falling.