Stop Killing Your Cattle: Server Infrastructure Advice
It's great to treat your infrastructure like cattle—until it comes to troubleshooting.
If you've spent enough time at DevOps conferences, you've heard the phrase "pets versus cattle" used to describe server infrastructure. The idea behind this concept is that traditional infrastructure was built by hand without much automation, and therefore, servers were treated more like special pets—you would do anything you could to keep your pet alive, and you knew it by name because you hand-crafted its configuration. As a result, it would take a lot of effort to create a duplicate server if it ever went down. By contrast, modern DevOps concepts encourage creating "cattle", which means that instead of unique, hand-crafted servers, you use automation tools to build your servers so that no individual server is special—they are all just farm animals—and therefore, if a particular server dies, it's no problem, because you can respawn an exact copy with your automation tools in no time.
If you want your infrastructure and your team to scale, there's a lot of wisdom in treating servers more like cattle than pets. Unfortunately, there's also a downside to this approach. Some administrators, particularly those that are more junior-level, have extended the concept of disposable servers to the point that it has affected their troubleshooting process. Since servers are disposable, and sysadmins can spawn a replacement so easily, at the first hint of trouble with a particular server or service, these administrators destroy and replace it in hopes that the replacement won't show the problem. Essentially, this is the "reboot the Windows machine" approach IT teams used in the 1990s (and Linux admins sneered at) only applied to the cloud.
This approach isn't dangerous because it is ineffective. It's dangerous exactly because it often works. If you have a problem with a machine and reboot it, or if you have a problem with a cloud server and you destroy and respawn it, often the problem does go away. Because the approach appears to work and because it's a lot easier than actually performing troubleshooting steps, that success then reinforces rebooting and respawning as the first resort, not the last resort that it should be.
The problem with respawning or rebooting before troubleshooting is that since the problem often goes away after doing that, you no longer can perform any troubleshooting to track down the root cause. To extend the cattle metaphor, it's like shooting every cow that is a little sluggish or shows signs of a cold, because they might have mad cow disease and not actually testing the cow for the disease. If you aren't careful, you'll find you've let a problem go untreated until it's spread to the rest of your herd. Without knowing the root cause, you can't perform any steps to prevent it in the future, and although the current issue may not have caused a major outage, there's no way to know whether you'll get off so easy the next time it happens. So although you may save time by not troubleshooting, that's time you lose from gaining troubleshooting experience. Eventually, you'll need to flex that troubleshooting muscle, and if you haven't exercised it, you may find yourself with a problem you can't solve.
In short, automation is great, and it's incredibly important in modern infrastructure to be able to respawn any host quickly and easily—just don't turn that infrastructure best practice into a troubleshooting worst practice.