In complex IT environments, automation is a godsend. You only need to be responsible for the administration of more than a handful of servers to realise that. And, if you’ve ever been responsible for more than a handful of servers, you’re probably also familiar with the three ways of troubleshooting using the OSI models:
- Top-down
- Bottom-up
- Divide and Conquer
Where does automation sit in these? Personally, I tend to see automation as the top-most layer and as I generally use the top-down approach, it’s often the first thing I investigate when an issue is reported. This is especially true given that modern approaches to infrastructure provisioning are pretty good at automating deployments that are quite uniform (this is essential to large-scale container deployments). Reducing the amount of snowflake servers in our environments makes life easier and I (for one) am not going to complain about that.
But consider this: when those edge-cases are gone, the size and scale of issues experienced tend to be quite a lot larger. If every node, VM or container is configured (supposedly) identically, then every one of them is likely to be impacted.
Can troubleshooting itself be automated? Capturing recent changes to a set of objects (node, containers or even entire environments) and being able to view that information in a single point saves Ops teams hours of running the same commands in multiple places, capturing output and comparing only to rule that particular object out as not being affected.
Are there other cases for troubleshooting of automation? or, is it a moot point given the industry’s trend towards self-healing infrastructure?