It has been said that technology in general, and automation in particular, simply make it easier for humans to propagate errors much faster than they could otherwise do so.
This morning, I awoke to find that none of my scripts – on any of my machines – had executed overnight. Hmmmm?!?
Initially, I only thought that it was two scripts that I had updated the night before, because those where the logs that I was checking. And, of course, I had only tested them peripherally, as the changes were relatively minor.
Well, it didn’t take more than a couple of minutes to determine that the problem was widespread. The source was a centralized date routine script that attempted to determine if the current date was an ODD or EVEN date. Unfortunately, if you’re using Windows native shell scripting, and you attempt to manipulate a number with a leading ZERO, it assumes that this number is octal rather than decimal. And there’s no way to change that.
So, today and tomorrow would have been utter script failure, since virtually all of my scripts call a centralized one with this routine. It didn’t take much to fix the problem, at least not after I stopped shaking my head, and this will probably be the straw that really gets me to convert the vast majority of these scripts to PowerShell, like I’ve been threatening to do for a few years.
The funny thing is that I added this particular routine on November 11. Had I done it just two days prior, I would have found the issue during testing, rather than one month later.
How does this tie in to the title of this post? Well, the more infrastructure or automation one ties together, the more likely a single mistake will cascade across different areas and systems causing substantial failure. And it might not happen immediately.
There are definitely immense gains to be made with consolidation and automation – we just have to evaluate and mitigate the risks more carefully when we’re putting more and more eggs in a smaller number of baskets…
Such is life on Technology Boulevard…