The Level of Uptime – Increasing Pressure Syndrome

When I was earning my bachelors, I joined the Association for Computing Machinery (ACM) and through them, several special interest groups. One of those groups was SIGRISK, which focused on high-risk software engineering. At the time the focus was on complex systems whose loss was irretrievable – like satellite guidance systems or deep sea locomotion systems – and those whose failure could result in death or injury to individuals – like power plant operations systems, medical equipment, and traffic light systems. The approach to engineering these controls was rigorous, more rigorous than most IT staff would consider reasonable.

And the reason was simple. As we’ve seen since – more than once - a space ship that has one line of bad code can end up veering off course and never returning, the data it collects completely different than that which it was designed to collect. Traffic lights that are mis-programmed offer a best-case of traffic snarls and people late for whatever they were doing, and a worst case of fatal accidents. These systems were categorized by the ACM as high-risk, and special processes were put in place to handle high-risk software development (interestingly, processes that don’t appear to have been followed by space agencies – who did a lot of the writing and presenting for SIGRISK when I was a member). Interestingly, SIGRISK no longer shows up on the list of SIGs at the ACM website. It is the only SIG I belonged to that seems to have gone away or been merged into something else.

What interests me in all of this is a simple truth that I’ve noticed of late. We are now making the same expectations of everyday software that we made of these over-engineered systems designed to function even in the face of complete failure. And they’re not designed for this level of protection. Think about it a bit, critical medical systems can be locked down so that the only interface is the operator’s interface, and upgrades are only allowed with a specific hardware key, things being launched into space don’t require serious protection from hackers, they’re a way out of reach, traffic lights have been hacked, but they’re not easy, and the public nature of the interfaces makes it difficult to pull off in busy times… But Facebook and Microsoft? They have massive interfaces, global connectedness, and by definition IT staff tweaking them constantly. Configuration, new features, uncensored third party development… The mind spins.

Ariane 5 courtesy of SpaceFlightNow.com

Makes me wonder if Apple (and to a lesser extent RIM) wasn’t smart to lock down development. RIM has long had a “you have a key for all of your apps, if you want to touch protected APIs, your app will have your key, and if you are a bad kid and crash our phones, we’ll shut off your key”. Okay, that last bit might be assumed, it’s been a while since I read the agreement (I’ve got a RIM dev license), but that was the impression I was left with three years ago when I read through the documentation. Apple took a lot of grief for their policies, but seriously, they want their phone to work. Note that Microsoft often gets blamed for problems caused by “rogue” applications.

But it doesn’t address the issue of software stability in a highly exposed, highly dynamic environment. We’re putting pressures on IT folks – who are already under time pressure – that used to be reserved for scientists in laboratories. And we’re expecting them to get it right with an impatience more indicative of a two year old than a pool of adults. Every time a big vendor has a crash or a security breech, we act like they’re idiots. Truth is that they have highly complex systems that are exposed to both inexperienced users and experienced hackers, and we don’t give them the years of development time that critical systems get.

So what’s my point? When you’re making demands of your staff, yes, business needs and market timing are important, but give them time to do their job right, or don’t complain about the results. And in an increasingly connected enterprise, don’t assume that some back-office corner piece of software/hardware is less critical than user-facing systems. After all, the bug that bit Microsoft not too long ago was a misconfiguration in lab systems. I’ve worked in a test lab before, and they’re highly volatile. When big tests are going on, the rest of the architecture can change frequently while things are pulled in and returned from the big test, complete wipe and reconfigure is common – from switches to servers – and security was considered less important than delivering test results. And the media attention lavished on the Facebook outage in September is enough that you’d think people had died from the failure… Which was caused by a software configuration change.

www.TechCrunch.com graphic of Facebook downtime

Nice and easy. Don’t demand more than can be delivered, or you’ll get sloppy work, both in App Dev and in Systems Management. Use process to double-check everything, making sure that it is right. Better to take an extra day or even ten than to find your application down and people unable to do anything. Because while Microsoft and Facebook can apologize and move on, internal IT rarely gets off that easily.

Automation tools like those presented by the Infrastructure 2.0 crowd (Lori is one of them) can help a lot, but in the end, people are making changes, even if they’re making them through a push button on a browser… Make sure you’ve got a plan to make it go right, and an understanding of how you’ll react if it doesn’t.

And the newly coined “DevOps” hype-word might be helpful too – where Dev meets Operations is a good place to start building in those checks.