Nothing sparks engineering debate quite as much as ‘network change control’. It’s one of those topics we love to hate. We feel buried by useless bureaucracy. We ask, ‘Why can’t our managers just trust us, instead of weighing us down with meaningless process and red tape’?
This may be a controversial perspective but I think we’ve gotten exactly what we deserve. We endure heavyweight change control procedures because when we make network changes we break stuff. We break stuff in truly spectacular ways, in ways we could never have predicted. We hit weird bugs, asymmetric configuration, faulty hardware, poor process, or we just have a brain fart/fat-finger/etc.
However the main reason our changes end badly is because we are hopelessly optimistic. We don’t predict things will go wrong so we don’t have any controls in place to deal with the inevitable car crash. When we underestimate the risk of a change and it goes horribly wrong, we undermine our credibility and lose trust.
We falsely assume the network engineer is in full control of the change.
There is no problem
I interviewed an engineer a few months back who was bringing up a new inter-data centre WAN link. He told me the story of how he brought up the link and it carried WAN traffic straight away. After half an hour he got complaints and after another hour of troubleshooting he backed out his change. The root cause was the new link dropping packets due to CRC errors.
I asked him what he would do differently if he had a chance to do the same procedure again? His answer was, “Nothing – I was just unlucky”. Nothing. I was floored. There are countless ways in which this ‘process’ could have been improved or made safer. However you can’t improve a bad situation unless you realise there is a problem.
Let’s be honest, we’ve all worked on cowboy networks were no-one knows what’s going on. We’ve worked on networks where every second change causes an outage, and sometimes another one when trying to back out the change, often undoing a successful change from a week ago. Once you have worked somewhere like that I think all engineers agree that we need ‘some’ process to bring the chaos under control.
The engineer is not in full control
About six years ago I worked with a colleague at Amazon who had the misfortune of making a network change which caused a nightmare AWS outage. A cascading storm of failures ensued and everyone and their mothers had an opinion on the root cause.
I remember listening to a very painful episode of the packet pushers podcast when the hosts questioned the professionalism of network engineer in question. I was angry because I could easily have been that engineer, and that change would not have ended any better. The root cause was a network configuration that was way too complex and a rogue OSPF ABR router that decided to inject a summary default route into Area 0.
Did AWS just dismiss this event as a rogue engineer or a buggy code? Like hell they did. This failure was diagnosed as a process and architecture failure within AWS, not a human one. This event triggered AWS to completely overhaul their change process and network architecture and the process continues to this date. AWS isn’t perfect, but they have a great ability to be self-critical. They started thinking defensively.
The starting assumption in AWS networking is that the engineer is not in full control of the change.
AWS assumes that every network change could go wrong, and engineers need to prove during a peer-review that they expect this. The engineer will harden their change to show how they will monitor, prevent and detect errors, and how they will handle issues that arise.
Suffocating change control
I’m guessing some of you are cursing me right now, because you think I’m going too far. You may have experienced ITIL-style change control at it’s very worst. You know that burdensome change control process leads to fossilised networks with no ability to change, and networks with no means of processing and managing risk.
You know that nothing can happen without the change review-board approval, and even when you explain your change to those folks, they don’t understand what’s going on.
I contend that we are hopelessly optimistic about the chances of success when making network changes. This blind optimism hurts our professional reputations. We engineers need to write defensive change procedures.
The big question is how do we get the protections of a defensive change procedure without getting mired in bureaucracy. In the next post I propose a change-control structure that delivers on safety without killing you with process.