Network change – who is in control?

January 4, 2016 John Harrington Comments 5 comments

Network Change

Nothing sparks engineering debate quite as much as ‘network change control’. It’s one of those topics we love to hate. We feel buried by useless bureaucracy. We ask, ‘Why can’t our managers just trust us, instead of weighing us down with meaningless process and red tape’?

This may be a controversial perspective but I think we’ve gotten exactly what we deserve. We endure heavyweight change control procedures because when we make network changes we break stuff. We break stuff in truly spectacular ways, in ways we could never have predicted. We hit weird bugs, asymmetric configuration, faulty hardware, poor process, or we just have a brain fart/fat-finger/etc.

However the main reason our changes end badly is because we are hopelessly optimistic. We don’t predict things will go wrong so we don’t have any controls in place to deal with the inevitable car crash. When we underestimate the risk of a change and it goes horribly wrong, we undermine our credibility and lose trust.

We falsely assume the network engineer is in full control of the change.

There is no problem

I interviewed an engineer a few months back who was bringing up a new inter-data centre WAN link. He told me the story of how he brought up the link and it carried WAN traffic straight away. After half an hour he got complaints and after another hour of troubleshooting he backed out his change. The root cause was the new link dropping packets due to CRC errors.
I asked him what he would do differently if he had a chance to do the same procedure again? His answer was, “Nothing – I was just unlucky”. Nothing. I was floored. There are countless ways in which this ‘process’ could have been improved or made safer. However you can’t improve a bad situation unless you realise there is a problem.
Let’s be honest, we’ve all worked on cowboy networks were no-one knows what’s going on. We’ve worked on networks where every second change causes an outage, and sometimes another one when trying to back out the change, often undoing a successful change from a week ago. Once you have worked somewhere like that I think all engineers agree that we need ‘some’ process to bring the chaos under control.

The engineer is not in full control

About six years ago I worked with a colleague at Amazon who had the misfortune of making a network change which caused a nightmare AWS outage. A cascading storm of failures ensued and everyone and their mothers had an opinion on the root cause.
I remember listening to a very painful episode of the packet pushers podcast when the hosts questioned the professionalism of network engineer in question. I was angry because I could easily have been that engineer, and that change would not have ended any better. The root cause was a network configuration that was way too complex and a rogue OSPF ABR router that decided to inject a summary default route into Area 0.
Did AWS just dismiss this event as a rogue engineer or a buggy code? Like hell they did. This failure was diagnosed as a process and architecture failure within AWS, not a human one. This event triggered AWS to completely overhaul their change process and network architecture and the process continues to this date. AWS isn’t perfect, but they have a great ability to be self-critical. They started thinking defensively.

The starting assumption in AWS networking is that the engineer is not in full control of the change.

AWS assumes that every network change could go wrong, and engineers need to prove during a peer-review that they expect this. The engineer will harden their change to show how they will monitor, prevent and detect errors, and how they will handle issues that arise.

Suffocating change control

I’m guessing some of you are cursing me right now, because you think I’m going too far. You may have experienced ITIL-style change control at it’s very worst. You know that burdensome change control process leads to fossilised networks with no ability to change, and networks with no means of processing and managing risk.
You know that nothing can happen without the change review-board approval, and even when you explain your change to those folks, they don’t understand what’s going on.

Sherpa Summary

I contend that we are hopelessly optimistic about the chances of success when making network changes. This blind optimism hurts our professional reputations. We engineers need to write defensive change procedures.
The big question is how do we get the protections of a defensive change procedure without getting mired in bureaucracy. In the next post I propose a change-control structure that delivers on safety without killing you with process.

5 thoughts on “Network change – who is in control?”

Fredrik says:

January 4, 2016 at 3:42 pm

John,
Going down the slippery slope of ITIL-hell I guess we can all reflect on the results of previous changes. We are after all humans, we have made mistakes and will continue to make them. Some are “sloppy ones”, such as poor pre-quality physical line checks and those can be minimised – the other ones which have been carefully thought out, how much time, resource and energy will be added from going from a (probable) 98% success rate to 99.9% ?
I know you know this, but in the AWS case nothing in the ITIL change control (peer-review, manager review, change control board) would not have _prevented_ that bug from hitting, only the time needed to verify and revert (back-out) the configuration change.
Reminds me of this https://www.linkedin.com/pulse/why-your-job-becoming-impossible-do-tragedy-overload-bob-sutton
~Fred

1. John Harrington says:
  
  January 4, 2016 at 9:03 pm
  
  Hey Fred,
  You make an excellent point about prevention vs management. Complete prevention of change-related downtime is a dangerous path, and I’ll try to better make that point in the next post. Admitting that you’ll probably mess-up should make the engineer and their peers a little sharper on change planning and review, but most of the value will come from rapid detection, rollback and repair.
  Love the new blog at spearpoint.net btw.
  Regards,
  John H
  
sergio.pardos@gmail.com says:

January 13, 2016 at 7:26 pm

The big issue to me is that management increasingly loads technical staff with endless processes and bureaucracy, and in the other hand they want things done right now. Both things just cannot be.
Regards,
Sergio Pardos.

Tamie says:

January 19, 2016 at 7:08 am

They should allow the engineers have full control. So that problems will addressed right away.

venkatesh subramani says:

January 25, 2016 at 2:48 pm

I reflect on the similar view, it was one of those bad days when i had a big outage during software upgrade of the device, our change team commanded to rollback the device after the major impact of the applications (of course, right thing to do 🙂 ), product vendor couldn’t understand whats going wrong within 30mins of the impact, change board and customer asked to rollback without any further ado for troubleshooting. After rollback, change team was after RCA, vendor claims he wasn’t given enough time to troubleshoot and hence can’t provide RCA officially. Provided logs/configurations/errors/recreation of the issue at vendor labs did not yield to any conclusions so no RCA from vendor, now, the change team is not ready to approve for further activities of this device. Ping pong between the change board and vendor is still ongoing, device is running with the potentially vulnerable code and needs upgrade. and the funnily sad part comes here, server/application team remembers this network engineer as “oh, this guy who created outage worth millions?!” the trustworthy factor of this engineer is shaken because of the outage (ideally caused by a bug because similar outage happened in other region and found it was bug, but still no official RCA from vendor for this issue since they weren’t given enough time to tshoot)

NetworkSherpa

navigating networks