Basic network change control process

Basic network change control process

Make your own change controlScenario: You are an engineer who runs a managed network on behalf of a customer. Your manager has asked you to create a change control process. Your customer and your manager will measure you only by the uptime or outages they experience, and don’t care what your process looks like.
I’ve discussed why we need change control in a previous post. Knowing this, what sort of process would you create? I this post I provide a high-level template and some tips.

A gentler form of change control

Before we begin you should note that no matter how many controls you put in place, you can never eliminate 100% of the risk of outages. A zero tolerance approach to outages is a dangerous goal, and can easily destroy agility. If you’re just starting out with change control, keep it high-level and introduce only controls that deliver value to you and your team.

.. no matter how many controls you put in place, you can never eliminate 100% of the risk of outages

Is it much more effective to start from a defensive mindset, where you expect the change to go wrong. When you’re executing the change you should focus on quickly detecting when you’ve broken then network, taking immediate action to undo your change.
The general idea is that you invest 90% of the effort into documenting and verifying your change proposal. You get it peer reviewed (if possible) and the final product is a clear and easy-to-implement change. If your change breaks the network or applications you will have the correct graphs and commands to hand to quickly diagnose this and you’ve got a full rollback configuration as a get-out-of-jail-free card.
If you exclude all the other controls mentioned below, you will get 80% of the benefit from this simple tip – write down every command you intend to apply in a simple document. This will allow you to step back and review your change as a whole and you can easily produce ‘no-format’ versions of your config for easy rollback.

A proposed format

I use a lightweight change control procedure using the elements below. Take it as a starting point and reduce or expand the list or individual elements as you see fit.

  • Create a simple text document describing the entire change.
  • The document will include:
    • A step to ensure the configs are backed up.
    • Check graphs.  Don’t just say check graphs, link to specific URL and say what levels you expect. e.g. ‘check traffic below 5 Mbps’, or ‘note traffic levels for later comparison’.
    • Baseline existing network configuration using ‘show config’ commands and ‘show state’ commands.
      • ‘State’ commands are useful so that you know if you’ve broken anything when your change is complete. e.g. Execute: sh ip ospf interface brief, Check: output includes 7 neighbors.
      • ‘Show config’ commands help you know if your change will update/overwrite or create new configuration. How would you back out a route-map change if you don’t know this? It’s no fun when you delete a route-map you accidentally ‘edited’ when you thought you’d ‘created’ it.
    • Detailed configuration steps – This is the exact configuration you will apply.
    • Post-change verification steps. Repeat your baseline and compare. Use your ‘state’ and ‘show config’ commands from before plus some active tests. E.g. traceroute, web browsing etc.
    • Detailed rollback steps – All commands that are needed for you or a help-desk technician to rollback your changes while you sleep.
    • Post-change graph checks.
  • Post-change notification – Tell your team what you did, when you did it and what to do if other stuff breaks when you’re asleep.

Exceptions

A great manager once told me ‘all processes have exceptions, and you need to include a way to handle them’. Here are some exceptions you may need to deal with:

  • I need to adjust configuration on the fly – The jury is out on this one. The purists will say that if the document is wrong in any way you should ‘halt and back-out’ no matter what. I would say that if it’s a cosmetic issue or a minor syntax issue, you can fix it, but you should amend the change document – otherwise rollback. Your team should always be able to see the precise changes you applied.
  • I need time to troubleshoot before rolling back – Sometimes there is cause to do this but in general it’s a really bad idea. If you have the right verification commands in your change, you’ll automatically have the data for later analysis. I’ve been that guy who consumed the entire change window and then tried to rollback only for that to fail. It’s your call, but I recommend you quickly gather diagnostic information, then rollback and analyse later.

Some Extra Tips

  • Use a text editor such as Notepad++ (Windows) or SublimeText (Mac/Windows) to gain access to block level editing (Alt+Select). If you’re not already using this, it’s a game-changer. Editing a config as a block really speeds up your changes and reduces the amount of copy-paste errors.
  • I prepend my comments in this document with exclamation ! marks so that comments can be integrated into the doc without risk of being accidentally pasted into the document.
  • I don’t include ‘conf t’ and ‘wr mem’ but add them as text steps preceded by ! marks. Again, I want complete control have found un-commented ‘con t’ is very dangerous where you copy and paste configs.
  • Exit to exec mode after every step, every time. I’ve had too many accidental right-clicks to count.

Sherpa Summary

A simple document which includes repeated blocks of the following steps will vastly increase your change safety. Start with this and include the other tips and suggestions only where they add value to your situation.
PRE-CHECK, EXECUTE, VERIFY, [ROLLBACK]
Some of you will see this as too heavy and some will feel it’s too light, but either way you should have an opinion on this topic. If you want some more practical tips, and a lot more depth you can review the comprehensive change control checklist post.
 

5 thoughts on “Basic network change control process

  1. Thanks for the “option + select” tip for Sublime. I’ve always used ctr l+ select and gnome terminal for that. It’s nice to have more options.

  2. I liked this a lot.
    Thank you for taking the time to publicly write about this.
    The only thing I can add about the “troubleshooting or not? till the downtime is over?” is that I usually do it like this (in unix land though):
    If something goes wrong at the start and I can easily get a new downtime, I’ll abort.
    Otherwise I’ll evaluate.
    Whether to enter a troubleshooting window depends on the time spent and the progress made.
    Progress 50-80% in < 50 of time? Can troubleshoot for a bit
    Progress <50% in 50% of time? Rollback!
    Can I undo parts only and test that before hitting 50% of the window? Might go with a partial win – the reasoning is: If I'm wrong and the test fails, I still have more than the expected rollback time – since my "configuration delta" also decreased by a lot.
    I think this is one of the lessons I learned: Be willing to give up parts of what you planned to gain testing time and make groundwork.
    Be able to ISOLATE aspects.
    I.e. want to migrate to LACP'ed VLANs and rebuild the core for redundancy?
    And time runs out?
    Pick either LACP or VLAN. If it wasn't redundant and segmented before you started, you're not failing if you don't tick all your goals. At worst, it'll be like a rollback, but i.e. with nicer interface descriptions 🙂
    I think many people getting stuck in troubleshooting too long and forgoing their rollback moment (bingo point) are either afraid to be blamed or they are used to fight hard to get their goal.
    Neither is useful for keeping a downtime, and both can be aided.
    The trick is again isolation an aspect – i.e. the person who wants to fight with the bug till they squelched it… They might stay at that for 10, 20 hours and you can't stop them. But you can ask them if they could also track the error while everything else is back up. Suddenly, bringing up stuff on time becomes just a thing they need to do to fight the bug and they'll immediately get on to it. 😉
    For the afraid ones, you need to turn it around and ask them if they could isolate the affected network portions from the error. I bet they immediately know how to do that. I used to be more on that side; always afraid to slap in a makeshift piece of workaround others would sell as "great solution". So, ask those "can you just slap something on this that will last a few hours".
    But you already mentioned the important part – being prepared so one knows which debug information to collect. I usually made scripts to collect it before I start, and then I can just run those once more when I'm aborting the change.
    One last thought: sometimes there's also a possible culture problem: if people need to justify why they aborted to superiors. Long ago we had some very sloppy superiors who DIDN'T follow any proper procedures (i.e. no backups, no prechecks, no cleanups) and then asked things like "oh but why did you abort? we only need 30 minutes for that".
    My advice for that is to have strict, predefined, documented cleanups and other post actions. If your team does the cleanups for others, there will come a day where you forget or miss one. And that will be the most lovely outage you ever saw.

  3. “The general idea is that you invest 90% of the effort into documenting and verifying your change proposal . ” – I like this comment because it highlights the importance of detailed planning and lab testing . many engineers unfortunately too often ignore the latter when making non routine changes
    I’ll add that , well planned and tested changes reduces the need to rollback due to outages OR when an approved change does not provide the desired functionality for your customers ( but not necessarily causing an outage)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.