Scenario: You are an engineer who runs a managed network on behalf of a customer. Your manager has asked you to create a change control process. Your customer and your manager will measure you only by the uptime or outages they experience, and don’t care what your process looks like.
I’ve discussed why we need change control in a previous post. Knowing this, what sort of process would you create? I this post I provide a high-level template and some tips.
A gentler form of change control
Before we begin you should note that no matter how many controls you put in place, you can never eliminate 100% of the risk of outages. A zero tolerance approach to outages is a dangerous goal, and can easily destroy agility. If you’re just starting out with change control, keep it high-level and introduce only controls that deliver value to you and your team.
.. no matter how many controls you put in place, you can never eliminate 100% of the risk of outages
Is it much more effective to start from a defensive mindset, where you expect the change to go wrong. When you’re executing the change you should focus on quickly detecting when you’ve broken then network, taking immediate action to undo your change.
The general idea is that you invest 90% of the effort into documenting and verifying your change proposal. You get it peer reviewed (if possible) and the final product is a clear and easy-to-implement change. If your change breaks the network or applications you will have the correct graphs and commands to hand to quickly diagnose this and you’ve got a full rollback configuration as a get-out-of-jail-free card.
If you exclude all the other controls mentioned below, you will get 80% of the benefit from this simple tip – write down every command you intend to apply in a simple document. This will allow you to step back and review your change as a whole and you can easily produce ‘no-format’ versions of your config for easy rollback.
A proposed format
I use a lightweight change control procedure using the elements below. Take it as a starting point and reduce or expand the list or individual elements as you see fit.
- Create a simple text document describing the entire change.
- The document will include:
- A step to ensure the configs are backed up.
- Check graphs. Don’t just say check graphs, link to specific URL and say what levels you expect. e.g. ‘check traffic below 5 Mbps’, or ‘note traffic levels for later comparison’.
- Baseline existing network configuration using ‘show config’ commands and ‘show state’ commands.
- ‘State’ commands are useful so that you know if you’ve broken anything when your change is complete. e.g. Execute: sh ip ospf interface brief, Check: output includes 7 neighbors.
- ‘Show config’ commands help you know if your change will update/overwrite or create new configuration. How would you back out a route-map change if you don’t know this? It’s no fun when you delete a route-map you accidentally ‘edited’ when you thought you’d ‘created’ it.
- Detailed configuration steps – This is the exact configuration you will apply.
- Post-change verification steps. Repeat your baseline and compare. Use your ‘state’ and ‘show config’ commands from before plus some active tests. E.g. traceroute, web browsing etc.
- Detailed rollback steps – All commands that are needed for you or a help-desk technician to rollback your changes while you sleep.
- Post-change graph checks.
- Post-change notification – Tell your team what you did, when you did it and what to do if other stuff breaks when you’re asleep.
A great manager once told me ‘all processes have exceptions, and you need to include a way to handle them’. Here are some exceptions you may need to deal with:
- I need to adjust configuration on the fly – The jury is out on this one. The purists will say that if the document is wrong in any way you should ‘halt and back-out’ no matter what. I would say that if it’s a cosmetic issue or a minor syntax issue, you can fix it, but you should amend the change document – otherwise rollback. Your team should always be able to see the precise changes you applied.
- I need time to troubleshoot before rolling back – Sometimes there is cause to do this but in general it’s a really bad idea. If you have the right verification commands in your change, you’ll automatically have the data for later analysis. I’ve been that guy who consumed the entire change window and then tried to rollback only for that to fail. It’s your call, but I recommend you quickly gather diagnostic information, then rollback and analyse later.
Some Extra Tips
- Use a text editor such as Notepad++ (Windows) or SublimeText (Mac/Windows) to gain access to block level editing (Alt+Select). If you’re not already using this, it’s a game-changer. Editing a config as a block really speeds up your changes and reduces the amount of copy-paste errors.
- I prepend my comments in this document with exclamation ! marks so that comments can be integrated into the doc without risk of being accidentally pasted into the document.
- I don’t include ‘conf t’ and ‘wr mem’ but add them as text steps preceded by ! marks. Again, I want complete control have found un-commented ‘con t’ is very dangerous where you copy and paste configs.
- Exit to exec mode after every step, every time. I’ve had too many accidental right-clicks to count.
A simple document which includes repeated blocks of the following steps will vastly increase your change safety. Start with this and include the other tips and suggestions only where they add value to your situation.
PRE-CHECK, EXECUTE, VERIFY, [ROLLBACK]
Some of you will see this as too heavy and some will feel it’s too light, but either way you should have an opinion on this topic. If you want some more practical tips, and a lot more depth you can review the comprehensive change control checklist post.