Checklist – Making safe network changes

June 27, 2013 John Harrington Comments 13 comments

A few months back I wrote a post about checklists, and promised to follow up with a few of my own. Here is my first attempt at a checklist, trying to address the common pitfalls I see when people are making network changes. I’m not sure I follow all the guidance in the ‘Checklist Manifesto’ but here goes.

Before you begin

Communication plan
- Concall bridge is published?
- Bridge is not dependent upon the network you’re changing?
Are you sure the change is fully approved?
Who will you call when the CM fails?
- TAC phone numbers
- Mgmt escalation contact
- Data Center Security if drive-of-shame is required
- Engineering backup / remote-hands
Clear timeline, Go/No-Go and Pause/Monitor decision points.
Clear ownership and authority (who get’s to make the call?)

Monitoring

How do you know if you have broken anything?
What graphs or stats will you check, throughput/pps, neighbor counts, prefixes etc.
Do you know what ‘normal’ looks like for your change window.

Console

Does your console port work for all devices you will touch?
Will your console access work when you lose TACACS?
Is your routing path to the console router dependent upon the device you are reloading?
- Are you sure?

Reloads

Do you have a config backup?
Is that boot image is present on bootflash?
- Has it been copied to all members of the stack?
Do you have a backup image on the bootflash? (may not be room).
Check your config register?
Check that your boot statement is correct?
Check that image is not corrupt (verify /md5)
- Compare with Cisco.com or trusted checksum cache, rather than an intermediate image on your ftp server which could be corrupt.
Check that you’re not going to disable SSH access by loading a non-K9 image where SSH is the only method of access?
Check that traffic is actually shifted away?
If modular switch / router:
- Are there any new or unpowered line cards in the router? Can the chassis supply power to all line cards when it comes back up?
- Are there any line cards with mixed forwarding modes? Is there a risk that system forwarding capacity will drop to match the lowest capacity line card?

Upgrades

Do you have a config backup?
Have you tested the upgrade cycle in a lab with typical production config?
- Does the upgrade remove any of your configuration?
- Does the upgrade change any of your configuration?
- Have any defaults changed? (check your show run all)?
- Do you know how long the upgrade will take?
Have you tested the downgrade cycle in a lab with typical production config?
- Does the downgrade remove any of your configuration?
- Do you know how long the downgrade will take?

Configuration

Do you have a config backup?
Assume the worst:
- Commit confirmed (junos)
- Reload in 10 (Cisco – Set an alarm on your phone, don’t get distracted!)
- Archive/Rollback (Cisco)
ACLs: Know you won’t cut off your own in-band session?
Are all steps explicitly called out?
Are all rollback steps explicitly called out?
Have you tested the rollback in the lab?
- Have you vim-diff’ed the pre-upgrade config with the post-rollback config?
- e.g. Did you change add a redistribute ospf y route-map x in config, then rollback with ‘no redistribute ospf y route-map x‘ in config.

Feedback

So folks, how could I make this a better and more useful resource? I’d love to get your suggestions. Drop a quick comment or on ping me twitter @networksherpa.
Image provided by FreeDigitalPhotos.net

13 thoughts on “Checklist – Making safe network changes”

Hey NetworkSherpa,
I’ve a few to add, if you can drop me your email address I can send you quite a bit more (very cautious financial background and all that);
If a change will involve the reset or reboot of a device, it can be useful to plan for the reset or reboot the device prior to making your change. This confirms the device will successfully reset or reboot in it’s current state and thus if it doesn’t do so after a change is made, the cause is likely to be related to the change. This is particularly pertinent where;
A device has been running for a long time
A device has mechanical parts such as a hard disk drive
A device is having it’s entire operating system or ‘core’ software upgraded
Pre and post change log checks, dumps and comparisons
Turning console logging on
Device stability checks (general CPU load etc.)
Especially if working remotely, use MAC checks or CCDP to ensure the topology is what you think it is
If changing or migrating L3 addresses, clear surrounding device ARP caches
When changing or modifying routes, add the new one first, then remove the old – this prevents traffic loss
Edit ACLs rather than removing and re-adding
…

john says:

June 28, 2013 at 8:13 am

Hey Steve, thanks for the additional details man. All great stuff. I’ve sent you an email offline.
/John

Reply

This is a more complete list than I had when starting out. Thank you. 🙂

One thing I always check(if I can) is the users ability to do what they need to. It’s an extra step that’s helped me a lot to know that the network’s working but their application is not.

john says:

June 28, 2013 at 8:09 am

Thanks for the comments Robert. That’s a great idea, if you have a few user groups it’s great to test on their behalf or better yet have them test. Scaling that approach is a challenge though.

Reply

Great breakdown! Console access is one I dont always think about. thanks!
Also, one I see missed a lot (im guilty as well) is log and stability check before hand. Depending on when the go/no-go calls are this could be incorporated into it. But its still safe to check right before your changes.
Thanks again!!

john says:

July 1, 2013 at 4:02 pm

Hey Ryan,
Thanks for the comment, I’ll update with yours and all the other commenters feedback.
/John H

Reply

Hi John,
Really nice compiled checklist. My question might be specific, but I still wanted to check with you on this. So as per your checklist it says
“Is that boot image is present on bootflash?” i.e. to check if the image is in bootflash directory, additionally I have seen on some devices (c7200, c6509 and nexus7k) there are other directories as well like flash: , disk0:, slot0: , sup1: , sup2: , sup: etc. So in which directory should the image and backup_running_config be present before we go for reload of the router.
Example scenario – I would like to wr erase and reload a device as I am trying to remove the box from the network. But just for successful rollback step , should I need to have the backup_running_config in bootflash / flash/slot / disk directory?
Sorry for highjacking the thread with this specific question.:)
Thanks,
Srinivas

john says:

July 17, 2013 at 9:20 am

Hi Sri,
The best approach to take that would work across all systems is:
Config: Do a ‘copy run start’ or equivalent on non-cisco. I wouldn’t look for the startup config file on the file system, but rely on the command working. If you were really suspicious, you could to a ‘copy run tftp’.
Image: I would check the image location with ‘show boot’, ‘show boot system’ or ‘show run | i boot’. If the boot statement says disk0: or bootflash: or other, you need to user ‘dir’ to ensure you see the matching image on the correct mount point; disk0: or bootflash: or other.

Reply
1. Srinivas says:
  
  July 18, 2013 at 10:21 pm
  
  Thanks for the reply John. That totally makes sense.
  
  Reply

Hey John, nice work.
I liked your original blog about checklists and it’s good to see you come back to it. Now you’ve got me thinking about how I can incorporate some of this in my Implementation Plan template.
Other people have already mentioned these but I think they’re so important it won’t hurt to mention them again:
Enabling logging on the console. I’m amazed to see people do changes without this enabled. Generally you want to know if an interface or OSPF/EIGRP/PIM/BGP etc neighbour bounces in the middle of a change!
Technical testing during and post change. These days I’m in the position of writing changes for other people to implement most of the time, and it’s really made me think about exactly how to test that particular steps within a change have done what they should have. Writing “check internet” just doesn’t cut it, when what you really want someone to do is checking both BGP neighbours are up, and the firewall cluster is in the expected state, and the router router is HSRP master, and…
Business testing post change. Doing all the technical testing in the world after your late night change doesn’t do you a whole lot of good if people rock into work the next day and find something not working. When you do your technical testing and then get someone to do an actual business test straight away, that can short circuit a whole bunch of questions if something does go wrong before the next business day.
And finally, not really for a checklist, but kind of related to your “Clear timeline, Go/No-Go” and “ownership and authority”. If you feel you need to go “off plan”, either a) run it past the PM or change manager if you have one or b) be prepared to explain yourself if anything goes wrong from that point on. (a) is always better if you can. In some places I’ve worked with very strict change control (b) would never have cut it. And in others, I’ve had to use up a fair bit of good will explaining why they had to send someone to the top a mountain to reset a router that was no longer reachable.
Cheers,
Ray

john says:

July 17, 2013 at 10:24 pm

Hey Ray,
Those are some excellent tips. I’m very focussed on network change procedure right now, and I do a lot of scrutiny (tier-3 tech reviewer) on all changes. One that I’ll add to yours is “make zero assumptions about network configuration or state.”
For example, I saw a very will written traffic shift CM, which in simplified form said, ‘shift off by setting cost = 65000, rollback by setting cost = 100’. The assumption here is that even though 100 is the expected normal cost, there was no explicit check to ensure that was the original state.
I will come back someday and compile all these great changes. In the mean time I think the readers will get great value out of all the comments. Thank you.

Reply

what I do when upgrading distribution switches
sh cdp neigbour
grep the ip adresses from the output
put them in a multi host pinger like this http://www.nirsoft.net/utils/multiple_ping_tool.html
check if every host is green
upgrade/reboot
check if every host is still green
do sh cdp neigbour again and compare to previous one
Does this mean I dont always trust the monitoring? Yes it does.

NetworkSherpa

navigating networks