Checklist – Making safe network changes

John Harrington

11 years ago

A few months back I wrote a post about checklists, and promised to follow up with a few of my own. Here is my first attempt at a checklist, trying to address the common pitfalls I see when people are making network changes. I’m not sure I follow all the guidance in the ‘Checklist Manifesto’ but here goes.

Before you begin

Communication plan
- Concall bridge is published?
- Bridge is not dependent upon the network you’re changing?
Are you sure the change is fully approved?
Who will you call when the CM fails?
- TAC phone numbers
- Mgmt escalation contact
- Data Center Security if drive-of-shame is required
- Engineering backup / remote-hands
Clear timeline, Go/No-Go and Pause/Monitor decision points.
Clear ownership and authority (who get’s to make the call?)

Monitoring

How do you know if you have broken anything?
What graphs or stats will you check, throughput/pps, neighbor counts, prefixes etc.
Do you know what ‘normal’ looks like for your change window.

Console

Does your console port work for all devices you will touch?
Will your console access work when you lose TACACS?
Is your routing path to the console router dependent upon the device you are reloading?
- Are you sure?

Reloads

Do you have a config backup?
Is that boot image is present on bootflash?
- Has it been copied to all members of the stack?
Do you have a backup image on the bootflash? (may not be room).
Check your config register?
Check that your boot statement is correct?
Check that image is not corrupt (verify /md5)
- Compare with Cisco.com or trusted checksum cache, rather than an intermediate image on your ftp server which could be corrupt.
Check that you’re not going to disable SSH access by loading a non-K9 image where SSH is the only method of access?
Check that traffic is actually shifted away?
If modular switch / router:
- Are there any new or unpowered line cards in the router? Can the chassis supply power to all line cards when it comes back up?
- Are there any line cards with mixed forwarding modes? Is there a risk that system forwarding capacity will drop to match the lowest capacity line card?

Upgrades

Do you have a config backup?
Have you tested the upgrade cycle in a lab with typical production config?
- Does the upgrade remove any of your configuration?
- Does the upgrade change any of your configuration?
- Have any defaults changed? (check your show run all)?
- Do you know how long the upgrade will take?
Have you tested the downgrade cycle in a lab with typical production config?
- Does the downgrade remove any of your configuration?
- Do you know how long the downgrade will take?

Configuration

Do you have a config backup?
Assume the worst:
- Commit confirmed (junos)
- Reload in 10 (Cisco – Set an alarm on your phone, don’t get distracted!)
- Archive/Rollback (Cisco)
ACLs: Know you won’t cut off your own in-band session?
Are all steps explicitly called out?
Are all rollback steps explicitly called out?
Have you tested the rollback in the lab?
- Have you vim-diff’ed the pre-upgrade config with the post-rollback config?
- e.g. Did you change add a redistribute ospf y route-map x in config, then rollback with ‘no redistribute ospf y route-map x‘ in config.

Feedback

So folks, how could I make this a better and more useful resource? I’d love to get your suggestions. Drop a quick comment or on ping me twitter @networksherpa.
Image provided by FreeDigitalPhotos.net