Before you begin
- Communication plan
- Concall bridge is published?
- Bridge is not dependent upon the network you’re changing?
- Are you sure the change is fully approved?
- Who will you call when the CM fails?
- TAC phone numbers
- Mgmt escalation contact
- Data Center Security if drive-of-shame is required
- Engineering backup / remote-hands
- Clear timeline, Go/No-Go and Pause/Monitor decision points.
- Clear ownership and authority (who get’s to make the call?)
Monitoring
- How do you know if you have broken anything?
- What graphs or stats will you check, throughput/pps, neighbor counts, prefixes etc.
- Do you know what ‘normal’ looks like for your change window.
Console
- Does your console port work for all devices you will touch?
- Will your console access work when you lose TACACS?
- Is your routing path to the console router dependent upon the device you are reloading?
- Are you sure?
Reloads
- Do you have a config backup?
- Is that boot image is present on bootflash?
- Has it been copied to all members of the stack?
- Do you have a backup image on the bootflash? (may not be room).
- Check your config register?
- Check that your boot statement is correct?
- Check that image is not corrupt (verify /md5)
- Compare with Cisco.com or trusted checksum cache, rather than an intermediate image on your ftp server which could be corrupt.
- Check that you’re not going to disable SSH access by loading a non-K9 image where SSH is the only method of access?
- Check that traffic is actually shifted away?
- If modular switch / router:
- Are there any new or unpowered line cards in the router? Can the chassis supply power to all line cards when it comes back up?
- Are there any line cards with mixed forwarding modes? Is there a risk that system forwarding capacity will drop to match the lowest capacity line card?
Upgrades
- Do you have a config backup?
- Have you tested the upgrade cycle in a lab with typical production config?
- Does the upgrade remove any of your configuration?
- Does the upgrade change any of your configuration?
- Have any defaults changed? (check your show run all)?
- Do you know how long the upgrade will take?
- Have you tested the downgrade cycle in a lab with typical production config?
- Does the downgrade remove any of your configuration?
- Do you know how long the downgrade will take?
Configuration
- Do you have a config backup?
- Assume the worst:
- Commit confirmed (junos)
- Reload in 10 (Cisco – Set an alarm on your phone, don’t get distracted!)
- Archive/Rollback (Cisco)
- ACLs: Know you won’t cut off your own in-band session?
- Are all steps explicitly called out?
- Are all rollback steps explicitly called out?
- Have you tested the rollback in the lab?
- Have you vim-diff’ed the pre-upgrade config with the post-rollback config?
- e.g. Did you change add a redistribute ospf y route-map x in config, then rollback with ‘no redistribute ospf y route-map x‘ in config.
Feedback
So folks, how could I make this a better and more useful resource? I’d love to get your suggestions. Drop a quick comment or on ping me twitter @networksherpa.
Image provided by FreeDigitalPhotos.net