Career – The network rockstar and the checklist
We’re in the midst of a networking boom at the moment and new technologies are being released at a rapid pace. So much so that network engineers need a suite of knowledge management tools to navigate the daily deluge of articles, documents, twikis and notes.
That said, how much of your day-to-day activities are markedly different than they were two years ago? As I see it, the role of the network engineer is largely unchanged. One still has to gather requirements, write designs docs, order, ship, build, configure and troubleshoot. Design reviews are still required and change procedures are still needed before touching the live network.
Trouble is, I think we execute these day-to-day tasks poorly, sometimes omitting crucial steps. I’m sure you excel at one or more of these areas, but are you consistent? What about consistency across your team?
If you are then congratulations, but I don’t see it too often. Why do we continually drop the ball again and again, only to repeat the same mistake in the next project? Can anything be done to turn this around? In the The Checklist Manifesto: How to Get Things Right, Atul Gawande presents checklists as the solution.
But networking is too complex for checklists
I know, I know, I’m insulting your intelligence. You’re far too smart to need a checklist. I’m quite sure to you could do a site-survey in your sleep. However, checklists are producing results for surgeons, pilots and structural engineers building some of the worlds tallest buildings. Can you really argue that your job is ‘more complex’ or requires ‘more knowledge’ that these professions?
My initial reaction when I heard about this checklist was, ‘oh great, more process’, but I’ve had a change of heart after reading the book. A good checklist should produce a better outcome, not simply the same result with additional crappy procedures.
The rockstar
One of the big reasons checklists aren’t already popular is because they’re just not cool. We like to think of ourselves as skillful engineers ready to deploy our years of acquired experience and judgement to solve every problem. There’s no question that experience is vital. However if we’re honest, we mostly fix problems caused by earlier bad decisions or omissions. If we get our shizz together, there’s really no need for the rockstar network engineer to come in and make all those diving saves.
The author is a surgeon and walks you through some fascinating stories where he applies the checklist approach to real world scenarios. You should read the book if you want a better background on the motivation for checklists and the value that they provide. Here’s an excerpt:
“It is common to misconceive how checklists function in complex lines of work. They are not comprehensive how-to guides, whether for building a skyscraper or getting a plane out of trouble. They are quick and simple tools aimed to buttress the skills of expert professionals. And by remaining swift and usable and resolutely modest, they are saving thousands upon thousands of lives.”
A short simple list of killer items + communication
Adding bullet points to a process or procedure does not make it a checklist. It takes a bit of work to edit down a procedure to remove the how-to steps, and include the absolute minimum checks needed to ensure a successful result. Here’s another quote from the book, this time from the guy who writes the flight check-lists for Boeing:
“..after about sixty to ninety seconds at a given pause point, the checklist often becomes a distraction from other things. People start “shortcutting.” Steps get missed. So you want to keep the list short by focusing on what he called “the killer items”—the steps that are most dangerous to skip and sometimes overlooked nonetheless. The wording should be simple and exact,… and use the familiar language of the profession.”
For example, here’s a few checks that could be included in a pre-change checklist.
- Confirm you have all the change approvals in place.
- What are the pause points for test, proceed and rollback? Fixed times for these?
- Has console access been confirmed or are you on site?
- Have backups been confirmed, device and central backup.?
- Have you included config rollback protections in case of isolation?
You don’t need to describe how do do these things in a checklist, just that the steps are completed. However, if the checklist is going to be brutally short then you have traded away completeness for brevity. So what happens when something occurs that wasn’t on the checklist?
The answer is to have a ‘communication step’ built into the checklist. Research has found that getting a team together to introduce themselves and state their role is remarkably effective in mitigating risk. After that follows a very quick briefing discussing the risks, potential complications and contingencies. That’s it. Just strong communications and a brief but powerful checklist.
Investigate your failures and fix your process
The author describes a fascinating incident of ice build up in the fuel lines of jet-engines. The crash investigators diagnosed the problem and issued guidance requiring them to take a counter-intuitive action to idle the engine when it loses power rather than increasing thrust. The issued re-occurred thirty days later on a Delta 777 flying over Montana.
“Pilots across the world were somehow supposed to learn about these findings and smoothly incorporate them into their flight practices within thirty days. The remarkable thing about this episode—and the reason the story is worth telling—is that the pilots did so. How this happened—it involved a checklist, of course—is instructive. But first think about what happens in most lines of professional work when a major failure occurs. To begin with, we rarely investigate our failures….”
If network engineers do manage to investigate failures, the learnings are rarely incorporated into any sort of standard procedure or checklist. This is really frustrating for senior engineers and managers, but in reality it’s our fault. If you learn a hard lesson and fail to pass it on to the engineers at the coal-face, then you can’t blame the engineer executing the change.
Sherpa Summary
“When we look closely, we recognize the same balls being dropped over and over, even by those of great ability and determination. We know the patterns. We see the costs. It’s time to try something else. Try a checklist.”
Checklists will save you a lot of hardship and a bucket-load of time. You can use that recouped time to start learning those new technologies that keep rolling in. Expect to see a lot more checklists coming your way on this blog.
Disclosure: This post contains links to an affiliate program, for which I receive a few cents if you make purchases.
3 thoughts on “Career – The network rockstar and the checklist”
Thanks, I’m really looking forward to them. I read that article on CNN or Time a few weeks ago about how surgeons making checklists significantly increased health of patients and had the similar thoughts. Doctors cant be ‘too good’ for checklists if it saves lives. I’m glad to see that you’ve translated that to an NE’s role.
I’ve screwed myself on my first sight survey as an NE walking into what I thought was a simple project and then forgetting to order a $5 mrtj-to-sc coupler.
I’ve built my own checklists for various project types, in my head, over the years and so I cant wait to see what you put on paper (and happy that you’re sharing as well).
Also on your comment of investigating failures. I like to see those results put into a database for RCA in the future. I thought it was a waste of time to try to do years ago but I’ve found since moving to the data center that it’s one of my most valuable tools. I’m surprised how easily I forget what happened to some program I rarely touch a few months ago much less a year ago!
Hey will,
The root cause analysis database is a good idea. I wonder sometimes though how you could make it interesting enough for people to read it. I’ve played with a few ideas before on anti-patterns. For example ‘100 ways to break the network. At first I thought that would be rediculous but then I realised just how many ways there are to cause a routing or bridging loop.
I have a clunky excel file for tech tips, but mostly I’m guilty of not putting my process learnings down on paper. When I get a chance to push out the first checklist I’ll probably add it as a dedicated page or a pdf download. Would love to get feedback from you and your experiences at that stage which would help me improve things considerably.
Thanks as always for sharing your comments and insights.
/John H