Written By: Chris McDermott
Edited By: Christopher Webber (@cwebber)
I love change management. I love the confidence it gives me. I love the traceability–how it’s effectively a changelog for my environment. I love the discipline it instills in my team. If you do change management right, it allows you to move faster. But your mileage may vary.
Not everyone has had a good experience with change management. In caricature, this manifests as the Official Change Board that meets bi-monthly and requires all participants to be present for the full meeting as every proposed plan is read aloud from the long and complicated triplicate copy of the required form. Questions are asked and answered; final judgements eventually rendered. Getting anything done takes weeks or months. People have left organizations because of change management gone wrong.
I suppose we really should start at the beginning, and ask “Why do we need change management at all?” Many teams don’t do much in the way of formal change process. I’ve made plenty of my own production changes without any kind of change management. I’ve also made the occasional human error along the way, with varying degrees of embarrassment.
I challenge you to try a simple exercise. Start writing down your plan before you execute a change that might impact your production environment. It doesn’t have to be fancy – use notepad, or vim, or a pad of paper, or whatever is easiest. Don’t worry about approval or anything. Just jot down three things: step-by-step what you’re planning to do, what you’ll test when you’re done, and what you would do if something went wrong. This is all stuff you already know, presumably. So it should be easy and fast to write it down somewhere.
When I go through this exercise, I find that I routinely make small mistakes, or forget steps, or realize that I don’t know where the backups are. Most mistakes are harmless, or they’re things that I would have caught myself as soon as I tried to perform the change. But you don’t always know, and some mistakes can be devastating.
The process of writing down my change plan, test plan, and roll-back plan forces me to think through what I’m planning carefully, and in many cases I have to check a man page or a hostname, or figure out where a backup file is located. And it turns out that doing all that thinking and checking catches a lot of errors. If I talk through my change plan with someone else, well that catches a whole bunch more. It’s amazing how much smarter two brains are, compared to just one. Sometimes, for big scary changes, I want to run the damn thing past every brain I can find. Heh, in fact, sometimes I show my plan to people I’m secretly hoping can think of a better way to do it. Having another human being review the plan and give feedback helps tremendously.
For me, those are the really critical bits. Write down the complete, detailed plan, and then make sure at least one other person reviews it. There’s other valuable stuff you can do like listing affected systems and stakeholders, and making notification and communication part of the planning process. But it’s critical to keep the process as simple, lightweight, and easy as possible. Use a tool that everyone is already using – your existing ticketing software, or a wiki, or any tool that will work. Figure out what makes sense for your environment, and your organization.
When you can figure out a process that works well, you gain some amazing benefits. There’s a record of everything that was done, and when, and by whom. If a problem manifests 6 or 12 or 72 hours after a change was made, you have the context of why the change was made, and the detailed test plan and roll-back plan right there at your fingertips. Requiring some level of review means that multiple people should always be aware of what’s happening and can help prevent knowledge silos. Calling out stakeholders and communication makes it more likely that people across your organization will be aware of relevant changes being made, and unintended consequences can be minimized. And of course you also reduce mistakes, which is benefit enough all by itself. All of these things combined allow high-functioning teams to move faster and act with more confidence.
I can give you an idea of what this might look like in practice. Here at SendGrid, we have a Kanban board in Jira (a tool that all our engineering teams were already using when we rolled out our change management process). If an engineer is planning a change that has the potential to impact production availability or customer data, they create a new issue on the Change Management Board (CMB). The template has the following fields:
- Affected hosts
- Change plan
- Test plan
- Roll-back plan
- Roll-back verification plan
All the fields are optional except the Summary, and several of them have example text giving people a sample of what’s expected. When the engineer is happy with the plan, they get at least one qualified person to review it. That might be someone on their team, or it might be a couple of people on different teams. Engineers are encouraged to use their best judgement when selecting reviewers. Once a CMB has been approved (the reviewer literally just needs to add a “LGTM” comment on the Jira issue), it is dragged to the “Approved” column, and then the engineer can move it across the board until they’re done with the change. Each time the CMB’s status in Jira changes, it automatically notifies a HipChat channel where we announce things like deploys. For simple changes, this whole process can happen in the space of 10 or 15 minutes. More complicated ones can take a day or two, or in a few cases weeks (usually indicative of complex inter-team dependencies). The upper bound on how long it has taken is harder to calculate. We’ve had change plans that were written and sent to other teams for review, which then spawned discussions that spawned projects that grew into features or fixes and the original change plan withered and died. Sometimes that’s the the better choice.
I don’t think we have it perfect yet; we’ll probably continue to tune it to our needs. Ours is just one possible solution among many. We’ve tried to craft a process that works for us. I encourage you to do the same.