sysadvent: Day 1 - Why You Need a Postmortem Process

Written by: Gabe Abinante (@gabinante)
Edited by: Shaun Mouton (@sdmouton)

Why Postmortems?

Failure is inevitable. As engineers building and maintaining complex systems, we likely encounter failure in some form on a daily basis. Not every failure requires a postmortem, but if a failure impacts the bottom line of the business, it becomes important to follow a postmortem process. I say “follow a postmortem process” instead of “do a postmortem”, because a postmortem should have very specific goals designed to prevent future failures in your environment. Simply asking the five whys to try and determine the root cause is not enough.

A postmortem is intended to fill out the sort of knowledge gaps that inevitably exist after an outage:

Who was involved / Who should have been involved?
Was/is communication good between those parties?
How exactly did the incident happen, according to the people who were closest to it?
What went well / what did we do right?
What could have gone better?
What action items can we take from this postmortem to prevent future occurrence?
What else did we learn?

Without a systematic examination of failure, observers can resort to baseless speculation.

Without an analysis of what went right as well as what went wrong, the process can be viewed as a complete failure.

Without providing key learnings and developing action items, observers are left to imagine that the problem will almost certainly happen again.

A Case Study of the Knight Capital critical SMARS error 2012

Knight Capital was a financial services firm engaging in high frequency trading in the New York Stock Exchange and NASDAQ. It posted revenue of 1.404 billion in 2011, but went out of business by the end of 2012.

On August 1, 2012, Knight Capital deployed untested software which contained an obsolete function to a production environment. The incident happened due to an engineer deploying new code to only 7/8 of the servers responsible for Knight’s automated routing system for equity orders. The code repurposed a flag that was formerly used to activate an old function known as “Power Peg”, which was designed to move stock prices higher and lower in order to verify the behavior of trading algorithms in a controlled environment. All orders sent with the repurposed flag to one of the servers triggered the obsolete code still present on that server. As a result, Knight’s trading activities caused a major disruption in the prices of 148 companies listed at the New York Stock Exchange. This caused the prices of certain stocks to jump by as much as 1200%. For the incoming parent orders that were processed by the defective code, Knight Capital sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes (1). Knight Capital took a pre-tax loss of $440 million. Despite a bailout the day after, this precipitated the collapse of Knight Capital’s stock, losing 75% of their equity value.

I chose to write about this incident because there is an incredible body of writing about it, but actually remarkably little information or substance beyond the SEC release. The amount of material is certainly partially because the incident had such a high impact - few companies have a technical glitch that puts them out of business so quickly. I believe that there’s more to it however - this type of response is an attempt by the community to make sense of the incident because the company itself never released a public postmortem. This is an incredibly interesting case because a production bug and operational failure actually perpetuated the collapse of a seemingly successful business - but the lack of a public postmortem exposed the company to all kinds of baseless speculation about lackadaisical attitudes towards change controls, testing, and production changes (see various citations, especially 11, 12). It would also seem that there was not an internal postmortem, or that it was not well circulated, based upon the Knight Capital CEO’s comments to the press (2).

As the John Allspaw notes in his blog (3), one of the worst consequences of Knight’s reticence was news companies and bloggers using the SEC investigation as a substitute for a postmortem. This was harmful to the business and particularly to the engineers involved in the incident. The SEC document is blamey. It’s supposed to be blamey. It details the incident timeline and outlines procedures that should have been in place to prevent an error - and in doing so it focuses entirely on what was lacking from their outside perspective. What it doesn’t do is accurately explain how the event came to be. What processes WERE in place that the engineers relied upon? What change controls WERE being used? What went right and what will be done to ensure this doesn’t happen in the future?

Did Knight Capital go out of business because they lost a bunch of money in a catastrophic way? Sure. But their core business was still a profitable model - it’s conceivable that they could have received a bailout, continued operations, and gotten out of the hole created by this massive failure. Unfortunately, they failed to demonstrate to their investors and to the public that they were capable of doing so. By failing to release a public document, they allowed the narrative to be controlled by news sites and bloggers.

Taking a look at IaaS provider outages

Infrastructure providers are in a unique position where they have to release postmortems to all of their customers for every outage, because all of their customers’ business systems rely upon IaaS uptime.

AWS experienced an outage that spanned April 21st-April 24th, 2011 and brought down the web infrastructure of several large companies such as Quora and Hootsuite. The incident began when someone improperly executed a network change and shunted a bunch of traffic to the wrong place, which cut a ton of nodes off from each other. Because so many nodes were affected at one time, all of them trying to re-establish replication and hunt for free nodes caused the entire EBS cluster to run out of free space. This generated a cascading failure scenario that required a massive amount of storage capacity in order to untangle. Recovery took quite a while because capacity had to be physically added to the cluster. The postmortem was published via Amazon’s blog on April 29th, 2011. This incident is notable because it was somewhat widespread (affected multiple availability zones) and resolution took longer than 24 hours - making it one of the largest outages that AWS has experienced. AWS has a response pattern that is characterized by communication throughout; updates to status page during the incident, followed by a detailed postmortem afterwards (4). Amazon’s postmortem structure seems to be consistent across multiple events. Many seem to use roughly this outline:

Statement of Purpose
Infrastructure overview of affected systems
Detailed recap of incident by service
Explanation of recovery steps & key learnings
Wrap-up and conclusion

From this we can learn two things: Firstly, we know that amazon has a postmortem process. They are pursuing specific goals around analyzing the failure of their service. Secondly, we know what they want to communicate. Primarily, they want to explain why the failure occurred and why it will not happen again in the future. They also provide an avenue for disgruntled stakeholders to reach out, receive compensation, get additional explanation, etc.

Azure experienced a similar storage failure in 2014 and we see a similar response from them - immediate communication via status pages, followed by a postmortem after the incident (5).

Taking a look at how the media approaches these failure events, it’s worthy of note that the articles written about the outages include links to the postmortem itself, as well as status pages and social media (6,7). Because the companies are communicative and providing documentation about the problem, the journalist can disseminate that information in their article - thus allowing the company that experienced the failure to control the narrative. Because so much information is supplied, there’s very little speculation about what went right or wrong on the part of individuals or journalists, despite the outage events impacting a huge number of individuals and companies utilizing the services themselves or software which relied upon them.

Conclusion

So, while postmortems are often considered a useful tool only from an engineering perspective, they are critical to all parts of a business for four reasons:

Resolving existing issues causing failures
Preventing future failures
Controlling public perception of the incident
Helping internal business units and stakeholders to understand the failure

Having a process is equally critical, and that process needs to be informed by the needs of the business both internally and externally. A process helps ensure that the right questions get asked and the right actions are taken to understand and mitigate failure. With the typical software development lifecycle increasing in speed, availability is becoming more of a moving target than ever. Postmortem processes help us zero in on threats to our availability and efficiently attack problems at the source.

About the Author

My name is Gabe Abinante and I am an SRE at ClearSlide. You can reach me at gabe@abinante.com or @gabinante on twitter. If you are interested in developing or refining a postmortem process, check out the Operations Incident Board on GitHub: https://github.com/Operations-Incident-Board