Edited by: Shaun Mouton (@sdmouton)
Why Postmortems?Failure is inevitable. As engineers building and maintaining complex systems, we likely encounter failure in some form on a daily basis. Not every failure requires a postmortem, but if a failure impacts the bottom line of the business, it becomes important to follow a postmortem process. I say “follow a postmortem process” instead of “do a postmortem”, because a postmortem should have very specific goals designed to prevent future failures in your environment. Simply asking the five whys to try and determine the root cause is not enough.
A postmortem is intended to fill out the sort of knowledge gaps that inevitably exist after an outage:
- Who was involved / Who should have been involved?
- Was/is communication good between those parties?
- How exactly did the incident happen, according to the people who were closest to it?
- What went well / what did we do right?
- What could have gone better?
- What action items can we take from this postmortem to prevent future occurrence?
- What else did we learn?
Without an analysis of what went right as well as what went wrong, the process can be viewed as a complete failure.
Without providing key learnings and developing action items, observers are left to imagine that the problem will almost certainly happen again.
On August 1, 2012, Knight Capital deployed untested software which contained an obsolete function to a production environment. The incident happened due to an engineer deploying new code to only 7/8 of the servers responsible for Knight’s automated routing system for equity orders. The code repurposed a flag that was formerly used to activate an old function known as “Power Peg”, which was designed to move stock prices higher and lower in order to verify the behavior of trading algorithms in a controlled environment. All orders sent with the repurposed flag to one of the servers triggered the obsolete code still present on that server. As a result, Knight’s trading activities caused a major disruption in the prices of 148 companies listed at the New York Stock Exchange. This caused the prices of certain stocks to jump by as much as 1200%. For the incoming parent orders that were processed by the defective code, Knight Capital sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes (1). Knight Capital took a pre-tax loss of $440 million. Despite a bailout the day after, this precipitated the collapse of Knight Capital’s stock, losing 75% of their equity value.
I chose to write about this incident because there is an incredible body of writing about it, but actually remarkably little information or substance beyond the SEC release. The amount of material is certainly partially because the incident had such a high impact - few companies have a technical glitch that puts them out of business so quickly. I believe that there’s more to it however - this type of response is an attempt by the community to make sense of the incident because the company itself never released a public postmortem. This is an incredibly interesting case because a production bug and operational failure actually perpetuated the collapse of a seemingly successful business - but the lack of a public postmortem exposed the company to all kinds of baseless speculation about lackadaisical attitudes towards change controls, testing, and production changes (see various citations, especially 11, 12). It would also seem that there was not an internal postmortem, or that it was not well circulated, based upon the Knight Capital CEO’s comments to the press (2).
As the John Allspaw notes in his blog (3), one of the worst consequences of Knight’s reticence was news companies and bloggers using the SEC investigation as a substitute for a postmortem. This was harmful to the business and particularly to the engineers involved in the incident. The SEC document is blamey. It’s supposed to be blamey. It details the incident timeline and outlines procedures that should have been in place to prevent an error - and in doing so it focuses entirely on what was lacking from their outside perspective. What it doesn’t do is accurately explain how the event came to be. What processes WERE in place that the engineers relied upon? What change controls WERE being used? What went right and what will be done to ensure this doesn’t happen in the future?
Did Knight Capital go out of business because they lost a bunch of money in a catastrophic way? Sure. But their core business was still a profitable model - it’s conceivable that they could have received a bailout, continued operations, and gotten out of the hole created by this massive failure. Unfortunately, they failed to demonstrate to their investors and to the public that they were capable of doing so. By failing to release a public document, they allowed the narrative to be controlled by news sites and bloggers.
AWS experienced an outage that spanned April 21st-April 24th, 2011 and brought down the web infrastructure of several large companies such as Quora and Hootsuite. The incident began when someone improperly executed a network change and shunted a bunch of traffic to the wrong place, which cut a ton of nodes off from each other. Because so many nodes were affected at one time, all of them trying to re-establish replication and hunt for free nodes caused the entire EBS cluster to run out of free space. This generated a cascading failure scenario that required a massive amount of storage capacity in order to untangle. Recovery took quite a while because capacity had to be physically added to the cluster. The postmortem was published via Amazon’s blog on April 29th, 2011. This incident is notable because it was somewhat widespread (affected multiple availability zones) and resolution took longer than 24 hours - making it one of the largest outages that AWS has experienced. AWS has a response pattern that is characterized by communication throughout; updates to status page during the incident, followed by a detailed postmortem afterwards (4). Amazon’s postmortem structure seems to be consistent across multiple events. Many seem to use roughly this outline:
- Statement of Purpose
- Infrastructure overview of affected systems
- Detailed recap of incident by service
- Explanation of recovery steps & key learnings
- Wrap-up and conclusion
Azure experienced a similar storage failure in 2014 and we see a similar response from them - immediate communication via status pages, followed by a postmortem after the incident (5).
Taking a look at how the media approaches these failure events, it’s worthy of note that the articles written about the outages include links to the postmortem itself, as well as status pages and social media (6,7). Because the companies are communicative and providing documentation about the problem, the journalist can disseminate that information in their article - thus allowing the company that experienced the failure to control the narrative. Because so much information is supplied, there’s very little speculation about what went right or wrong on the part of individuals or journalists, despite the outage events impacting a huge number of individuals and companies utilizing the services themselves or software which relied upon them.
- Resolving existing issues causing failures
- Preventing future failures
- Controlling public perception of the incident
- Helping internal business units and stakeholders to understand the failure
Having a process is equally critical, and that process needs to be informed by the needs of the business both internally and externally. A process helps ensure that the right questions get asked and the right actions are taken to understand and mitigate failure. With the typical software development lifecycle increasing in speed, availability is becoming more of a moving target than ever. Postmortem processes help us zero in on threats to our availability and efficiently attack problems at the source.
email@example.com or @gabinante on twitter. If you are interested in developing or refining a postmortem process, check out the Operations Incident Board on GitHub: https://github.com/Operations-Incident-Board