Way back in 2014, Gartner estimated the average cost of a minute of downtime at $5,600. That widely cited figure adds up to more than $300,000 an hour, and that’s just the average; in 2019, the impact on large organizations at critical “moments of truth”—like an e-commerce site on Black Friday or a streaming video service during a major sporting event—could be much, much larger.
Numbers in that range underscore the importance of responding quickly and effectively to any incidents that affect the availability or performance of your site.
What makes an “incident?”
Basically, an incident occurs any time a service is not available or does not perform in the way it has been defined to—typically through a formal service-level agreement (SLA). Incidents can be caused by a variety of factors: Network outages. Application bugs. Hardware failures. And, increasingly, in today’s complex and multilayered infrastructures, configuration errors.
Incident management (IM) refers to the collective processes that help detect, identify, troubleshoot, and resolve such incidents. Strongly influenced by the IT Infrastructure Library (ITIL) by the British government in the 1980s, IM has evolved over the years to include many frameworks and approaches. They all share a common goal, however: giving stakeholders the tools they need to get misbehaving customer-impacting systems up and running again ASAP, while also making those systems more robust and reliable.
But despite its long history, IM is still shrouded in myths and hobbled by misperceptions that prevent companies from resolving incidents as quickly and effectively as they could—and perhaps more importantly, from learning how to reduce the occurrence of incidents.
That’s why we asked incident management experts at New Relic and around the industry to identify common IM myths and mistakes, and share their insights on best practices for optimal incident management.
Myth #1: Speed is everything
Also known as the “any-fix-is-a-good-fix” myth. Rapidly resolving issues is obviously important, especially for systems that directly touch customers. But it’s not the only thing to worry about. A bad or incomplete fix, or a temporary fix, or a fix that breaks something else downstream, can be dangerous to implement in the name of speed.
“A lot of lip service is paid to the need for quality and customer satisfaction in IM, but when you look at a lot of the metrics for measuring IM success, they actually mostly focus on efficiency: how fast an issue is resolved,” says Christoph Goldenstern, vice president of innovation and service excellence at Kepner-Tregoe, a training and consulting firm specializing in IM.
Instead, businesses should focus on the effectiveness of the end result as well as the speed. “Are we ultimately giving the customer resolution in the long term?” Goldenstern asks. “Are we preventing the same thing from happening again? Those are the questions to ask.”
He adds that focusing on “lagging indicators,” or looking backwards to measure how something was done, is not terribly effective. Rather, he says, businesses should concentrate on improving behaviors that drive better and long-lasting results, and create metrics around those.
One metric that Kepner-Tregoe encourages clients to use is the time it takes to get to a good statement of the problem at hand. “We know from our research that the quality of the problem statement is a direct driver of lower resolution time and higher customer satisfaction,” Goldenstern says. “Training your people to create clear, concise, and precise problem statements as quickly as possible will serve you better than simply putting a fix into place.”
Don’t miss: How to Run an Adversarial Game Day
Myth #2: Once you’ve put out the fire you’re done
This myth is, happily, slowly being eradicated. These days, it’s fairly standard to have some kind of post mortem or internal retrospective after resolving an incident. The point is to proactively learn from the incident to make your systems more robust and stable, and to avoid similar incidents in the future. The relevant phrase here is, “proactively learn.”
“It’s really important to incentivize measures for prevention as opposed to just resolving incidents in reactive mode,” says Adam Serediuk, director of operations for xMatters, a maker of DevOps incident-management tools. If you don’t dictate that your incident lifecycle doesn’t end until that postmortem is completed and its findings are accepted or rejected, “you’re effectively saying, ‘we’re not really interested in preventing future incidents,’” says Serediuk. There’s a difference, he adds, between reacting and responding. You could react to an incident, for example, by throwing some of your rock stars at it, and fixing it right away. “But that process can’t be easily repeated,” he notes, “and it can’t scale.”
It’s important to think of IM as an end-to-end process in which the response is measured, iterative, repeatable, and scalable, agrees Branimir Valentic, a Croatian ITIL and ISO 20000 specialist at Advisera.com, an international ITSM consultancy. “The point of IM is not just to resolve, but to go much deeper, and to learn,” he says.
One risk is that over time the post-mortem can turn into a rote exercise—just a box to be checked by jaded engineers. “There’s this simplistic model where the postmortem just becomes busy work,” warns Beth Long, a senior software engineer and technical product manager at New Relic. Learning from incidents is incredibly valuable but also challenging, she says, “and requires you to constantly be tuning and adapting to figure out how to learn effectively.”
Don’t miss: How and Why to Hold “Blameless Retrospectives”
Myth #3: Report only major issues that customers complain about, to avoid making IT look bad
Another prevalent myth holds that you shouldn’t be overly communicative about your incidents. If you report every incident, the reasoning goes, IT can look as though it’s failing. It’s better to keep your head down, and acknowledge and communicate only the serious incidents that customers have noticed and reported.
That’s the theory, anyway, but it’s a really bad idea. Customers—and internal stakeholders—want to feel that you’re being honest and transparent, and that they can trust you to detect and acknowledge incidents that could impact them. Hiding incidents—even minor ones—can destroy that trust.
You shouldn’t view it as a black mark against your IT organization when things break, says Long. “You’re
running complex systems,” she says. “Of course things are going to break. Having incidents is just part of the game. The key is what you do about them.”
“One of the things I like at New Relic,” Long adds, “is that we’re proactive about communicating, both internally and to customers, which counteracts that myth of, ‘Oh no, you can’t tell anyone unless it’s a huge deal.’ A lot of companies are paranoid about sharing any information unless they’re basically forced to, but that’s a mistake. Be transparent.”
Myth #4: Only customer-impacting incidents matter
A related myth is that only incidents that impact external customers are relevant. In fact, some organizations even define incidents solely as “customer-impacting disruptions.” But believing that myth will reduce your overall IM effectiveness. Again, the idea is that IM should be a learning experience—and that you should take proactive actions based on that learning.
“There’s a lot to learn from internal misses and internal-only incidents. They might even be some of your best learning experiences because it’s a chance to hone your response process and learn without pressure,” says xMatters’ Serediuk. “It’s hard to instill true organizational change when things are on fire.”
Say your internal ticketing system goes down or your internal wiki blows up. What type of oversight or lack of control allowed that to happen? In relatively minor internal situations like these, “you can learn under less pressure and perhaps avoid production incidents later on,” says Serediuk. With lower pressure you may be able to focus a little more purposefully on why you had a particular problem, as well as how to prevent it from popping up again.
Don’t miss: Driving Operational Awareness With Incident Data
Myth #5: Systems will always alert you when they’re in pain
Operations folks tend to monitor what they believe to be important. But they’re not always right. When that happens, a system could be in trouble, and your team could be blissfully ignorant. Historically, ops teams looked at such metrics as disk utilization, CPU usage, and network throughput. “But the issue is really, is the service healthy?” says Serediuk.
This comes down to the difference between macro and micro monitoring. In micro monitoring, you’re looking at individual components such as CPU, memory, and disk. With macro monitoring, you’re looking at the bigger picture, which is how it impacts the systems’ users.
“This is where service level objectives [SLOs] and service level indicators [SLIs] come into play,” says Serediuk. “You’re judging things by the user experience.” For example, if all of a sudden your web requests per second drop to zero, you know you have a problem. If you were merely doing micro monitoring, such as keeping tabs on memory utilization, you could have missed it. “By looking at the metric that mattered—whether users are engaging with my system,” he notes, “I catch something that I might not otherwise have noticed.”
Myth #6: You can tell how well your IM processes are working by your mean time to resolution (MTTR)
The MTTR is just what it says: the mean (average) time it takes to resolve an incident. But problems abound with depending on this metric as your barometer for IM success. For starters, all incidents are not created equal. Simple, easy-to-resolve incidents should not be judged with the same metric as more complicated ones.
“How do you compare an enterprise-wide email service going down with an application that has only a handful of users, that maybe suffers from one easily resolved incident every other month?” asks Randy Steinberg, a solutions architect with IT consulting firm Concurrency. “Incidents are so varied, it’s not a good barometer of how well you’re doing.”
Also, measuring MTTR is itself an art, not a science. For example, when does the clock start ticking? Is it when an application starts slowing down? When you get your first alert? When a customer notices? “The boundaries of complex systems are so fluid, this is a difficult metric to capture consistently,” notes New Relic’s Long. MTTR can be useful if your IM response time is so poor that you’re trying to get it down to a “sane” number, she adds. “Otherwise, it can be very misleading.”
Don’t miss: Reducing MTTR the Right Way
Myth #7: We’re getting better at IM because we’re detecting issues faster and earlier
Thanks to the increased efficacy and granularity of automated monitoring and alerting tools like New Relic, businesses are getting much better at detecting incidents than was previously possible. But that doesn’t mean we’re getting better at incident management. Detecting an incident is only half the equation. Resolving it is the other half.
“What’s interesting is that if you look at the overall process, we’re not getting better at responding to incidents in general,” claims Vincent Geffray, senior director of product marketing at Everbridge, a critical-event management company. Why? Because all the gains that we get in the first phase of the process—detecting incidents sooner—are wasted in the second phase of the process, which involves finding the right people resolve the issue. “It can take a few minutes to detect an issue and then an hour just to get the right people to the table to begin figuring out a solution,” he says.
The remedy? Take the time to study the steps in the incident response process, with an eye toward making them more efficient. That’s where the biggest gains have yet to be achieved.
“What happens in real life, Geffray says, “after a tool like New Relic has identified a problem with an application, is that a ticket is created in your ticketing system, and then you have to find the right people, get them together, and give them the information they need so they can start investigating.” In most cases, it’s not going to be one person. “Studies show that most IT incidents require a minimum of five people to be resolved,” he notes. “And as you can imagine, the higher the number of mission-critical applications, the larger and more distributed the organization, the more time it takes.”
Myth #8: A “blameless culture” means no accountability for incidents
This is an important myth to dispel, given the (overwhelmingly positive) movement in the IT industry toward a blameless culture.
On the plus side, a blameless culture removes fear from the IM equation: People are much more likely to be candid and transparent when they know they are not going to be fired for making a mistake. “But that doesn’t mean no accountability,” says Long. “Just because there are no sanctions for making mistakes, doesn’t mean you shouldn’t identify which mistakes were made, and by whom, so as to learn from them.”
There’s a big difference between accountability and blame. Blame typically misunderstands the nature of complex systems, in which a particular mistake is likely to more of a triggering event that tips over the dominoes of latent failures. A blameless culture actually enables true accountability, because individuals and teams feel safe enough to be transparent about missteps so that the organization can improve the overall system.
Don’t miss: How and Why to Hold “Blameless Retrospectives”
Myth #9: You need a dedicated IM team
While some companies choose to have a separate, dedicated incident management team, others prefer to rotate people through regular IT engineering jobs, In fact, there are many reasons why you would want IM skills to be distributed throughout your IT organization.
“If you look at the DevOps approach, any engineer at the entire company can respond to any incident in any role, and that’s really powerful,” says Long, who notes that while New Relic has the New Relic Emergency Response Force (NERF) ready to step in as Incident Commanders for high-severity incidents and the “really difficult stuff,” for day-to-day incidents, responses are distributed across the whole organization.
Empowering any engineer who has the necessary information to make tough calls during an incident is crucial, explains Long. “You can’t be sitting around when things are on fire waiting for someone to get on the call. You need to empower whoever is responding to be able to make difficult decisions and know that, as long as they’ve been equipped to do that, go ahead, make the call, do your best.”
Of course, all this requires intense, in-depth, and continuous training, as well as repeatable, iterative processes. You want to have the best possible resources in place to address the biggest incidents which requires proper organization and well-honed processes. At New Relic, Vice President of Engineering Matthew Flaming has staked out a position that every engineer who is on call should have sufficient training and enough experience to make good calls. “And if they make a call that happens to go sideways, we will have their back,” Flaming says.