The unfortunate reality about running a web service is that every now and again, you’re going to have downtime. Even the best web companies have the occasional blip in service. If downtime is inevitable, then it’s best to plan ahead so that you can be ready. After all, prior preparation prevents poor performance.
Poorly handled downtime will cause your customers a lot of pain which will in turn affect your bottom line. Some of your customers may outright switch to a competitor. You’ll lose future customers due to lack of trust. You’ll get less word-of-mouth referrals because your customers will just like you less.
Luckily, unplanned downtime doesn’t have to turn into a customer service nightmare. It turns out that if you just keep your customers in the loop by communicating what’s happening and what you’re doing to fix the problem, they’ll understand and have a much less negative reaction to the whole situation.
Preparing for an incident
-
-
Define what constitute an incident
-
The first thing you need to decide is what has to happen for you to notify your customers. Does being down for 1 minute warrant you telling your customer? What about three? Five? It depends on how critical the service you offer is to your customers. In general, I suggest telling customers if you’re down for at least 3-5 minutes.
Another thing to consider is the fact that being “down” has shades of grey. Sometimes your whole site won’t be down, just one part or feature (e.g. report generation or email sending). If it’s a critical part of your site, it may makes sense to tell your customers when that feature is broken at all; while if it’s a minor feature, it may make sense to tell your customers if it is/has been down for a longer time.
It gets worse. Sometimes parts of your site aren’t down…they’re just running slow. You’ll want to take into consideration what part of your site is running slow, how slow things are, how long they’ve been slow. This is hard to plan for and is something you’ll have to make a judgment call on.
At the end day, you’ll have to decide for yourself when it makes sense to communicate an outage.
There are two times when you should 100% be telling your customers about an issue: security issues and data loss.
-
-
Identify which communication channels will be used
-
The next step is to decide which channels you are going to use to communicate through during downtime. Will you have a dedicated status page? Should you just post it to your blog? Maybe just send updates via Twitter, Facebook, email, or SMS. Maybe you should have a widget so that you can communicate these issues right in your webapp. Decide which one or ones make sense to you.
If you do choose multiple channels, I recommend that you identify one as your primary communication vehicle and funnel everyone there from the other channels. For example, we have a dedicated status page but we also tweet out updates and display a notice in our webapp during downtime. The tweets and in-webapp notices funnel users back to the status page for the full story.
-
-
Who owns the downtime communication process
-
It helps to make sure everyone in the organization knows who is generally in charge of updating the status page. If your product caters to a very technical customer base, it might make sense to have the devops team update the status page. If your product caters to a non-technical customer base, it may make more sense to have your customer service team be in charge of it. Think about your specific case and make sure both teams are on the same page about what they can expect from each other.
-
-
Create templates for common scenarios
-
One great resource to have is a few sets of template language that can be used to quickly send an update. Can you imagine being the engineer gets woken up at 3:00AM because the site is down for some reason? The last thing you’d want to be doing thinking about how specific you should be about what’s happening, the type of language you use, or how technical you can get. One way to avoid this situation is to make those decisions ahead of time.
You can create templates for common issues that everyone has and even ones specific to your company around features that are more likely to break or become degraded in some way.
Here are a few of the incident templates that we have:
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We’re investigating the cause and will provide an update as soon as possible.
Our storage provider for public metrics data is currently experiencing infrastructure issues. Updates will be made available as the situation develops or information is provided to us.
A recent deploy was found to contain errors or significant performance degradations. Our infrastructure has been successfully rolled back to a previous version of the code, and traffic is being served as normal again.
When incidents happen
-
-
Initial response
-
Over the course of an incident, you might send out a few updates to keep your customers up-to-date with how you’re handling the problem. The initial update is the single most important one. The speed of first response is critical because it greatly determines how your “response effort” will be graded in the public eye.
Your goal here should be to immediately acknowledge the issue in all the communication channels you’ve chosen and to funnel them into your single primary channel. This is a situation where the sets of template communication really shine. Your initial response doesn’t necessarily have to be super specific about what’s happening or when the problem will be fixed, but should be communicated as early as possible.
When a user tries to access your site and fails (request times out, DNS issue makes request hang, etc), the first thing they’re going to do is probably check your status page or your Twitter account. If you are truly down and haven’t acknowledged the issue before they check your status page, it is a very frustrating experience for that user. They may feel like you’re lying or just don’t care about them. This is why it is so important that you acknowledge the issue as soon as possible.
One other interesting point here: if you’re a big (millions of users) company and the problem is severe (security breach), it can be helpful to contact tech blogs early so that you can “control the story”.
-
-
Ongoing communication
-
If the issue is not going to be resolved within ~ 30 minutes, your next step is to designate someone as the dedicated communicator. It’s this person’s job to be the go between for the devops team and support team to make sure everyone in your organization is aware of what’s happening, how severe the problem is, what the expected time-to-fix is, and other things of this nature.
They will also be in charge of continuing to update the status page or post updates to other channels as the situation evolves. If you have to say “We’re still working on the problem, nothing to report.” that’s still better than saying nothing at all.
You should also have a schedule for on-going updates. For issues that are still currently affecting your customer’s ability to use your product, you should never go more than 1 hour without sending an update. You should also always say when the next update will be if issue not resolved by then.
-
-
What to do after the incident
-
If an incident is serious enough (security breach, loss of data, long downtime), you should write a postmortem. A good postmortem can actually generate a lot of goodwill with your customers.
At the end of the day, a postmortem needs to do 3 things: apologize personally, show you’re capable of doing your job and you knew what happened/how to fix, and talk about your plan to avoid situation in the future.
-
-
Apologizing
-
Step one is apologizing and meaning it. This means it should be the first paragraph of your postmortem, not an afterthought for the very end. Don’t use language like “we’re sorry for the inconvenience” because it sounds like some bullshit dime a dozen apology from the corporate overlords. Say something like “I’m sorry we let you down. The whole team is working very hard to prevent this from happening again in the future”. An insincere apology is worse than no apology at all…it’s like rubbing salt in the wound. Your apology should actually sound like how one human would apologize to another.
-
-
Demonstrating understanding
-
Believe it or not, your customers are rooting for you and want to continue trusting you. In order to do that, you need to be able to demonstrate you know exactly what happened and how you’re going to prevent this in the future. If you can’t do that, how can you expect your customers to trust you in the future?
The amount of technical detail required here will depend on your audience. If you’re a consumer app, you should probably give a high level overview here. If your customer base is very technical, they might appreciate you telling them about exactly what happened, what you tried, what did/didn’t work, and all the other nitty gritty details.
If the “reason” you were down is due to some upstream provider that your infrastructure is built on, this is NOT an opportunity to flame them. At the end of the day, it was your choice to use their service and to build your stack in such a way that you had this single point of failure. Owning up to your mistake does not include blaming other people for your problems.
-
-
Plan to avoid it in the future
-
This part is pretty straight forward. Your customers want to know what it is you’re going to do to ensure that the same problem doesn’t happen in the future. This is an opportunity for you to detail the fixes or changes you are going to make and your timeline for doing that.
A secondary benefit to putting this in writing is it helps to keep you accountable for actually fixing/changing the things you said that you would. It forces you to harden your systems because you’ve publicly committed to doing it.
Bonus: a downtime secret weapon
We like to think Statuspage is an important piece of any team’s toolbox during downtime.
What does Statuspage do? We make it easy to keep customers informed when things go wrong. Teams have a huge responsibility to keep customers in the loop during things like system downtime. We make that a little easier.
We could tell you more, but we’d rather you see for yourself. Our trial is free and unlimited, so there’s no harm in setting up a page and seeing if it works for your team.