Jeremy Daer, Author at Signal v. Noise

Basecamp has suffered through three serious outages in the last week, on Friday, August 28th, on Tuesday, September 1, and again today. It’s embarrassing, and we’re deeply sorry.

This is more than a blip or two. Basecamp has been down during the middle of your day. We know these outages have really caused issues for you and your work. We’ve put you in the position of explaining Basecamp’s reliability to your customers and clients, too.

We’ve been leaning on your goodwill and we’re all out of it.

Here’s what has happened, what we’re doing to recover from these outages, and our plan to get Basecamp reliability back on track.

What happened

Friday, August 28

What you saw: Basecamp 3 Campfire chat rooms and Pings stopped loading. You couldn’t chat with each other or your teams for 40 minutes, from to 12:15pm to 12:55pm Central Time (17:15–17:55 UTC). Incident timeline.
What we saw: We have two independent, redundant network links that connect our two redundant datacenters. The fiber optic line carrying one of the network links was cut in a construction incident. No problem, right? We have a redundant link! Not today. Due to a surprise interdependency between our network providers, we lost the redundant link as well, resulting in a brief disconnect between our datacenters. This led to a failure in our cross-datacenter Redis replication when we exceeded the maximum replication buffer size, triggering a catastrophic replication resync loop that overloaded the primary Redis server, causing very slow responses. This took Basecamp 3 Campfire chats and Pings out of commission.

Tuesday, September 1

What you saw: You couldn’t load Basecamp at all for 17 minutes, from 9:51am to 10:08am Central Time (14:51–15:08 UTC). Nothing seemed to work. When Basecamp came back online, everything seemed back to normal. Incident timeline.
What we saw: Same deal, with a new twist. Our network links went offline, taking down Basecamp 3 Campfire chats and Pings again. While recovering from this, one of our load balancers (a hardware device that directs Internet traffic to Basecamp servers) crashed. A standby load balancer picked up operations immediately, but that triggered a third issue: our network routers failed to automatically synchronize with the new load balancer. That required manual intervention, extending the outage.

Wednesday, September 2

What you saw: You couldn’t load Basecamp for 15 minutes, from 10:50am to 11:05am Central Time (15:50–16:05 UTC). When Basecamp came back online, chat messages felt slow and sluggish for hours afterward. Incident timeline.
What we saw: Earlier in the morning, the primary load balancer in our Virginia datacenter crashed again. Failover to its secondary load balancer proceeded as expected. Later that morning, the secondary load balancer also crashed and failed back to the former primary. This led to the same desynchronization issue from yesterday, which again required manual intervention to fix.

All told, we’ve tickled three obscure, tricky issues in a 5-day span that led to overlapping, interrelated failure modes. These woes are what we plan for. We detect and avert these sorts of technical issues daily, so this was a stark wake-up call: why not today? We’re working to learn why.

What we’re doing to recover from these outages

We’re working multiple options in parallel to recover and manage any contingencies in case our recovery plans fall through.

We’re getting to the bottom of the load balancer crash with our vendor. We have a preliminary assessment and bugfix.
We’re replacing our hardware load balancers. We’ve been pushing them hard. Traffic overload is a driving factor in one outage.
We’re rerouting our redundant cross-datacenter network paths to ensure proper circuit diversity, eliminating the surprise interdependency between our network providers.
As a contingency, we’re evaluating moving from hardware to software load balancers to decrease provisioning time. When a hardware device has an issue, we’re days out from a replacement. New software can be deployed in minutes.
As a contingency, we’re evaluating decentralizing our load balancer architecture to limit the impact of any one failure.

What we’re doing to get our reliability back on track

We engineer our systems with multiple levels of redundancy & resilience precisely to avoid disasters like this one, including practicing our response to catastrophic failures within our live systems.

We didn’t catch these specific incidents. We don’t expect to catch them all! But what catches us by surprise are cascading failures that expose unexpected fragility and difficult paths to recovery. These, we can prepare for.

We’ll be assessing our systems for resilience, fragility, and risk, and we’ll review our assessment process itself. We’ll share what we learn and the steps we take with you.

We’re sorry. We’re making it right.

We’re really sorry for the repeated disruption this week. One thing after another. There’s nothing like trying to get your own work done and your computer glitching out you or just not cooperating. This one’s on us. We’ll make it right.

We really appreciate all your understanding and patience you’ve shown us. We’ll do our best to earn back the credibility and goodwill you’ve extended to us as we get Basecamp back to rock-solid reliability. Expect Basecamp to be up 24/7.

As always, you can follow along with live updates about Basecamp status here and follow the play-by-play on Twitter, and get in touch with our support team anytime.

Jeremy Daer

Three Basecamp outages. One week. What happened?

What happened

What we’re doing to recover from these outages

What we’re doing to get our reliability back on track

We’re sorry. We’re making it right.

Basecamp is hiring a Senior Programmer