Postmortem on the read-only outage of Basecamp on November 9th, 2018


Last Thursday, November 9th, Basecamp 3 was in read-only mode for almost five hours starting 7:21am CST and ending 12:11pm CST. That meant users could access existing messages, todo lists, and files, but no new information could be entered, and no existing information could be altered. Everything was frozen in place.

The root cause was that our database hit the ceiling of 2,147,483,647 on our very busy events table. Almost every single activity in Basecamp is tracked in this table. When you post a message, update a todo list, or applaud a comment, we track that activity in the events table. So when we became unable to write new events to that table, every attempt to do practically anything in Basecamp was halted.

This was an avoidable problem. We were actively working on expanding the capacity of the events table in the days prior to this outage, but we failed to properly account for how quickly we were running out of headroom.

To compound the avoidable factor, we should had been aware of the general issue much sooner. The programming framework we use, Ruby on Rails (which was originally extracted from Basecamp!), moved to a new default for database tables in version 5.1 that was released in 2017. That change lifting the headroom for records from 2,147,483,647 to 9,223,372,036,854,775,807 on all tables. Which ended up being the same root-cause fix that we applied to our tables.

It’s bad enough that we had the worst outage at Basecamp in probably 10 years, but to know that it was avoidable is hard to swallow. And I cannot express my apologies clearly or deeply enough.

We pride ourselves at Basecamp on being “boring software” because it just works and it’s always available. Since Basecamp 3 was launched, and up until this outage, we’ve had an uptime record of 99.998%. This near five-hour outage has taken that impressive statistic down to a more humbling 99.978%.

Some companies might choose to weasel around an outage like ours by claiming that it was only a “partial outage”, because the application remained available in read-only mode for the majority of this time. But that’s not what we’re going to do at Basecamp. We’re going to take the scar in our uptime record as a reminder to do better.

Because we owe everyone using Basecamp to do better. It’s embarrassing and humbling to have suffered the biggest outage at Basecamp in a decade from an issue that we should have addressed years ago, and that we were actively working on addressing, but failed to complete in time.

As the CTO of Basecamp and the creator of Ruby on Rails, I accept full responsibility for our failures. I should have been more vigilant with our own database schema when Rails 5.1 announced the new default, and I should have followed up and asked the right questions when we finally did start work on remediation. I’m really sorry to have failed you 😢

If you have any questions, or if we can help in any way, please reach out to our wonderful support crew who’ve been dealing with each report individually.

I also want to express my deep gratitude to everyone who’ve been so gracious with their kind words of encouragement and support during and after this ordeal. I don’t know if we’ve earned such understanding, given our clear culpability, but we are extremely grateful none the less.

Note: If you weren’t using Basecamp at the time, you can see how we kept everyone in the loop using our status.basecamp.com updates and a play-by-play record on our blog. We can’t promise to be perfect, but we promise always to keep you informed in a timely and completely transparent manner.


On a personal note, I want to apologize for not posting this postmortem until today. The plan was to have this final summary ready on Friday, but then the Woolsey fire hit, and our family was forced to evacuate our home in Malibu. It’s been a crazy week 😬

Update on Basecamp 3 being stuck in read-only as of Nov 8, 9:22am CST

Basecamp 3 is now back online for reading and writing. All data was confirmed to be fully safe and intact. No emails that were sent to Basecamp during the outage were dropped. We may still have some backlogs on processing things like incoming emails, and you may still see some slowdowns here and there as we catch up. But we are back, and we are safe.

We will be following up with a detailed and complete postmortem soon. All in, we were stuck in read-only mode for almost five hours. That’s the most catastrophic failure we’ve had a Basecamp in maybe as much as decade, and we could not be more sorry. We know that Basecamp customers depend on being able to get to their data and carry on the work, and today we failed you on that.

We’ve let you down on an avoidable issue that we should have been on top of. We will work hard to regain your trust, and to get back to our normal, boring schedule of 99.998% uptime.

Note: If you were in the middle of posting something new to Basecamp, and you got an error, that data is most likely saved in our browser-based autosave system. If it doesn’t appear automatically, we can help you recover that data. Please contact support if you’re in this situation, and we’ll have a team ready to assist.

Below is the timeline for today:

At 7:21am CST, we first got alerted that we had run out of ID numbers on an important tracking table in the database. This was because the column in database was configured as an integer rather than a big integer. The integer runs out of numbers at 2147483647. The big integer can grow until 9223372036854775807.

At 7:29am CST, the team diagnosed the problem and started working on the fix. This meant writing what’s called a database migration where you change the column type from the regular integer to the big integer type. Changing a production database is serious business, so we had to test this fix on a staging database to make sure it was safe.

At 7:52am CST, we had verified that the fix was correct and tested it on a staging database, so we commenced making the change to the production database table. That table in the database is very large, of course. That’s why it ran out of regular integers. So the migration was estimated to take about one hour and forty minutes.

At 10:56am CST, we completed the upgrade to the databases. This was the largest part of the fix we needed to address the problem. But we still have to verify all the data, update our configurations, and ensure that we won’t have more problems when we go back online. We’re working on this as fast as we can.

At 11:33am CST, we’re still verifying that all data is as it should be for Basecamp 3. The database migration has finished, but the verification process is still ongoing. We’re working as fast as we can and hope to be back fully shortly.

At 11:52am CST, verification of the databases is taking longer than expected. We have 4 databases per datacenter and we have two datacenters with databases. So a total of 8 databases. We need to be absolutely certain that all the data is in proper sync before we can go back online. It’s looking good, but 99% sure isn’t good enough. Need 100%.

At 12:22pm CST, Basecamp came back online after we successfully verified that all data was 100% intact.

At 12:33pm CST, Basecamp had another issue dealing with the intense load of the application being back online. This caused a caching server to get overwhelmed. So Basecamp is down again while we get this sorted.

At 12:41pm CST, Basecamp came back online after we switched over to our backup caching servers. Everything is working as of this moment, but we’re obviously not entirely out of the woods yet. We remain on red alert.

I will continue to update this post with more information, and we will provide a full postmortem after this has completed.


Further insight on the technical problem: It’s embarrassing to admit, but the root cause of this issue with running out of integers has been a known problem in our technical community. We use the development framework Rails (which we created!), and the default setting for that framework move from integer to big integer two years ago.

We should have known better. We should have done our due diligence when this improvement was made to the framework two years ago. I accept full responsibility for failing to heed that warning, and by extension for causing the multi-hour outage today. I’m really, really sorry 😢