Become the newest member of our Ops team

We’re looking for two new people to join our Ops team and help deliver the most reliable and performant Basecamp sites and services. You’ll work on every single piece of our infrastructure, both colocated and in the cloud. You’ll touch every single one of our applications, and you’ll frequently support other teams at Basecamp too. You’ll be joining our existing operations team with Blake, Eron, John, Matthew, Nathan (and me!). (That means joining our on call rotation too, whee!) We’re a super close team and we have a lot of fun together. We promise to learn from you and help you develop and mature new and existing personal and professional skills.

Currently our company works from 32 different cities, spread across 6 countries. You can work from anywhere in the world, so long as your working day overlaps the beginning or end of the day in the U.S. That means 8–10am Eastern time or 3–5pm Pacific time.

About You

As an experienced Ops person, you should be very familiar with Ruby on Rails, Ruby in general and all the normal frontend components. You’ve used the Percona Toolkit for years and know that Redis is the most stable part of the stack. If you are like us you find writing new tooling in Go to be a pleasure, but you are also comfortable with good old Bash, Ruby, and Python. (We have a mostly hate relationship with Java.) You enjoy the excitement of putting out the occasional fire and you have a vision for building stable, reliable and performant systems that create less work, not more.

In broad strokes, Managers of One thrive at Basecamp. We are committed generalists, eager learners, conscientious workers, and curators of what’s essential. We’re quick to trust. We see things through. We’re kind to each other, look up to each other, and support each other. We achieve together. We are colleagues, here to do our best work. We frequently joke that the Ops team is like a big family — because we care for and treat each other like family members.

About the work

Lately our team has been working on migrating off the hundreds of servers we have in colocation (in 3 data centers) and moving our legacy applications to the cloud. Just this week we’ve been working on:

  • Creating a gem to talk to Google Cloud Storage and implementing it in one of our older Rails apps so we can move tens of terabytes of files off our local object storage
  • Back and forth with Juniper TAC on an issue we have been having with our MX104 routers + crypto cards
  • Using Nginx/Openresty + Lua + Redis to implement some new abuse filtering / mitigation in the cloud
  • Moving an existing application deployed on AWS with Terraform to use ALBs instead of ELBs

Please Apply

If you’ve gotten this far and you are thinking “I’d love to do that kind of work at Basecamp”, then please apply.

If you’ve gotten this far and thought, “my friend Jennifer would be great for this”, please let her know!

We’re especially interested in applications from folks in the early stages of their operations careers who show great aptitude.

We’re accepting applications until May 12th, 2017 and we look forward to hearing from you!

Reducing Incident Escalations at Basecamp

Incidents escalated to PagerDuty are generally “all hands on deck”.

tldr; The bottom line is that we had fewer actual site down interruptions and false alarm escalations in 2015.

Here’s a non exhaustive list of contributing factors to these improvements over the years:

  • Eliminating scheduled maintenance that would take a site offline
  • Limiting API abuse (and a general decrease in the number of abuse incidents)
  • Automated blocking of other common abuse traffic
  • Fairly generous ongoing hardware refresh with better distribution across cabinets
  • Completely new core and top of rack network switches
  • Hiring the right people (and the right number of people)
  • Moving to more stable storage (EMC / Isilon to Cleversafe)
  • Taking control of our public Internet connectivity and routing (Our own IP space, our own routers, carefully selected providers, filtering traffic)
  • Right sizing database hardware for every major application
  • Better development / deployment practices and consistency in following those practices (local tests / ci, staging, rollout, production)
  • Practicing incident response and keeping play books up to date
  • Vastly improved metrics and dashboards
  • Better application architecture and design choices with regard to availability and failure modes
  • Being ruthless in our tuning of internal monitoring and alerting (Nagios) to only escalate alerts that really need to be escalated
  • (Full disclosure we actually had more incidents escalated from our internal monitoring this year. The “quality” of those escalations is higher though.)

Basecamp had 99.99+% uptime in 2015!

Overall, our uptime this year was the best it’s ever been in our modern recorded history. All of our customer facing apps recorded 4 9’s of uptime or better (meaning 99.99x% uptime), and each individual app had less downtime than last year.

Our Applications 2012–2015

What About Basecamp 3?

Basecamp 3 isn’t on the list, because it’s had perfect uptime since launch!

Our team will continue working hard to deliver the most stable and performant Basecamp you’ve ever used. We’re looking forward to a great 2016!