Become the newest member of our Ops team

We’re looking for two new people to join our Ops team and help deliver the most reliable and performant Basecamp sites and services. You’ll work on every single piece of our infrastructure, both colocated and in the cloud. You’ll touch every single one of our applications, and you’ll frequently support other teams at Basecamp too. You’ll be joining our existing operations team with Blake, Eron, John, Matthew, Nathan (and me!). (That means joining our on call rotation too, whee!) We’re a super close team and we have a lot of fun together. We promise to learn from you and help you develop and mature new and existing personal and professional skills.

Currently our company works from 32 different cities, spread across 6 countries. You can work from anywhere in the world, so long as your working day overlaps the beginning or end of the day in the U.S. That means 8–10am Eastern time or 3–5pm Pacific time.

About You

As an experienced Ops person, you should be very familiar with Ruby on Rails, Ruby in general and all the normal frontend components. You’ve used the Percona Toolkit for years and know that Redis is the most stable part of the stack. If you are like us you find writing new tooling in Go to be a pleasure, but you are also comfortable with good old Bash, Ruby, and Python. (We have a mostly hate relationship with Java.) You enjoy the excitement of putting out the occasional fire and you have a vision for building stable, reliable and performant systems that create less work, not more.

In broad strokes, Managers of One thrive at Basecamp. We are committed generalists, eager learners, conscientious workers, and curators of what’s essential. We’re quick to trust. We see things through. We’re kind to each other, look up to each other, and support each other. We achieve together. We are colleagues, here to do our best work. We frequently joke that the Ops team is like a big family — because we care for and treat each other like family members.

About the work

Lately our team has been working on migrating off the hundreds of servers we have in colocation (in 3 data centers) and moving our legacy applications to the cloud. Just this week we’ve been working on:

  • Creating a gem to talk to Google Cloud Storage and implementing it in one of our older Rails apps so we can move tens of terabytes of files off our local object storage
  • Back and forth with Juniper TAC on an issue we have been having with our MX104 routers + crypto cards
  • Using Nginx/Openresty + Lua + Redis to implement some new abuse filtering / mitigation in the cloud
  • Moving an existing application deployed on AWS with Terraform to use ALBs instead of ELBs

Please Apply

If you’ve gotten this far and you are thinking “I’d love to do that kind of work at Basecamp”, then please apply.

If you’ve gotten this far and thought, “my friend Jennifer would be great for this”, please let her know!

We’re especially interested in applications from folks in the early stages of their operations careers who show great aptitude.

We’re accepting applications until May 12th, 2017 and we look forward to hearing from you!

Feeling Safe Across Data Centers

The Single Server Room

In a perfect world all of your servers are hard-wired to each other in a room. There would be one opening with which the world connects to your little slice of the Internet. That way all of your cross-service communication into databases, infrastructure and other services happens within the single set of servers all directly connected.

In the modern world we often have to reach across the Internet to access services and applications. This can be an awkward feeling and presents some unique problems. However, there are a few techniques and patterns you can use to make it a little less frightening. Let’s talk through some of the bigger concerns and what you can do about them.

Security

One big reason reaching across data centers, even for first-party systems, can be an issue is the matter of security. There’s a base level of security you lose sending data and commands across the Internet where anyone can glance at your requests and try to decode the data or alter what’s being sent. All someone needs is a basic understanding of networking, attack techniques like Man-in-the-middle and perhaps Wireshark to unpack your cross-service request, see sensitive data, tinker with the request and send an altered request to the final destination. Fear not, however, there are some standard techniques to mitigate this risk:

1. SSL

Always communicate over SSL when you’re sending requests back and forth over your systems. This is a straightforward, standard way to secure communications between two services or entities on the Web. Under the hood, SSL uses Public/Private Key encryption to secure the body of a request between two entities. Reddit, Facebook and all of your financial institutions use SSL (HTTPS) to communicate with your browser, and likely when they communicate between internal services. Its become far easier and cheaper (free) to get SSL for your services as well thanks to organizations like Let’s Encrypt.

2. Request Signing

While communication over SSL is somewhat secure, it can fail. Or perhaps you don’t need SSL to prevent snooping, but you do want to ensure the data wasn’t tampered with. At Highrise we decided to utilize a drafted standard that is being worked on currently under IETF, which outlines a method for signing a request. This means you can use an encryption algorithm and set of keys that you configure to define a formal verification for the content of your request. Let’s say I want to ensure that the Digest, Authentication and Date headers were specifically never altered. By following this protocol I would: Set up the request, retrieve the signature (using signing keys) for the specified headers, add the signature to the request and execute the request. This standard allows for specifying what keys you used to sign the request (via a KeyId parameter), which headers were signed, and which algorithm was used to do the signing. The recipient server can use this information to verify the contents of those headers were not altered during transport. The details of this freshly forming protocol go a fair bit deeper and are worth understanding. There will be a followup post directed at this topic shortly.

These two protocols give us a stronger confidence in the things being sent over the wire to other services.

Reliability

Speed of accessing external services due to network fluctuations as well as actual downtime are facts of a cross-data-center world. Obviously, both types of issues can compound themselves and start making whole services virtually unusable. You often won’t be able to stop these things from happening so you have to prepare for them. Let’s talk about four mitigation techniques:

1. Local caches, Avoid making requests

Caching or intelligently deciding when to request across services can cut down on the number of actual requests you need to make. Things like eTags can help with this as well as expiration headers or simply not requesting data unless you absolutely need it to accomplish your task. If the thing didn’t change from the last time it was requested let the client reuse the data it already has.

2. Timeout Retry

I mentioned earlier that slow responses from external services can create a compounding problem for your system. You can mitigate this risk by planning for it to happen and wrapping specific patterns around your communication. Specifically, set reasonable timeouts when you make external requests. One problem with timeouts is that you can’t tell if it ever reached the server. So you should plan to make your endpoint idempotent whenever possible. Idempotent endpoints make retries simpler as well, since you can just keep hitting the endpoint and expect no unexpected change. Finally, you maybe should slow down rescheduling the request to give some time for a system to recover or avoid hammering the service. This is called exponential back-off.

At Highrise, certain important requests have a timeout like 1 second. If the request fails it will be retried 3 times before it stops trying and starts messaging our team about issues. Each time it will schedule the job to retry further out: 3 seconds after failure, 9 seconds after failure and 27 seconds after failure, because of the exponential back-off algorithm. In cases where something is, for instance, sending an email via an external request, idempotency is a very serious concern so that you avoid sending the exact same email 3 times because of retries. You can accomplish something like that with a key that the server uses to decide if that operation has already been accomplished.

3. Circuit Breakers

Circuit Breakers paired with timeouts can help you both better handle full-service degradation and provide a window for recovery. A Circuit Breaker basically lets you define a set of rules that says when a breaker should “trip.” When a breaker trips you skip over an operation and instead respond with “try again later please,” re-queue a job or use some other retry mechanism. In practice at Highrise, we wrap requests to an external service in a circuit breaker. If the breaker trips due to too many request timeouts, we display a message to any users trying to access functionality that would use that service, and put jobs on hold that use that service. Jobs that were in-flight will presumably fail and be retried as usual. A tripped breaker stays tripped for several minutes (a configured value) and thus keeps us from hammering a service that may be struggling to keep up. This gives Operations some breathing room to add servers, fix a bug or simply allow network latency to recover a little.

4. Upcheck

Upchecks, Health-Checks and the like are very useful to get a basic understanding of whether you can reach a service. Libraries standardize some of this for you so you don’t have to think much about what to provide. Really what you want is to understand whether you can reach the service and if its basic functions are operational. Upchecks paired with a circuit breaker can help decide whether to show a maintenance page or to skip jobs that won’t work at the moment. These checks should be extremely fast. At Highrise for our first-party, external services we check once on each web-request for the livelihood of a feature about to be accessed. Again let’s say we have an external emailing service. If someone goes to the email feature we wouldn’t check at each email operation, in the code, that the service is up. Instead, we would check at the beginning of the web request if the email service is up. If it is up continue to the feature, if it isn’t display a basic “down, please try later” message.

Act like it isn’t yours

When it comes to external services, even if you wrote it, you have to act like you have no control of it. You can’t assume any external service will always operate normally. The reality is you have limited control, so you have to design a system that explains issues to your users as they happen and mostly recovers on its own. Protect what you can control and avoid requiring humans to repair issues like this. The more your computers can recover on their own, the more you can worry about the next feature, the next user or the next beer.


I’m a Software Engineer at Highrise (@highrise). Follow me on Twitter to tell me what you think, as well as find more ramblings about Software and the world by me @jphenow.

Reducing Incident Escalations at Basecamp

Incidents escalated to PagerDuty are generally “all hands on deck”.

tldr; The bottom line is that we had fewer actual site down interruptions and false alarm escalations in 2015.

Here’s a non exhaustive list of contributing factors to these improvements over the years:

  • Eliminating scheduled maintenance that would take a site offline
  • Limiting API abuse (and a general decrease in the number of abuse incidents)
  • Automated blocking of other common abuse traffic
  • Fairly generous ongoing hardware refresh with better distribution across cabinets
  • Completely new core and top of rack network switches
  • Hiring the right people (and the right number of people)
  • Moving to more stable storage (EMC / Isilon to Cleversafe)
  • Taking control of our public Internet connectivity and routing (Our own IP space, our own routers, carefully selected providers, filtering traffic)
  • Right sizing database hardware for every major application
  • Better development / deployment practices and consistency in following those practices (local tests / ci, staging, rollout, production)
  • Practicing incident response and keeping play books up to date
  • Vastly improved metrics and dashboards
  • Better application architecture and design choices with regard to availability and failure modes
  • Being ruthless in our tuning of internal monitoring and alerting (Nagios) to only escalate alerts that really need to be escalated
  • (Full disclosure we actually had more incidents escalated from our internal monitoring this year. The “quality” of those escalations is higher though.)