Seamless branch deploys with Kubernetes

Basecamp’s newest product HEY has lived on Kubernetes since development first began. While our applications are majestic monoliths, a product like HEY has numerous supporting services that run along-side the main app like our mail pipeline (Postfix and friends), Resque (and Resque Scheduler), and nginx, making Kubernetes a great orchestration option for us.

As you work on code changes or new feature additions for an application, you naturally want to test them somewhere — either in a unique environment or in production via feature flags. For our other applications like Basecamp 3, we make this happen via a series of numbered environments called betas (beta1 through betaX). A beta environment is essentially a mini production environment — it uses the production database but everything else (app services, Resque, Redis) is separate. In Basecamp 3’s case, we have a claim system via an internal chatbot that shows the status of each beta environment (here, none of them are claimed):

(prior to starting work on HEY, we were running 8 beta environments for BC3)

Our existing beta setup is fine, but what if we can do something better with the new capabilities that we are afforded by relying on Kubernetes? Indeed we can! After reading about GitHub’s branch-lab setup, I was inspired to come up with a better solution for beta environments than our existing claims system. The result is what’s in-use today for HEY: a system that (almost) immediately deploys any branch to a branch-specific endpoint that you can access right away to test your changes without having to use the claims system or talk to anyone else (along with an independent job processing fleet and Redis instance to support the environment).

Let’s walk through the developer workflow

  • A dev is working on a feature addition to the app, aptly named new-feature.
  • They make their changes in a branch (called new-feature) and push them to GitHub which automatically triggers a CI run in Buildkite:
  • The first step in the CI pipeline builds the base Docker image for the app (all later steps depend on it). If the dev hasn’t made a change to Gemfile/Gemfile.lock, this step takes ~8 seconds. Once that’s complete, it’s off to the races for the remaining steps, but most importantly for this blog post: Beta Deploy.
  • The “Beta Deploy” step runs bin/deploy within the built base image, creating a POST to GitHub’s Deployments API. In the repository settings for our app, we’ve configured a webhook that responds solely to deployment events — it’s connected to a separate Buildkite pipeline. When GitHub receives a new deployment request, it sends a webhook over to Buildkite causing another build to be queued that handles the actual deploy (known as the deploy build).
  • The “deploy build” is responsible for building the remainder of the images needed to run the app (nginx, etc.) and actually carrying out the Helm upgrades to both the main app chart and the accompanying Redis chart (that supports Resque and other Redis needs of the branch deploy):
  • From there, Kubernetes starts creating the deployments, statefulsets, services, and ingresses needed for the branch, a minute or two later the developer can access their beta at (If this isn’t the first time a branch is being deployed, there’s no initializing step and the deployment just changes the images running in the deployment).

What if a developer wants to manage the deploy from their local machine instead of having to check Buildkite? No problem, the same bin/deploy script that’s used in CI works just fine locally:

$ bin/deploy beta
[✔] Queueing deploy
[✔] Waiting for the deploy build to complete :
[✔] Kubernetes deploy complete, waiting for Pumas to restart

Deploy success! App URL:

(bin/deploy also takes care of verifying that the base image has already been built for the commit being deployed. If it hasn’t it’ll wait for the initial CI build to make it past that step before continuing on to queueing the deploy.)

Remove the blanket!

Sweet, so the developer workflow is easy enough, but there’s got to be more going on below the covers, right? Yes, a lot. But first, story time.

HEY runs on Amazon EKS — AWS’ managed Kubernetes product. While we wanted to use Kubernetes, we don’t have enough bandwidth on the operations team to deal with running a bare-metal Kubernetes setup currently (or relying on something like Kops on AWS), so we’re more than happy to pay AWS a few dollars per month to handle managing our cluster masters for us.

While EKS is a managed service and relatively integrated with AWS, you still need a few other pieces installed to do things like create Application Load Balancers (what we use for the front-end of HEY) and touch Route53. For those two pieces, we have a reliance on the aws-alb-ingress-controller and external-dns projects.

Inside the app Helm chart we have two Ingress resources (one external, and one internal for cross-region traffic that stays within the AWS network) that have all of the right annotations to tell alb-ingress-controller to spin up an ALB with the proper settings (health-checks so that instances are marked healthy/unhealthy, HTTP→HTTPS redirection at the load balancer level, and the proper SSL certificate from AWS Certificate Manager) and also to let external-dns know that we need some DNS records created for this new ALB. Those annotations look something like this:

Annotations:                             alb                  [{"HTTP": 80},{"HTTPS": 443}]                        internet-facing                    ELBSecurityPolicy-TLS-1-2-2017-01          {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}               arn:aws:acm:us-east-1:############:certificate/########-####-####-####-############     ,

alb-ingress-controller and external-dns are both Kubernetes controllers and constantly watch cluster resources for annotations that they know how to handle. In this case, external-dns will know that it shouldn’t create a record for this Ingress resource until it has been issued an Address, which alb-ingress-controller will take care of in it’s own control loop. Once an ALB has been provisioned, alb-ingress-controller will tell the Kubernetes API that this Ingress has X Address and external-dns will carry on creating the appropriate records in the appropriate Route53 zones (in this case, external-dns will create an ALIAS record pointing to Ingress.Address and a TXT ownership record within the same Route53 zone (in the same AWS account as our EKS cluster that has been delegated from the main app domain just for these branch deploys).

These things cost money, right, what about the clean-up!?

Totally, and at the velocity that our developers are working on this app, it can rack up a small bill in EC2 spot instance and ALB costs if we have 20-30 of these branches deployed at once running all the time! We have two methods of cleaning up branch-deploys:

  • a GitHub Actions-triggered clean-up run
  • a daily clean-up run

Both of these run the same code each time, but they’re targeting different things. The GitHub Actions-triggred run is going after deploys for branches that have just been deleted — it is triggered whenever a delete event occurs in the repository. The daily clean-up run is going after deploys that are more than five days old (we do this by comparing the current time with the last deployed time from Helm). We’ve experimented with different lifespans on branch deploys, but five works for us — three is too short, seven is too long, it’s a balance.

When a branch is found and marked for deletion, the clean-up build runs the appropriate helm delete commands against the main app release and the associated Redis release, causing a cascading effect of Kubernetes resources to be cleaned up and deleted, the ALBs to be de-provisioned, and external-dns to remove the records it created (we run external-dns in full-sync mode so that it can delete records that it owns).

Other bits

  • We’ve also run this setup using Jetstack’s cert-manager for issuing certs with Let’s Encrypt for each branch deploy, but dropped it in favor of wildcard certs managed in AWS Certificate Manager because hell hath no fury like me opening my inbox everyday to find 15 cert expiration emails in it. It also added several extra minutes to the deploy provisioning timeline for new branches — rather than just having to wait for the ALB to be provisioned and the new DNS records to propagate, you also had to wait for the certificate verification record to be created, propagate, Let’s Encrypt to issue your cert, etc etc etc.
  • DNS propagation can take a while, even if you remove the costly certificate issuance step. This was particularly noticeable if you used bin/deploy locally because the last step of the script is to hit the endpoint for your deploy over and over again until it’s healthy. This meant that you could end up caching an empty DNS result since external-dns may not have created the record yet (likely, in-fact, for new branches). We help this by setting a low negative caching TTL on the Route53 zone that we use for these deploys.
  • There’s a hard limit on the number of security groups that you can attach to an ENI and there’s only so much tweaking you can do with AWS support to maximize the number of ALBs that you can have attached to the nodes in an EKS cluster. For us this means limiting the number of branch deploys in a cluster to 30. HOWEVER, I have a stretch goal to fix this by writing a custom controller that will play off of alb-ingress-controller and create host-based routing rules on a single ALB that can serve all beta instances. This would increase the number of deploys per cluster up to 95ish (per ALB since an ALB has a limit on the number of rules attached), and reduce the cost of the entire setup significantly because each ALB costs a minimum of $16/month and each deploy has two ALBs (one external and one internal).
  • We re-use the same Helm chart for production, beta, and staging — the only changes are the database endpoints (between production/beta and staging), some resource requests, and a few environmental variables. Each branch deploy is its own Helm release.
  • We use this setup to run a full mail pipeline for each branch deploy, too. This makes it easy for devs to test their changes if they involve mail processing, allowing them to send mail to <their username> and have it appear in their account as if they sent it through the production mail pipeline.
  • Relying on GitHub’s Deployments API means that we get nice touches in PRs like this:
complete with a direct link to the temporary deploy environment

If you’re interested in HEY, checkout and learn about our take on email.

Blake is Senior System Administrator on Basecamp’s Operations team who spends most of his time working with Kubernetes, and AWS, in some capacity. When he’s not deep in YAML, he’s out mountain biking. If you have questions, send them over on Twitter – @t3rabytes.