How to waste half a day by not reading RFC 1034

HEY uses a branch deploy system that I’ve written about here on SvN and talked about frequently on Twitter. Plenty of other companies have implemented their own version of branch deploys (typically under a different name), but this was my own implementation, so I’m proud of it. First, a primer on how it works:

  • Developer makes a code change in a git branch and pushes it to GitHub.
  • An automated build pipeline run is kicked off by a GitHub webhook. It builds some Docker images and kicks off another build that handles the deploy itself.
  • That deploy build, well, it deploys — to AWS EKS, Amazon’s managed Kubernetes offering, via a Helm chart that contains all of the YAML specifications for deployments, services, ingresses, etc.
  • alb-ingress-controller (now aws-load-balancer-controller) creates an ALB for the branch.
  • external-dns creates a DNS record pointing to the new ALB.
  • Dev can access their branch from their browser using a special branch-specific URL

    This process takes 5-10 minutes for a brand new branch from push to being accessible (typically).

Our current setup works well, but it’s got two big faults:

  • Each branch needs its own ALB (that’s what is generated by the Ingress resource).
  • DNS is DNS is DNS and sometimes it takes a while to propagate and requires that we manage a ton of records (3-5 for each branch).

These faults are intertwined: if I didn’t have to give each branch it’s own ALB, I could use a wildcard record and point every subdomain on our branch-deploy-specific-domain to a single ALB and let the ALB route requests to where they belong via host headers. That means I can save money by not needing all of those ALBs and we can reduce the DNS-being-DNS time to zero (and the complexity of external-dns annotations and conditionals spread throughout our YAML).

(While waiting a few minutes for DNS to propagate and resolve doesn’t sound like a big deal, we shoot ourselves in the foot with the way our deploy flow works by checking that the revision has actually been deployed by visiting an internal path on the new hostname as soon as the deploy build finishes, causing us to attempt to resolve DNS prior to the record being created and your local machine caching that NXDOMAIN response until the TTL expires.)

Before now, this was doable but required some extra effort that made it not worthwhile — it would likely need to be done through a custom controller that would take care of adding your services to a single Ingress object via custom annotations. This path was Fine™️ (I even made a proof-of-concept controller that did just that), but it meant there was some additional piece of tooling that we now had to manage, along with needing to create and manage that primary Ingress object.

Enter a new version of alb-ingress-controller (and it’s new name: aws-load-balancer-controller) that includes a new IngressGroup feature that does exactly what I need. It adds a new set of annotations that I can add to my Ingresses which will cause all of my Ingress resources to be routing rules on a single ALB rather than individual ALBs.

“Great!” I think to myself on the morning I start the project of testing the new revision and figuring out how I want to implement this (using it as an opportunity to clean up a bunch of technical debt, too).

I get everything in place — I’ve updated aws-load-balancer-controller in my test cluster, deleted all of the branch-specific ALIAS records that existed for the old ALBs, told external-dns not to manage Ingress resources anymore, and setup a wildcard ALIAS pointing to my new single ALB that all of these branches should be sharing.

It doesn’t work.

$ curl --header "Host: alb-v2.branch-deploy.com" https://alb-v2.branch-deploy.com
curl: (6) Could not resolve host: alb-v2.branch-deploy.com

But if I call the ALB directly with the proper host header, it does:

$ curl --header "Host: alb-v2.branch-deploy.com" --insecure https://internal-k8s-swiper-no-swiping.us-east-1.elb.amazonaws.com
<html><body>You are being <a href="https://alb-v2.branch-deploy.com/sign_in">redirected</a>.</body></html>

(╯°□°)╯︵ ┻━┻

I have no clue what is going on. I can clearly see that the record exists in Route53, but I can’t resolve it locally, nor can some DNS testing services (❤️ MX Toolbox).

Maybe it’s the “Evaluate Target Health” option on the wildcard record? Disabled that and tried again, still nothing.

I’m thoroughly stumped and start browsing the Route53 documentation and find this line and think it’s the answer to my problem:

If you create a record named *.example.com and there’s no example.com record, Route 53 responds to DNS queries for example.com with NXDOMAIN (non-existent domain).

So off I go to create a record for branch-deploy.com to see if maybe that’s it. But that still doesn’t do it. This is when I re-read that line and realize that it doesn’t apply to me anyway — I had read it incorrectly the first time, I’m not trying to resolve branch-deploy.com. (My initial reading was that *.branch-deploy.com wouldn’t resolve without a record for branch-deploy.com existing).

Welp, time to dig into the RFC, there’s bound to be some obscure thing I’m missing here. Correct that assumption was.

Wildcard RRs do not apply:

– When the query is in another zone. That is, delegation cancels the wildcard defaults.

– When the query name or a name between the wildcard domain and the query name is know to exist. For example, if a wildcard RR has an owner name of “*.X”, and the zone also contains RRs attached to B.X, the wildcards would apply to queries for name Z.X (presuming there is no explicit information for Z.X), but not to B.X, A.B.X, or X.

Hmm, that second bullet point sounds like a lead. Let me go back to my Route53 zone and look.

┬─┬ ノ( ゜-゜ノ)

Ah, I see it.

One feature of our branch deploy system is that you can also have a functioning mail pipeline that is specific to your branch. To use that feature, you email [email protected]. To make that work, each branch gets an MX record on your-branch.branch-deploy.com.

Here-in lies the problem. While you can have a wildcard record for branch-deploy.com, if an MX record (or other any record really) exists for a given subdomain and you try to visit your-branch.branch-deploy.com, that A/AAAA/CNAME resolution will not climb the tree to the wildcard. 🙃

This is likely a well-known quirk (is it even a quirk or is it common sense? it surely wasn’t common sense for me), but I blew half a day banging my head against my desk trying to figure out why this wasn’t working because I made a bad assumption and I really needed to vent about it. Thank you for indulging me.

2 thoughts on “How to waste half a day by not reading RFC 1034

  1. Thanks for sharing this Blake! I’m sure this hits home for most (all?) techies – for me there must be hundreds of times over the past twenty years of working in IT I’ve banged my head against some problem!

    The problem solving skills you outline in resolving your dilemma are the key to “being good” at tech. It’s the difference between stagnation within a specific, limited area of responsibility and the ability to change and grow in response to a world that keeps spinning.

  2. I really wish Basecamp would have more of these technical / behind-the-scenes posts more.

    Like what’s the tech stack running Hey. etc

Comments are closed.