I’m reviewing transcripts from interviews we did with customers last year and came across a nice example of interview technique.
The hardest thing about customer interviews is knowing where to dig. An effective interview is more like a friendly interrogation. We don’t want to learn what customers think about the product, or what they like or dislike — we want to know what happened and how they chose. What was the chain of cause and effect that made them decide to use Basecamp? To get those answers we can’t just ask surface questions, we have to keep digging back behind the answers to find out what really happened.
Here’s a small example.
Doris (name changed) works in the office at a construction company. She had “looked for a way to have everything [about their projects] compiled in one area” for a long time. All the construction-specific software she tried was too “in depth.” She gave up her search. Some months later, she and her co-worker tried some software outside the construction domain: Monday and Click-Up.
I asked: How did you get to these options?
She said she and her co-worker did Google searches.
I asked: Why start searching again?You tried looking for software a few months before. What happened to make you look again?
She said they had quite a few new employees. And “needed a place for everything to be.”
That sounds reasonable enough. We could have moved on and started talking about how she got to Basecamp. But instead of accepting that answer, I kept digging.
Ok so you hired more employees. Why not just keep doing things the same way?
“It was an outdated system. It’s all paper based. And this isn’t a paper world.”
We have our answer, right? “Paper based” is the problem. No, not yet. That answer doesn’t tell us enough about what to do.
As designers that’s what we need to know. We need to understand the problem enough to actually decide what specifically to build. “Paper based” sounds like a problem, but what does it tell us? If we had to design a software fix for her right now, where would we start? What would we leave out? How would we know when we made the situation better enough to stop and move on to something else?
So I asked one of my favorite questions:
What was happening that showed you the way you were doing things wasn’t working anymore?
This question is extremely targeted and causal. It’s a very simple question that invites her to describe the problem in a way that is hard, factual, time-bound, contextual, and specific — without any analysis, interpretation, speculation or rationalization. Just: What happened. What did you see. What was wrong.
“The guys would just come ask for the same information over and over again. And it was taking up time for me. . . . They shouldn’t have to ask me some of these questions. You get asked 20, 30 stupid questions and try to go back to something you have to pay attention to . . . you’re working numbers and money you need to be paying attention to what you’re doing.”
Aha. Now we’re getting somewhere. She holds all the information. The guys in the field need the information. She needs to give them access to the information so they can look it up themselves. Then she’ll stop getting interrupted and she can focus on her own work.
This dramatically narrows down the supply side and begins to paint the outlines of actionable design requirements.
I check to see if I understand the causality here.
Was the number of interruptions worse after you hired more people?
“Oh yeah, absolutely.”
Because we kept digging for causality, we got to an understanding of the situation — what caused it, what went wrong, what progress meant to her, and why she was saying “yes” and “no” to different options as she evaluated them.
Here are all my extracted answers from our monthly Basecamp check-in question of What are you reading? for 2019. (See also my answers from 2016, 2017, and 2018).
The Sane Society Another Fromm tome! This one starts from the premise of evaluating the different social characters of various societies. But not from the abstract, pretend objectiveness of “everything is equal; everything is just different”, but from a bold perspective of “some societies truly do better than others at promoting human health and flourish. That’s potentially a dynamite perspective, but Fromm handles it with utter grace and respect.
I particularly enjoyed the concept of mental illness or malaise as an act of rebellion against societal pressures and norms. Particularly the refutation that “the sane person” is whoever is performing their productive function in society. Yeah, fuck that.
I also really liked the depiction of a country’s social character to be an expression of what that society thought it needed. A German reputation for stinginess/savings as virtuous? A reflection of what the country needed in order to rebuild after 20th century devastation. It’s a fascinating recast of “stereotype” as something intellectually productive, and not just lazy othering.
The internet is finally coming out of its long haze on privacy, but it’s with one hell of a hangover. So many practices that were once taken for granted are now getting a second, more critical look. One of those is the practice of spying on whether recipients of marketing emails open them or not.
But whether these open rates are “useful” or not is irrelevant. They’re invasive, they’re extracted without consent, and they break the basic assumptions most people have about email. There’s a general understanding that if you take actions on the internet, like clicking a link or visiting a site, there’s some tracking associated with that. We might not like it, but at least we have a vague understanding of it. Not so with email spy pixels.
Just about every normal person (i.e. someone not working in internet marketing) has been surprised, pissed, or at least dismayed when I tell them about spy pixels in emails. The idea that simply opening an email subjects you to tracking is a completely foreign one to most people.
When I’ve raised this concern in conversations with people in the marketing industry, a lot of them have taken offense to the term “spy pixels”. Affixing the spying label made a lot of them uncomfortable, because they were just trying to help! I get that nobody wants to think of themselves as the bad guy (Eilish not withstanding), but using the word “spy” isn’t exactly a reach.
Here’s the dictionary definition of a spy: “a person who secretly collects and reports information on the activities, movements, and plans”. That fits pretty well to a spy pixel that tracks whether you open an email or not, without your knowledge or consent!
So. Let’s stop doing that. Collectively. And the best place to instigate reform is with the mailing list software we use. A modest proposal for a basic ethics reform:
1) Mailing list software should not have spy pixels turned on by default. This is the most important step, because users will follow the lead of their software. It must be OK to spy on whether people open my marketing emails if the software I’m using it provides that by default.
2)Mailing list software can ask for explicit consent when the sender really does want to track open rates. Let the sender include a disclaimer at the bottom of their email: “[The sender] would like to know when you open this email to help improve their newsletter. If that’s OK with you, [please opt-in to providing read receipts]. Thanks!”.
That’s it. Don’t do it by default, ask for informed consent if you must. Being respectful of someone’s privacy isn’t rocket science.
And remember, you can still tag your links in those emails with ?source=newsletter or whatever to see whether your call-to-action is working. As we discussed, people have a basic understanding that clicking links and visiting websites – explicit actions they take! – has some tracking involved.
This isn’t going to magically make everything better. It’s not going to fix all the issues we have with privacy online or even all the deceptive practices around mailing lists. But it’s going to make things a little better. And if we keep making things a little better, we’ll eventually wake up to a world that’s a lot better.
One of the great tragedies of modern web development over the last five years or so has been the irrational exuberance for microservices. The idea that making a single great web application had simply become too hard, but if we broke that app up into many smaller apps, it’d all be much easier. Turned out, surprise-surprise, that it mostly wasn’t.
As Kelsey Hightower searingly put the fallacy: “We’re gonna break it up and somehow find the engineering discipline we never had in the first place”.
But it’s one of those hard lessons that nobody actually wants to hear. You don’t want to hear that the reason your monolith is a spaghetti monster is because you let it become that way, one commit at the time, due to weak habits, pressurized deadlines, or simply sheer lack of competence. No, what you want to hear is that none of that mess is your fault. That it was simply because of the oppressive monolithic architecture. And that, really, you’re just awesome, and if you take your dirty code and stick it into this new microservices tumbler, it’s going to come out sparking clean, smelling like fucking daffodils.
The great thing about such delusions is that they can keep you warm for quite a while. A yeah, sure, maybe the complexities of your new microservices monstrosity are plain as day right from the get go, but you can always excuse them with “it’s really going to pay off once we…” bullshit. And it’ll work! For a while. Because, who knows? Maybe this is better? But it’s not. And the day you have to really admit its not, you’re probably not even still there. On to the next thing.
Microservices as an architectural gold rush appealed to developers for the same reason TDD appeals to developers: it’s the pseudoscientific promise of a diet. The absolution of a new paradigm to wash away and forgive our sins. Who doesn’t want that?
Well, maybe you? Now after you’ve walked through the intellectual desert of a microservice approach to a problem that didn’t remotely warrant it (ie, almost all of them). Maybe now you’re ready to hear a different story. There’s a slot in your brain for a counterargument that just wasn’t there before.
So here’s the counterargument: Integrated systems are good. Integrated developers are good. Being able to wrap your mind around the whole application, and have developers who are able to make whole features, is good! The road to madness and despair lays in specialization and compartmentalization.
The galaxy brain takes it all in.
But of course, you cry, what if the system is too large to fit in my brain? Won’t it just swap and swap until I kernel failure? Yes, if you try to stick in a bloated beast of an application, sure.
So the work is to shrink the conceptual surface area of your application until it fits in a normal, but capable and competent, programmer’s brain. Using conceptual compression, sheer good code writing, a productive and succinct environment, using shortcuts and patterns. That’s the work.
But the payoff is glorious. Magnificent. SUBLIME. The magic of working on an integrated system together with integrated programmers is a line without limits, arbitrary boundaries, or sully gatekeepers.
Forget frontend or backend. The answer is all of it. At the same time. In the same mind.
This sounds impossible if you’ve cooked your noodle too long in the stew of modern astronautic abstractions. If you turn down the temperature, you’ll see that the web is actually much the same as it always was. Sure, a few expectations increased here, and a couple of breakthrough techniques appeared there, but fundamentally, it’s the same. What changed was us. And mostly not in ways for the better.
If your lived experience still haven’t hit the inevitable wall of defeat on the question of microservices, then be my guest, sit there with your folded arms and your smug pout. It’s ok. I get it. There’s not an open slot for this argument in your brain just yet. It’s ok. I’m patient! I’ll still be here in a couple of years when there’s room. And then I’ll send you a link to this article on twitter.
My name is David Heinemeier Hansson, and I’m the CTO and co-founder of Basecamp, a small internet company from Chicago that sells project-management and team-collaboration software.
When we launched our main service back in 2004, the internet provided a largely free, fair, and open marketplace. We could reach customers and provide them with our software without having to ask any technology company for permission or pay them for the privilege.
Today, this is practically no longer true. The internet has been colonized by a handful of big tech companies that wield their monopoly power without restraint. This power allow them to bully, extort, or, should they please, even destroy our business – unless we accept their often onerous, exploitive, and ever-changing terms and conditions.
We strongly encourage candidates of all different backgrounds and identities to apply. Each new hire is an opportunity for us to bring in a different perspective, and we are always eager to further diversify our company. Basecamp is committed to building an inclusive, supportive place for you to do the best and most rewarding work of your career.
ABOUT THE JOB The Research & Fidelity team consists of two people, Sam Stephenson and Javan Makhmali, whose work has given rise to Stimulus, Turbolinks, and Trix—projects that exemplify our approach to building web applications. You’ll join the team and work with them closely.
In broad terms, Research & Fidelity is responsible for the following:
Designing, implementing, documenting, and maintaining front-end systems for multiple high-traffic applications
Assisting product teams with front-end decisions and participating in code reviews
Tracking evergreen browser changes and keeping our applications up-to-date
Extracting internal systems and processes into open-source software and evolving them over time
As a member of the R&F team at Basecamp, you’ll fend off complexity and find a simpler path. You’ll fix bugs. You’ll go deep. You’ll learn from us and we’ll learn from you. You’ll have the freedom and autonomy to do your best work, and plenty of support along the way.
Our team approaches front-end work from an unorthodox perspective:
We believe designers and programmers should build UI together, and that HTML is a common language and shared responsibility. Our tools and processes are manifestations of this belief.
We are framework builders. We approach intractable problems from first principles to make tools that help make Basecamp’s product development process possible.
Here are some things we’ve worked on recently that might give you a better sense of what you’ll be doing day to day:
Working with a designer during Office Hours (our weekly open invitation) to review and revise their code
Researching Service Workers and building a proof-of-concept offline mode for an existing application
Creating a Stimulus controller to manage “infinite” pagination using IntersectionObserver
Investigating a Safari crash when interacting with <datalist> elements and filing a detailed report on WebKit’s issue tracker
Extracting Rails’ Action Text framework from the rich text system in Basecamp 3
Working with programmers from the iOS and Android teams to co-develop a feature across platforms
Porting Turbolinks from CoffeeScript to TypeScript and refactoring its test suite
Responding to a security report for our Electron-based desktop app and implementing a fix
You might have a CS degree. You might not. That’s not what we’re looking for. We care about what you can do and how you do it, not about how you got here. A strong track record of conscientious, thoughtful work speaks volumes.
This is a remote job. You’re free to work where you work best, anywhere in the world: home office, coworking space, coffeeshops. While we currently have an office in Chicago, you should be comfortable working remotely—most of the company does!
Managers of One thrive at Basecamp. We’re committed generalists, eager learners, conscientious workers, and curators of what’s essential. We’re quick to trust. We see things through. We’re kind to each other, look up to each other, and support each other. We achieve together. We are colleagues, here to do our best work.
We value people who can take a stand yet commit even when they disagree. And understand the value in others being heard. We subject ideas to rigorous consideration and challenge each other, but all remember that we’re here for the same purpose: to do good work together. That comes with direct feedback, openness to each others’ experience, and willingness to show up for each other as well as for the technical work at hand. We’re in this for the long term.
PAY AND BENEFITS Basecamp pays in the top 10% of the industry based on San Francisco rates. Same position, same pay, no matter where you live. The salary for this position is either $149,442 (Programmer) or $186,850 (Senior Programmer). We assess seniority relative to the team at Basecamp during the interviewing process.
Benefits at Basecamp are all about helping you lead a healthy life outside of work. We won’t treat your life as dead code to be optimized away with free dinners and dry cleaning. You won’t find lures to keep you coding ever longer. Quality time to focus on work starts with quality time to think, exercise, cook a meal, be with family and friends—time to yourself.
Work can wait. We offer fully-paid parental leave. We work 4-day weeks through the summer (northern hemisphere), enjoy a yearly paid vacation, and take a one-month sabbatical every three years. We subsidize coworking, home offices, and continuing education, whether professional or hobbyist. We match your charitable contributions. All on a foundation of top-shelf health insurance and a retirement plan with a generous match. See the full list.
HOW TO APPLY Please send an application that speaks directly to this position. Show us your role in Basecamp’s future and Basecamp’s role in yours. Address some of the work we do. Tell us about a newer (less than five years old) web technology you like and why.
We’re accepting applications until Sunday, February 2, 2020, at 9:00PM US-Central time. There’s no benefit to filing early or writing a novel. Keep it sharp, short, and get across what matters to you. We value great writers, so take your time with the application. We’re giving you our full attention.
We expect to take two weeks to review all applications. You’ll hear from us by February 14 about whether you’ve advanced to the written code review part of the application process. If so, you’ll submit some code you’re proud of, review it, and tell its story. Then on to an interview. Our interviews are one hour, all remote, with your future colleagues, on your schedule. We’ll talk through some of your code and some of ours. No gotchas, brainteasers, or whiteboard coding. We aim to make an offer by March 20 with a start date in early April.
At Basecamp we have an internal project called “Your proudest moments”. My colleague Dan set it up so that people at Basecamp could share anything we’re proud of. So far people have shared impressive, really feel-good accomplishments, such as performing complicated house renovations without professional help, writing books, or taking their parents on an unforgettable vacation.
This post comes from my first contribution to this project. As I told them, it went to “Your proudest moments” because we don’t have a “Your most useless and pointless self-inflicted programming hours” project, that would have been the best fit. Still, this quite a ridiculous thing to do made me super proud, and I also had a lot of fun doing it.
Advent of Code is an advent calendar of programming puzzles that’s been happening since 2015, made by Eric Wastl. Every day from 1st to 25th December a new puzzle with 2 parts gets released and for each part solved you get a star. The goal is to collect all 50 stars to save Christmas. All the problems follow a story normally involving the space, a spaceship, elves, reindeers and Santa. The difficulty increases as the days pass. First ones are simpler, but then they start getting complicated and laborious. Some of them are pretty tricky! They aren’t necessarily super hard algorithmically, binary search, BFS, Dijkstra, Floyd–Warshall, A*… might be all you need but I can easily take several hours to finish each one. This year there was also a bit of modular arithmetic that I loved and a little bit of trigonometry. The problems are quite amazing. This year for example included things like a Pong game in Intcode, a made-up assembly language (you had to program the joystick movements and feed them to the program) and a text-based adventure game in Intcode as well. It’s seriously cool.
I’ve done it in 2016, 2017 and 2018, the first time in Ruby, then the last 2 years in Elixir. I never finish on the 25th December because for me it’s impossible to work, take care of life stuff and also spend several hours programming these puzzles every day 😅I normally finish around 27th or 28th December. This year I was moving to a new apartment in the middle of December, and since that wasn’t stressful enough, I took on a new challenge with Advent of Code and finished just this Sunday: doing a polyglot version, that is, every day in a different programming language! I was inspired by a friend who had done this in 2018. I thought I’d give up, but the more I solved, the more invested I was and the less willing to give up!
Every year there’s some sort of made up assembly code that you have to write an interpreter for, and then it reappears in subsequent problems, so you reuse your code, enhance it, etc. In the past, there had been a handful of problems using this assembly language. This year, however, 12 problems involved Intcode programs (the name of 2019’s made up assembly). Doing a polyglot challenge meant I had to rewrite the interpreter every time. I almost go mad 😆- hysterical laughter.
A very important part of the challenge for me was trying to write code as good as possible, even for languages I had never seen. I didn’t want to complete problems by learning how to declare variables, loops and if-else and force my Ruby or Elixir solution into the language. This meant looking at style guides so I could follow the naming conventions and that, but also getting familiar with idioms and structures, and looking at official repos and examples if possible, so my code was at least a little idiomatic. This took a lot of time in some cases, and I’m not sure I achieved my goals, but I did my best! What I didn’t do was to learn and use some features from some languages as I couldn’t really fit them in this kind of problems (like macros to manipulate AST nodes in Crystal or Nim, concurrency stuff in Erlang…).
Some fun things I learnt and other anecdotes from doing this challenge:
Programming languages with array indexes starting at 1 are stressful. In my case, this was Lua, R and Julia. Off-by-one errors are twice as fun here.
Rust is as cool as I had imagined it to be, but it has this complex and interesting memory management system that I only scratched the surface of. Ownership, borrowing and memory safety. Errors were ridiculously cryptic if you didn’t understand this very well, and yes, the hour I spent looking into it wasn’t enough to understand it well.
I always forget how much I like Golang and this made me remember. It’s so neat ❤️
Erlang was such a nice surprise. I had never used it and had this unfounded idea it was tough and unpleasant. It’s not, it’s lovely! 💘 This must be a false rumour spread by Erlang programmers so they can remain in their exclusive club.
Julia was also a nice surprise, easy to pick up, at least for the basics, and I imagined it being pretty good for scientists.
I also remembered how much I love C, it was the first language I learnt. I’m lucky I don’t write C code professionally, though. It’s seriously scary. One obscure, edge error and someone will get root in your machine.
I used Kotlin and Swift too soon for too simple problems! Same for Common Lisp and Prolog. I wasted them when I still thought I wouldn’t finish the challenge.
I skipped PHP and Java! Two languages I know and have even used professionally, but dislike them so much that I preferred to learn new ones! There’s already been enough PHP and Java in my life.
Nim was a language I didn’t even know it existed, and I’m quite glad I chose it.
Crystal’s syntax is so similar to Ruby that I felt I was cheating by using it in the challenge. Still, I made the rules, so…🤷🏻♀️
Dylan was impossible to find documentation and examples for, in the regular ways programmers search for stuff nowadays, at least. Whenever I’d google how to do something, all I’d find was someone named Dylan explaining how to do whatever I wanted to do in another language like Java. I ended up using a really good book from the 90s that I found available online in PDF.
Naming conventions in R are famously anarchic! 5 naming conventions to choose from, and multiple conventions used simultaneously in the same packages, style guides and tutorials 🤣In my code I went for the period.separated one: e.g. cut.cardssimply because I wasn’t going to have the chance of using this convention in any other language. Of course, I threw in some naming inconsistencies as well to comply with the chaos! ✌️
I hope you enjoyed reading this. If you’re interested, here are all my solutions for 2019 and the previous years. I’ll be forever grateful to Eric Wastl for creating Advent of Code, and I’m sure many other programmers share this gratitude. Now I need to think of a new challenge for next year!
A few days ago my wife and I went to see Uncut Gems at a Regal theater in Chicago.
We booked our ticket online, reserved our seats, showed up 15 minutes ahead of time, and settled in.
After the coil of previews, and jaunty, animated ads for sugary snacks, the movie started.
About 20 minutes in, a loud, irritating buzzing started coming from one corner of the theater. No one was sure what to make of it. Was it part of the movie? We all just let it go.
But it didn’t stop. Something was wrong with the audio. It was dark, so you couldn’t see, but you could sense people wondering what happens now. Was someone from the theater company going to come in? Did they even know? Is there anyone up in the booth watching? Did we have to get someone?
We sent a search party. A few people stood up and walked out to go get help. The empty hallways were cavernous, no one in sight.
Eventually someone found someone from the staff to report the issue. Then they came back into the theater to settle in and keep watching the movie.
No one from the theater came into the theater to explain what was going on. The sound continued for about 10 more minutes until the screen abruptly went black. Nothingness. At least the sound was gone.
Again, no one from the theater company came in to say what was going on. We were all on our own.
The nervous, respectfully quiet giggle chatter started. Now what?
A few minutes later, the movie started again. From the beginning. No warning. Were they going to jump forward to right before they cut it off? Or were we going to have to watch the same 25 minutes again?
No one from the theater company appeared, no one said anything. The cost of the ticket apparently doesn’t include being in the loop.
Eventually people started walking out. My wife and I included.
As we walked out into the bright hallway, we squinted and noticed a small congregation of people way at the end of the hall. It felt like finally spotting land after having been at sea for awhile
We walked up. There were about eight of us, and two of them. They worked here. We asked what was going on, they didn’t know. They didn’t know how to fix the sound, there was no technical staff on duty, and all they could think of was to restart that movie to see if that fixed it.
We asked if they were planning on telling the people in the theater what was going on. It never occurred to them. They dealt with movies, they didn’t deal with people.
We asked for a refund. They pointed us to the box office. We went there and asked for a refund. The guy told us no problem, but he didn’t have the power to do that. So he called for a manager. The call echoed. Everyone looked around.
Finally a manager came over. We asked for a refund, he said he could do that. We told him we purchased the tickets through Fandango, which complicated things. Dozens of people lined up behind us. The refund process took a few minutes.
Never a sorry from anyone. Never even an acknowledgment that what happened wasn’t supposed to happen. Not even a comforting “gosh, that’s never happened before” lie. It was all purely transactional. From the tickets themselves, to the problem at hand, to the refund process. Humanity nowhere.
We left feeling sorry for the whole thing. The people who worked at the theater weren’t trained to know how to deal with the problem. They probably weren’t empowered to do anything about it anyway. The technical staff apparently doesn’t work on the premises. The guy at the box office wanted to help, but wasn’t granted the power to do anything. And the manager, who was last in the line of misery, to have to manually, and slowly, process dozens of refunds on his own. No smiles entered the picture.
This is the future, I’m afraid. A future that plans on everything going right so no one has to think about what happens when things go wrong. Because computers don’t make mistakes. An automated future where no one actually knows how things work. A future where people are so far removed from the process that they stand around powerless, unable to take the reigns. A future where people don’t remember how to help one another in person. A future where corporations are so obsessed with efficiency, that it doesn’t make sense to staff a theater with technical help because things only go wrong sometimes. A future with a friendlier past.
I even imagine an executive somewhere looking down on the situation saying “That was well handled. Something went wrong, people told us, someone tried to restart it, it didn’t work. People got their refunds. What’s the problem?” If you don’t know, you’ll never know.
Back in November, we noticed something odd happening with large uploads to Amazon S3. Uploads would pause for 10 seconds at a time and then resume. It had us baffled. When we started to dig, what we found left us with more questions than answers about S3 and AWS networking.
We use Amazon S3 for file storage. Each Basecamp product stores files in a primary region, which is replicated to a secondary region. This ensures that if any AWS region becomes unavailable, we can switch to the other region, with little impact to users uploading and downloading files.
Back in November, we started to notice some really long latencies when uploading large files to S3 in us-west-2, Basecamp 2’s primary S3 region. When uploading files over 100MB, we use S3’s multipart API to upload the file in multiple 5MB segments. These uploads normally take a few seconds at most. But we saw segments take 40 to 60 seconds to upload. There were no retries logged, and eventually the file would be uploaded successfully.
For our applications that run on-premise in our Ashburn, VA datacenter, we push all S3 traffic over redundant 10GB Amazon Direct Connects. For our Chicago, IL datacenter, we push S3 over public internet. To our surprise, when testing uploads from our Chicago datacenter, we didn’t see any increased upload time. Since we only saw horrible upload times going to us-west-2, and not our secondary region in us-east-2, we made the decision to temporarily promote us-east-2 to our primary region.
Now that we were using S3 in us-east-2, our users were no longer feeling the pain of high upload time. But we still needed to get to the bottom of this, so we opened a support case.
Our initial indication was that our direct connections were causing slowness when pushing uploads to S3. However, after testing with mtr, we were able to rule out direct connect packet loss and latency as the culprit. As AWS escalated our case internally, we started to analyze the TCP exchanges while we upload files to S3.
The first thing we needed was a repeatable and easy way to upload files to S3. Taking the time to build and set up proper tooling when diagnosing an issue really pays off in the long run. In this case, we built a simple tool that uses the same Ruby libraries as our production applications. This ensured that our testing would be as close to production as possible. It also included support for multiple S3 regions and benchmarking for the actual uploads. Just as we expected, uploads to both us-west regions were slow.
region user system total real
us-east-1: 1.894525 0.232932 2.127457 ( 3.220910)
us-east-2: 1.801710 0.271458 2.073168 ( 13.369083)
us-west-1: 1.807547 0.270757 2.078304 ( 98.301068)
us-west-2: 1.849375 0.258619 2.107994 (130.012703)
While we were running these upload tests, we used tcpdump to output the TCP traffic so we could read it with Wireshark and TShark.
When analyzing the tcpdump using Wireshark, we found something very interesting: TCP retransmissions. Now we were getting somewhere!
Analysis with TShark gave us the full story of why we were seeing so many retransmissions. During the transfer of 200MB to S3, we would see thousands of out-of-order packets, causing thousands of retransmissions. Even though we were seeing out-of-order packets to all US S3 regions, these retransmissions compounded with the increased round trip time to the us-west regions is why they were so much worse than the us-east regions.
What’s interesting here is that we see thousands of our-of-order packets when transversing our direct connections. However, when going over the public internet, there are no retransmissions or out-of-order packets. When we brought these findings to AWS support, their internal teams reported back that “out-of-order packets are not a bug or some issue with AWS Networking. In general, the out-of-order packets are common in any network.” It was clear to us that out-of-order packets were something we’d have to deal with if we were going to continue to use S3 over our direct connections.
Thankfully, TCP has tools for better handling of dropped or out-of-order packets. Selective Acknowledgement (SACK) is a TCP feature that allows a receiver to acknowledge non-consecutive data. Then the sender can retransmit only missing packets, not the out-of-order packets. SACK is nothing new and is enabled on all modern operating systems. I didn’t have to look far until I found why SACK was disabled on all of our hosts. Back in June, the details of SACK Panic were released. It was a group of vulnerabilities that allowed for a remotely triggered denial-of-service or kernel panic to occur on Linux and FreeBSD systems.
In testing, the benefits of enabling SACK were immediately apparent. The out-of-order packets still exist, but they did not cause a cascade of retransmissions. Our upload time to us-west-2 was more than 22 times faster than with SACK disabled. This is exactly what we needed!
region user system total real
us-east-1: 1.837095 0.315635 2.152730 ( 2.734997)
us-east-2: 1.800079 0.269220 2.069299 ( 3.834752)
us-west-1: 1.812679 0.274270 2.086949 ( 5.612054)
us-west-2: 1.862457 0.186364 2.048821 ( 5.679409)
The solution would not just be as simple as just re-enabling SACK. The majority of our hosts were on new-enough kernels that had the SACK Panic patch in place. But we had a few hosts that could not upgrade and were running vulnerable kernel versions. Our solution was to use iptables to block connections with a low MSS value. This block allowed for SACK to be enabled while still blocking the attack.
$ iptables -A INPUT -p tcp -m tcpmss --mss 1:500 -j DROP
After almost a month of back-and-forth with AWS support, we did not get any indication why packets from S3 are horribly out of order. But thanks to our own detective work, some helpful tools, and SACK, we were able to address the problem ourselves.
Can you believe we used to willingly tell Google about every single visitor to basecamp.com by way of Google Analytics? Letting them collect every last byte of information possible through the spying eye of their tracking pixel. Ugh.
But 2020 isn’t 2010. Our naiveté around data, who captures it, and what they do with it has collectively been brought to shame. Most people now sit with basic understanding that using the internet leaves behind a data trail, and quite a few people have begun to question just how deep that trail should be, and who should have the right to follow it.
In this new world, it feels like an obligation to make sure we’re not aiding and abetting those who seek to exploit our data. Those who hoard every little clue in order to piece of together a puzzle that’ll ultimately reveal all our weakest points and moments, then sell that picture to the highest bidder.
The internet needs to know less about us, not more. Just because it’s possible to track someone doesn’t mean we should.
That’s the ethos we’re trying to live at Basecamp. It’s not a straight path. Two decades of just doing as you did takes a while to unwind. But we’re here for that work.