Feeling Safe Across Data Centers

The Single Server Room

In a perfect world all of your servers are hard-wired to each other in a room. There would be one opening with which the world connects to your little slice of the Internet. That way all of your cross-service communication into databases, infrastructure and other services happens within the single set of servers all directly connected.

In the modern world we often have to reach across the Internet to access services and applications. This can be an awkward feeling and presents some unique problems. However, there are a few techniques and patterns you can use to make it a little less frightening. Let’s talk through some of the bigger concerns and what you can do about them.

Security

One big reason reaching across data centers, even for first-party systems, can be an issue is the matter of security. There’s a base level of security you lose sending data and commands across the Internet where anyone can glance at your requests and try to decode the data or alter what’s being sent. All someone needs is a basic understanding of networking, attack techniques like Man-in-the-middle and perhaps Wireshark to unpack your cross-service request, see sensitive data, tinker with the request and send an altered request to the final destination. Fear not, however, there are some standard techniques to mitigate this risk:

1. SSL

Always communicate over SSL when you’re sending requests back and forth over your systems. This is a straightforward, standard way to secure communications between two services or entities on the Web. Under the hood, SSL uses Public/Private Key encryption to secure the body of a request between two entities. Reddit, Facebook and all of your financial institutions use SSL (HTTPS) to communicate with your browser, and likely when they communicate between internal services. Its become far easier and cheaper (free) to get SSL for your services as well thanks to organizations like Let’s Encrypt.

2. Request Signing

While communication over SSL is somewhat secure, it can fail. Or perhaps you don’t need SSL to prevent snooping, but you do want to ensure the data wasn’t tampered with. At Highrise we decided to utilize a drafted standard that is being worked on currently under IETF, which outlines a method for signing a request. This means you can use an encryption algorithm and set of keys that you configure to define a formal verification for the content of your request. Let’s say I want to ensure that the Digest, Authentication and Date headers were specifically never altered. By following this protocol I would: Set up the request, retrieve the signature (using signing keys) for the specified headers, add the signature to the request and execute the request. This standard allows for specifying what keys you used to sign the request (via a KeyId parameter), which headers were signed, and which algorithm was used to do the signing. The recipient server can use this information to verify the contents of those headers were not altered during transport. The details of this freshly forming protocol go a fair bit deeper and are worth understanding. There will be a followup post directed at this topic shortly.

These two protocols give us a stronger confidence in the things being sent over the wire to other services.

Reliability

Speed of accessing external services due to network fluctuations as well as actual downtime are facts of a cross-data-center world. Obviously, both types of issues can compound themselves and start making whole services virtually unusable. You often won’t be able to stop these things from happening so you have to prepare for them. Let’s talk about four mitigation techniques:

1. Local caches, Avoid making requests

Caching or intelligently deciding when to request across services can cut down on the number of actual requests you need to make. Things like eTags can help with this as well as expiration headers or simply not requesting data unless you absolutely need it to accomplish your task. If the thing didn’t change from the last time it was requested let the client reuse the data it already has.

2. Timeout Retry

I mentioned earlier that slow responses from external services can create a compounding problem for your system. You can mitigate this risk by planning for it to happen and wrapping specific patterns around your communication. Specifically, set reasonable timeouts when you make external requests. One problem with timeouts is that you can’t tell if it ever reached the server. So you should plan to make your endpoint idempotent whenever possible. Idempotent endpoints make retries simpler as well, since you can just keep hitting the endpoint and expect no unexpected change. Finally, you maybe should slow down rescheduling the request to give some time for a system to recover or avoid hammering the service. This is called exponential back-off.

At Highrise, certain important requests have a timeout like 1 second. If the request fails it will be retried 3 times before it stops trying and starts messaging our team about issues. Each time it will schedule the job to retry further out: 3 seconds after failure, 9 seconds after failure and 27 seconds after failure, because of the exponential back-off algorithm. In cases where something is, for instance, sending an email via an external request, idempotency is a very serious concern so that you avoid sending the exact same email 3 times because of retries. You can accomplish something like that with a key that the server uses to decide if that operation has already been accomplished.

3. Circuit Breakers

Circuit Breakers paired with timeouts can help you both better handle full-service degradation and provide a window for recovery. A Circuit Breaker basically lets you define a set of rules that says when a breaker should “trip.” When a breaker trips you skip over an operation and instead respond with “try again later please,” re-queue a job or use some other retry mechanism. In practice at Highrise, we wrap requests to an external service in a circuit breaker. If the breaker trips due to too many request timeouts, we display a message to any users trying to access functionality that would use that service, and put jobs on hold that use that service. Jobs that were in-flight will presumably fail and be retried as usual. A tripped breaker stays tripped for several minutes (a configured value) and thus keeps us from hammering a service that may be struggling to keep up. This gives Operations some breathing room to add servers, fix a bug or simply allow network latency to recover a little.

4. Upcheck

Upchecks, Health-Checks and the like are very useful to get a basic understanding of whether you can reach a service. Libraries standardize some of this for you so you don’t have to think much about what to provide. Really what you want is to understand whether you can reach the service and if its basic functions are operational. Upchecks paired with a circuit breaker can help decide whether to show a maintenance page or to skip jobs that won’t work at the moment. These checks should be extremely fast. At Highrise for our first-party, external services we check once on each web-request for the livelihood of a feature about to be accessed. Again let’s say we have an external emailing service. If someone goes to the email feature we wouldn’t check at each email operation, in the code, that the service is up. Instead, we would check at the beginning of the web request if the email service is up. If it is up continue to the feature, if it isn’t display a basic “down, please try later” message.

Act like it isn’t yours

When it comes to external services, even if you wrote it, you have to act like you have no control of it. You can’t assume any external service will always operate normally. The reality is you have limited control, so you have to design a system that explains issues to your users as they happen and mostly recovers on its own. Protect what you can control and avoid requiring humans to repair issues like this. The more your computers can recover on their own, the more you can worry about the next feature, the next user or the next beer.


I’m a Software Engineer at Highrise (@highrise). Follow me on Twitter to tell me what you think, as well as find more ramblings about Software and the world by me @jphenow.

Balance Driven Development


I mentioned in my last post that I would talk about my opinions on TDD, so here it is. Kicking it off, I will explain what TDD is, how it’s meant to work. Then I’ll explain what some people have said about it and talk about what I believe the real benefits of TDD are. Finally, I’ll walk through whether I think it’s worth using and explain my use of the practice. Oh, and I’ll also provide a disclaimer as to what the heck possessed me to pile onto this already well-discussed topic.

TDD stands for “Test Driven Development.” At its core, it’s a development practice; a way to approach writing code. The rules of how to practice TDD are fairly simple at their surface. Say you have a new function that you need in order to accomplish a task: write the smallest test you can imagine, run the test, watch it respond with a failure, write the smallest possible amount of code to make that test pass, repeat until the necessary functionality is complete. With that process there are a number of benefits that people reference.

The main benefit I see TDD-promoters reference is test coverage. Since, with TDD, testing is part of how you write code, you just get more tests that are very well tied to the logic inside your functions. That test coverage paired with ongoing use of the practice tends to make new development less frightening because you have pretty high confidence that your code is covered and will alert you to unexpected behavior changes.

One counter-argument to the test coverage benefit is that the immense depth at which you’re covering your code in this type of practice results in brittle tests. Growing the test code at a rate faster than your app code can increasingly make it difficult to make changes to your app without spending many more hours rejiggering your tests. So, while you maybe have higher confidence in your app at one point, by the time you’ve redone much of your testing, due to feature additions, you’re in kind of a ¯\_(ツ)_/¯ state. So much so that by the time you’re done getting the tests green, you can’t tell if you’ve fixed the tests properly or if you just made them look green.

As Software Engineers we like to find processes and tools that allow us to remove blame and responsibility from the human. We want the computer and process to protect us and keep us in a safe zone. I think both of the above arguments are trying to achieve that same end of some kind of safe zone. TDD, in terms of the testing benefit that’s often referenced, would like to keep us in a zone of constant “yes it works.” The anti-TDD position explains a world where the process potentially slows down our ability to progress and potentially hurts our confidence in new functionality due to lots of changing tests.


One of the tricks with the name TDD is that it implies that tests are the benefit, when in fact they’re simply a vehicle for development. I actually liken tests from the TDD process to CO2 — they’re there to move things forward and useful for that but otherwise need to be cleaned up after their purpose has been served. That is to say that a lot of tests I write during a TDD exercise are meant to be deleted at the end. I tend to use those TDD tests to help write new tests that are intended to live on with the app for regression and documentation. They often even look very similar, but now I’m writing tests to lasts rather than tests drive development. These are fundamentally different mindsets.

I’ve mentioned that TDD is more of a vehicle for development. The effect TDD can have on the design of your code seems to be the most overlooked benefit of the practice while still the most important, in my mind. When I’m writing code without some sort of test I just add things and out comes the functionality. I’m worried about how easiest I can get the feature done. When I’m writing it from the perspective of a test I’m writing as if I’m a user of the code that I need. That means that my mentality for what needs to be written is altered. I’m basically defining the API that I can test and understand. This is much different than writing a bunch of things that technically work but then having to explain that API from the reverse — in order to write the tests.

With code/API-design and test coverage being the main arguments for or against TDD lets talk about what I do. I tend to think that when there are two big camps of people shouting for or against an idea the truth or the best path lies somewhere in the middle. I think some of what you can learn about your code for future testing and for understanding design is an enormous win. I don’t feel that my longer-term regression tests come from the practice, though. So after working with TDD for quite some time, I now lean on it as a tool in by toolbox. Generally, I remove those initial tests and move on with life. One thing I will give TDD is that the design mentality around understanding what your external API looks and feels like has ultimately changed how I write code, regardless of whether I’m actually living by the practice. Now, when I write code I tend to sketch up an API of what I’d expect to use, then I aim to fill that in. The same idea as TDD in terms of thinking from the other end, with less rigidity.

So what should you do? If you’ve never tried it, don’t just listen to me or to the others on the all-knowing Internet. Try it! I still think its a practice worth doing for a little while, if only to understand it yourself and develop your own opinion. Your opinion may be different than others’ and that’s ok. If it helps you make cool things or enables you or your company to make money, then that’s awesome; keep doing that.


I recognize that I’m beating a proverbial dead horse. Everyone and their mother has already written about their feelings on TDD, I even recycled a lot of those same arguments here. I decided to publish my own post on the topic because I feel like I’ve found a place somewhere in the middle of the argument. I think a lot of people tend to focus on picking sides and I wanted to explain that I think you can use it as a tool; a tool in your belt. It doesn’t have to be your whole world, but if you prefer that tool over another, great!

Also, in the last month, several people have asked about my feelings on the topic so I figured I’d compile my feelings in this format for reference.


Thanks for listening. I work for Highrise HQ building a better CRM. If you liked this, you might check out my Twitter for more silly opinions and feels on tech and society.

Failure as Progress


Failure has a negative connotation and reasonably so; it literally means that the thing you set out to do did not happen or happened incorrectly. Looking at each step and being distressed by failure is not only a drag, it’s unproductive. Altering your perception and response to failure can make life an entirely more enjoyable and fruitful endeavor.


SpaceX is a private Space Explorations and Technology company. Basically, they launch fancy rockets and eventually want to take us to Mars for colonization.

To make a Mars trek a reality, they have to bring the cost of a ticket down to the point where they can sell to non-millionaires. Rockets today are virtually disposable. They’re launched, debris falls to the ocean and that’s it. One big way to lower the cost of a trip could be to make rockets actually reusable. Reusable rockets are more akin to how we think of Airplanes now. Landing a large needle-shaped rocket isn’t that simple a task though. They have to correct for the speed, angle, wind conditions, everything. By my count, they’ve attempted 12 propulsive landings, 3 went famously, with others having varying levels of success. Many were considered successful simply for the fact that they were able to collect a ton of data and correct their systems for the next flight.

We are not expecting a successful attempt this time around but we are learning — John Federspiel (Lead Mechanical Design at Space X)

Space X didn’t set out to make a rocket landing possible, work for several years and output a rocket that can land upright out on the ocean. They worked for several years using what they know as a launchpad to new questions and experiments. They measured at every step and used those measurements to make decisions about the next mission. The key to the progress is to “commit to failing in a new way.


In order to succeed, or fail enough to succeed, you first need to define it. One of the biggest mistakes I’ve seen is when a project has significant work done without understanding what the ultimate goal is. You’re simply chipping away at something, but you have no clue what. In order to fail at a thing or succeed and move on, you have to have a tangible goal post that you’re working towards with a definition of “it worked.”

A concrete example of this is writing; let’s say I’d like to write more. I’ve told people I’m trying to write more and initially I did. A month or two went by and someone asked, “how’s the writing going?” and I responded, “It’s going ok.” And that’s the end of it. What’s wrong with that? Well, aside from it not really going at all, I had no measurable concept of whether writing was going well. I had no baseline to measure against.

In an alternate universe, when I’ve decided to improve my writing, I write up that I’d like to output one post per week and publish somewhere at least once a month for the next 6 months. This time, when a friend asks how my writing has been going, I can loosely compare my actual writing to my goals and decide whether I’ve been failing at writing. If I have been failing I can adjust my expectations or better define my goals.

The time boxing of a goal is also important. If you have a supposed infinite amount of time to succeed or fail, then you never fail. You’re really just in a constant state of in-progress until you “succeed” which, aside from being unrealistic, doesn’t teach you anything because you don’t get to learn from any sort of failure.


In Software Engineering there’s a well known development technique called TDD — Test Driven Development. Rather than writing code and then testing it, you write a test and then write the code that makes that test pass. One reason people find it useful is that finding proof that your code works comes fairly naturally.

As an Engineer, TDD sounds a little backwards; maybe I don’t know exactly what I need to write so how should I know what test to write? TDD implies that if you’re writing code before you know what you need then you haven’t thought about your problem. This act of defining success for your code forces you through the motions of understanding what you need to learn from the code, what isn’t there now that I need to be there? TDD can instill a sense of constant failure. You write a test that will fail, watch it fail, make it not fail. It provides a quick understanding that a big red “F” isn’t a bad thing, now we know something, definitively, that we didn’t know before.


The problem with success-sans-failure is that you don’t know the range of possible failures, yet. I prefer to have failed at some point along the way because at least then you’ve picked up some more concrete understanding of the problem at hand. You might call SpaceX a Stellar example — do enough of what you know is feasible while trying to experiment with things that will fail initially. Learn from acceptable amounts of failures while succeeding enough to stay afloat.

When I was a Junior Software Engineer I feared some of the changes I was making. I feared them not working properly and I feared critiques, initially. I came to learn that all the critique, the failed tests, even errors in production were just a fact of progress. Remove the fear and jump in the deep end, learn. Its a lot more fun and you’ll find that you understand everything far better than if you had spent 10 times the amount of time trying to just “get it right.”


I’m a Software Engineer at Highrise. We build a Simple CRM. Check out my rants and recommendations on Twitter.

A couple footnotes for the curious. In regards to any research I did on SpaceX, most of the inspiration and some of my basic understanding comes from Wait But Why. Do check them out, but if you’re curious about the SpaceX article, specifically, check that out here. There’s also some great stuff in their latest (5/6/2016) launch stream video. My count of launches is listed in their well-groomed controlled-descent Wikipedia article. I have some feelings about TDD but I didn’t want to muddy the point of the article with those feelings. I will write more explicitly about TDD (because I feel left out) but if you are interested, some assessments of the practice are worth checking out by c2 (more of a collection) and some opposing viewpoints from David Heinemeier Hansson.