Respecting privacy at Basecamp

I spend a lot of time as a data scientist thinking about how to use data responsibly, particularly when it comes to privacy. There’s tremendous value to be found by analyzing data, but the only way the data science field will continue to have data to analyze is if we are responsible in how we use it.

As a company, Basecamp strives to have the respect for user privacy that we’d like in every service we personally use.

I could talk about the things that we do relating to privacy:

  • We have a plain English privacy policy, and expect to be GDPR compliant by the deadline.
  • We use encryption for all communications between Basecamp and your browser, and we encrypt our backend services as much as is practical.
  • When you cancel, we delete your account and all your data.
  • We purge log data and database backups after 30 days.

But I think our privacy philosophy is better defined by the things that we don’t do.

We don’t access customer accounts unless they ask.

The only time we’ll ever put ourselves into a position to see a customer’s account is if they grant explicit permission to do so as part of a support ticket. We log and audit all such access.

We don’t look at customer identities.

Many companies, especially startups, review every signup manually and reach out to interesting looking customers. I get lots of these emails, and every one leaves me unsettled.

Tons of companies will also use the fact that you signed up as permission to identify you as a customer for marketing purposes. Over the years, I’ve had to ask no fewer than a dozen companies to remove Basecamp from their marketing material.

I find both of these practices to be distasteful. There’s no reason I, or anyone else here, needs to know the names of people who are signing up for Basecamp. It’s unnecessary.

We don’t share customer data.

There are a few aspects to this, but our basic premise is that it’s your data, and not ours, so we shouldn’t be sharing it.

We get lots of people writing us from big companies asking “does anyone else at Acme use Basecamp?” or people asking “can you tell me any companies in our industry that use Basecamp?”. Just like we don’t look at identities ourselves, we also don’t disclose them to people who ask.

We’ll only provide customer data to law enforcement agencies in response to court orders. Unless specifically prohibited from doing so, we’ll always inform the customer of the request.

It should go without saying, but we don’t sell customer lists or any other data to anyone.

We don’t look at identifiable usage data.

To make Basecamp better, we do analyze usage patterns, and we have instrumentation to enable us to do that. This inherently requires us to in some form look at what people are doing when they’re using Basecamp.

Where we draw the line is that we never look at identifiable usage data. Any data that we use for analysis is stripped of all customer provided content (titles, message or comment bodies, file names, etc.), leaving only metadata, and it’s blinded to remove identifiable information like user IDs, IP addresses, etc. We try to do these things in such a way that it’s impossible for anyone analyzing data to even accidentally have access to anything identifiable.

This choice to never look at any identifiable data (or even be able to) does place minor constraints on the analyses we can perform, but so what? There’s plenty of value left in what we can do. My job might be a little bit harder, but I’m happy to spend the extra effort to be respectful of customers’ privacy.

We don’t send customer data to third party services.

As much as possible, we avoid the user of third party services that require any customer data to pass through them. There are many cases of such tools capturing too much, and we can’t control what happens with data once it reaches them.

There are a few cases where we do use third party services, which I’m happy to disclose:

  • We use Amazon Web Services and Google Cloud Platform to host some parts of our applications. In those cases, we use available encryption options to prevent the platform provider from having access to the underlying customer data.
  • We use third party analytics tools (currently Google Analytics and Clicky) on public facing websites only. They capture IP addresses, etc., but are not put in any place where they could capture user provided content.
  • We use a third party helpdesk tool for answering support cases (HelpScout). This mean that HelpScout has any data that gets sent in a support ticket.
  • We use third party tools for sending some emails (MailChimp and, which have access to customer email addresses and metadata required to know when to send an email. We don’t send any customer provided data to either service.
  • We use third party CDNs (Akamai and Cloudfront) for serving static assets. Those services have access to IP addresses, etc.

We don’t want you to feel creeped out.

At the end of the day, this is the bottom line. We don’t want to do anything that feels creepy or that we wouldn’t want done with our data.

We know that you’re putting your trust in us when you use Basecamp, and we want to do everything we can to honor and live up to that trust.

What does Team Data do at Basecamp?

Basecamp’s “team data” recently doubled in size with Justin White joining us full-time as a programmer. We’ve been in the data business at Basecamp for over six years, but the occasion of hiring a second team member caused me to reflect on why Team Data exists, what it does, and how we work.

A simple objective: make Basecamp better

We’re basically interested in three things on Team Data:

  1. Make Basecamp-the-product better to help our customers achieve their goals.
  2. Make Basecamp-the-company a great place to work.
  3. Make Basecamp-the-business successful.

These are really the same fundamental objectives every team at Basecamp is working towards, and each team has their own specific angle on this. The support team focuses on how their interactions with customers can achieve these goals, the design and development teams focus on how the functionality of Basecamp itself can achieve these goals, etc.

On the data team, we primarily attempt to use quantitative information to achieve these goals. Our approach isn’t necessarily the best approach for every problem, but it’s the angle we take on things. If we can’t address a specific question or problem with some sort of number, we probably aren’t the best team to answer it, and we’ll gladly defer to the perspective others can bring to bear.

What we do

Pretty much everything we do on Team Data falls into one of two categories:

  1. We answer questions about Basecamp using data and make recommendations. The questions we tackle span a wide range: from specific questions about how a feature is used, to understanding how a change we made impacted signups, to open questions about how we can improve some aspect of business performance.
  2. We build infrastructure and tools to a) support our ability to answer the questions above, and b) help others at Basecamp accomplish their work more effectively.

We occasionally do things that don’t fall into either of those categories, but the core of what we do falls into either analysis or infrastructure.

A sampling of our work over the past few months includes:

  • Analyzing the performance of a new marketing site and account setup process.
  • Improving the internal dashboard app that powers many of our internal tools by removing thousands of lines of dead code and upgrading to a modern version of Rails.
  • Helped design, implement, and analyze a dozen A/B tests.
  • Migrating our data infrastructure from on-premise hardware to cloud-based services.
  • Analyzed the sequencing of notifications sent by Basecamp and recommended ways to adjust timing.

Things we believe about ourselves

Every team at every company has a set of beliefs about how they work, whether they are aware of them, acknowledge them, or codify them. Here on team data, there are a few tenets that we try to embody that we’ve taken the time to write down:

  1. We are scientists. Wherever possible, we apply the scientific method to solving problems, whether through analysis or engineering.
  2. We are objective. There’s no agenda on team data other than seeking the truth; we report the facts whether we like them or not.
  3. We try for simple. We don’t use a machine learning model when a heuristic will do, and we don’t write complicated programs when a simple `awk` one liner will work.
  4. We are rigorous. When a problem demands a nuanced understanding or when data needs to be high quality, we stick to those requirements. We’d rather over-explain a complicated situation than over-simplify it.
  5. We are technology and tool agnostic. Ruby, Go, Scala, R, Python — whatever the best tool for the job is. When possible, we use open-source or third-party tools, but we’ll build what’s needed that isn’t otherwise available.
  6. We collaborate, engaging in peer review of analysis and code.

We don’t hit all of these points on every day, but they’re the aspiration we’re working towards.

How we work

Unlike the core product teams at Basecamp, we don’t explicitly work in six week cycles, and we tend to each have multiple projects under way at any given time. Many of our projects are a couple days or weeks, and some stretch over six months or a year. We might do some instrumentation today and then back burner that for 30 days while we wait for data to collect, or a thorny problem might wait until we figure out how to solve it.

Generally, Justin spends about 80% of his time working on infrastructure and the remainder on analysis, and I spend about 80% of my time on analysis and the remainder on infrastructure. This is mostly about specialization — Justin is a far better programmer than I am, and I have more experience and background with analytics than he has. We don’t hit this split exactly, but it’s our general goal.

We get lots of specific requests from others at Basecamp: questions they’d like answered, tools that would help them do their work, etc., and we also have a long list of bigger projects that we’d like to achieve. We explicitly reserve 20% of our time to devote to responding directly to requests, and we both try to set aside Fridays to do just that.

Anyone can add a request to a todolist in our primary Basecamp project, and we’ll triage it, figure out who is best equipped to fulfill it, and try to answer it. Some requests get fulfilled in 20 minutes; we have other requests that have been around for months. That’s ok — we embrace the constraint of not having unlimited time, and we admit that we can’t answer every question that comes up.

Outside of requests, we collaborate with and lean on lots of other teams at Basecamp. We build some of the tooling that the operations team uses for monitoring and operating our applications, and they provide the baseline infrastructure we build our data systems on. We collaborate with developers and designers to figure out how what data or analysis is helpful as they design and evaluate new features. We work closely with people working on improving and the onboarding experience through A/B testing, providing advice on experimental design, analysis, etc.

One of the most visible things our team does is put out a chart-of-the-day; some piece of what we’re working on, shared daily with the whole company.

Like the rest of Basecamp, we don’t do daily stand-ups or formal status meetings. Justin and I hop on a Google Hangout once a week to review results, help each other get unstuck on problems, and — since Justin is still relatively new to team data — walk through one piece of how our data infrastructure works and discuss areas for improvement each week. Other than that, all of our collaboration happens via Basecamp itself, through pings, messages, comments, etc.

Sound like fun?

Here’s the shameless plug: If you read the above and it sounds like your cup of tea, and you’re a student or aspiring data analyst, I hope you’ll consider joining us this summer as an intern. You’ll work mostly on the analysis side of things: you’ll take requests off our main request list and projects from our backlog, structure the question into something that can be answered quantitatively, figure out what data you need to answer that question, figure out how to get the data, perform analysis, write up results, and make recommendations.

Let’s Chart: stop those lying line charts

I want to talk about one of the most basic tasks a data analyst will be asked to do on a regular basis: present some data over a period of time.

Let’s look at a chart of monthly sales from Noah’s Imaginary Widget Company. I see charts like this on a regular basis:

A basic line chart, right? Nothing fancy or special about it, just a couple clicks in Excel.

Not so simple: this innocent little chart is actually lying to you in a couple of significant ways.

First, you can’t actually tell where the monthly sales values fall —they fall at even points along the width of the chart, but it’s very difficult for you to mentally place the points there. Let’s fix that:

A little better — I can see actual data points now. This chart is still lying to us though. Let’s zoom in on September and October to see why:

The chart makes it look like sales dipped below their September number and then increased to October. This isn’t actually true, or at least we can’t tell from the data we have — we only have monthly numbers, so we can’t possibly have enough information to say that’s what happened in between those two points.

When you use the “smoothed” lines functionality in Excel, Highcharts, D3 or any other visualization tool, you’re asking the tool to lie for you. It’ll happily fit an equation to make things look smooth, but that’s not representing the data. I wish tools didn’t make it so easy to invent data — I can’t think of a legitimate case where you should use an auto smoothing function like this.

Let’s straighten out those lines:

This is getting better — we no longer imply some perfect mathematical equation that doesn’t exist.

This chart still has a big problem though: by connecting the data points, we imply continuity in the underlying data that doesn’t exist. All we have is monthly data, but when you connect them together, you imply to the viewer that you know what happened in between the points.

For example, zoom in on July, August, and September. At the monthly level, they look like:

Here’s one set of daily data that could make up this monthly:

Alternately, here’s a different set of daily data that would get you the same monthly trend:

Connecting the monthly data points together sure makes it seem to the viewer more like the former than the latter, but we don’t actually have enough data to make that conclusion. It could just as easily be the latter case, but you’re unlikely to consider that possibility based on the monthly connected line chart.

The better visualization here is actually to not use a line chart at all. A bar or column chart better conveys discrete quantities like monthly sales: it’s easier to compare relative quantities visually, and it doesn’t imply continuity in the underlying data where there is none.

Much better.

But Noah, aren’t you guilty of using line charts without truly continuous underlying data?

Yep, I am. When you have high frequency data (like if you have once-per-hour data for a few weeks), even though you’re implying some continuity that doesn’t really exist, it can be much easier to comprehend when you do connect the datapoints.

For example, here’s some actual data that meets that criteria: hourly signups for Basecamp over the last two weeks. The bar chart version isn’t bad, but it’s a little hard to grok at first glance, because there’s so much visually going on at that density:

You can probably get a little better by changing the width of the bars, but he equivalent line chart is, at least to me and most people I’ve talked to, a lot easier to comprehend:

So yes, sometimes I deceive with line charts, but it’s a small lie that I can live with.

What if I really do want smoothed data?

If you want to show “smoothed” data, that’s ok, but you should explictly decide what sort of transformation you want to apply to “smooth” the data and acknowledge it. Here’s that same signup data with a five hour moving average applied:

This isn’t fancy analysis, and I don’t claim to be Edward Tufte — I put out plenty of bad visualizations myself. This might seem too basic to be worth talking about, but I see this sort of deceitful chart almost every day, both from analysts and in tons of commercial products which use smoothed line charts.

If you like charting, maybe you’d like to try out a daily chart habit — you’ll get lots of practice at making good (and some bad) charts.

Getting your recommended daily chart allowance

About a year ago, I wrote about something I’d recently started doing at Basecamp and a year and over 250 charts later, I’m still at it: every workday, I share a different “chart of the day” with my coworkers at Basecamp.

The charts are just pulled from whatever I’m working on, a question someone asked, or something topical (iOS 10 was released a couple days ago, so yesterday’s chart was about adoption among our users). They can be about anything — marketing, support, operations, performance, usage, the company itself, whatever. I don’t intentionally try to make them extra interesting or visually stunning, and I try to spend no more than ten minutes per day on that day’s chart. I just find a chart and post it in the “Data” project on our Basecamp account.

A few charts from the last year

I had two primary motivations in starting this chart habit:

  1. I wanted to challenge myself to keep things fresh, and to tell stories with data without using more than one chart and a couple sentences.
  2. I wanted to make data more accessible. You shouldn’t have to set aside a half hour to read a report to get a piece of information that can change the way you think.

It’s been a fun challenge to keep this up for over a year, and I’d like to share a few of the things that stand out to me from the last 272 charts.

Isn’t this just chart junk food?

Given how I feel about real-time dashboards and the importance of solving real business problems, I sometimes wonder if these charts are just the data equivalent of junk food.

Maybe they’re a little high in data sugar, but I think they serve a purpose that you don’t get from a dashboard.

Every day is something different. In a year of charting, I’ve never intentionally reused a chart, which means that people have seen over 250 unique slices of data about our business. That’s a breadth that’s hard to achieve any other way.

There’s context. I don’t do a lengthy writeup about each chart, but I write a sentence or two about what the chart shows and why it matters. A chart with no context might just be eye candy, but contextualizing makes it more valuable.

They’re a conversation. I post a chart. People read the chart. Some people applaud it. Some people ask a question that I can try to answer. Some people reference it later. Today’s chart is influenced by yesterday’s chart. Unlike a dashboard or a report, the chart of the day serves as the starting point for a conversation about the challenges we’re facing as a company and the things that people care about.

Making data fun

One of my goals with Chart of the Day is to make working with and thinking about data fun for people. Data isn’t just numbers and long reports; it can also inspire, motivate, intrigue, and make you laugh out loud. While I hope that all of my charts cause joy, there are a couple things that I’ve started doing that are a little more blatant in their aim.

Round numbers

When you do something daily, you’ll inevitably start numbering things, and when you do that, you hit round number milestones, and you’ll be tempted to go a little crazy.

I wanted chart #100 to literally light people up with a look at our growth as company over more than ten years [pun intended].

Chart #100 now resides at Basecamp headquarters in Chicago

When it came time for chart #200, I was just hungry.

Chart #200 now resides in my stomach

I’ve got about two months to figure out what to do for #300. Your suggestions for the wackiest, most over-the-top chart possible are appreciated.

Fun Chart Fridays

On most Fridays, rather than posting a “serious business” chart about Basecamp, I try for something a little lighter. Fun Chart Fridays are either charts about a less serious aspect of Basecamp (Campfire sounds are a perennial crowd pleaser) or something that I’ve seen elsewhere on the internet that’s interesting or amusing.

Sometimes they’re also a good chance to talk about a way of visualizing or thinking about data that’s a little different than the ordinary, or to contrast different looks at the same piece of data.

Inside the mind of a daily charter

Most days, charting is easy: I copy something from an analysis or report I’m working on and I paste it in Basecamp, write a sentence or two of explanation, and move on with my day.

Then there are days where charting is a real slog — I don’t have anything handy because I’ve been working on infrastructure, or I realize that the chart I was all set to post is actually too similar to something else I’ve posted, or I’m just tired. On those days, I sometimes question why I’m bothering. Does it really matter if I post a chart today? Or any day?

Eventually, I always convince myself that it does matter, because data can change the way people see things. People sometimes don’t even know the question they want to ask, which can make it hard for me to help them, but I can at least put a piece of data in front of them each day and hope that it sparks something in them that leads them to think about a problem a little differently. It sounds corny, but if that leads to us making a single better decision for our customers, that’s all the payback I need to make another chart.

A daily reminder that there are questions we can answer if we look at them the right way is pretty neat. — Jim

I’ve missed days because I’ve totally forgotten, and the chart of the day took a few weeks of summer vacation this year, but I’ve yet to just give up and not post on a day because I didn’t feel like it. That’s a small thing, but it gives me some satisfaction, and I’m going to keep on charting until they pry the x-axis out of my fingers.

These charts help us make Basecamp. You can use Basecamp for your daily charting habit too! If you do, let me know

Twelve Weeks @ Basecamp: A Summer Tale


I found out about Basecamp’s internship program during the winter break completely on accident. Burned out from the fall semester and unmotivated to get any work done, I gave “data science internship” a cursory Google search and happened to find Basecamp had just posted about looking for summer interns.

After reading through the description and learning about Basecamp’s kill on the cover letter mantra, I knew this application wouldn’t be like others. I meticulously combed over my resume and cover letter to make sure I was making my reviewers’ job as easy as possible. It isn’t just about checking off each requested application component: your application materials should speak to your personality, experiences, and really show the reviewer that you can deliver. Basecamp’s application reviewers had to put in tons of work, and to quote Ann:

Tell us your qualifications
Demonstrate why you’re qualified. Sounds like a no brainer, right? People applied for programming internships without showing us any projects they worked on, or even describing their experience in any depth. We’re not looking for fully formed apps — these are interns after all. Projects for classes are great. Bootcamp projects are great. Simple design portfolios are all we’re looking for.

Lesson learned: don’t give application reviewers even the slightest room to reject you — their review of your materials should be as straight-forward and applicable as it should be, and no more.

(A short aside: I decided to have some fun with my cover letter by including my attempt at digitally sketching Basecamp 3’s Happy Camper mascot. Nobody has said anything to me about it, but I’m convinced this gave me a competitive edge.)

Let’s be real: this is how you kill on a cover letter.

True to his word, Noah reached out to me on March 1st asking me to set up a quick phone interview with someone on the Basecamp team. I was thrilled, excited to reignite my skills, and mostly worried.

My first call was with Eron, one of Basecamp’s Ops extraordinaire (Fun fact: Eron once informed me that I tried querying our internal timeseries database with a start time of 70 seconds after January 1, 1970, which broke a thing or two. We don’t talk about that.). We had a cool discussion around the problems Basecamp had faced in the past with respect to load balancing, and the architectural challenges therein. Whereas Eron had worked on an in-house system for load balancing in the event of a Distributed Denial of Service (DDoS), I hadn’t ever really considered the load balancing problem. The conversation was really cool, though, because the problem domain presented interesting avenues for solutions and got us talking. No useless technical screen asking me to implement a Hash-map, no quiz on the internals of Python’s GIL. Lesson learned: few companies know how to properly evaluate prospective interns — Basecamp is one of them.

Around two weeks later, Noah got in touch asking for another phone interview. Okay, I thought, now we’re in the big leagues.

It was interview time. For the past two weeks I had been preparing myself for anything Noah might throw at me that was fair game — SQL queries, Postgres internals, Python trivia, etc. Imagine my surprise when we talked about my personal projects, my interests, having to context switch between MySQL and Postgres queries, and how I’d be able to contribute to Basecamp’s data infrastructure. I remember ending the call thinking I was a bit too casual — after all, not one line of code was written during a startup’s interview process? I must have been funneled into the “call, but don’t bother asking to code” subset. I sent Noah a thank-you email and figured that was the end of that — after all, there was only room for one data science intern.


Three weeks later, I was at the library studying for midterms. Ding. I checked my phone. “Join us at Basecamp this summer!” Lesson learned: don’t let your dreams be dreams. (Actual lesson learned: the logo worked!)


So what did I actually do at Basecamp? With Noah’s frequent guidance I built Thermometer, which is our embarrassingly-parallel aberration detection system designed to autonomously report any aberrations across our 10,000+ internal metrics in real-time. Thermometer was both an engineering and data science challenge — I constantly thought about the trade-offs between performance and statistical rigor, and with Noah’s help quickly settled a great equilibrium.

Alongside Thermometer I also developed and internally released Thermos, an R package designed to keep Thermometer’s logic straightforward and abstract away the nitty-gritty bits of anomaly detection. I learned a ton about package development and came to truly dedicate my allegiance to the Hadley-verse.

Towards the end of my time here, Noah and I have been pairing up and walking through a few different A/B tests. Each time I get to pair up with Noah on a task, it’s really humbling to see how much I can improve as a data scientist. Things that would take me at least an hour or two to recognize, diagnose, and implement is a casual 1-line fix for him. Lesson learned (in the best way possible): you don’t know anything — yet.


In Noah’s retrospective on hiring Basecamp’s summer intern class, he mentions three core reasons Basecamp was offering internships:

Give back and have an impact on the community
Challenge ourselves to grow
Improve Basecamp (the product and company)

I like to think that as a data science intern, I was able to consistently experience the intent behind all three everyday.

Problem-solving, meet the real world

When building Thermometer there were times where I lost sight of what to do next — there was so much! On top of having to deal with real world edge cases and meet certain features, I had never worked on something this conceptually large before. Not to mention I owned 100% of all the commits to Thermometer and Thermos — I should be Noah’s go-to for questions on Thermometer, not the other way around. Time and time again, however, I would ping Noah about a function I had written and ask about code quality, logic integrity, and more.

This isn’t to say I was struggling for air — I also got to propose solutions, discuss trade-offs of pursuing potential solutions, defend certain design decisions, and really feel like a full-time member of the data team. Treating interns as employees was without a doubt the norm at Basecamp, and it reinforces the people-first culture of betterment and transparency at all levels.

The Boy Scout Rule

Always leave the campground cleaner than you found it.

Working on Thermometer and Thermos gave me the opportunity to impact a real-world system at scale. Sharing my achievements with the company was a total confidence-booster, and as a builder it’s awesome when people use something you’ve made. I’ve left Basecamp in better shape to handle data problems in the future than how it was before — and in many ways that has made all the difference.

Parting Words

Basecamp surrounds its interns with so many opportunities to grow, collaborate, learn, and contribute. I’ve been fortunate to gain a crazy amount of both technical and domain knowledge, and it’s in large part thanks to the two-fold effort from the culture’s emphasis on communication and transparency and Noah’s exceptional skills as a data scientist, multi-tasker, and mentor.

Each day presented another challenge, and I got to learn from and face each one with a wide smile. There’s no better feeling than solving a problem you’ve been stuck on for days and knowing that your work is going to be used to support an entire organization, and I’ve grown to appreciate those hair-pulling moments of frustration as learning experiences.

Lesson learned: Basecamp isn’t about implementing bleeding-edge machine learning algorithms or dominating the market— it’s about people helping people. That’s it.

If you liked this post, please hit the ❤️ button below. Thank you!

Feel free to send me your thoughts, questions, and prayers on Twitter.

Real-time dashboards considered harmful

Walk into any startup office and you’ll see almost the exact same thing: a bunch of big televisions showing real-time dashboards. Traffic, tweets, leads, sales, revenue, application performance, support cases, satisfaction, A/B test results, open rates; you name it, there’s a real-time dashboard for it.

Walk into Basecamp and you won’t see any of those, and it’s not just because we’re a remote company. It’s because real-time dashboards are often more harmful than they are beneficial.

Robert Caro nailed it in a recent Gothamist interview:

[Gothamist] There’s something called Chartbeat — it shows you how many people are reading a specific article in any given moment, and how long they spend on that article. That’s called “engagement time.” We have a giant flatscreen on the wall that displays it, a lot of publications do.

[Caro] What you just said is the worst thing I ever heard. [Laughs]

What’s the point of that dashboard?

I do a lot of reporting: on operations, on support, on usage, on finances, on marketing, and on every other topic that matters to a business. Whenever I consider a new piece of reporting, I ask myself one question: what’s the point? What’s the action or decision that this reporting is intended to impact? When someone consumes it, what can they do about it? Can they make a decision? Can they go do something personally or ask someone to do something? If there’s nothing that can be done in response to a report, does it need to be reported in that manner?

Most real-time dashboards fail to pass this usefulness test. Knowing how many visitors you have on your site right now or what the open rate was for the email that just went out doesn’t generally enable you to do anything. In a SaaS business, knowing what today’s revenue is doesn’t really enable you to do anything either: revenue today is the consequence of a sales and retention cycle that started long ago.

There are cases where real-time dashboards are invaluable. Knowing whether database response time is higher right now than it was a few minutes ago is incredibly useful when your site is slow, and we use real-time dashboards extensively for solving availability and performance problems at Basecamp.

Schrödinger’s dashboard

Perhaps real-time dashboards aren’t that useful, but if they aren’t a lot of work to set up, what’s the harm? Isn’t faster data better data?

The problem comes when you look at a real-time dashboard: no matter how much you try to train yourself, you’re going to react to the data that you just saw. You might not realize that you’re reacting to it, but you absolutely are.

Almost every metric is noisy. Active users being down 3% from yesterday could be the start of a longer trend, but it’s much more likely that it’s just noise in the data. When you see that 3% decrease on a real-time dashboard, however, the panic starts to set in: what did we do wrong? Anything you were thinking about gets thrown out the window, because now you’re reacting to something that looks urgent, but really isn’t important.

I’ve seen many cases of people looking at real-time A/B test results and judging the experiment after an hour or two. No matter how much labeling you do to point out that the results are meaningless at that scale, humans will still draw conclusions from them. In our case, and for virtually every online business, daily updated results are more than adequate for making decisions, so there’s only downside to real-time A/B test results: the risk of making a decision off insufficient data and that decision turning out to be the wrong one.

We recently scaled back and de-emphasized the use of a bunch of metrics relating to our support team. We found that a focus on average customer happiness scores, response time, and case volume made it hard to give each individual customer the attention they deserved, and caused a ton of unnecessary stress. Kristin explained our motivation well:

We’re attempting to change our relationship with Smiley and metrics so that our focus is more on each individual customer and less on any sense of competition with ourselves and/or each other. Smiley leads us to focus on the vocal minority (about 20%) of customers who leave a rating. The customer we’re currently working with should have 100% of our attention, so we shouldn’t be worried about quickly getting rid of them to move on to the next one or focusing on the customer as a potential Smile instead of as a person who needs help.

The next time you feel the urge to look at Smiley and/or Dash, get up and take a break. Make some tea. Eat some cheese popcorn. Pet an animal. Stretch.

I’m really proud of the support team for their evolving relationship with the use of metrics. We got a lot of value out of rigorously analyzing our support caseload to figure out the right level of staffing, scheduling, and address root causes, but we can do all of those things without real-time reporting. Knowing when not to look at a piece of data is just as important as knowing when to look.

Make reporting great again

How can you make reporting less stressful and more useful? Try a few of these simple changes:

  • Change the timeframe. Instead of looking at the last day of data, look at the last week or month. Maybe there’s a bigger seasonal trend that will help to contextualize today’s data.
  • Move upstream. Instead of reporting something like daily revenue, which is the output of every step of your funnel, report on the actual underlying drivers that you can impact.
  • Contextualize. Instead of showing an absolute metric, show a percentage change or a comparison to last week or last month.
  • Convert dashboards to alerts. Computers are great at sending emails according to defined conditions, so let them do that. Don’t rely on checking a real-time dashboard to detect that something isn’t right; define your criteria and let an automated system tell you when you need to take a deeper look.

I get it: real-time reporting is fun. It’s something shiny to put up in your lobby, and it fills you with lots of little bits of trivia to drop at a moment’s notice. But that comes at a cost, and too many people embrace real-time reporting without thinking through the consequences.

A paean to slow data

Eschewing real-time dashboards is just one part of what I like to call a “slow data” approach to data science. I’m not talking about free-range histograms or artisinal Poisson distributions, but about taking the time to really understand the problem you’re solving, the data you’re using, and the implications of the results. My profession spends most of its time talking about statistical methods and visualization, and very little time talking about the actual business problems or impacts of the work. Fortunately, I mostly just do arithmetic, make very simple charts, and avoid making real-time dashboards, so I have lots of time to think about the problem we’re trying to solve.

I’d encourage you to give this slower approach to data science a shot in your organization too. Next time you think about making a real-time dashboard, ask a deeper question about the underlying problem instead. I guarantee you’ll find more value from that.

Practical skills that practical data scientists need

When I wrote about how I mostly just use arithmetic, a lot of people asked me about what skills or tools a data scientist needs if not fancy algorithms. What is this mythical “basic math” that I mentioned? Here’s my take on what skills are actually needed for the sort of work that I do at Basecamp: simple analyses focused on solving actual business problems.

The most important skill: being able to understand the business and the problem

I’ll get to actual practical skills that you can learn in a textbook in a minute, but first I have to belabor one point: the real essential skill of a data scientist is the ability to understand the business and the problem, and the intellectual curiosity to want to do so. What are you actually trying to achieve as a business? Who are your customers? When are you selling your product? What are the underlying economics of the business? Are profit margins high or modest? Do you have many small customers or a few large customers? How wide is your product range? Who are you competing with? What challenge is the business facing that you’re trying to solve or provide input towards a decision on? What’s the believable range of answers? Who is involved in solving this problem? Can analysis actually make a difference? How much time is worth investing in this problem?

Understanding the data

Before you look at any data or do any math, a data scientist needs to understand the underlying data sources, structure, and meaning. Even if someone else goes out and gets the data from wherever it’s stored and gives it to you, you still need to understand the origin and what each part of the data means. Data quality varies dramatically across and within organizations; in some cases you’ll have a well documented data dictionary, and in other cases you’ll have nothing. Regardless, you’ll want to be able to answer the following questions:

  • What data do I need to solve the problem?
  • Where is that data located? In a relational database? In a log file on disk? In a third party service?
  • How comprehensive (time and scope) is the data? Are there gaps in coverage or retention?
  • What does each field in the data mean in terms of actual behavior of humans or computers?
  • How accurate is each field in the data? Does it come from something that’s directly observed, self-reported, third-party sourced, or imputed?
  • How can I use this data in a way that minimizes the risk of violating someone’s privacy?

SQL skills

For better or worse, most of the data that data scientists need live in relational databases that quack SQL, whether that’s MySQL, Postgres, Hive, Impala, Redshift, BigQuery, Teradata, Oracle, or something else. Your mission is to free the data from the confines of that relational database without crashing the database instance, pulling more or less data than you need to, getting inaccurate data, or waiting a year for a query to finish.

Virtually every query a data scientist writes to get data to analyze to solve business problems will be a SELECT statement. The essential SQL concepts and functions that I find necessary are:

  • WHERE clauses, including IN (…)
  • Joins, mostly left and inner
  • Using already indexed fields
  • if()
  • String manipulation, primarily left() and lower()
  • Date manipulation: date_add, datediff, to and from UNIX timestamps, time component extraction
  • regexp_extract (if you’re lucky to use a database that supports it) or substring_index (if you’re less lucky)
  • Subqueries

Basic math skills

Once you have some data, you can do some maths. The list of what I consider to be the essential list of math skills and concepts is not a long one:

  • Arithmetic (addition, subtraction, multiplication, division)
  • Percentages (of total, difference vs. another value)
  • Mean and median (and mean vs. median)
  • Percentiles
  • Histograms and cumulative distribution functions
  • An understanding of probability, randomness, and sampling
  • Growth rates (simple and compound)
  • Power analysis (for proportions and means)
  • Significance testing (for proportions and means)

This isn’t a very complicated set of things. It’s not about the math, it’s about the problem you’re solving.

Slightly more advanced math concepts

On occasion, some more advanced mathematical or SQL concepts or skills are of value to common business problems. A handful of the more common things I use include:

  • Analytic functions if supported by your database (lead(), lag(), rank(), etc.)
  • Present and future value and discount rates
  • Survival analysis
  • Linear and logistic regression
  • Bag of Words textual representations

There are some problems that require more advanced techniques, and I don’t mean to disparage or dismiss those. If your business can truly benefit from things like deep learning, congratulations! That probably means you’ve solved all the easy problems that your business is facing.

Data scientists mostly just do arithmetic and that’s a good thing

Hi, I’m Noah. I work at Basecamp. Sometimes I’m called a “data scientist.” Mostly, I just do arithmetic, and I’m ok with that.

Here’s a few of the things I worked on in the last couple of weeks, each of them in response to a real problem facing the business:

  • I analyzed conversion, trial completion, and average invoice amounts for users in different countries.
  • I identified the rate at which people accidentally sign up for Basecamp when they mean to sign in to an existing account and how that’s changed over time.
  • I analyzed and reported on financial performance of a few of our products.
  • I ran and analyzed a survey of account owners.
  • I analyzed an A/B test we ran that affected the behavior of a feature within Basecamp.

In the last two weeks, the most “sophisticated” math I’ve done has been a few power analyses and significance tests. Mostly what I’ve done is write SQL queries to get data, performed basic arithmetic on that data (computing differences, percentiles, etc.), graphed the results, and wrote paragraphs of explanation or recommendation.

I haven’t coded up any algorithms, built any recommendation engines, deployed a deep learning system, or built a neural net.

Why not? Because Basecamp doesn’t need those things right now.

The dirty little secret of the ongoing “data science” boom is that most of what people talk about as being data science isn’t what businesses actually need. Businesses need accurate and actionable information to help them make decisions about how they spend their time and resources. There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means that is best gained using simple methods.

Some people will argue that what I’ve described as being valuable isn’t “data science”, but is instead just “business intelligence” or “data analytics”. I can’t argue with that arbitrary definition of data science, but it doesn’t matter what you call it — it’s still the most valuable way for most people who work with data to spend their time.

I get a fair number of emails from people who want to get into “data science” asking for advice. Should they get a masters degree? Should they do a bunch of Kaggle competitions?

My advice is simple: no. What you should probably do is make sure you understand how to do basic math, know how to write a basic SQL query, and understand how a business works and what it needs to succeed. If you want to be a valuable contributor to a business, instead of spending your weekend working on a data mining competition, go work in a small business. Talk to customers. Watch what products sell and which ones don’t. Think about the economics that drive the business and how you can help it succeed more.

Knowing what matters is the real key to being an effective data scientist.

Infrastructure worth investing in

I analyze data for a living. I occasionally do some other things for Basecamp — help with marketing, pitch in on support, do some “business” things — but at the end of the day, I analyze data. Some of the data is about feature usage, some about application performance or speed, some about our great support team, some about financial matters, and some of it defies categorization; regardless of the type of data, my job is to identify business problems, use data to understand them, and make actionable recommendations.

If you ask almost any data analyst, they’ll tell you that the biggest chunk of their time is spent cleaning and preparing data — getting it into a form that’s usable for reporting or analysis. Ironically, I don’t have any actual data about how much time that process consumes, either personally or for the profession as a whole, but I’d guess that the time spent on acquisition and transformation of data outweighs actual math, statistics, or coming up with recommendations by a factor of five to one or more.

Over time, both personally and as an organization, you get better at capturing and preparing data, and it eats up less and less of your time. I’d characterize the time I’ve spent at Basecamp as being four fairly distinct phases of increasingly greater sophistication in terms of how I prepare data for analysis:

  1. The CSV phase: In my early days at Basecamp, I was just happy to have data at all, and everything was basically a comma separated value (CSV) file. I used `SELECT … INTO OUTFILE` to get data from MySQL databases, `awk` to get things from log files, and the ‘export’ button from third party services to get data that I could then analyze.
  2. The R script phase: After a month or two, I graduated to a set of R scripts to get data directly into my analysis environment of choice. A wrapper function got data from our MySQL databases and I wrote API wrappers to get data from external services. Our first substantial automated reports showed up in this phase, and they were literally R scripts piped to `sendmail` on my laptop.
  3. The embryonic data warehouse: Eventually, a fledgling “data warehouse” started to take form — a MySQL instance held some data explicitly for analysis, a Hadoop cluster came into the picture to process and store logs, and we added a dashboard application that standardized reporting.
  4. The 90% data warehouse (today): today a centralized data warehouse holds almost of all our data, every type of data belongs to a documented schema, and Tableau has dramatically changed the way we do reporting. It’s not perfect — there are some pieces of data that remain scattered and analyzed only after cleaning and manual processing, but that’s the exception rather than the rule.

Over the course of this transformation, the time that I spend preparing data for analysis has fallen dramatically — it used to be that I started any “substantial” analysis with two or three days of getting and cleaning data followed by a day of actually analyzing it; now, I might spend twenty minutes getting vastly greater quantities of already clean data and then a couple of days analyzing it far more deeply.

That evolution didn’t come for free — it took substantial investments of both time and money into our data infrastructure. We’ve built out two physical Hadoop clusters, bought software licenses, and poured hundreds of hours over the last five years into developing the systems that enable reporting and analysis.

I used to struggle with feeling guilty every time I spent time on our data infrastructure. After all, wasn’t my job to analyze data and help the business, not build data infrastructure?

Over time, I’ve come to realize that there’s nothing to feel guilty about. The investment in our infrastructure have paid dividends many times over: in direct time savings (mine and others), in greater insights for the company, and in empowering others to work with data. In the example analysis case I described above, the transformation infrastructure saved a day or two of my time and delivered a better result to the business; I do perhaps thirty or forty such analyses per year. That makes a few weeks or even months of total time spent on those investments look like a bargain.

I often hear people argue that “investing in infrastructure” is just code for giving in to “Not Invented Here” syndrome. The single biggest impact infrastructure investment we’ve made was actually abandoning a custom developed reporting solution for a piece of commercially developed software. Just like any sort of investment, you can of course spend your resources poorly, but done properly, investing in infrastructure can be one of the highest returns you can possibly achieve.

Seth Godin had an excellent take on the topic recently:

Here’s something that’s unavoidably true: Investing in infrastructure always pays off. Always. Not just most of the time, but every single time. Sometimes the payoff takes longer than we’d like, sometimes there may be more efficient ways to get the same result, but every time we spend time and money on the four things, we’re surprised at how much of a difference it makes.

I recently wrapped up a fairly large infrastructure project at Basecamp, and my focus is naturally swinging back towards focusing more exclusively on the core of what I do: analyzing data. For the first time, however, I’m moving on from an infrastructure project without much guilt about whether it was an investment worth making. Instead, I’m looking forward to reaping the dividends from these investments for years to come.