What does Team Data do at Basecamp?

Basecamp’s “team data” recently doubled in size with Justin White joining us full-time as a programmer. We’ve been in the data business at Basecamp for over six years, but the occasion of hiring a second team member caused me to reflect on why Team Data exists, what it does, and how we work.

A simple objective: make Basecamp better

We’re basically interested in three things on Team Data:

  1. Make Basecamp-the-product better to help our customers achieve their goals.
  2. Make Basecamp-the-company a great place to work.
  3. Make Basecamp-the-business successful.

These are really the same fundamental objectives every team at Basecamp is working towards, and each team has their own specific angle on this. The support team focuses on how their interactions with customers can achieve these goals, the design and development teams focus on how the functionality of Basecamp itself can achieve these goals, etc.

On the data team, we primarily attempt to use quantitative information to achieve these goals. Our approach isn’t necessarily the best approach for every problem, but it’s the angle we take on things. If we can’t address a specific question or problem with some sort of number, we probably aren’t the best team to answer it, and we’ll gladly defer to the perspective others can bring to bear.

What we do

Pretty much everything we do on Team Data falls into one of two categories:

  1. We answer questions about Basecamp using data and make recommendations. The questions we tackle span a wide range: from specific questions about how a feature is used, to understanding how a change we made impacted signups, to open questions about how we can improve some aspect of business performance.
  2. We build infrastructure and tools to a) support our ability to answer the questions above, and b) help others at Basecamp accomplish their work more effectively.

We occasionally do things that don’t fall into either of those categories, but the core of what we do falls into either analysis or infrastructure.

A sampling of our work over the past few months includes:

  • Analyzing the performance of a new marketing site and account setup process.
  • Improving the internal dashboard app that powers many of our internal tools by removing thousands of lines of dead code and upgrading to a modern version of Rails.
  • Helped design, implement, and analyze a dozen A/B tests.
  • Migrating our data infrastructure from on-premise hardware to cloud-based services.
  • Analyzed the sequencing of notifications sent by Basecamp and recommended ways to adjust timing.

Things we believe about ourselves

Every team at every company has a set of beliefs about how they work, whether they are aware of them, acknowledge them, or codify them. Here on team data, there are a few tenets that we try to embody that we’ve taken the time to write down:

  1. We are scientists. Wherever possible, we apply the scientific method to solving problems, whether through analysis or engineering.
  2. We are objective. There’s no agenda on team data other than seeking the truth; we report the facts whether we like them or not.
  3. We try for simple. We don’t use a machine learning model when a heuristic will do, and we don’t write complicated programs when a simple `awk` one liner will work.
  4. We are rigorous. When a problem demands a nuanced understanding or when data needs to be high quality, we stick to those requirements. We’d rather over-explain a complicated situation than over-simplify it.
  5. We are technology and tool agnostic. Ruby, Go, Scala, R, Python — whatever the best tool for the job is. When possible, we use open-source or third-party tools, but we’ll build what’s needed that isn’t otherwise available.
  6. We collaborate, engaging in peer review of analysis and code.

We don’t hit all of these points on every day, but they’re the aspiration we’re working towards.

How we work

Unlike the core product teams at Basecamp, we don’t explicitly work in six week cycles, and we tend to each have multiple projects under way at any given time. Many of our projects are a couple days or weeks, and some stretch over six months or a year. We might do some instrumentation today and then back burner that for 30 days while we wait for data to collect, or a thorny problem might wait until we figure out how to solve it.

Generally, Justin spends about 80% of his time working on infrastructure and the remainder on analysis, and I spend about 80% of my time on analysis and the remainder on infrastructure. This is mostly about specialization — Justin is a far better programmer than I am, and I have more experience and background with analytics than he has. We don’t hit this split exactly, but it’s our general goal.

We get lots of specific requests from others at Basecamp: questions they’d like answered, tools that would help them do their work, etc., and we also have a long list of bigger projects that we’d like to achieve. We explicitly reserve 20% of our time to devote to responding directly to requests, and we both try to set aside Fridays to do just that.

Anyone can add a request to a todolist in our primary Basecamp project, and we’ll triage it, figure out who is best equipped to fulfill it, and try to answer it. Some requests get fulfilled in 20 minutes; we have other requests that have been around for months. That’s ok — we embrace the constraint of not having unlimited time, and we admit that we can’t answer every question that comes up.

Outside of requests, we collaborate with and lean on lots of other teams at Basecamp. We build some of the tooling that the operations team uses for monitoring and operating our applications, and they provide the baseline infrastructure we build our data systems on. We collaborate with developers and designers to figure out how what data or analysis is helpful as they design and evaluate new features. We work closely with people working on improving basecamp.com and the onboarding experience through A/B testing, providing advice on experimental design, analysis, etc.

One of the most visible things our team does is put out a chart-of-the-day; some piece of what we’re working on, shared daily with the whole company.

Like the rest of Basecamp, we don’t do daily stand-ups or formal status meetings. Justin and I hop on a Google Hangout once a week to review results, help each other get unstuck on problems, and — since Justin is still relatively new to team data — walk through one piece of how our data infrastructure works and discuss areas for improvement each week. Other than that, all of our collaboration happens via Basecamp itself, through pings, messages, comments, etc.

Sound like fun?

Here’s the shameless plug: If you read the above and it sounds like your cup of tea, and you’re a student or aspiring data analyst, I hope you’ll consider joining us this summer as an intern. You’ll work mostly on the analysis side of things: you’ll take requests off our main request list and projects from our backlog, structure the question into something that can be answered quantitatively, figure out what data you need to answer that question, figure out how to get the data, perform analysis, write up results, and make recommendations.

Data is Man-Made

Here’s a secret from the support team at Highrise. Customer support metrics make us feel icky.

Our team doesn’t know our satisfaction score. We’ve never asked any of the people that use Highrise to try those types of surveys.

We can’t give you an exact number for our average response time. It depends. Sometimes it’s 90 seconds, and other times it’s within 24 hours.

We can’t tell you our average handle time for an issue. Our team has a general idea, but no exact number.

These types of customer support metrics aren’t wrong. We’re sure they work for other support teams.

We’re just not sure they’re right for us.

Because there is one piece of knowledge we’ve come to realize: data is man-made.

What do you mean data is man-made?

Data or metrics or stats are all man-made. A human decides what to measure, how to measure it, how to present it, and how to share it with others.

But why does it matter to measure these things? And what’s the point?

A lot of times people avoid these questions when it comes to data. Companies copy what other teams measure, ignoring the fact if it’s important to measure the same things in the same way, or if it’s even important to measure it at all.

But isn’t all data objective?

Many people view numerical data as more trustworthy than qualitative data.

Clayton Christensen, Competing Against Luck

Numbers are black and white. Concrete. You can trust the numbers.


Nope. Almost all data is built on biases and judgement. Because humans are deciding what to measure, how to measure, and why to measure.

Numbers fit perfectly into a spreadsheet or a graph. A number gives a definitive answer to questions like how much or how many.

That doesn’t mean you should treat those numbers as insights and act immediately. Data shouldn’t be used to prove a point.

Data should be used to fuel your imagination.

Words > Numbers

Qualitative data isn’t easy. There aren’t any formulas or simple math. It doesn’t fit into a spreadsheet. It doesn’t answer questions. It’s not black and white.

It’s colorful. Messy. Qualitative data creates more questions. It’s not simple to present or share with others. It takes some time.

Our support team has found one thing to be true. Qualitative data is worth it. 100 percent worth it.

For example, our team recently updated the filters in Highrise. This update was to an earlier revision to filters we made during the year.

It was driven by one piece of qualitative data from a new user:

Thanks for pointing out those filters. I didn’t even know they were there. Those icons weren’t obvious to me at first.

This hit all of us across the nose. The filters looked better. It was a much more clean than the original design.

The original design vs. our next iteration

We didn’t make these changes just for aesthetic reasons though. The original design had a lot of trouble for most of our users who had more than a handful of custom fields. But how to use the filters wasn’t as obvious any longer.

Folks need to find a specific set of contacts in the city of: Chicago, that have the value: Interested in the custom field: Status, and that are tagged: Potential.

It wasn’t clear how to do that, so our team made a change.

We made it abundantly clear what to click on to add a filter.

Quantitative data didn’t tell us we needed to make this change. It was all qualitative.

Questions from customers and questions from our team. It was a conversation. There is not a numerical value you can put on that.

So what do we measure?

Instead of striving to lower our average response time or improve our customer satisfaction score, our support team is aiming for something a bit different. Something harder to measure. It’s not a number.

As Alison would say, we strive to put ourselves out of work.

Don’t confuse that with us not wanting to work at Highrise. We love it, and love working with our small team.

What we mean is we want to make it easier for people to use Highrise. We want to create a product that is so obvious and so easy to use that we seldom get questions on how to use it.

And when folks do have questions, we want to have resources available to them right away, so they can help themselves. So if someone has a question at 2 am in the morning, and we’re not around, they can find an answer without waiting for us.

Because we don’t believe managing a number is going to improve our support. We believe focusing on customers and what they are trying to do with Highrise is going to make a better product, and better support.

If you enjoyed this post, please click the 💚 to share it with others. Please don’t take this as gospel either. What works for our team, might not work for your team. And vice versa.

Also, chapter 9 of Clayton Christensen’s recent book, Competing Against Luck, was a big inspiration for this post. The entire book is great, and you should check it out.

Real-time dashboards considered harmful

Walk into any startup office and you’ll see almost the exact same thing: a bunch of big televisions showing real-time dashboards. Traffic, tweets, leads, sales, revenue, application performance, support cases, satisfaction, A/B test results, open rates; you name it, there’s a real-time dashboard for it.

Walk into Basecamp and you won’t see any of those, and it’s not just because we’re a remote company. It’s because real-time dashboards are often more harmful than they are beneficial.

Robert Caro nailed it in a recent Gothamist interview:

[Gothamist] There’s something called Chartbeat — it shows you how many people are reading a specific article in any given moment, and how long they spend on that article. That’s called “engagement time.” We have a giant flatscreen on the wall that displays it, a lot of publications do.

[Caro] What you just said is the worst thing I ever heard. [Laughs]

What’s the point of that dashboard?

I do a lot of reporting: on operations, on support, on usage, on finances, on marketing, and on every other topic that matters to a business. Whenever I consider a new piece of reporting, I ask myself one question: what’s the point? What’s the action or decision that this reporting is intended to impact? When someone consumes it, what can they do about it? Can they make a decision? Can they go do something personally or ask someone to do something? If there’s nothing that can be done in response to a report, does it need to be reported in that manner?

Most real-time dashboards fail to pass this usefulness test. Knowing how many visitors you have on your site right now or what the open rate was for the email that just went out doesn’t generally enable you to do anything. In a SaaS business, knowing what today’s revenue is doesn’t really enable you to do anything either: revenue today is the consequence of a sales and retention cycle that started long ago.

There are cases where real-time dashboards are invaluable. Knowing whether database response time is higher right now than it was a few minutes ago is incredibly useful when your site is slow, and we use real-time dashboards extensively for solving availability and performance problems at Basecamp.

Schrödinger’s dashboard

Perhaps real-time dashboards aren’t that useful, but if they aren’t a lot of work to set up, what’s the harm? Isn’t faster data better data?

The problem comes when you look at a real-time dashboard: no matter how much you try to train yourself, you’re going to react to the data that you just saw. You might not realize that you’re reacting to it, but you absolutely are.

Almost every metric is noisy. Active users being down 3% from yesterday could be the start of a longer trend, but it’s much more likely that it’s just noise in the data. When you see that 3% decrease on a real-time dashboard, however, the panic starts to set in: what did we do wrong? Anything you were thinking about gets thrown out the window, because now you’re reacting to something that looks urgent, but really isn’t important.

I’ve seen many cases of people looking at real-time A/B test results and judging the experiment after an hour or two. No matter how much labeling you do to point out that the results are meaningless at that scale, humans will still draw conclusions from them. In our case, and for virtually every online business, daily updated results are more than adequate for making decisions, so there’s only downside to real-time A/B test results: the risk of making a decision off insufficient data and that decision turning out to be the wrong one.

We recently scaled back and de-emphasized the use of a bunch of metrics relating to our support team. We found that a focus on average customer happiness scores, response time, and case volume made it hard to give each individual customer the attention they deserved, and caused a ton of unnecessary stress. Kristin explained our motivation well:

We’re attempting to change our relationship with Smiley and metrics so that our focus is more on each individual customer and less on any sense of competition with ourselves and/or each other. Smiley leads us to focus on the vocal minority (about 20%) of customers who leave a rating. The customer we’re currently working with should have 100% of our attention, so we shouldn’t be worried about quickly getting rid of them to move on to the next one or focusing on the customer as a potential Smile instead of as a person who needs help.

The next time you feel the urge to look at Smiley and/or Dash, get up and take a break. Make some tea. Eat some cheese popcorn. Pet an animal. Stretch.

I’m really proud of the support team for their evolving relationship with the use of metrics. We got a lot of value out of rigorously analyzing our support caseload to figure out the right level of staffing, scheduling, and address root causes, but we can do all of those things without real-time reporting. Knowing when not to look at a piece of data is just as important as knowing when to look.

Make reporting great again

How can you make reporting less stressful and more useful? Try a few of these simple changes:

  • Change the timeframe. Instead of looking at the last day of data, look at the last week or month. Maybe there’s a bigger seasonal trend that will help to contextualize today’s data.
  • Move upstream. Instead of reporting something like daily revenue, which is the output of every step of your funnel, report on the actual underlying drivers that you can impact.
  • Contextualize. Instead of showing an absolute metric, show a percentage change or a comparison to last week or last month.
  • Convert dashboards to alerts. Computers are great at sending emails according to defined conditions, so let them do that. Don’t rely on checking a real-time dashboard to detect that something isn’t right; define your criteria and let an automated system tell you when you need to take a deeper look.

I get it: real-time reporting is fun. It’s something shiny to put up in your lobby, and it fills you with lots of little bits of trivia to drop at a moment’s notice. But that comes at a cost, and too many people embrace real-time reporting without thinking through the consequences.

A paean to slow data

Eschewing real-time dashboards is just one part of what I like to call a “slow data” approach to data science. I’m not talking about free-range histograms or artisinal Poisson distributions, but about taking the time to really understand the problem you’re solving, the data you’re using, and the implications of the results. My profession spends most of its time talking about statistical methods and visualization, and very little time talking about the actual business problems or impacts of the work. Fortunately, I mostly just do arithmetic, make very simple charts, and avoid making real-time dashboards, so I have lots of time to think about the problem we’re trying to solve.

I’d encourage you to give this slower approach to data science a shot in your organization too. Next time you think about making a real-time dashboard, ask a deeper question about the underlying problem instead. I guarantee you’ll find more value from that.

Practical skills that practical data scientists need

When I wrote about how I mostly just use arithmetic, a lot of people asked me about what skills or tools a data scientist needs if not fancy algorithms. What is this mythical “basic math” that I mentioned? Here’s my take on what skills are actually needed for the sort of work that I do at Basecamp: simple analyses focused on solving actual business problems.

The most important skill: being able to understand the business and the problem

I’ll get to actual practical skills that you can learn in a textbook in a minute, but first I have to belabor one point: the real essential skill of a data scientist is the ability to understand the business and the problem, and the intellectual curiosity to want to do so. What are you actually trying to achieve as a business? Who are your customers? When are you selling your product? What are the underlying economics of the business? Are profit margins high or modest? Do you have many small customers or a few large customers? How wide is your product range? Who are you competing with? What challenge is the business facing that you’re trying to solve or provide input towards a decision on? What’s the believable range of answers? Who is involved in solving this problem? Can analysis actually make a difference? How much time is worth investing in this problem?

Understanding the data

Before you look at any data or do any math, a data scientist needs to understand the underlying data sources, structure, and meaning. Even if someone else goes out and gets the data from wherever it’s stored and gives it to you, you still need to understand the origin and what each part of the data means. Data quality varies dramatically across and within organizations; in some cases you’ll have a well documented data dictionary, and in other cases you’ll have nothing. Regardless, you’ll want to be able to answer the following questions:

  • What data do I need to solve the problem?
  • Where is that data located? In a relational database? In a log file on disk? In a third party service?
  • How comprehensive (time and scope) is the data? Are there gaps in coverage or retention?
  • What does each field in the data mean in terms of actual behavior of humans or computers?
  • How accurate is each field in the data? Does it come from something that’s directly observed, self-reported, third-party sourced, or imputed?
  • How can I use this data in a way that minimizes the risk of violating someone’s privacy?

SQL skills

For better or worse, most of the data that data scientists need live in relational databases that quack SQL, whether that’s MySQL, Postgres, Hive, Impala, Redshift, BigQuery, Teradata, Oracle, or something else. Your mission is to free the data from the confines of that relational database without crashing the database instance, pulling more or less data than you need to, getting inaccurate data, or waiting a year for a query to finish.

Virtually every query a data scientist writes to get data to analyze to solve business problems will be a SELECT statement. The essential SQL concepts and functions that I find necessary are:

  • WHERE clauses, including IN (…)
  • Joins, mostly left and inner
  • Using already indexed fields
  • if()
  • String manipulation, primarily left() and lower()
  • Date manipulation: date_add, datediff, to and from UNIX timestamps, time component extraction
  • regexp_extract (if you’re lucky to use a database that supports it) or substring_index (if you’re less lucky)
  • Subqueries

Basic math skills

Once you have some data, you can do some maths. The list of what I consider to be the essential list of math skills and concepts is not a long one:

  • Arithmetic (addition, subtraction, multiplication, division)
  • Percentages (of total, difference vs. another value)
  • Mean and median (and mean vs. median)
  • Percentiles
  • Histograms and cumulative distribution functions
  • An understanding of probability, randomness, and sampling
  • Growth rates (simple and compound)
  • Power analysis (for proportions and means)
  • Significance testing (for proportions and means)

This isn’t a very complicated set of things. It’s not about the math, it’s about the problem you’re solving.

Slightly more advanced math concepts

On occasion, some more advanced mathematical or SQL concepts or skills are of value to common business problems. A handful of the more common things I use include:

  • Analytic functions if supported by your database (lead(), lag(), rank(), etc.)
  • Present and future value and discount rates
  • Survival analysis
  • Linear and logistic regression
  • Bag of Words textual representations

There are some problems that require more advanced techniques, and I don’t mean to disparage or dismiss those. If your business can truly benefit from things like deep learning, congratulations! That probably means you’ve solved all the easy problems that your business is facing.

Data scientists mostly just do arithmetic and that’s a good thing

Hi, I’m Noah. I work at Basecamp. Sometimes I’m called a “data scientist.” Mostly, I just do arithmetic, and I’m ok with that.

Here’s a few of the things I worked on in the last couple of weeks, each of them in response to a real problem facing the business:

  • I analyzed conversion, trial completion, and average invoice amounts for users in different countries.
  • I identified the rate at which people accidentally sign up for Basecamp when they mean to sign in to an existing account and how that’s changed over time.
  • I analyzed and reported on financial performance of a few of our products.
  • I ran and analyzed a survey of account owners.
  • I analyzed an A/B test we ran that affected the behavior of a feature within Basecamp.

In the last two weeks, the most “sophisticated” math I’ve done has been a few power analyses and significance tests. Mostly what I’ve done is write SQL queries to get data, performed basic arithmetic on that data (computing differences, percentiles, etc.), graphed the results, and wrote paragraphs of explanation or recommendation.

I haven’t coded up any algorithms, built any recommendation engines, deployed a deep learning system, or built a neural net.

Why not? Because Basecamp doesn’t need those things right now.

The dirty little secret of the ongoing “data science” boom is that most of what people talk about as being data science isn’t what businesses actually need. Businesses need accurate and actionable information to help them make decisions about how they spend their time and resources. There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means that is best gained using simple methods.

Some people will argue that what I’ve described as being valuable isn’t “data science”, but is instead just “business intelligence” or “data analytics”. I can’t argue with that arbitrary definition of data science, but it doesn’t matter what you call it — it’s still the most valuable way for most people who work with data to spend their time.

I get a fair number of emails from people who want to get into “data science” asking for advice. Should they get a masters degree? Should they do a bunch of Kaggle competitions?

My advice is simple: no. What you should probably do is make sure you understand how to do basic math, know how to write a basic SQL query, and understand how a business works and what it needs to succeed. If you want to be a valuable contributor to a business, instead of spending your weekend working on a data mining competition, go work in a small business. Talk to customers. Watch what products sell and which ones don’t. Think about the economics that drive the business and how you can help it succeed more.

Knowing what matters is the real key to being an effective data scientist.

Infrastructure worth investing in

I analyze data for a living. I occasionally do some other things for Basecamp — help with marketing, pitch in on support, do some “business” things — but at the end of the day, I analyze data. Some of the data is about feature usage, some about application performance or speed, some about our great support team, some about financial matters, and some of it defies categorization; regardless of the type of data, my job is to identify business problems, use data to understand them, and make actionable recommendations.

If you ask almost any data analyst, they’ll tell you that the biggest chunk of their time is spent cleaning and preparing data — getting it into a form that’s usable for reporting or analysis. Ironically, I don’t have any actual data about how much time that process consumes, either personally or for the profession as a whole, but I’d guess that the time spent on acquisition and transformation of data outweighs actual math, statistics, or coming up with recommendations by a factor of five to one or more.

Over time, both personally and as an organization, you get better at capturing and preparing data, and it eats up less and less of your time. I’d characterize the time I’ve spent at Basecamp as being four fairly distinct phases of increasingly greater sophistication in terms of how I prepare data for analysis:

  1. The CSV phase: In my early days at Basecamp, I was just happy to have data at all, and everything was basically a comma separated value (CSV) file. I used `SELECT … INTO OUTFILE` to get data from MySQL databases, `awk` to get things from log files, and the ‘export’ button from third party services to get data that I could then analyze.
  2. The R script phase: After a month or two, I graduated to a set of R scripts to get data directly into my analysis environment of choice. A wrapper function got data from our MySQL databases and I wrote API wrappers to get data from external services. Our first substantial automated reports showed up in this phase, and they were literally R scripts piped to `sendmail` on my laptop.
  3. The embryonic data warehouse: Eventually, a fledgling “data warehouse” started to take form — a MySQL instance held some data explicitly for analysis, a Hadoop cluster came into the picture to process and store logs, and we added a dashboard application that standardized reporting.
  4. The 90% data warehouse (today): today a centralized data warehouse holds almost of all our data, every type of data belongs to a documented schema, and Tableau has dramatically changed the way we do reporting. It’s not perfect — there are some pieces of data that remain scattered and analyzed only after cleaning and manual processing, but that’s the exception rather than the rule.

Over the course of this transformation, the time that I spend preparing data for analysis has fallen dramatically — it used to be that I started any “substantial” analysis with two or three days of getting and cleaning data followed by a day of actually analyzing it; now, I might spend twenty minutes getting vastly greater quantities of already clean data and then a couple of days analyzing it far more deeply.

That evolution didn’t come for free — it took substantial investments of both time and money into our data infrastructure. We’ve built out two physical Hadoop clusters, bought software licenses, and poured hundreds of hours over the last five years into developing the systems that enable reporting and analysis.

I used to struggle with feeling guilty every time I spent time on our data infrastructure. After all, wasn’t my job to analyze data and help the business, not build data infrastructure?

Over time, I’ve come to realize that there’s nothing to feel guilty about. The investment in our infrastructure have paid dividends many times over: in direct time savings (mine and others), in greater insights for the company, and in empowering others to work with data. In the example analysis case I described above, the transformation infrastructure saved a day or two of my time and delivered a better result to the business; I do perhaps thirty or forty such analyses per year. That makes a few weeks or even months of total time spent on those investments look like a bargain.

I often hear people argue that “investing in infrastructure” is just code for giving in to “Not Invented Here” syndrome. The single biggest impact infrastructure investment we’ve made was actually abandoning a custom developed reporting solution for a piece of commercially developed software. Just like any sort of investment, you can of course spend your resources poorly, but done properly, investing in infrastructure can be one of the highest returns you can possibly achieve.

Seth Godin had an excellent take on the topic recently:

Here’s something that’s unavoidably true: Investing in infrastructure always pays off. Always. Not just most of the time, but every single time. Sometimes the payoff takes longer than we’d like, sometimes there may be more efficient ways to get the same result, but every time we spend time and money on the four things, we’re surprised at how much of a difference it makes.

I recently wrapped up a fairly large infrastructure project at Basecamp, and my focus is naturally swinging back towards focusing more exclusively on the core of what I do: analyzing data. For the first time, however, I’m moving on from an infrastructure project without much guilt about whether it was an investment worth making. Instead, I’m looking forward to reaping the dividends from these investments for years to come.