Let’s Chart: stop those lying line charts

I want to talk about one of the most basic tasks a data analyst will be asked to do on a regular basis: present some data over a period of time.

Let’s look at a chart of monthly sales from Noah’s Imaginary Widget Company. I see charts like this on a regular basis:

A basic line chart, right? Nothing fancy or special about it, just a couple clicks in Excel.

Not so simple: this innocent little chart is actually lying to you in a couple of significant ways.

First, you can’t actually tell where the monthly sales values fall —they fall at even points along the width of the chart, but it’s very difficult for you to mentally place the points there. Let’s fix that:

A little better — I can see actual data points now. This chart is still lying to us though. Let’s zoom in on September and October to see why:

The chart makes it look like sales dipped below their September number and then increased to October. This isn’t actually true, or at least we can’t tell from the data we have — we only have monthly numbers, so we can’t possibly have enough information to say that’s what happened in between those two points.

When you use the “smoothed” lines functionality in Excel, Highcharts, D3 or any other visualization tool, you’re asking the tool to lie for you. It’ll happily fit an equation to make things look smooth, but that’s not representing the data. I wish tools didn’t make it so easy to invent data — I can’t think of a legitimate case where you should use an auto smoothing function like this.

Let’s straighten out those lines:

This is getting better — we no longer imply some perfect mathematical equation that doesn’t exist.

This chart still has a big problem though: by connecting the data points, we imply continuity in the underlying data that doesn’t exist. All we have is monthly data, but when you connect them together, you imply to the viewer that you know what happened in between the points.

For example, zoom in on July, August, and September. At the monthly level, they look like:

Here’s one set of daily data that could make up this monthly:

Alternately, here’s a different set of daily data that would get you the same monthly trend:

Connecting the monthly data points together sure makes it seem to the viewer more like the former than the latter, but we don’t actually have enough data to make that conclusion. It could just as easily be the latter case, but you’re unlikely to consider that possibility based on the monthly connected line chart.

The better visualization here is actually to not use a line chart at all. A bar or column chart better conveys discrete quantities like monthly sales: it’s easier to compare relative quantities visually, and it doesn’t imply continuity in the underlying data where there is none.

Much better.

But Noah, aren’t you guilty of using line charts without truly continuous underlying data?

Yep, I am. When you have high frequency data (like if you have once-per-hour data for a few weeks), even though you’re implying some continuity that doesn’t really exist, it can be much easier to comprehend when you do connect the datapoints.

For example, here’s some actual data that meets that criteria: hourly signups for Basecamp over the last two weeks. The bar chart version isn’t bad, but it’s a little hard to grok at first glance, because there’s so much visually going on at that density:

You can probably get a little better by changing the width of the bars, but he equivalent line chart is, at least to me and most people I’ve talked to, a lot easier to comprehend:

So yes, sometimes I deceive with line charts, but it’s a small lie that I can live with.

What if I really do want smoothed data?

If you want to show “smoothed” data, that’s ok, but you should explictly decide what sort of transformation you want to apply to “smooth” the data and acknowledge it. Here’s that same signup data with a five hour moving average applied:

This isn’t fancy analysis, and I don’t claim to be Edward Tufte — I put out plenty of bad visualizations myself. This might seem too basic to be worth talking about, but I see this sort of deceitful chart almost every day, both from analysts and in tons of commercial products which use smoothed line charts.

If you like charting, maybe you’d like to try out a daily chart habit — you’ll get lots of practice at making good (and some bad) charts.

Getting your recommended daily chart allowance

About a year ago, I wrote about something I’d recently started doing at Basecamp and a year and over 250 charts later, I’m still at it: every workday, I share a different “chart of the day” with my coworkers at Basecamp.

The charts are just pulled from whatever I’m working on, a question someone asked, or something topical (iOS 10 was released a couple days ago, so yesterday’s chart was about adoption among our users). They can be about anything — marketing, support, operations, performance, usage, the company itself, whatever. I don’t intentionally try to make them extra interesting or visually stunning, and I try to spend no more than ten minutes per day on that day’s chart. I just find a chart and post it in the “Data” project on our Basecamp account.

A few charts from the last year

I had two primary motivations in starting this chart habit:

  1. I wanted to challenge myself to keep things fresh, and to tell stories with data without using more than one chart and a couple sentences.
  2. I wanted to make data more accessible. You shouldn’t have to set aside a half hour to read a report to get a piece of information that can change the way you think.

It’s been a fun challenge to keep this up for over a year, and I’d like to share a few of the things that stand out to me from the last 272 charts.

Isn’t this just chart junk food?

Given how I feel about real-time dashboards and the importance of solving real business problems, I sometimes wonder if these charts are just the data equivalent of junk food.

Maybe they’re a little high in data sugar, but I think they serve a purpose that you don’t get from a dashboard.

Every day is something different. In a year of charting, I’ve never intentionally reused a chart, which means that people have seen over 250 unique slices of data about our business. That’s a breadth that’s hard to achieve any other way.

There’s context. I don’t do a lengthy writeup about each chart, but I write a sentence or two about what the chart shows and why it matters. A chart with no context might just be eye candy, but contextualizing makes it more valuable.

They’re a conversation. I post a chart. People read the chart. Some people applaud it. Some people ask a question that I can try to answer. Some people reference it later. Today’s chart is influenced by yesterday’s chart. Unlike a dashboard or a report, the chart of the day serves as the starting point for a conversation about the challenges we’re facing as a company and the things that people care about.

Making data fun

One of my goals with Chart of the Day is to make working with and thinking about data fun for people. Data isn’t just numbers and long reports; it can also inspire, motivate, intrigue, and make you laugh out loud. While I hope that all of my charts cause joy, there are a couple things that I’ve started doing that are a little more blatant in their aim.

Round numbers

When you do something daily, you’ll inevitably start numbering things, and when you do that, you hit round number milestones, and you’ll be tempted to go a little crazy.

I wanted chart #100 to literally light people up with a look at our growth as company over more than ten years [pun intended].

Chart #100 now resides at Basecamp headquarters in Chicago

When it came time for chart #200, I was just hungry.

Chart #200 now resides in my stomach

I’ve got about two months to figure out what to do for #300. Your suggestions for the wackiest, most over-the-top chart possible are appreciated.

Fun Chart Fridays

On most Fridays, rather than posting a “serious business” chart about Basecamp, I try for something a little lighter. Fun Chart Fridays are either charts about a less serious aspect of Basecamp (Campfire sounds are a perennial crowd pleaser) or something that I’ve seen elsewhere on the internet that’s interesting or amusing.

Sometimes they’re also a good chance to talk about a way of visualizing or thinking about data that’s a little different than the ordinary, or to contrast different looks at the same piece of data.

Inside the mind of a daily charter

Most days, charting is easy: I copy something from an analysis or report I’m working on and I paste it in Basecamp, write a sentence or two of explanation, and move on with my day.

Then there are days where charting is a real slog — I don’t have anything handy because I’ve been working on infrastructure, or I realize that the chart I was all set to post is actually too similar to something else I’ve posted, or I’m just tired. On those days, I sometimes question why I’m bothering. Does it really matter if I post a chart today? Or any day?

Eventually, I always convince myself that it does matter, because data can change the way people see things. People sometimes don’t even know the question they want to ask, which can make it hard for me to help them, but I can at least put a piece of data in front of them each day and hope that it sparks something in them that leads them to think about a problem a little differently. It sounds corny, but if that leads to us making a single better decision for our customers, that’s all the payback I need to make another chart.

A daily reminder that there are questions we can answer if we look at them the right way is pretty neat. — Jim

I’ve missed days because I’ve totally forgotten, and the chart of the day took a few weeks of summer vacation this year, but I’ve yet to just give up and not post on a day because I didn’t feel like it. That’s a small thing, but it gives me some satisfaction, and I’m going to keep on charting until they pry the x-axis out of my fingers.

These charts help us make Basecamp. You can use Basecamp for your daily charting habit too! If you do, let me know

Real-time dashboards considered harmful

Walk into any startup office and you’ll see almost the exact same thing: a bunch of big televisions showing real-time dashboards. Traffic, tweets, leads, sales, revenue, application performance, support cases, satisfaction, A/B test results, open rates; you name it, there’s a real-time dashboard for it.

Walk into Basecamp and you won’t see any of those, and it’s not just because we’re a remote company. It’s because real-time dashboards are often more harmful than they are beneficial.

Robert Caro nailed it in a recent Gothamist interview:

[Gothamist] There’s something called Chartbeat — it shows you how many people are reading a specific article in any given moment, and how long they spend on that article. That’s called “engagement time.” We have a giant flatscreen on the wall that displays it, a lot of publications do.

[Caro] What you just said is the worst thing I ever heard. [Laughs]

What’s the point of that dashboard?

I do a lot of reporting: on operations, on support, on usage, on finances, on marketing, and on every other topic that matters to a business. Whenever I consider a new piece of reporting, I ask myself one question: what’s the point? What’s the action or decision that this reporting is intended to impact? When someone consumes it, what can they do about it? Can they make a decision? Can they go do something personally or ask someone to do something? If there’s nothing that can be done in response to a report, does it need to be reported in that manner?

Most real-time dashboards fail to pass this usefulness test. Knowing how many visitors you have on your site right now or what the open rate was for the email that just went out doesn’t generally enable you to do anything. In a SaaS business, knowing what today’s revenue is doesn’t really enable you to do anything either: revenue today is the consequence of a sales and retention cycle that started long ago.

There are cases where real-time dashboards are invaluable. Knowing whether database response time is higher right now than it was a few minutes ago is incredibly useful when your site is slow, and we use real-time dashboards extensively for solving availability and performance problems at Basecamp.

Schrödinger’s dashboard

Perhaps real-time dashboards aren’t that useful, but if they aren’t a lot of work to set up, what’s the harm? Isn’t faster data better data?

The problem comes when you look at a real-time dashboard: no matter how much you try to train yourself, you’re going to react to the data that you just saw. You might not realize that you’re reacting to it, but you absolutely are.

Almost every metric is noisy. Active users being down 3% from yesterday could be the start of a longer trend, but it’s much more likely that it’s just noise in the data. When you see that 3% decrease on a real-time dashboard, however, the panic starts to set in: what did we do wrong? Anything you were thinking about gets thrown out the window, because now you’re reacting to something that looks urgent, but really isn’t important.

I’ve seen many cases of people looking at real-time A/B test results and judging the experiment after an hour or two. No matter how much labeling you do to point out that the results are meaningless at that scale, humans will still draw conclusions from them. In our case, and for virtually every online business, daily updated results are more than adequate for making decisions, so there’s only downside to real-time A/B test results: the risk of making a decision off insufficient data and that decision turning out to be the wrong one.

We recently scaled back and de-emphasized the use of a bunch of metrics relating to our support team. We found that a focus on average customer happiness scores, response time, and case volume made it hard to give each individual customer the attention they deserved, and caused a ton of unnecessary stress. Kristin explained our motivation well:

We’re attempting to change our relationship with Smiley and metrics so that our focus is more on each individual customer and less on any sense of competition with ourselves and/or each other. Smiley leads us to focus on the vocal minority (about 20%) of customers who leave a rating. The customer we’re currently working with should have 100% of our attention, so we shouldn’t be worried about quickly getting rid of them to move on to the next one or focusing on the customer as a potential Smile instead of as a person who needs help.

The next time you feel the urge to look at Smiley and/or Dash, get up and take a break. Make some tea. Eat some cheese popcorn. Pet an animal. Stretch.

I’m really proud of the support team for their evolving relationship with the use of metrics. We got a lot of value out of rigorously analyzing our support caseload to figure out the right level of staffing, scheduling, and address root causes, but we can do all of those things without real-time reporting. Knowing when not to look at a piece of data is just as important as knowing when to look.

Make reporting great again

How can you make reporting less stressful and more useful? Try a few of these simple changes:

  • Change the timeframe. Instead of looking at the last day of data, look at the last week or month. Maybe there’s a bigger seasonal trend that will help to contextualize today’s data.
  • Move upstream. Instead of reporting something like daily revenue, which is the output of every step of your funnel, report on the actual underlying drivers that you can impact.
  • Contextualize. Instead of showing an absolute metric, show a percentage change or a comparison to last week or last month.
  • Convert dashboards to alerts. Computers are great at sending emails according to defined conditions, so let them do that. Don’t rely on checking a real-time dashboard to detect that something isn’t right; define your criteria and let an automated system tell you when you need to take a deeper look.

I get it: real-time reporting is fun. It’s something shiny to put up in your lobby, and it fills you with lots of little bits of trivia to drop at a moment’s notice. But that comes at a cost, and too many people embrace real-time reporting without thinking through the consequences.

A paean to slow data

Eschewing real-time dashboards is just one part of what I like to call a “slow data” approach to data science. I’m not talking about free-range histograms or artisinal Poisson distributions, but about taking the time to really understand the problem you’re solving, the data you’re using, and the implications of the results. My profession spends most of its time talking about statistical methods and visualization, and very little time talking about the actual business problems or impacts of the work. Fortunately, I mostly just do arithmetic, make very simple charts, and avoid making real-time dashboards, so I have lots of time to think about the problem we’re trying to solve.

I’d encourage you to give this slower approach to data science a shot in your organization too. Next time you think about making a real-time dashboard, ask a deeper question about the underlying problem instead. I guarantee you’ll find more value from that.

Practical skills that practical data scientists need

When I wrote about how I mostly just use arithmetic, a lot of people asked me about what skills or tools a data scientist needs if not fancy algorithms. What is this mythical “basic math” that I mentioned? Here’s my take on what skills are actually needed for the sort of work that I do at Basecamp: simple analyses focused on solving actual business problems.

The most important skill: being able to understand the business and the problem

I’ll get to actual practical skills that you can learn in a textbook in a minute, but first I have to belabor one point: the real essential skill of a data scientist is the ability to understand the business and the problem, and the intellectual curiosity to want to do so. What are you actually trying to achieve as a business? Who are your customers? When are you selling your product? What are the underlying economics of the business? Are profit margins high or modest? Do you have many small customers or a few large customers? How wide is your product range? Who are you competing with? What challenge is the business facing that you’re trying to solve or provide input towards a decision on? What’s the believable range of answers? Who is involved in solving this problem? Can analysis actually make a difference? How much time is worth investing in this problem?

Understanding the data

Before you look at any data or do any math, a data scientist needs to understand the underlying data sources, structure, and meaning. Even if someone else goes out and gets the data from wherever it’s stored and gives it to you, you still need to understand the origin and what each part of the data means. Data quality varies dramatically across and within organizations; in some cases you’ll have a well documented data dictionary, and in other cases you’ll have nothing. Regardless, you’ll want to be able to answer the following questions:

  • What data do I need to solve the problem?
  • Where is that data located? In a relational database? In a log file on disk? In a third party service?
  • How comprehensive (time and scope) is the data? Are there gaps in coverage or retention?
  • What does each field in the data mean in terms of actual behavior of humans or computers?
  • How accurate is each field in the data? Does it come from something that’s directly observed, self-reported, third-party sourced, or imputed?
  • How can I use this data in a way that minimizes the risk of violating someone’s privacy?

SQL skills

For better or worse, most of the data that data scientists need live in relational databases that quack SQL, whether that’s MySQL, Postgres, Hive, Impala, Redshift, BigQuery, Teradata, Oracle, or something else. Your mission is to free the data from the confines of that relational database without crashing the database instance, pulling more or less data than you need to, getting inaccurate data, or waiting a year for a query to finish.

Virtually every query a data scientist writes to get data to analyze to solve business problems will be a SELECT statement. The essential SQL concepts and functions that I find necessary are:

  • WHERE clauses, including IN (…)
  • Joins, mostly left and inner
  • Using already indexed fields
  • if()
  • String manipulation, primarily left() and lower()
  • Date manipulation: date_add, datediff, to and from UNIX timestamps, time component extraction
  • regexp_extract (if you’re lucky to use a database that supports it) or substring_index (if you’re less lucky)
  • Subqueries

Basic math skills

Once you have some data, you can do some maths. The list of what I consider to be the essential list of math skills and concepts is not a long one:

  • Arithmetic (addition, subtraction, multiplication, division)
  • Percentages (of total, difference vs. another value)
  • Mean and median (and mean vs. median)
  • Percentiles
  • Histograms and cumulative distribution functions
  • An understanding of probability, randomness, and sampling
  • Growth rates (simple and compound)
  • Power analysis (for proportions and means)
  • Significance testing (for proportions and means)

This isn’t a very complicated set of things. It’s not about the math, it’s about the problem you’re solving.

Slightly more advanced math concepts

On occasion, some more advanced mathematical or SQL concepts or skills are of value to common business problems. A handful of the more common things I use include:

  • Analytic functions if supported by your database (lead(), lag(), rank(), etc.)
  • Present and future value and discount rates
  • Survival analysis
  • Linear and logistic regression
  • Bag of Words textual representations

There are some problems that require more advanced techniques, and I don’t mean to disparage or dismiss those. If your business can truly benefit from things like deep learning, congratulations! That probably means you’ve solved all the easy problems that your business is facing.