The value of human, exploratory testing

Ann and Michael find things programmers never would have.

Since unit testing and test-driven development burst onto the programming scene in the early 2000s, too many programmers have deluded themselves into thinking that they could ship high-quality software with automated testing alone. It’s a mirage.

Don’t get me wrong. The industry took a big leap forward when the tooling and conventions for automated testing got put in the spotlight. But in many corners, it also threw the baby out with the bathwater. Automated testing does not replace “testing by hand”, it augments it.

Testing by hand, or exploratory testing, is a crucial technique for ferreting out issues off the happy path. It is best carried out by dedicated testers who did not work on the implementation. Those pesky auditors who have the nerve to try using the application in all the ways a real user might.

None of this is news, of course. I remember reading a statistic long ago saying that Microsoft had three testers for every developer. That sounds wild to me, but I suppose if you’re trying to keep three decades of backwards compatibility going, maybe you do need that sort of firepower.

What it is, though, is forgotten wisdom. Especially in the small and mid-sized shops. Propelled by the idea that automation could take care of the testing, dedicated testers weren’t even on the menu in many establishments.

For many years that included us at Basecamp. Yeah, sure, programmers and designers would sorta click through a feature and make sure it sorta worked. Then we’d ship it and see what users found.

But we leveled up in a big way when Michael Berger became the first dedicated tester at Basecamp several years ago. He’s since been joined by Ann Goliak. And between the two of them, we’ve never shipped higher quality software. Far more issues are caught in the dedicated QA rounds that precede all major releases.

What Ann and Michael bring to the table just cannot be replicated by programmers writing automated tests. We still write lots of those, and they serve as a very useful guide during development, and form a strong regression suite. But they’re woefully insufficient. And it doesn’t matter whether they’re unit, functional, model, or even system tests. They’re no substitute for a fresh pair of human eyes bent on breakage.

I hope we start seeing a renaissance for human testers at some point. Not just as something to do if there’s time, but something for dedicated individuals to do because it’s effective. Long live manual testing!

Ann and Michael just finished testing a major upgrade to the todos feature in Basecamp 3. You should give it a try.

Behind the scenes: A/B testing at Highrise

Highrise launched in 2007 and was a leader in teaching folks about some successful marketing split tests. Today we’ve got a few new lessons. First let’s talk a bit about strategy, and then I’ll share a couple important results we’ve seen.

When to stop

Here at Highrise, we split test constantly. Since these can take quite a lot of time to reach statistical significance, and I want to keep using our time efficiently, it’s important to have another test ready in the queue as soon as one completes. Even if it’s changing a word on a single button.

But it’s super easy to get stuck. We run dozens and dozens of experiments and often end up with nothing. No changes. Or the new stuff just makes things worse.

So when do you stop? It’s helpful to come up with some baselines. We look at other products out there, our past, and our sister product Basecamp. How are we doing compared to those baselines? In some cases we’re behind, so those are ripe areas for testing. Other areas we’re on the baseline, and have decided our time is better spent elsewhere.

Don’t get lazy

Here’s a lesson that came to us at great cost: we weren’t measuring enough variables. When I took over Highrise in 2014, we immediately started split testing an entire new marketing site design. We compared the old and new site’s signups. Waited until we had a statistically significant result. And bingo. Saw the new site was doing better, and switched all our traffic over to it.

We were befuddled then when we saw our growth plateau. What were we doing wrong? We’ve been improving the product at a super fast pace. People seem really happy.

Turned out we were measuring the wrong thing. We were split testing total signups which include free and paid.

When we dove in, we saw our new marketing site had actually hurt our paid conversions but improved our free conversions masking the overall impact. And those free conversions weren’t upgrading at a high enough rate to make up for it.

But that’s not the whole story either. People who were still signing up for our paid plans, were now more often signing up for the cheaper plan instead of what was our most popular plan — a more expensive one. Totally changing our revenue mix.

The lesson: measure more detail than you think you need to. Put the extra work into splitting up the different variables that are important to you in your split test regime.

It could also be worth paying for a data scientist to come in and make sure you’re doing the right thing. I’ve been split testing marketing sites for many years, but it took Noah Lorang at Basecamp to open my eyes that we were doing something pretty stupid. And it didn’t take him long either. This doesn’t have to be an elaborate project. Just make sure you you’re testing the right things. Don’t get lazy. Or you could pay an expensive price like we did.

Some interesting results

Too many plans

One change we made when we relaunched our marketing site was to our plans page. We went from this:

to this:

I wasn’t in love with the change, but I didn’t hate it either. It brought our new, more minimal, aesthetic to the plans, and it also addressed one thing we heard from some customers: “Do you have a bigger plan?”

We did! We just didn’t advertise it. So let’s add that. Can’t hurt?

Well it did hurt (like I mentioned above — we just didn’t see it soon enough). Paid signups went down and people started signing up more for the Basic plan.

When we moved back to something more akin to the old design:

Paid signups went back up 51.4%. And our new revenue improved 67.6%!!

Quite the mistake and improvement. Why the difference? Probably easy to guess that more plans doesn’t mean better. Too many choices. The extra choice probably just made for too much anxiety and killed people’s desire to sign up. And the original design, really made things a no-brainer: “Here, don’t debate, just sign up for this.”

The free link also became bigger in our changes that weren’t working well, encouraging folks to bail into that plan. So we bumped the font size back down to what it was originally.

Revisit old hypotheses and assumptions

Another interesting result we just bumped into was an explainer video we had on our features page.

I remember when we added that over a year ago; we split tested it of course. Though again, a mistake we made, we only split tested total conversions.

Recently we decided maybe the explainer video wasn’t up to date enough (Highrise is improving at a very fast clip), but before removing it, we tested three things: removing it, leaving it at the bottom of the page where it was, or moving it to the top.

Removing it improved our free signups by 53.2%! And didn’t change our paid signups at all.

Why would getting rid of a video sitting at the bottom of a page that doesn’t get a ton of our traffic make that much of a difference?

It’s also an important reminder that not all customers behave the same way on the site. Maybe folks who are more anxious about signing up spend more time pouring over the details of our features and how we present them. Then they bail into a free plan. What can we do to take advantage of that information? Maybe offer the free signup at the bottom of the features page? Improve our call to action on that page stressing free trials? Lots of options when you get more granular about what you’re testing.

It’s also worth rethinking a lot of assumptions and hypotheses that we thought we knew a couple years ago. Not just because our testing is more thorough. But also because maybe these things have changed since then. Maybe an image or a video or some copy that converted well just doesn’t have the same impact today. Maybe the consistency of those images with other assets have changed (logos and styles in screenshots). Maybe, simply tastes have changed since we tested those.

Just a couple recent lessons from us. Stay tuned for a lot more. There’s some really interesting changes we’ve been testing and haven’t gotten quite right yet, but have angles on improving…

P.S. You should follow my YouTube channel, where I share more behind the scenes of Highrise and how history, psychology, and science help us run our own business. And if you find yourself overwhelmed while managing your customer relationships, customer support, or the tons of people you communicate with, check out how Highrise can help!

5 steps to creating frustration-free Android test devices

How to setup devices so that manual testing doesn’t crush your soul

A few days ago, I picked up one of my test devices to try out some new code. I couldn’t believe how frustrating it was.

I wasn’t logged into the right accounts. I didn’t have the right apps installed. By the time I finished testing, I couldn’t even remember how to reproduce the bug.

And like any Android programmer, my testing frustration was magnified because we support numerous OS versions/devices.

To save my sanity, I built a system for a unified, predictable setup on every device. Here’s how to do it.

1. Install the OS versions you support

Depending on what API levels you support, ideally you have a 1–1 device to API ratio. This isn’t always possible of course, but it’s helpful.

So first things first — take an inventory of your devices and which ones support which OS versions. Then examine what your customers use the most and optimize for those scenarios.

With that in mind, my lineup looks like this right now:

To truly embrace your OCD like me, slap a version label on the back of each device. 🤓
  1. Nexus 5 (5.1.1) — The Nexus 5 the most valuable device in my lineup. It’s supremely flexible and can run all the OS versions that most users have (4.4–6.x).
  2. Nexus 5 (6.0.1) — More than 50% of our customers are on 6.x. This is currently my baseline test device.
  3. Samsung Galaxy S6 (6.0.1)— Samsung devices make up a good chunk of our users, so it’s important to have at least one representative device. Their implementation of certain features (particularly WebView) can be different, so it’s important to test non-stock Android devices.
  4. Nexus 5x (7.0)— A newer device where I can test the very latest Android builds and features.
  5. Nexus 6P (7.0) — Not totally necessary, but it can be helpful to have one big screen device to see how things look in the real world, as compared to something closer to the 5″ size. Also gives me some flexibility to move down to 6.x as needed.

(I admittedly don’t have a 4.4 device, and rely on a Genymotion VM to test for that. I’ve debated knocking down my Nexus 6P down to 6.x, and flashing a Nexus 5 to 4.4).

2. Install and configure a common set of testing apps

You’ve probably got a common set of apps you rely on to test your app. This is the time to make sure they’re all installed, logged in, and preferences tweaked to your liking.

App choices will vary person to person, but here are a few that I rely on and recommend:

  • 1Password — Keep all your passwords secure, and makes logging in to apps so much easier. Always the first app I install.
  • AZ Screen Recorder— Great for screencasts or to create gifs to share with teammates.
  • Chrome Beta — We do a lot of WebView work, so we want a heads up on how future versions of Chrome/WebView will behave.
  • Dropbox — Automatically uploads screenshots so I can grab them from my computer quickly. I also use it to do some file-based testing.
  • Flesky / Swiftkey / Google Keyboard — Writing on our homegrown rich text editor, Trix, is a big part of our app. So we test various keyboards frequently.
  • Keep — Super handy to save quick notes, URLs and whatever else synced up across devices.
  • Solid Explorer — The best file manager I’ve found. Moving things around in the file system can be very handy.

3. Login everywhere

It sounds painfully obvious, but with so many devices floating around, you might not actually be logged in everywhere you need to be. Inventory your standard places to login and do it.

Typically for me this means logging in to just a handful of places:

  1. 1Password for Teams
  2. Google — Personal
  3. Google — Work
  4. Dropbox

It’s basic but there’s nothing more annoying than getting into your testing and realizing halfway through you’re not logged in to the right accounts.

4. Use Nova Launcher for a consistent experience

This was the real game changer for me. Using Nova Launcher, you can make every device look and work the same.

Nova Launcher all the things. 🚀

For me the biggest irritation was the launcher/app organization being different on every device. Everything was hard to find and it slowed me down.

Nova solves all of this.

You can setup your home screen, dock, and app drawer once, then share that across devices. When you pick up another device, your apps are in the exact same place as you expect. It’s predictable and fast— no hunting, no mental overhead.

Here’s how to do it.

  • Pick your favorite device and install Nova Launcher. Buy and install Nova Launcher Prime (this unlocks a set of handy features).
  • Set Nova as your home screen launcher, replacing whatever you’re currently using.
  • Open Nova settings and play with all the settings. There’s too much to cover here, but take the time to make it work exactly how you want. Nova’s customizations can do anything your heart desires.
  • When you’re happy with the setup, in settings go to “Backup & import settings”. Backup your current settings to Dropbox (or wherever).

  • Pick up one of your other devices. Install Nova again.
  • Go to “Backup & import settings” again, but this time do a restore. Pick the file from Dropbox (or wherever) that you saved in the previous step. Repeat for all devices.
  • Voila — your devices now all look and work the same!

The long-term beauty of using Nova is that as your apps or preferences change, just upload a new backup and restore it on all your other devices. You’re all set again!

5. Tweak all your sytem settings

The last thing to do is go through all your system preferences and get them working the same on each device. For me that means:

  • Making sure all my wifi networks are setup (home, office, favorite coffee shops)
  • DND/total silence is activated. Test devices don’t need to notify me about anything.
  • Developer options and USB debugging is enabled
  • Screen stays awake when plugged in (developer options)
  • Screen brightness is set to a level I like (with adaptive brightness off)

Optional: live with it

One thing I like to do is swap devices from time to time and “live” with our app for a day or two on that device.

Using the app on a real device under real scenarios gives valuable perspective. You can tell if everything looks, feels, and performs as you’d expect.

To make this process easier, a couple tips:

  • Use a nano SIM from your cellular provider, and keep a SIM card adapter set handy. Even though all newer devices use nano SIM, you still might run into micro SIM slots (or if you’re really lucky, a standard SIM slot!)
  • Install apps that you use outside of work. This helps ensure you don’t jump ship back to your daily driver, and you give the test device a real shot. But keep your personal apps in a separate tab in Nova’s app launcher. That way your testing apps are still front and center, but you can still get to the fun stuff and live with the device for a bit.

That’s it, I’m glad you made it this far! Following these steps should help reduce your manual testing frustrations, and hopefully keep you in the zone doing the more fun stuff (like programming everything that needs to be tested!)

If you liked this post, please do hit the 💚 button below. I’d really appreciate it!

I’m part of a fantastic team that builds (and tests) Basecamp 3 and its companion Android app. Check ’em out and let me know what you think!

Balance Driven Development

I mentioned in my last post that I would talk about my opinions on TDD, so here it is. Kicking it off, I will explain what TDD is, how it’s meant to work. Then I’ll explain what some people have said about it and talk about what I believe the real benefits of TDD are. Finally, I’ll walk through whether I think it’s worth using and explain my use of the practice. Oh, and I’ll also provide a disclaimer as to what the heck possessed me to pile onto this already well-discussed topic.

TDD stands for “Test Driven Development.” At its core, it’s a development practice; a way to approach writing code. The rules of how to practice TDD are fairly simple at their surface. Say you have a new function that you need in order to accomplish a task: write the smallest test you can imagine, run the test, watch it respond with a failure, write the smallest possible amount of code to make that test pass, repeat until the necessary functionality is complete. With that process there are a number of benefits that people reference.

The main benefit I see TDD-promoters reference is test coverage. Since, with TDD, testing is part of how you write code, you just get more tests that are very well tied to the logic inside your functions. That test coverage paired with ongoing use of the practice tends to make new development less frightening because you have pretty high confidence that your code is covered and will alert you to unexpected behavior changes.

One counter-argument to the test coverage benefit is that the immense depth at which you’re covering your code in this type of practice results in brittle tests. Growing the test code at a rate faster than your app code can increasingly make it difficult to make changes to your app without spending many more hours rejiggering your tests. So, while you maybe have higher confidence in your app at one point, by the time you’ve redone much of your testing, due to feature additions, you’re in kind of a ¯\_(ツ)_/¯ state. So much so that by the time you’re done getting the tests green, you can’t tell if you’ve fixed the tests properly or if you just made them look green.

As Software Engineers we like to find processes and tools that allow us to remove blame and responsibility from the human. We want the computer and process to protect us and keep us in a safe zone. I think both of the above arguments are trying to achieve that same end of some kind of safe zone. TDD, in terms of the testing benefit that’s often referenced, would like to keep us in a zone of constant “yes it works.” The anti-TDD position explains a world where the process potentially slows down our ability to progress and potentially hurts our confidence in new functionality due to lots of changing tests.

One of the tricks with the name TDD is that it implies that tests are the benefit, when in fact they’re simply a vehicle for development. I actually liken tests from the TDD process to CO2 — they’re there to move things forward and useful for that but otherwise need to be cleaned up after their purpose has been served. That is to say that a lot of tests I write during a TDD exercise are meant to be deleted at the end. I tend to use those TDD tests to help write new tests that are intended to live on with the app for regression and documentation. They often even look very similar, but now I’m writing tests to lasts rather than tests drive development. These are fundamentally different mindsets.

I’ve mentioned that TDD is more of a vehicle for development. The effect TDD can have on the design of your code seems to be the most overlooked benefit of the practice while still the most important, in my mind. When I’m writing code without some sort of test I just add things and out comes the functionality. I’m worried about how easiest I can get the feature done. When I’m writing it from the perspective of a test I’m writing as if I’m a user of the code that I need. That means that my mentality for what needs to be written is altered. I’m basically defining the API that I can test and understand. This is much different than writing a bunch of things that technically work but then having to explain that API from the reverse — in order to write the tests.

With code/API-design and test coverage being the main arguments for or against TDD lets talk about what I do. I tend to think that when there are two big camps of people shouting for or against an idea the truth or the best path lies somewhere in the middle. I think some of what you can learn about your code for future testing and for understanding design is an enormous win. I don’t feel that my longer-term regression tests come from the practice, though. So after working with TDD for quite some time, I now lean on it as a tool in by toolbox. Generally, I remove those initial tests and move on with life. One thing I will give TDD is that the design mentality around understanding what your external API looks and feels like has ultimately changed how I write code, regardless of whether I’m actually living by the practice. Now, when I write code I tend to sketch up an API of what I’d expect to use, then I aim to fill that in. The same idea as TDD in terms of thinking from the other end, with less rigidity.

So what should you do? If you’ve never tried it, don’t just listen to me or to the others on the all-knowing Internet. Try it! I still think its a practice worth doing for a little while, if only to understand it yourself and develop your own opinion. Your opinion may be different than others’ and that’s ok. If it helps you make cool things or enables you or your company to make money, then that’s awesome; keep doing that.

I recognize that I’m beating a proverbial dead horse. Everyone and their mother has already written about their feelings on TDD, I even recycled a lot of those same arguments here. I decided to publish my own post on the topic because I feel like I’ve found a place somewhere in the middle of the argument. I think a lot of people tend to focus on picking sides and I wanted to explain that I think you can use it as a tool; a tool in your belt. It doesn’t have to be your whole world, but if you prefer that tool over another, great!

Also, in the last month, several people have asked about my feelings on the topic so I figured I’d compile my feelings in this format for reference.

Thanks for listening. I work for Highrise HQ building a better CRM. If you liked this, you might check out my Twitter for more silly opinions and feels on tech and society.