" "

Three Control-Group Mistakes and How To Avoid Them

Sophisticated marketing teams often borrow an idea from the pharmaceutical industry to test new ideas: randomized controlled trials (RCTs). They divide customers into separate groups, treat each group differently, and then measure variations in outcomes. And as long as the groups are the same except for the random allocation of the “treatment,” then any difference in outcomes can be attributed to the treatment. If done right, RCTs can be almost magical, telling you, for instance, which half of your marketing budget you’re wasting. But, in practice, a lot can go wrong—even in large, sophisticated organizations.

In my years of experience dealing with marketing teams, I’ve seen three things that often introduce inaccuracy into RCT results:

  1. Differences creep in that are caused by something other than just “luck of the draw” between the groups.
  2. Teams can simply be unlucky with the actual groups picked.
  3. The team can make the wrong measurement, for example not applying the same measurements to both groups, or not measuring results over a long-enough time period.

These issues can be difficult to spot—but they matter. They can lead companies to end projects that should continue, and to continue projects that are harmful to the business.

Difference not due to randomization

Recently, I worked on a marketing project with a large loyalty program. As part of the project, one group of customers would be the recipients of a new, personalized approach to the program. To test the approach, we used randomization to pick a treatment group (5% of existing customers) and, for comparison, an equal-sized control group. Although we used the same database script to pick the two groups, we picked them at different times. The treatment group was picked as we started to develop the new approach, but we didn’t pick the control group until we were ready to test and measure it—about three months later.

This meant that on the day we started the experiment, some people in the control group had joined the program very recently—even right up to the day before. Everybody in the treatment group, on the other hand, had been a member for at least three months. This tenure difference matters because new customers are more likely to have made purchases recently. Longer-term customers are less likely to have made recent purchases, or perhaps may have lost their card, moved, or even died.

Solution: To avoid introducing this inaccuracy, pick the groups at the same time and either randomly assign new customers when they arrive, or exclude them from all measurements.

A similar thing happens when refreshing a universal control group (customers who don’t get any marketing). In these cases, it’s common to ensure that a customer won’t be in two successive universal control groups. But this makes the refreshed group slightly depleted in longer-term customers (because longer-term customers could have been in the prior universal control group, while new customers couldn’t have because they weren’t customers at the time). This small difference can introduce the same tenure bias we saw with the 5% sample from the loyalty program.

Solution: The simplest way to avoid this problem is to allow customers who have been in a previous universal control group to be in a subsequent one. People often don’t like this idea: it feels risky to switch off marketing for too long for some customers. So, if you can’t persuade your colleagues to allow some of these customers to be part of the new control group, the next best alternative is to also make sure they also get excluded from measurement in the comparison group (often labelled “business as usual”).

Getting unlucky about who’s in each group

You can arrive at similarly inaccurate results if you have "whales" (customers who purchase much more more than other customers) in either the treatment or control group. Usually a single customer won’t make a big difference in an RCT, but it can be worth checking for anomalies such as "everyone who pays with cash" or "the combined purchases of all our staff when they use their discount card.” I have seen both of these create spurious customers.

Although it's usually not a single customer, it’s not unusual for a small group of customers to be responsible for a large fraction of sales, so they have to be allocated carefully between the treatment and control groups.

Solution: You can avoid this pitfall this by using stratified sampling to ensure there are an equal number of these “whales” in both the treatment and the control groups.

Measuring the wrong thing or over the wrong time period

Lastly, differences in the time period over which an RCT is run can have a dramatic effect on results. Suppose you create an RCT that gave members of the treatment group an offer to purchase a large quantity of dog food. If you measure one-week sales to the treatment group versus sales to customers who receive no offer, the results may well show an increase in such sales—and that the offer makes sense.

But if you compare sales over a six-month period, the results might be quite different because the offer has resulted in customers not buying more, but buying the same amount sooner and at a lower price. The offer “paid” customers to bring spending forward and in effect, borrowed from the future.

Solution: To account for this effect, RCTs need to be planned in a way that also measures impact after the intervention. The appropriate length of time varies. You can calculate it by plotting the difference over time between the treatment and control groups, and then seeing how quickly the difference returns to zero. If you don't have enough time to perform this kind of statistical analysis, then you can often use your experience to choose a test period that makes sense for the incentive being offered.

I have seen people make measurements that, frankly, are hard to fathom. For example, using one metric for the treatment group and another for the control group. If you use different metrics for each group, you can get any answer you want, but it may be very costly. For example, one company measured an outcome over three months for one group and compared it to three times the outcome over one month for the other. This inconsistent approach was leading them to make interventions they thought were effective but that were, in many cases, commercially harmful.

Stop, Plot, and Study

An effective way to avoid most of these potential problems is to measure historical performance. Once the treatment and control groups have been selected—and before you make an offer to the treatment group—look back in time as far as you can and make the same measurement that you plan to make during the experiment. One commonly used measurement is the percentage difference in weekly sales per customer between the treatment group and the control group.

Once you have the data, plot it and take a look. Just by looking at the chart, you’ll be able to see if the signal has drifted over time, and how much it jumps around.

You can quantify your observations using summary statistics such as the average treatment-to-control gap (which should be near zero before the experiment started) and the standard deviation of this gap, which quantifies how much the signal “jumps around.” But don’t fall into the common trap of calculating the statistics without reviewing the chart. Often the chart tells a richer and more intuitive story than the summary statistics.

Tech + Us: Monthly insights for harnessing the full potential of AI and tech.