Caveats and Limitations of A/B Testing at Growth Tech Companies

For non-tech industry folks, an “A/B test” is just a randomized controlled trial where you split users or other things into treatment and control groups, and then later compare key metrics across those groups and decide which one performed better, so you can learn whether the treatment or control group is preferable. For the context of this post, I will be talking specifically about end-user testing for some sort of digital application. Also, when I talk about effect sizes, I am talking in terms of percentages, so the assumption is that effects are comparable across different sizes.

Introduction

A/B tests are the gold standard of user testing, but there are a few fundamental limitations to A/B tests:

  1. When evaluating an A/B test, your metrics must be (a) measurable, and (b) actually being measured.
  2. A/B tests apply effects that are (a) on the margin, and (b) within the testing period.

Some of these points may seem obvious on their face, but they have pretty important implications that many businesses (specifically managers, and even many folks with “data” in their job titles) fail to consider.

The “TL;DR” version

What people expect is that as an app grows, the sample sizes get larger, so this increases the statistical power of experiments. Additionally, larger denominators in KPIs imply that fixed labor costs are more productive per unit, so even if you see lower effect sizes (on a percentage basis), they become more acceptable when spread out across more users.

Everything above is true, but there are some oft neglected countervailing effects:

  1. Features “cannibalize” one another, or rather they experience diminishing marginal returns as other features get added: \frac{\partial^2{KPI}}{\partial \textrm{quantity of features}^2} < 0
  2. As a consequence of the first point, contemporaneous effects of experimentation may squeeze rapidly toward zero, even if intertemporal effects– which are often stronger anyway– remain non-zero.

The end result is that, at a growth company, it is not unreasonable to find that diminishing effect sizes may decrease the statistical power of experiments (measuring contemporaneous or near-term effects) faster than inflating sample sizes increase statistical power. But critically, this doesn’t mean that the features being tested are “bad.”

While drafting and before publishing, I have confirmed from other data scientists at other companies that they can see a pattern of diminishing effect sizes of A/B tests over time across experiments in aggregate, and have also discussed with them some things in regards to the internal politics surrounding it. That said, everything in this post is ultimately my own opinion and analysis.

Measurable metrics are usually a proxy for immeasurable sentiment

The implications of measurability are that you can’t tell whether users actually enjoy or hate a particular feature if this doesn’t show up in metrics, e.g. they hate the feature but continue to use the app within the period of your testing.

The assumption that churn is a measure for user dissatisfaction is, ultimately, an assumption. It is more likely that contemporaneous churn signifies extreme dissatisfaction, but that minor dissatisfaction may exist without immediately obvious churn.

Some folks may mistakenly believe things like, “if we have enough users, we’ll be able to see the effects.” The mistake here is not really one of sample size, but of timing of the effects: if user dissatisfaction leads to a delayed churn, the test may conclude before the measurable quantity sees a statistically significant effect (even though the immeasurable effect, i.e. sentiment, was immediate).

The idea that users will not churn contemporaneously even if they’ll churn later sounds like a stretch– surely across 500,000 users in your experiment, you’ll see some sort of immediate effect, no?– but it’s not a stretch if you think about it from a user experience perspective. Measurable effects on churn or other KPIs may be delayed because users often do not immediately stop what they are doing just because a particular in-app flow becomes more annoying. Usually the biggest hurdle to a user is convincing them to log into the app in the first place, and a “high intent” user will not be immediately dissuaded from completing a flow even if they are dissatisfied with the smoothness of the experience in the process of doing so. Measuring the delayed effect may be especially slow if your typical user only logs in, say, once a month. (I highly recommend doing cumulative impulse response functions of key metrics partitioned by experiment group like average amount of dollar spend. You may be shocked how many effects linger past some experiment dates. You may even see sustained effects in the first derivatives of your CIRFs!) User insistence on completing subpar flows is how you get zero effects within-period but non-zero effects in future periods, even with massive sample sizes.

While we’re here, it should also be noted that “measurable” is not a synonym for “being measured.” Many metrics that sound great in theory during a quick call with management may not ever be collected in practice for a wide variety of reasons. Some of those reasons will be bad reasons and should be identified as missing metrics, then rectified. Other reasons will be good reasons, like it’d involve 30 Jira points of labor, and oh well, that’s life. So the company’s “data-driven” understanding of what is going on is confined to what the company has the internal capacity to measure. This means that many theoretically measurable metrics will be de facto immeasurable. And many of the things you really, really want to know will be unknowable via data.

And finally, a metric being measured and utilized also assumes that it is the correct metric being measured, and that there are no errors in what a particular number represents. Of course these assumptions can be violated as well.

Some metrics are adversarial; some effects are non-stationary

Let’s say you are A/B testing a price for a simple consumer product, and you decide to go with whichever one yields the most profit to the business.

This is just a simple (price - marginal cost) * quantity calculation, and it should not usually be hard to calculate which is more profitable. The problem here isn’t the measurability of profit.

Yet, there is still a hidden measurement problem, combined with an intertemporal effect problem. Specifically, the measurement problem is consumer surplus (i.e. it’s semi-measurable or deducible, but most likely not actually being measured at the majority of tech companies). The intertemporal effects are twofold: (1) users who receive low consumer surplus may be less likely to purchase again, and (2) the profit-maximizing price may change across periods due to competition and macroeconomic conditions.

Profit being an adversarial metric with the user is the root problem. Without a countervailing force being “measured” in some way (consumer surplus), you may end up pissing off users enough to not come back a 2nd time.

The other problem is competition, which may introduce non-stationarities in a broader sense. So optimal prices can change a lot over time– even if users have the memories of goldfish and there are no autoregressive (so to speak) intertemporal effects on a user-by-user basis.

I mention A/B testing of prices in particular because it is an especially dangerous thing to attempt without a deep understanding of pricing from the marketing side of things, plus a confidently held theory of how to attain desired business results with pricing. I don’t even mean it’s dangerous from an ethical sense (that’s true but outside the scope of this document); I mean it is dangerous from a strictly long-term business perspective. Even if your theory is that it’s optimal business to squeeze users for as much money as possible in a short-term sense, this should be acknowledged as a theory and it should confidently held when being executed.

Many managers do not want to confidently state that squeezing users for every penny is desirable, perhaps because it is counter to liberal sensibilities. So they may instead state that to price a particular way is “data-driven.” It’s a farce to hide behind the descriptor “data-driven” in the pricing context, as short-term profit maximization is not a data-driven result, it’s a theory for how to run a business. The price that spits out from the A/B test as maximizing short-term profit is what’s “data-driven,” not the decision to go with said pricing strategy. A general form of this idea is true in all contexts of using KPIs to make decisions (e.g. is maximizing time users spend on an app actually a good thing?), but the pricing context is where it is most obvious that describing a decision as “data-driven” is just begging the question.

A/B tests tend to yield lower effect sizes as the app grows

If your app has 5 features (ABCDE), and you are A/B testing a 6th feature (F), the A/B test is testing the differences between the combination of ABCDE and ABCDEF. In other words, you are measuring the marginal impact of F conditional on ABCDE.

It is easy to imagine a few unintended problems arising from this. Imagine for example that all these features are more or less interchangeable in terms of their effects on KPIs in isolation, but the effects of these features tend to eat into each other as the features get piled on. In this case, the order in which these features gets tested (and not the quality of the features in a vacuum) would be the primary determinant of which features evaluate well.

Note that this argument does not rely on the idea that the company tackles the most important features first, and then features additions by their very nature become more fringe and smaller over time. Although there will certainly be some of that going on, this effect comes from a saturation of features in general. Diminishing marginal returns exist even if all the features are more or less the same quality. But, that’s the key to understanding a key limitation of A/B tests: you are testing the marginal effect conditional on what was contemporaneously in the app while you conducted the test.

In the worst-case scenario, diminishing marginal returns in quantity of features become negative marginal returns, and basically nothing else can be added to the app.

Companies won’t usually tackle this problem head on (which is somewhat reasonable)

Let’s go back to our example of testing F, conditional on ABCDE. Imagine feature D was only ever tested conditional on ABC existing in the app. However, the fairest comparison of features D and F would be testing feature D conditional on ABCE, testing feature F conditional on ABCE, testing the interaction of DF conditional on ABCE, and testing neither conditional on ABCE. Given that ABCDE are already established features in the app, this would mean we need to re-test D.

To be clear, actually doing this “fair” comparison is a little unreasonable. Companies don’t tend to continually A/B test features that have proven themselves successful for a few reasons:

  • Non-zero maintenance cost of sustaining multiple experiences.
  • Difficulty in marketing a feature that not all users receive. (Imagine being advertised to about a feature that you can’t even see because you’re in the control group.)
  • Testing a worse experience comes with the opportunity cost (sometimes called “regret”) of showing some proportion of users something that doesn’t optimize a KPI.

Additionally, there are arguments in favor of preferring to retain older features at full exposure, even if they have worse marginal effects in actuality. Imagine in the hypothetical “fair” comparison that F is actually slightly better than D. We still may prefer D over F because:

  • Maintaining tried and true features tends to be easier from the engineering perspective.
  • Users come to expect the existence of a feature the longer it is around.
  • Users are “loss averse” to the removal of a feature (i.e. going from “not X” to “X” back to “not X” is potentially worse than just starting and staying on “not X”).

A final reason to not tackle this problem is because the “dumb” interpretation of A/B testing isn’t obviously bad for operational purposes. Effectively, if F has basically no effect conditional on ABCDE existing, then it’s not obvious at first glance that there are problems with taking this literally to mean there is no effect of including or excluding “F,” unless modifying ABCDE is a reasonable option to pursue.

That said, many companies would still benefit from doing the following:

  1. Most importantly, have a theory and use common sense when building the app out. (Many things don’t need to be A/B tested.)
  2. When you start to see diminished effect sizes, run longer duration A/B tests.
  3. Be careful to consider the temporal effect of when the experiment was started; experiments not run contemporaneously are often not reasonable to compare to one another.
  4. Measure IRFs and CIRFs of your KPIs with experiment exposure as the impulse, even past the experimentation period.

Many companies who experience this don’t have a strong institutional understanding of what’s going on

Product managers, data scientists, data analysts, and engineers tend to suck at thinking about their company holistically. In fairness, their job is usually to think about very small slices of the company very intensely. But spending the entire month of August optimizing some API to run faster, even if economical to do so, will belie the true determinants of the app’s performance, which are often a grab bag of both banal and crazy factors, both intrinsic and extrinsic to the company and its operations. Some tech employees lose the plot and really do need to look at an overpriced MBB consulting deck about their company, unmarred by being too deep in the weeds on one thing.

So these tech employees may not even realize that their A/B test performances are getting worse; or they may notice but not understand it is not just a coincidence, that it is a perfectly reasonably expected thing to be happening.

Another reason many tech employees may not understand what’s going on is because they’ve been told that A/B tests are good and the gold standard for testing features, without deeply understanding what they do and don’t measure. Unfortunately for managers who insist on being “data-driven,” comparing the results of A/B tests across different periods absolutely do require interpretations and subject matter expertise because inter-period A/B tests are in some ways incomparable.

These limitations, if not acknowledged, lead to toxic internal politics

A “data-driven” culture that does not acknowledge simple limitations of data tools we use is one where data and data-ish rhetoric can be abused to push for nonsensical conclusions and to cause misery.

The main frustration that comes from diminishing marginal effects of A/B testing is that managers will see those beneath them as failing in some sense– upper managers thinking their product managers are failing; product managers thinking their data scientists are failing, etc. compared to the experiments and features of yesteryear.

The truth of course is more complicated and some combination of:

  1. The intertemporal effects dominate (implying the test needs to go on for longer)
  2. The app is saturated with other cannibalizing features (implying that the work is either not productive, or that other features should be removed).
  3. The feature makes the app better (or worse?), but in some fundamentally immeasurable-in-an-A/B-test way. (Requiring theory, intuition, and common sense to dictate what should happen)

Usually people over-rely on this sort of “data-driven” rhetoric when they lack confidence in what they are doing in some regards– maybe they are completely clueless, or maybe they sense what they are doing is unethical and need some sort of external validation / therapy in the form of a dashboard.

I am a data person at the end of the day. I think the highest value data is that which tells you that a prior belief was wrong, or tells you something when you had no real prior beliefs at all. A lot of features you’ll be adding to an app don’t really need to be justified through “data-driven” means if you have a strong prior belief that it makes the app better. Maybe it doesn’t have an immediately obvious impact measurable in the treatment group, but maybe it makes users more happy over time and more likely to tell their friends to download the app (also, your app’s referral feature is probably garbage/unused and doesn’t capture 99% of downloads through word-of-mouth attributable to an A/B test).

These strong prior beliefs may come from, say, subject matter expertise. And if your company is at the point where contemporaneous and measurable effect sizes are decaying your power analyses faster than sample size increases can keep up, or you are working on a problem that is dealing with hard to measure or immeasurable effects, you’re going to need subject matter expertise to fill in that gap.

Unfortunately, this is all easier said than done. Middle managers don’t like hearing that tests need to run for longer. They don’t like hearing that their product idea “failed” an A/B test. They don’t like having to actively disavow a “data-driven” approach that upper management is pressing them to adhere strictly to (although they’ll likely be politically pressured into pretending to be “data-driven” by always reporting metrics that support the value of their own work, but that’s another story).

I’ll end on a personal anecdote of when an over-abundance of nominal commitments to “data-driven decisionmaking” led to some toxic internal politics:

I once had a “data-driven” upper manager who would say “show me the data” in conversation a lot when I said we should do something. The data this person claimed to want was never something that was actually being measured (always for good reason or reasons outside of my control). In fact, that was usually why I was not showing data– it was because it didn’t exist in some sense! But I’m not so credulous to believe that the request for data was sincere; I suspect that this manager knew this data was unattainable in some sense, and was using “show me the data” as a bludgeon to deliberately suppress these conversations and “win” arguments.