Use Decision Theory to Choose Significance Levels for Experiments

Alpha shouldn’t always be 5%. Here’s a simple method to decide what it should be.

Originally published at: https://goodenoughstatistics.com/use-decision-theory-to-choose-significance-levels-for-experiments-073fecae0865.

Use Decision Theory to Choose Significance Levels for Experiments

Alpha shouldn’t always be 5%. Here’s a simple method to decide what it should be.

We are all enlightened folks here who know that just choosing p = 0.05 as our significance threshold isn’t cool and that we should base it on how much noise we’re willing to tolerate before we believe an effect size. So, that’s all well and vague, but how exactly do we do so? I bring good news. We can use decision theory to pick significance levels based on numbers that humans can have a decent intuition about. Get excited.

Frequentist decision rules in an Expected Utility Framework

(Check out this paper https://faculty.washington.edu/kenrice/testingrev2a.pdf for more background on this idea).

Statistical decision theory is about how to make decisions from data. Data signals what to do but tells us nothing with certainty, so we have to decide how to deal with uncertainty.

Expected Utility Maximization is a common and convenient way to do so because it is mathematically tractable and has a nice, interpretable axiomatization via the von Neumann-Morgenstern axioms.

In Expected Utility Maximization, we evaluate the utility of an action (a) under each potential (unknown) state (s) of the world and then weight the utility by the likelihood of each state, i.e.,

This decision-making rule usually lives firmly in Bayesian land because only in the Bayesian world do we have a full posterior distribution of beliefs about the parameters (the “states” in the above), and we need the full distribution (F) to compute the expected utility.

However… if we can cast the standard frequentist statistical significance analysis in an expected utility framework, we can effectively make the significance level a utility/loss function parameter. If we think the loss function is reasonable and understand how the significance level affects our implied preferences, we have a good basis for choosing alpha.

First, the threshold question: Is there a utility/loss function that gives the standard frequentist decision rule? The trite answer is: of course. Expected Utility Maximization is flexible enough to support many decision rules. So, the more relevant question is: is there a reasonable loss function that leads to standard hypothesis testing?

After a standard online experiment, there are two decisions we can make post-experiment:

We do not have sufficient evidence, so we decide not to do anything and treat the experiment’s effect size as unknown. (Label this decision as h=0)
We have sufficient evidence to decide on the treatment effect, and then, going forward, we treat the point estimate as the truth. Perhaps this is bad, but it’s unavoidable. We ultimately behave as if the treatment effect is some specific value, e.g., we make financial projections for the impact of some initiative based on a single treatment effect, etc. So, let’s bake that reality into our decision problem. (h=1)

If we make decision h=1, we also make a separate decision d about the true value of the treatment effect.

Let t be the true treatment effect. Now, suppose there is some “null” value of the treatment effect s, that we would be perfectly content to miss being able to measure. For example, s = 0 is a common choice, i.e., if we can’t accurately measure a zero treatment effect, we don’t care because no matter whether we launch A or B in our A/B test, the effect will be the same.

Under decision h=0, our loss should be smaller the closer the null value s is to the true treatment effect. A natural choice for loss is the squared difference between the true treatment effect and the null: (t — s)².

Under decision h=1, we make a decision d about the true treatment effect. A natural choice for our loss is the squared difference between our decision and the truth: (t — d)².

We combine the two parts of the loss function by taking a weighted sum of the h=1 and h=0 cases, putting weight w > 0 on the h=1 decision. Larger weights will make us dislike deciding on the effect size more. Smaller values of w mean we are more inclined to make a call, given the data we have.

This loss function approximates decision-making in industry. We decide whether to discard the experiment’s findings or to pick a value to take as the treatment effect going forward.

Let’s solve for expected loss given data X under each decision:

Given h = 0, the expected loss is: E[(t-s)² | X], where the expectation is with respect to beliefs over the treatment effect t given data X. We can rewrite this as:

Given h = 1, the expected loss is: w E[(t — d)² | X]. d = E[t|X] minimizes the loss because the mean minimizes the mean square error. So, we can write the expected loss as: w Var(t | X).

So, we choose h = 1 over h = 0 when:

Hey! The right-hand side looks awfully similar to a standard Wald test in frequentist inference. It’s just the square of the t-statistic!

The only difference is that the variance and expectation are with respect to a posterior distribution and not, say, the sample mean or the variance of the sample mean (i.e., not with respect to the empirical distribution).

But, in fact, this is not such a large difference. The Bernstein-von Mises theorem tells us (among other things) that for a wide variety of priors and likelihoods, the posterior mean and variance in Bayesian land will converge to the maximum likelihood estimates (implied by the distributions converging in total variation).

So, even though, in practice, we won’t use Bayesian inference, this theory allows us to link frequentist inference to Bayesian decision theory. In other words, following the frequentist decision-making algorithm is (roughly) equivalent to following this Bayesian, expected utility maximization algorithm. So, we can treat the Bayesian version as a thought experiment to help choose critical values for our frequentist tests.

At a 5% significance level, the critical value of a chi-square test is ~3.84. This implies that using the 5% significance level for decision-making is rationalized with a weight of w = 4.84 on the loss from making a decision, i.e., about 5x more weight on the loss from making a call on the treatment effect relative to deciding that we don’t have enough data.

So, a 5% significance level means we’re pretty reluctant to make a call about the treatment effect. Let’s say we think 5x is too much weight on loss from h=1. We value making a call more. So, let’s say we half it and want only 2.5x weight on the loss from a decision. Then, we would get the critical value of w — 1 = 1.5, corresponding to a p-value of about 0.22 — higher than traditional significance levels but perhaps more realistic about how much we worry about making a mistake. We’re doing e-commerce, SaaS, etc. — not building bridges.

What would we do if we used equal weight (w=1)? Then, we’d always go with h = 1. We’d always make a decision. Intuition: why resign yourself to s if you put equal weight on both? You might as well make the ‘best’ decision given the data.

How to Choose Significance Levels

Summarizing the strategy in bullet points:

Decide how much more weight you want to put on loss from being wrong about our estimate of the treatment effect relative to sticking with the status quo and admitting ignorance of the effect.
Subtract one and find the relevant percentile of the chi-square distribution with 1 degree of freedom.
Going forward, consider only the experiment results that are significant at that level. Otherwise, ignore the results and go with the control.

The status quo decision-making paradigm is usually: 5% is nominally the significance level, but we don’t use it strictly for making decisions — how could we? Some product surfaces don’t have the volume needed to detect the kind of effect size improvements that we might reasonably generate in the time frame that we can reasonably run an experiment.

We still have to decide what to launch. So, we make launch decisions by taking the point estimates as given and ignoring statistical errors. By revealed preference, we don’t actually care about incorrectly making a call as much as implied by the 5% critical values.

Our willingness to make decisions when things aren’t significant at the 5% level suggests that it is the wrong significance level for most industry applications.

The problem with setting the bar too high is that experimenters stop caring about statistical significance. They never get it, so the large standard errors fade into the background. If statistical significance is more achievable and reflects our actual level of concern about making the wrong decision, experimenters can use the fact that something is not significant to stop and say, “We’re not going to learn about this effect. It must be pretty small. Let’s try something else.” The error starts to matter for decision-making.

How to Make Loss-Minimizing Decisions On Experiments

The loss function approach to choosing significance level only makes sense if we optimize the loss function when we make decisions. Here’s how to do so for a given set of experiment results.

Compare the chi-square stat to the relevant critical value for the weight w. If it exceeds the critical value, choose h=1; otherwise, choose h=0.
If we choose h = 0, we Launch Control (the status quo).
If we choose h = 1, we treat the treatment effect as the point estimate from the experiment and make decisions from there, i.e., Launch Treatment if it’s positive (there might be guardrail conditions that also matter); Launch Control if it’s negative.

Thanks for reading!

Connect at: https://linkedin.com/in/zlflynn

Zach