You are one of...current visitors on the English part - also ...current visitors on the Swedish part

(the number of current visitors is automatically updated every 4 minutes)

Cite this page as:

-

Level of significance

-

First published:

on:

INFOVOICE.SE

Last updated:

If you want to share information about this web page...

This web-page describes level of significance and how this relates to p-values. These concepts are sometimes mixed up and reading this page will enable you to understand the difference. The page also contains online calculators that will help you.

You will understand this page best if you have read Introduction to statistics and Inferential statistics.

The difference between level of significance (alpha) and the p-value

A low p-value indicates that it is highly unlikely we would see our data if the effect or correlation we are looking for were truly zero. In other words, a low p-value allows us to reject the null hypothesis in favor of the alternative hypothesis. But exactly how low must the p-value be for us to take this step? This threshold should be determined from case to case and is known as the level of significance, or alpha.

We use inferential statistics to calculate a p-value.  The next step is to compare this calculated p-value against our predetermined level of significance (alpha). If the p-value is below alpha, we reject the null hypothesis and consider the alternative hypothesis to be the most plausible. Conversely, if the p-value is higher than alpha, we cannot reject the null hypothesis; our results do not provide enough evidence to contradict it..

By tradition, alpha is most often set to ≤0.05. Therefore, a p-value of 0.045 would be classified as statistically significant, while a p-value of 0.055 would not. However, it is crucial to remember that this ≤0.05 threshold is not black and white. In reality, findings of p=0.045 and p=0.055 are statistically very similar. Because of this, exact p-values should always be reported, rather than simply stating whether a finding was “significant” or not.

In summary: The level of significance (alpha) is a fixed limit determined by the researcher, usually in advance. It does not depend on our observations and is not calculated; rather, it is a conscious decision based on the safety margin needed to avoid making a Type I error (a false positive). The p-value, on the other hand, is calculated directly from—and depends entirely on—our observed data.

The level of significance and pure chance

Assume we want to know which variables differ between two groups, those who have experienced an illness compared to those who have not (or it could be those who say yes to a question compared to those who say no). Also assume that we want to investigate 50 different variables, some being categorical while others are continuous.  Using chi-square or t-test to compare the two groups would result in 50 p-values, some below and some above 0.05. However, a p-value ≤0.05 can occur by pure chance without representing a real difference between groups. The probability to wrongly reject the null hypothesis (believe that a low p-value represents something else than pure chance) increases the more tests we do (Figure 1).

Figure 1 – Probability of wrongly rejecting the null hypothesis in multiple testing

Deciding the level of significance

To compensate for the possibility of getting statistical significance by pure chance when we do multiple testing we need to lower the limit where we consider a statistical finding as significant as soon as we present more than one p-value. There are different methods to do this.

Family-Wise Error Rate (FWER) and False Discovery Rate (FDR)

Family-Wise Error Rate (FWER) is the probability of making at least one Type I error — that is, at least one false positive — across a family of statistical tests. In simpler terms: If you run several hypothesis tests, FWER is the chance that one or more of them incorrectly appears statistically significant just by chance. The formula is 1-(1-alpha)^m (m=number of statistical tests). For example we make 20 statistical tests the chance of making at least one type I error is:

1(10.05)20=0.641 – (1 – 0.05)^{20} = 0.64

In this example it would be 64% chance of having at least one statistical test being false positive. FWER is concerned with controlling the probability of any such false discovery in the whole set of tests. FWER is stricter than the False Discovery Rate (FDR):

  • FWER controls the chance of making any false positive.
  • FDR controls the expected proportion of false positives among the statistically significant findings.

Overview of “universal adjusters” of the level of significance

MethodDescriptionAdvantageDisadvantage
The Bonferroni methodControls the Family-Wise Error Rate (FWER). Keeps the chance of making even one false positive at exactly 5% (or your chosen level of significance). Never use this method (better to use one of the below recommended methods).Very simple to understand. Does not require that each test is independent. “Very conservative” making it hard to find statistical significance if you calculate many p-values.
The Dunn-Šidák correctionControls the Family-Wise Error Rate (FWER). Keeps the chance of making even one false positive at exactly 5% (or your chosen level of significance). Never use this method (better to use one of the below recommended methods).Slightly less conservative than Bonferroni’s method. However, the difference is very small and this method is, like Bonferroni’s method likely to miss relevant findings.Less easy to understand. Also requires that every single one of your 20 tests is completely independent of the others (which is often not the case in reality).
The Holm-Bonferroni method Controls the Family-Wise Error Rate (FWER). Keeps the chance of making even one false positive at exactly 5% (or your chosen level of significance). Suitable for confirmatory research.Less conservative compared to Bonferroni’s method meaning that the chance of finding a true statistical significance is higher than if you use Bonferroni’s method. Does not require that each test is independent.It is still relatively conservative. If you are doing hundreds of tests, it will still wipe out a lot of potentially valid discoveries.
The Benjamini-Hochberg procedure Instead of controlling the chance of making any false positive, this method controls the False Discovery Rate (FDR) meaning it accepts a small, known percentage of false positives in exchange for a much higher chance of discovering true significant differences. Suitable for exploratory research.Massively more statistical power than any of the above methods. It is the gold standard for exploratory data analysis where you are screening dozens or hundreds of variables to see what looks promising. You use BH when you want to cast a wide net to find leads for future, more targeted research. This is the best approach when making calculating many p-values (>20?). Does not require that each test is independent. You will get some false positive p-values.

The Bonferroni Method

This simple method means that we divide the desired overall level of significance (often 0.05) with the number of p-values calculated. In an example with 50 p-values calculated a Bonferroni adjustment means that only p-values ≤0.001 should be considered as statistically significant which may be difficult to achieve. Hence, Bonferroni’s method is not suitable if you calculate many p-values. The performance of the Holm-Bonferroni method or the he Benjamini-Hochberg procedure is much better so use one of those instead.

The Dunn-Šidák correction

The Dunn-Šidák correction (or simply the Šidák correction) is, as Bonferroni’s method, a historical footnote. The performance of the Holm-Bonferroni method or the he Benjamini-Hochberg procedure is much better so use one of those instead.

The Holm-Bonferroni Method

Controls the Family-Wise Error Rate (FWER) just like the standard Bonferroni, meaning it keeps the chance of making even one false positive at exactly 5% (or your chosen alpha). Instead of treating every test the same, Holm-Bonferroni uses a “step-down” approach. It is essentially a smarter, less punishing version of the standard Bonferroni correction. The procedure is step by step:

  1. Rank your p-values: Order all your p-values from smallest (most significant) to largest (least significant).
  2. Apply a sliding threshold: Instead of dividing your alpha (foe example 0.05) by the total number of tests done for every single test, the threshold changes depending on the rank of the p-value. The formula for the adjusted threshold for the i-th ranked p-value is:
    Threshold = alpha / (m – i + 1)
    (Where alpha is 0.05, m is the total number of p-values calculated, and i is the rank of the p-value).
  3. The Step-Down: For your smallest p-value, the threshold is exactly the same as the one calculated using the standard Bonferroni method. If your p-value is equal to or below this, it is significant. For your second smallest p-value, the threshold becomes slightly easier to achieve compared to the one calculated by the Bonferroni method. You then continue down the list. With each step, the threshold becomes larger (easier to achieve) compared to the threshold calculated by the Bonferroni method. As soon as you encounter a p-value that is larger than its calculated threshold, you stop. That variable, and all variables following it on the list, are declared statistically non-significant.

In a “Step-Down” procedure like Holm-Bonferroni, the p-values function like a series of gates. To reach the final gate, you must first pass through all the preceding ones. When the calculator encounters the first non-significant value, it closes the door completely. According to the rules of Holm-Bonferroni, that p-value and all subsequent p-values are then declared non-significant—regardless of how small they might be in relation to their own theoretical thresholds. This is the “penalty” inherent in Holm-Bonferroni: if one link in the chain breaks, the rest of the tests fall with it to protect against false positives.

Holm-Bonferroni calculator

The Benjamini-Hochberg Procedure

Controls the False Discovery Rate (FDR). This is a massive shift in statistical philosophy. Instead of trying to guarantee zero false positives, the BH procedure accepts that false positives will happen. Its goal is to control the percentage of your “significant” findings that are actually false. If you set your FDR to ≤0.05, it means: “Of all the variables I declare significant today, I am okay with 5% of them being false positives.” The procedure is step by step:

  1. Rank your p-values: Order them from smallest on top to the largest at the bottom of the list.
  2. Calculate a critical value for each p-value: The threshold grows much faster than in the Holm-Bonferroni method. The formula is:
    Critical Value = (i/m) * Q
    (Where i is the rank, m is total number of p-values calculated, and Q is your chosen False Discovery Rate, usually 0.05).
  3. The Step-Up: You look at your list and find the largest p-value that is smaller than its corresponding critical value. That variable, and every variable ranked smaller than it, is declared statistically significant.

Benjamini-Hochberg calculator

Tips on which adjusting method to use

  • Never use the standard Bonferroni correction. The other methods are always better.
  • Having a small number of tests (like 5, 10, or 15) usually means you have specific, pre-planned hypotheses. You aren’t “fishing” for data; you are trying to confirm specific theories. In these cases, you want strict protection against false positives, so Holm-Bonferroni (controlling the FWER) is the correct choice. Use Holm-Bonferroni when you are doing confirmatory research. If a false positive would result in changing a clinical guideline, approving a useless drug, or making a definitive claim in a high-impact journal, you must use Holm-Bonferroni to guarantee that your overall risk for getting one or several false positive results is capped at 5%. Even if you have 30 p-values, if the stakes are high, you have to take the power hit and use Holm-Bonferroni.
  • Use Benjamini-Hochberg when you are doing exploratory research. If your goal is simply to find “variables of interest” to test again in a future, more targeted study, Benjamini-Hochberg is perfect. You accept a 5% false-discovery rate because a false positive here just means a slight waste of time in the next study, not a catastrophic real-world error. You can use BH even if you only have 12 p-values, as long as your goal is purely exploratory. When you cross into 20, 50, or 100+ p-values, you are usually entering the realm of exploratory research (e.g., screening dozens of genes, biomarkers, or demographic variables). In exploratory research, Benjamini-Hochberg (controlling the FDR) is absolutely the right choice because you want to cast a wide net and don’t want a strict FWER penalty to hide valuable leads.

If you perform 100 tests and use Holm-Bonferroni with alpha = 0.05, the goal is that in 95 out of 100 cases, you will have zero false positives. If, on the other hand, you use Benjamini-Hochberg (FDR) at 5%, you accept that approximately 5% of the results you declare significant are actually false. The difference in stringency is therefore enormous!

Scheffé’s Method and Tukey’s HSD

These are methods for follow-ups (post-hoc tests) to an ANOVA. If the ANOVA yields a significant p-value (meaning at least one group is different from the others) then you run Tukey or Scheffé to find out exactly which groups are driving that difference. You aren’t limited to simple pairwise comparisons (e.g., Group A vs. Group B). You can compare the average of Group A and B against Group C. It handles unequal sample sizes between groups perfectly well. These approaches are tailored to group comparisons and they are not “universal adjusters” of the level of significance.

Multiple testing over time of the same outcome

While Holm-Bonferroni and Benjamini-Hochberg are your go-to methods for testing dozens of variables simultaneously at the end of a study, the O’Brien-Fleming boundary or the Pocock Boundary are methods for testing a single primary variable at different points in time while a study is still ongoing. The typical example is interim analysis of intervention studies. Interime analysis creates the exact same Family-Wise Error Rate (FWER) problem we discussed earlier. If you test your data at a 0.05 level of significance at Year 1, Year 2, Year 3, Year 4, and Year 5, your overall chance of a false positive inflates to around 14%. You are essentially “fishing” for a significant result by checking the data repeatedly. To keep the overall error rate at 5%, you have to “spend” your alpha (your 0.05 significance level) carefully across those different looks.

The Pocock Boundary lowers the level of significance and keep it constant during the study. The same level of significance is applied at each interime analysis. For example if data is analysed four times at 25%, 50%, 75% and 100% of observations collected the level of significance is set to 0.015.

The O’Brien-Fleming method is an alpha-spending function. It dictates exactly what your p-value threshold must be at each interim look. Its defining characteristic is that it makes it incredibly difficult to stop a trial early, but it leaves the final threshold almost completely unchanged. If you plan 4 looks at the data (3 interim analyses and 1 final analysis), the O’Brien-Fleming p-value boundaries look roughly like this:

  • Look 1 (25% of data): p-value must be ≤ 0.00001 to stop early.
  • Look 2 (50% of data): p-value must be ≤ 0.0013 to stop early.
  • Look 3 (75% of data): p-value must be ≤ 0.012 to stop early.
  • Final Look (100% of data): p-value must be ≤ 0.043 for final significance. (Notice how the final p-value is 0.043, which is very close to the standard 0.05!)

The O’Brien-Fleming method preserves the final p-value. Because you “spend” very little of your alpha during the early looks, your threshold at the end of the trial is still very close to 0.05. Furthermore, The O’Brien-Fleming method prevents premature stopping. Early in a trial, data can be highly volatile. A few lucky successes can make a drug look like a miracle cure. The massive hurdle at Look 1 (e.g., p ≤ 0.00001) ensures you only stop the trial early if the evidence is absolutely overwhelming. The O’Brien-Fleming boundary should be your preferred go-to method for testing a single primary variable at different points in time while a study is still ongoing.

In what context should I adjust the level of significance?

The level of significance needs to be lowered if you present multiple p-values. Do I need to consider all p-values presented in a single table, all p-values presented in one manuscript or all p-values I have ever calculated in my life? If the latter was correct it would mean that all experienced statisticians would be out of work because of their need for hefty adjustments of the level of significance.

It would be reasonable to adjust for all p-values in a manuscript considered to be a result. P-values are sometimes calculated in correlations or simple regressions purely as a sorting mechanism to decide which variables to include in a final multivariable regression and these preliminary p-values should not be considered as a result. A reasonable suggestion might be to adjust the level of significance only for primary outcomes, a view presented by Steve Grambow in this video:

References

{2262766:NRFTS5CK};{2262766:SIMFEMTT} vancouver default asc 0 1101