(the number of current visitors is automatically updated every 4 minutes)
If you want to share information about this web page...
This web-page describes what sample size calculation is and what you need to consider. Reading this page will give you the ability to do simple sample size calculations yourself.
You will understand this page best if you first have read the pages Introduction to statistics, Sampling Strategies and Data Collection and Effect size.
Suppose we want to investigate if vitamin C lowers blood pressure. We assume, as an initial null hypothesis, that there will be no difference between groups. The alternative hypothesis is that there is a difference between groups. We want to test this by comparing two groups of individuals: one group receiving Vitamin C and one receiving a placebo. In this inferential analysis we want to determine the degree of uncertainty regarding our null hypothesis by calculating an effect size and a p-value. What is the chance that we succeed in demonstrating that we can reject the null hypothesis (that there is a true difference between groups) if it is true that there is a difference? How large sample size do we need for our study?
Type II error and the power of a study
It would be beneficial if we could know, before the study is conducted, how great the chance is that we will obtain a probalility that the null hypothesis is unlikely (p<0.05). This can be calculated in advance, and it is called the power of the study. The power of a study is a measure of the probability that our study will reject the null hypothesis and detect an effect of vitamin C when an effect actually exists. It is desirable to estimate the power of the project in advance before conducting it. If the power is estimated to be below 80%, the project design should be changed. A common way to modify the design is to increase the planned sample size. How much does the sample size need to be increased to achieve a >80% chance of detecting an effect (e.g., obtaining a p-value < 0.05)?
The power of a study is related to the risk of committing a Type II error ( not seeing an effect of vitamin C that is there). A type II error is also labelled beta (ß) and means that we fail to reject the null hypothesis even though it is actually false. In the event of a Type II error, our study yields a false negative result (p > 0.05) despite the existence of a real effect/difference/etc. The magnitude of beta is determined partly by the size of the effect we are investigating and partly by the sample size. The effect size is what it is but the sample size can be manipulated.
The higher the risk is for a type II error (ß) the lower is the power of the study. Mathematically the power of the study is 1-ß (if you prefer to present the power as a number between 0-1). Often this is multiplied by 100 to present the power as a percent between 0 – 100%.
Introduction to sample size calculation
The statistical calculations (=inferential statistics) looks at your data and produces results such as effect sized (odds ratios, hazard ratios, etc) and p-values. Is the reason for not reaching statistical significance that there are no correlation / no difference between groups or is the reason that your sample size was too small. To avoid ending up with the latter problem it is recommended to use a special software and do the planned statistical calculation backwards using an assumed effect size, decided level of significance and desired power of the study. It has gradually become more common that ethics committees and funders require a sample size estimation before approving a project.
In any project you first collect data and initially process them before starting the actual analysis. The following steps include calculating descriptive statistics, and if applicable, also inferential statistics. If your project only includes descriptive statistics focus your sample size calculation on that. Examples of focusing on descriptive statistics: “To confirm a 5% prevalence of a condition with a margin of error of 3% (2-8%) would require 377 observations”. However, if your project includes some inferential statistics focus your sample size calculation on that and don’t do a sample size calculation for any descriptive statistics.
Different approaches to sample size estimation
- Get a convenient sample and hope it is enough
- See how many observations other published projects included and imitate them
- Follow a rule of thumb
- Make a calculation based on your best assumptions.
Hope is good in many situations except this one. Imitate others is also not a good advice. What if the others did an underpowered study? Why replicate their mistake? There are some rules of thumb (alternative C above) such as:
- For group comparisons of means (t-test) have at least 30 in each group.
- For group comparisons of proportions (chi-square) have at least 5 in each cell.
- For standard linegression / correlation have at least 20 observations for each independent variable.
- For logistic regression have have at least 10 times more events / end points than independent variables .
- For Cox regression have at least 10 times more events / end points than independent variables . For example: you have four independent predictor variables in the model and the proportion of positive cases in the population is expected to be 0.30 (30%) the minimum number of cases required would be 133.
However, these rule of thumb are quite rudimentary because they do not consider the magnitude of the effect size you are looking for. They just give the bare minimum number you should have to avoid violating underlying mathematical assumptions but they do not consider your particular situation. The best approach to estimate the size of the sample is to do a proper sample size calculation considering the situation in your study. This is done by first making four important decisions:
- Decide what statistical method is going to be used for the inferential statistics.
- Decide what effect size / correlation you are looking for. It is best if this can be estimated using data from previous publications. You have to make a qualified guess if no prior publications exists.
- Decide what would be an acceptable safety margin to avoid doing a type one error (claiming a statistical finding that is not true). This safety margin is labelled alpha or level of significance and is commonly set to 0.05. This means that you have a one in twenty chance of doing a type I error (hallucinating and believing there is an effect when in reality there isn’t).
- Decide what power your study should have. This is the same as the inverse of the risk of doing a type II error (not identifying an effect/correlation that is true). The power is often set to something between 0.80-0.95 which corresponds to a 5-20% chance of doing a type two error.
The rest is quite easy once we have made these four decisions. We put in our decisions in a software that does the statistical calculation backwards and states how large sample we need. Example of such software are G*Power and PASS. G*Power is free but PASS is quite expensive. G*Power can manage most situations except Cox regression.
Examples of sample size calculations
Click to expand and watch a video explaining further:
Examples using the software G*Power
Example 1 of sample size calculation for comparing two groups – T-test and Mann-Whitneys test
Example 2 of sample size calculation for comparing two groups – T-test and Mann-Whitneys test
Example 3 of sample size calculation for unconditional binary Logistic regression when the independent variable is binary (such as gender)
Examples using the software PASS
Example of sample size calculation for Cox regression
Sample size calculation in multiple regression
You may plan for a multiple regression (having more than one independent variable) as your preferred final statistical analysis. There are a few approaches to this situation:
- Make one sample size calculation for each independent variable as if you are going to do simple (unadjusted) regressions. You will get one sample size for each independent variable. Pick the one with the highest number as your preferred sample size (and perhaps add a margin of 20% extra). This is the most common strategy and the one used in the videos above.
- In case you are only interested in one independent variable and want to add a few more only to adjust for them (as confounding variables) try to estimate the contribution from the covariates (R square other X in G*Power) and add it in G*Power together with the expected information around your main independent variable to calculate the sample size required. Finding the right value on the “R square other X” is tricky and might be impossible. Either make a reasonable guess or go with strategy 1 above.
- There may be many independent variables in an exploratory study and none are initially more important than another. The simplest solutions is to use strategy 1 above. It may be difficult to sort out how the variables may relate in a multivariable model without making a lot of guesses.
- Calculating sample size for interaction variables in a regression is tricky for two reasons. Firstly, it is often difficult to find support for the assumptions you need to make so you may be left with some wild guessing. Secondly, you would need more advanced software than G*Power and a statistician who has experience of this advanced calculation (not all statisticians would have that).
Level of significance (alpha) versus p-value
A low p-value says it is unlikely that we would get the observed observations if the effect / correlation we’re looking for in reality is zero. A low p-value indicates that the null hypothesis can be rejected and the alternative hypothesis is the most likely. How low must the p-value be for us to believe that our alternative hypothesis is the most plausible? This should be determined from case to case. Read more about this on the page describing the level of significance (alpha).
Calculators for power or sample size
A lot of statistical software packages, such as G*Power (free), SPSS, STATA, MedCalc, The R project (free) and SAS, have functions for estimating sample size. There are also several free online calculators:
- ClinCalc.com: Comparing two unmatched groups or one group versus population
- Raosoft: Calculator for confidence interval for proportions
- Sealed envelope: Superiority study comparing two unmatched groups where the outcome variable is binary
- Sealed envelope: Equivalence study comparing two unmatched groups where the outcome variable is binary
- Sealed envelope: Non-inferiority study comparing two unmatched groups where the outcome variable is binary
- Sealed envelope: Superiority study comparing two unmatched groups where the outcome variable is continous
- Sealed envelope: Equivalence study comparing two unmatched groups where the outcome variable is contonous
- Sealed envelope: Non-inferiority study comparing two unmatched groups where the outcome variable is continous
- Power and sample size: Lots of different calculators (including survival analysis such as Cox regression)
Sample size estimation in clustered studies
Traditional statistical methods assume that sampling and analysis happen at the same level. Clustered data occurs when participants are chosen via groups but analyzed as individuals. While useful, this approach reduces statistical efficiency because observations within the same group are often correlated .
The impact of clusters is measured with Intra Class Correlation (ICC). Think of the ICC as a percentage. Imagine all the differences (variation) in your data as a pie. The ICC tells you how much of that pie is caused by the group someone belongs to. The formula for tyhis can be simplified to:
ICC = Variation caused by the group / Total variation
If the ICC is 0.05, it means 5% of the difference in your results is because of which group the participants were in, and 95% is just individual differences between people.
A typical example can be that observations are clustered in different primary health care centres (GP clinics) or different hospitals. Since observations within each clusters are somewhat related it will add a random variation between clusters that makes your vision slightly blurred. It means that you must increase your sample size to maintain your ability to find what you are looking for. It can be shown that it is better to have many clusters contributing with a few observations compared to having a few clusters contributing with many observations. To estimate this calculate the required sample size as if there was no cluster effect. After that use the calculator below to estimate the effect on the required sample size that different cluster designs will have.
You need to find a suitable assumption for ICC to put in below. The ideal situation is if you find a publication with a study similar to yours stating the ICC. If that is the case use that. Otherwise make a reasonable guess to estimate ICC. In a hospital settings ICC varies from 0.02-0.2 . In outpatient care care settings common estimates of ICC are 0.01-0.02 . Although ICC is usually <0.1 it may occasionally be up towards 0.3 . If you have no idea of what ICC is you may explore the consequences of setting ICC to 0.01, 0.02, 0.05 and 0.1.
Examples of how to write up your sample size estimation
Below are examples of how to write the sample size section in a study protocol or in the final manuscript.
| Situation | Example of how to present the sample size calculation |
|---|---|
| A randomized controlled trial aiming to reduce antibiotic prescribing for urinary tract infections in frail older adults . | “For the sample size calculation, we assumed a clinically relevant reduction in antibiotic prescribing rates from 0.75 to 0.40 per person year, an intracluster correlation coefficient of 0.06, one sided testing, an α of 0.05, a power of 0.8, and a cluster size of 10 patients contributing for seven months in the follow-up period. Using a Wilcoxon test with an adjustment for cluster randomisation, it was estimated that 333 patients would be needed. To account for loss to follow-up, we increased the cluster size to 20 patients. In total, we aimed to include 680 participants in 34 clusters.” |
| An observational study to develop and validate a multivariable prediction model from a retrospective cohort study. The aim was to predict the development of an entero-atmospheric fistula in patients with open abdomen . | “Sample size calculations were based on significant prognostic factors from the recently published systematic review regarding each of the outcomes. All sample size calculations were performed using the software G*Power version 3.1.9.2 with the level of significance set to 0.05, the power to 95% and using a two-tailed test. The sample sizes required for analysing the different independent prognostic factors were for (a) large bowel resection: 287 patients; for (b) failed delayed fascial closure: 99 patients. Therefore, for the expected number of significant variables considered within our study, the aim is to include a total of at least 287 patients.” |
| An observational study to obtain a brief estimate of the relative importance of demographic factors such as rurality, socio-economic standard and ethnicity versus traditional risk factors for women diagnosed with breast cancer in Far North Queensland, Australia . | “A sample size calculation was performed for the primary research question using Power Analysis and Sample Size (PASS) Software. Assuming a power of 0.95, an alpha of 0.05, and hazards ratios of 1.6, 1.4, and 1.3 for Aboriginal and Torres Strait Islander status, remoteness of area of residence, and socioeconomic status respectively, the required sample sizes were 224, 276, and 501.” |
| An intervention study with only one group where some individuals were assumed to react differently to the intervention compared to others . | A sample size calculation was made for the potential difference in antibiotic prescribing in case of a negative test for GAS. It was assumed that 20% of general practice trainees (registrars) would prescribe antibiotics despite a negative test for GAS, in comparison to 40% for specialist general practitioners, assuming a level of significance of 0.05, a power of 0.8 and a two-sided test requires 207 patients. The software G*Power version 3.1.9.2 was used assuming logistic regression with antibiotic prescribing as the dependent variable. The researchers aimed to collect data from 300 patients. |
| A randomized controlled trial that aimed to measure if providing mothers with pedometers would increase physical activity in their children . | A two-tailed Student’s t-test was used as a surrogate analysis in the sample size calculation. Under the assumptions of 80% power, an alpha of 0.05, increase of daily steps of +1300 in the intervention group and no change in the control group with a standard deviation of 1200 in both groups, results in a requirement of 16 participating children. To allow for some loss to follow-up, the target was set to 25 in each group. |
| A randomized controlled trial with three aims. This project aims (a) to estimate people’s interest in health-related research, (b) to establish the extent to which people appreciate being actively informed about current local health-related research and (c) to discover if the level of people’s interest can be influenced by proactively promoting local current health-related research using large TV monitors . | A sample size estimation was made for each of the aims: a) Accepting a margin of error of 2.5% with a 95% confidence level and assuming that 80% are positive towards medical research requires 938 responses. b) Accepting a margin of error of 2.5% with a 95% confidence level and assuming that 50% are positive to the automated information system for medical research requires 1428 responses. c) Assuming a level of significance of 0.05, 95% power, a two-tailed test and assuming that the proportion of patients being positive to medical research increases from 80% to 90% requires 341 surveys before and 341 surveys after the introduction of the automated presentation system. We aimed to collect approximately 500 answered surveys in each phase, in total 1500. |
| A randomized controlled trial aimed to assess the efficacy of low-dose oral prednisolone for four days in addition to conventional therapy in the management of painful acute otitis externa . | Sample size calculations were based on the primary research questions and made two-tailed to avoid the assumption that a difference between groups would always favour the intervention group. Sample size calculations for survival analysis used the statistical software PASS version 11.0.8.20 Other sample size calculations were done using the statistical software G*Power version 3.1.3. We calculated that 198 patients would be sufficient to answer all primary research questions. We expected that some patients would be lost to follow-up so we aimed to include 250 patients. A more detailed description of the sample size calculation is described in the full study protocol. |
| An observational cross-sectional survey study aiming to clarify factors which correlate to the propensity of general practitioners (GPs) to prescribe supplementation for borderline vitamin B12 deficiency . | Male medical practitioners have in other situations been seen as more proactive (for better or worse) than female medical practitioners in prescribing behaviour [Citation 21]. The authors wanted to explore if this was also true for the prescribing of vitamin B12 supplementation for the described scenario. We assumed that 30% of female and 60% of male practitioners are high prescribers of B12 and using logistic regression. Level of significance was set to 0.05 and power set to 80%. A required sample size of 88 GPs was calculated by the statistical power analysis program G*Power, version 3.1.9.2, on 31st October, 2014. We aimed to collect more than 90 surveys. |
| An observational study aimed to 1) establish a reproducible method of assessing abdominal aortic aneurysm (AAA) calcification using computed tomography (CT); 2) investigate the association between AAA calcification and growth . | The required sample size was calculated based on two assumptions. Firstly, mean AAA volume growth/year in patients with calcification volume < median was assumed to be 12 cm3/yr, SD = 6.5 cm3/yr based on results from a previous CT study [Citation 17]. Secondly we predicted that AAA growth rate would be 42% greater in patients with calcification volume < median as suggested by results from a study by Lindholt and colleagues [Citation 16]. Using the G-power 3.1.9.2 tool, (Two tailed t-test: difference between means α = 0.05, Power = 0.95), 30 observations in each group were needed. |
| An observational study aiming to quantify the prevalence of documented urinary tract infection (UTI), nonspecific symptoms, and antibiotic treatment of suspected UTI in nursing homes. The study explored covariations with logistic regression . | To estimate the covariation between presence of a symptom and having diabetes we assume that 3% of non-diabetics and 12% of diabetics has confusion or fatigue or restlessness with an alpha error of 0.05, a power of 90% and a prevalence of diabetes of 15% requires 620. To estimate the covariation between being on antibiotics and having diabetes we assume that 1% of non-diabetics and 8% of diabetics are on antibiotics with an alpha error of 0.05, a power of 90% and a prevalence of diabetes of 15% requires 602. To ensure a suitable sample we aim to include 850 participants. All sample size calculations are made using the software G*Power version 3.1.9.2. |
Read more…
- NCSS Statistical software: Video about basics in calculating power and sample size
- Sabyasachi et al. Sample size calculation – Basic principles. Indian J of Anaesthesia. 2016 .
- Rutherford et al. Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology. 2015