Evaluation of tests

You are one of...current visitors on the English part - also ...current visitors on the Swedish part

(the number of current visitors is automatically updated every 4 minutes)

Cite this page as:

Ronny Gunnarsson

First published:

January 28, 2018

on:

INFOVOICE.SE

Last updated:

June 12, 2026

If you want to share information about this web page...

This page gives you a bird’s-eye view of the evaluation of various tests used for screening, diagnostics, or different research purposes.

You will understand this page best if you have first read the page Observations and variables.

In research with an empirical-atomistic (“quantitative”) approach, we sometimes refer to our data collection techniques as tests. In research projects that use tests, it is always beneficial to be able to clarify how good the test is in the specific situation used. Some research projects have the direct goal of developing a better description of the validity and reliability of a test.

When we want to evaluate a test, we often compare the outcome of our test with a reference method (gold standard). The reference method is a kind of “answer key” (more on this below). A test can be an analysis of the hemoglobin level in the blood (Hb), measurement of systolic blood pressure, a bacterial culture from the throat, a structured questionnaire, or a structured interview.

Table of contents (with links)

19 min reading (excluding any videos)

What types of tests are there?

Within the empirical-atomistic (“quantitative”) approach, there are four main types of tests:

Tests that provide an exact measurement value according to an interval or ratio scale, for example, Hb value. In principle, the value can assume any value, within reasonable limits, of course. Measurement data are continuous or discrete, i.e., the scale steps are equidistant and measured according to the interval scale or ratio scale (for information on different variables and measurement scales, see the page on variables).
Tests that provide answers according to an ordinal scale with more than two possible outcomes where the possible outcomes are ordered, for example, a questionnaire with the answer options “strongly agree”-“partially agree”-“undecided”-“strongly disagree”. VAS (Visual Analogue Scale) is sometimes included here.
Tests that provide answers according to a nominal scale with more than two possible outcomes where the possible outcomes are unordered, for example, blood type.
Tests that provide answers according to a nominal scale with only two possible outcomes (dichotomous outcome). This can be a yes-no answer, for example, the presence or absence of streptococcal bacteria in the throat.

Evaluating Tests

The outcome is measured according to the interval / ratio scale

For tests of type 1 as described above, where both the new test and the reference method produce quantitative results, we want to investigate how well the result of the new test agrees with the result of the reference method/gold standard. This can be done by calculating the difference between the new test and the reference method for each individual, for example: new test result minus reference method result. We then calculate the mean and standard deviation of these differences. The mean difference estimates the systematic bias between the two methods, while the standard deviation describes how much the individual differences vary around this mean. Bland–Altman 95% limits of agreement are usually calculated as: mean difference ± 1.96 × standard deviation of the differences. This calculation assumes that the differences between the two methods are approximately normally distributed and that the variability of the differences is reasonably constant across the measurement range.These limits estimate the range within which about 95% of individual differences between the two methods are expected to lie.

Let us look at an example where two methods of measuring Haemoglobin levels in the blood are compared. If the limits of agreement are −12 to +12 g/L, we can state that, when using the new method to measure Haemoblobin, about 95% of future individual results are expected to be within 12 g/L of the result obtained by the reference method, assuming similar measurement conditions and approximately normally distributed differences. You can also create a Bland-Altman plot. Read more about this on our page about evaluating the degree of agreement (limits of agreement).

The outcome is measured according to the ordinal scale

If the number of possible outcomes is reasonably large, the kappa coefficient is suitable as a measure of how well the new test agrees with the old established one. A standard kappa coefficient does not take the ordering of the outcomes into account. If there are many outcomes, the kappa coefficient tends to become very low. If there is an inherent order, a weighted kappa coefficient is an alternative option.

The outcome is measured according to the nominal scale – more than two outcomes

For tests of type 3, you can indicate how well the test agrees when two different people perform the test; this is called “inter-rater reliability” or inter-rater agreement. If the same person performs the test on two different occasions, it is called test-retest reliability. The degree of agreement is calculated using the kappa coefficient. In the case of a questionnaire consisting of multiple components, you can indicate how well the different parts correlate with each other by reporting Cronbach’s alpha (=”internal consistency reliability”).

The outcome is measured according to the nominal scale – only two outcomes

For tests of type 4, the kappa coefficient can also be applied here. However, it is much more common to evaluate the test based on characteristics such as the sensitivity of the method (sensitivity), the probability of obtaining a negative result in those who do not have the condition (specificity), whether the test actually added any new knowledge (likelihood ratio), and its utility in the individual case (predictive value).

Sensitivity and Specificity

To determine the sensitivity and specificity characteristics of a test, we must have an answer key to compare it against. This answer key is called the reference method or “gold standard” (see below) and is the method considered to best reflect the truth. Sensitivity is the proportion of true positives that the test correctly identifies as positive, and specificity is the proportion of true negatives that the test correctly identifies as negative. Read more on the webpage about sensitivity, specificity, and ROC analysis.

Different manufacturing procedures or different ways of handling the test alter the test characteristics; thus, you can increase the sensitivity of a test at the cost of lower specificity, and vice versa. Companies that manufacture tests often invest a great deal of effort into striking the right balance between sensitivity and specificity for the test.

Theoretically, sensitivity and specificity are often viewed as prevalence-independent characteristics of the test. In practical situations, however, they can be indirectly influenced by changes in patient selection, disease spectrum, or observer behavior. In reality, therefore, sensitivity and specificity can be slightly affected even by the prevalence of the phenomenon. Imagine a person sitting and examining culture plates to detect strep throat bacteria. If that person knew that roughly every other plate contained strep throat bacteria, every plate would likely be scrutinized carefully. This results in high sensitivity. If, instead, only 1 in 1000 plates contained strep throat bacteria, each plate would likely not be examined as thoroughly. The probability of missing that 1000th plate would then increase slightly—in other words, sensitivity would drop a little and specificity would increase. Thus, prevalence does influence sensitivity and specificity to a small degree.

Likelihood Ratio

The point of doing a test is so that we will know more afterward. The test should therefore add information. The probability that the individual has the characteristic (e.g., the disease) should be higher after a positive test compared to before the test. If the probability does not increase, the test has not added any new knowledge. The factor by which the odds for the condition increases is called the likelihood ratio (LR) of a positive test. You can calculate the LR of a positive test outcome (PLR) and of a negative test outcome (NLR). Typically, people only calculate the PLR and less frequently the NLR. Read more on the webpage about Likelihood Ratio.

A high PLR means that the test will add new information. The reverse applies to the NLR, meaning a low value is good. LR depends on sensitivity and specificity but not directly on prevalence. Following the reasoning above, sensitivity and specificity can in some situations change slightly if the prevalence changes. As a rule, LR is less affected by changes in prevalence than sensitivity and specificity are. If you know the pre-test prevalence, the LR is an excellent way to calculate the probability that the individual has the characteristic you are looking for after the test (=positive predictive value). This becomes particularly useful when you start with a known prevalence and then perform multiple mutually independent tests in a series. The odds after the first test become the pre-test odds for the next test, and so on.

It is important to remember that if you do not know the pre-test prevalence, the likelihood ratio is not much more useful than sensitivity and specificity. A high positive likelihood ratio can show that it is a good test in and of itself, but it does not mean that a positive test indicates the presence of disease with a high probability (assuming it is a disease the test is looking for).

Predictive value of tests

Sensitivity and specificity generally solve the wrong problem. They tell you how the test functions, but not how the patient is doing. The predictive value tells you the probability that the individual patient actually has whatever the test is designed to find. It should be noted that predictive value is a statistical concept and is not used exclusively in medicine. In statistics, predictive values are sometimes calculated for many different phenomena, such as the probability that today’s average wind speed will exceed 10 meters per second.

When we use a test, we do not know who has or lacks the disease (or characteristic) beforehand. We can use sensitivity, specificity, and the occurrence (=prevalence) of the targeted characteristic to calculate the predictive value. Of these three, prevalence is usually the factor that influences the predictive value the most. The positive predictive value (PPV) is the probability that the characteristic (disease?) actually exists in the tested individual if the test is positive. Consequently, the negative predictive value (NPV) is the probability that the characteristic (disease?) is absent in the tested individual if the test is negative. If the prevalence of the characteristic (disease?) decreases, the positive predictive value decreases, while the negative predictive value increases. The conclusion is that if the prevalence changes, sensitivity and specificity might change slightly, but the predictive value will unconditionally change, often quite substantially. Read more on the webpage about predictive values.

The lower the prevalence of the phenomenon/disease you are looking for, the less useful the PPV becomes, whereas the value of the NPV becomes more significant. The reverse applies as prevalence rises. Generally speaking, the higher the predictive values, the more useful they are (more on this later).

The higher the predictive value, the greater the clinical utility of the test. How high does it need to be for the test to be considered useful? That depends on the situation. If we are looking for a dangerous disease that can easily be cured with a side-effect-free treatment, we settle for a lower positive predictive value (PPV). Conversely, if we are looking for less dangerous conditions where the treatment has questionable efficacy or noticeable side effects, we demand a higher PPV (more on this later).

Assessing the Practical Utility of Dichotomous Tests

Which test metric should we use? Put simply, you can say that:

Sensitivity and specificity answer the question: How is the test doing?
Likelihood ratio answers the question: How much new information does the test add?
Predictive value answers the question: How is the patient doing? (or what is the probability of the phenomenon…?)

If we want to assess the practical utility of a test, sensitivity and specificity are fairly uninteresting. Predictive value is by far the best way to evaluate practical (clinical) utility. The likelihood ratio is an alternative pathway to determine the predictive value. This pathway is particularly useful when you want to assess the value of performing several different tests in sequence. If we know both the predictive value and the likelihood ratio characteristics (the latter can be calculated from sensitivity and specificity), we can attempt to estimate the clinical utility of the test (Table 1).

Positive predictive value (PPV)	Negative predictive value (NPV)	Positive Likelihood ratio (PLR)	Negative Likelihood ratio (NLR)	Practical Utility
High		High		The test will provide you with useful information.
High		Low		Even before the test is performed, you already know that the patient likely has the disease. The test does not add much new information.
Low		High		The test provides you with new information, which is, however, of questionable clinical value.
Low		Low		The test is useless in this situation.
	High		High	Even before the test is performed, you already know that the patient likely does not have the disease. The test does not add much new information.
	High		Low	The test will provide you with useful information.
	Low		High	The test is useless in this situation.
	Low		Low	The test provides you with new information, which is, however, of questionable clinical value.

Table 1 – Using predictive values and Likelihood ratio to determine if a test is useful

Our values for the likelihood ratio and predictive values are point estimates. The reliability of these point estimates depends heavily on how many observations we have as a basis for our calculations. This means that we should always calculate 95% confidence intervals for our estimates of the likelihood ratio and predictive values. How we should interpret the likelihood ratio and predictive values is determined entirely by their confidence intervals.

Mathematically, there are no fixed threshold values for how the confidence intervals should be interpreted. Where one chooses to establish thresholds for interpretation is therefore a gray area, where a decision relevant to the specific study must be made in each situation. A few practical threshold values that have been discussed are provided below as examples (Tables 2-4).

Lower limit	Upper limit	Utility
≥90%		Very useful
≥60% and <90%		Probably useful
(everything else)	(everything else)	Information is missing to determine the utility
	>10% and ≤40%	Probably useless
	≤10%	Clearly useless

Table 2 – 95% confidence intervals for determining the utility of predictive values

Lower limit	Upper limit	Utility
≥10		Very useful
≥5 and <10		Moderately useful
≥2 and <5		Weakly useful
(everything else)	(everything else)	Information is missing to determine the utility
	>1,5 and ≤2	Probably useless
	≤1,5	Clearly useless

Table 3 – 95% confidence intervals for determining the utility of PLR

Lower limit	Upper limit	Utility
	≤0,1	Very useful
	>0,1 and ≤0,2	Moderately useful
	>0,2 and ≤0,5	Weakly useful
(everything else)	(everything else)	Information is missing to determine the utility
>0,2 and ≤0,5		Probably useless
≥0,5		Clearly useless

Table 4 – 95% confidence intervals for determining the utility of NLR

What is high and what is low? It is difficult to give an exact answer because it depends on what you are looking for and the consequences of missing it. The values in Tables 2, 3, and 4 above are a proposal meant to serve as a rough starting point for discussion. The values for predictive values (Table 2) have been used in a previous study .

The examples of thresholds provided in Tables 2–4 can serve as an aid to understanding Table 1. In an actual evaluation of the clinical utility of a test, however, one must weigh the implications of missing what the test is looking for (classifying sick individuals as healthy) against the consequences of classifying healthy individuals as sick.

If it involves a potentially fatal disease that can easily be cured with a harmless treatment, it is crucial not to miss any individual. In this scenario, a post-test PPV of more than 5–10% might be considered sufficient to initiate treatment. On the other hand, if you are evaluating a test to find a disease that only rarely causes serious complications, it is reasonable to require a higher PPV value before administering treatment. For instance, with strep throat, some authors believe that the probability of the individual having streptococci (the PPV of a test for detecting streptococcal bacteria) should exceed 60% before treatment is given. If it involves a disease that only rarely causes serious complications and where the treatment carries risks for the patient, it may be reasonable to demand a PPV of over approximately 95% before initiating treatment.

Binary Diagnostic Test Characteristics Calculator

This calculator above utilizes standard mathematical approximations to generate metrics and confidence intervals dynamically, without relying on external statistical libraries. Some details of how the calculator works is explained below:

Confidence intervals for proportions (Sensitivity, Specificity, PPV, NPV) are calculated using the standard Wald method. The formula used for the upper and lower bounds is:

p \pm Z \sqrt{\frac{p(1-p)}{n}}

p represents the calculated proportion (e.g., Sensitivity).
Z represents the critical Z-score associated with the user-defined confidence interval.
n represents the applicable sample size (e.g., total actual positives for Sensitivity).

Note: To prevent mathematically impossible bounds, the output of the Wald calculation is strictly floored at 0% and ceilinged at 100%.

To allow for custom confidence intervals (e.g., 95%, 98%, 99%), the script dynamically converts the user’s percentage into a Z-score using the Abramowitz and Stegun approximation. This highly accurate algorithm computes the inverse of the standard normal cumulative distribution function, bypassing the need for a lookup table.

Unlike proportions, likelihood ratios are not bounded between 0 and 1; they range from 0 to infinity. Calculating standard errors on a linear scale would result in statistically invalid bounds. To account for this skew, the calculator calculates the variance on a logarithmic scale. First, it calculates the standard error of the natural log of the Likelihood Ratio. For the Positive Likelihood Ratio, the formula is:

SE = \sqrt{\frac{1}{TP} – \frac{1}{TP+FN} + \frac{1}{FP} – \frac{1}{FP+TN}}

The confidence bounds are calculated on this log scale and then exponentiated back to the linear scale to provide the final, properly asymmetrical confidence intervals:

\exp(\ln(LR) \pm Z \times SE)

Reference Method (Gold Standard)

The gold standard is a generally accepted reference method or the best available method for determining the presence or absence of whatever you are looking for. Hopefully, the generally accepted reference method is also the best method. All the above measures of a test’s value are obtained by comparing our test to a gold standard. It is important to remember that “the truth” and the gold standard are not always the same thing. If they differ, we must keep in mind that our test evaluation is not optimal. The greater the difference between “the truth” and our gold standard, the greater the risk that our new test being evaluated will receive better or worse test values than it actually should. (Falsely better test values if the gold standard and the new test share the same systematic error; falsely worse test values if only our gold standard has a systematic error or a large random error).

When speaking of a predictive value, it is not always a given that it is the probability of a disease that is being predicted. In medical contexts, it can often be the presence of something other than a disease, such as a streptococcal bacterium in the throat. If the presence of the bacterium in the throat means that one is always sick from it, then there is no difference between predicting the presence of a bacterium or a disease, such as strep throat caused by streptococcal bacteria. However, if there are healthy carriers of the same bacterium who should not be treated, it immediately makes a big difference. A positive test could then mean that the individual is a carrier of streptococcal bacteria but is actually sick from a virus. Here, it is crucial to be clear about what is being predicted and its relevance. What is it that our gold standard is actually predicting? More information on this can be found in our section on etiologic predictive value. This exact issue is also highlighted in the video below:

Dichotomization

Tests of type 1 (measured on an interval or ratio scale) and type 2 (measured on an ordinal scale) are often converted into yes/no tests by establishing a threshold value. If the test result falls above the threshold, it is considered a “yes” response; if the value falls below, the test is considered to have given a “no” response. Following dichotomization, the test is evaluated as if the outcome were dichotomous.

Incorrect Methods for Comparing Different Tests

It is considered inappropriate to use correlation analysis to compare the outcomes of type 1 or type 2 tests with a gold standard. The reason is that correlation analysis reflects individual variations to a greater extent than it reflects differences in test outcomes between the new test and the gold standard. Even when the tests show poor agreement, a correlation analysis can still yield a high correlation because the relationship is primarily driven by the fact that you are measuring something where individual variation is highly pronounced. An example of this is using different methods to measure Body Mass Index (BMI). Here, two different methods may agree poorly, yet the differences in BMI between the various individuals carry far more weight than the differences in the outcomes of the two tests.

Intraclass Correlation (ICC) is a method that has become popular in recent years. However, ICC measures reliability rather than agreement. Occasionally, one sees researchers using a t-test to compare the mean of the results from the new test with the mean of the results produced by our gold standard, subsequently claiming that the lack of a significant difference indicates the new test is as good as the gold standard. This is an incorrect way to evaluate tests. Learn more on our page about choosing a statistical method.

References

{2262766:T9NZANUA};{2262766:T9NZANUA} vancouver default asc 0 2390