You are one of...current visitors on the English part - also ...current visitors on the Swedish part

(the number of current visitors is automatically updated every 4 minutes)

Cite this page as:

-

Standard linear regression

-

First published:

on:

INFOVOICE.SE

Last updated:

If you want to share information about this web page...

Reading this page will give you an understanding of what standard (classical) linear regression is and how to perform the analysis in the statistical software SPSS (the process is similar in other statistical programs).

You will understand this webpage best if you have first read the pages Covariation and Correlation and regression.

Linear regression involves using observations to derive the equation of the straight line that best describes the data. There are many different types of linear regression models that all use the equation of a straight line (it is important to check the previous link before reading further).

In linear regression, where the dependent variable is measured on an interval or ratio scale, it is called “standard linear regression”. This webpage focuses exclusively on “standard linear regression”.

Different types of standard linear regression

There are different types of “standard linear regression”. It is also often just called “linear regression”. Standard linear regression comes in a few different variants:

One dependent (Y) and one independent (X) variableMore than one independent variable (multiple X)More than one dependent variable (several Y)More than one dependent variable and more than one independent variable (multiple Y and X)
Simple regression = Simple standard linear regression = Unadjusted standard linear regression (= Bivariate standard linear regression)Multiple (standard) linear regression = Multivariable (standard) linear regression = Adjusted (standard) linear regressionMultivariate (standard) linear regressionMultivariate (standard) linear regression *

* For the sake of consistency, this should be called “Multivariate multiple (standard) linear regression,” but in practice, “multivariable” is omitted when “multivariate” is used.

The underlying mathematics of linear regression are explained simply and clearly on another webpage. It is important that you familiarize yourself with this before reading further.

Simple or multiple linear regression

A “simple standard linear regression” has only one dependent variable (Y) and one independent variable (x). It is therefore the equation of a straight line, and it can be mathematically described in different ways (all of which mean the same thing):

Y = a + bx
Y = b0 + b1x
Y = b + mx
Y = kx + m

Have a look at this introduction to simple linear regression by Numiqo:

Sometimes, variations in the dependent variable (y) cannot be adequately described by the variation in just a single independent variable (an x). In such cases, it may be appropriate to look at a multiple linear regression that includes several independent variables (multiple x’s). This can be mathematically described as:

Y = a + b1x1 + b2x2 + b3x3 + … …bnxn

In the formula above, we have an intercept (a) and several regression coefficients (several different b’s). Simple linear regression can easily be visualized using a scatter plot (Figure 1 on the page about correlation and regression). A multiple linear regression with two independent variables can be visualized using a three-dimensional (difficult to interpret) scatter plot. Regressions with three or more independent variables cannot be visualized in diagram form.

How is it done in practice?

Prerequisites your observations must meet

  • Requirements for the variables themselves (Data type):
    The dependent variable (y): Must be continuous (numeric). This means it must be measurable on a scale, for example, blood pressure, income, age, or weight. If your outcome is a category (e.g., “Sick” or “Healthy”), you cannot use linear regression, but must switch to logistic regression.
    The independent variables (x): Can be either continuous (e.g., BMI) or categorical (e.g., sex, smoker/non-smoker). If they are categorical, they are recoded into so-called “dummy variables” (often 0 and 1) in the statistical software.
  • Statistical assumptions:
    Once the variables are of the correct type, the model must meet four mathematical requirements. In statistics, these are usually remembered via the acronym LINE:
    L – Linearity: There must be a linear (straight-line) relationship between your x-variables and your y-variable. If the relationship actually looks like a U-curve (e.g., that stress is good up to a certain level, but then becomes dangerous), a straight line will miss the truth completely. This is most easily checked by looking at a scatter plot.
    I – Independence (Independent observations): All observations / individuals must be independent of each other. This means that one individual’s measured value must not affect another’s. If, for example, you have measured the blood pressure of the same patients three days in a row, the observations are dependent. In that case, you violate this requirement and must use other methods (e.g., mixed-effects models).
    N – Normally distributed residuals (Normality): Here is one of the most common misconceptions in statistics: many believe that the x and y variables themselves must be normally distributed. This is wrong. In standard linear regression, it is instead important that the residuals, i.e., the errors around the regression line, are roughly symmetrical and do not contain too many extreme deviations. This means that the points, on average, should lie about as much above as below the line. This is especially important if you have a small number of observations and want to use standard statistical tests and confidence intervals.
    E – Equal variance (Homoscedasticity): This is a complicated word for a simple concept: the spread around the regression line should be roughly equal along the entire line. If the data points lie close to the line at low values but spread out like a large funnel at high values (so-called heteroscedasticity), then the model relies too much on some data points and too little on others.

If you have more than one x-variable in your model, an additional important requirement applies: Absence of multicollinearity. Your independent variables (x) must not correlate too strongly with each other. If, for example, you are trying to predict a person’s salary and include both “Years in the profession” and “Age” as x-variables, these two will be so similar to each other that the model becomes confused about which of them is actually doing the work. In such a case, exclude one of the two independent variables that correlate strongly with each other. You can read more about this further down.

If the observations are not linear or the residuals are not normally distributed, this can often be solved by transforming the variables (e.g., by taking the logarithm of the y-variable). If the data is highly skewed, it may sometimes be necessary to switch to non-parametric tests, i.e., a test other than standard linear regression.

Preparations

  1. Data cleaning: Perform a frequency analysis for each variable separately. You will likely find some surprises, such as a few individuals with a third or fourth sex, a person with an unreasonable age, a person with a body weight of 2 kg, or more missing data than expected. Go back to the source and correct all obvious errors. If you do not have access to the source, change obvious errors to missing data. Check all affected variables after the correction by performing a new frequency analysis. This must be done carefully before you proceed.
  2. Examine traces of potential bias: Look at the proportion of missing data for each variable. There is almost always some missing data. Is the dropout rate high in certain variables? Do you have a reasonable explanation as to why? Could it be a sign that there is an inherent bias (systematic error in the selection of observations/individuals) in your study that may affect the outcome?
  3. Check if assumptions are met: Check whether the assumptions for standard linear regression (see above) are met.
  4. Do any variables need to be transformed?: Sometimes the observations do not meet the conditions stated above. If the observations are not linear or the residuals are not normally distributed, this can often be solved by transforming the variables (e.g., by taking the logarithm of the y-variable). Sometimes it may be relevant to transform a variable even if it meets all the conditions above. An example could be transforming income in a currency into thousands or tens of thousands of the same currency.
  5. Choose a strategy: Perform a simple (unadjusted) linear regression if you only have one independent variable (just one x). If, on the other hand, you have several independent variables (multiple x’s), you need to choose a strategy for how these should be included in the analysis. There are a few different ways to do this. The first approach (5a below) is the best. However, it is not always feasible, so you may need to use a different strategy.
    1. Building a multivariable model – from a predetermined theory: Decide to use a fixed combination of independent variables based on logical reasoning/theories (expert advice). The number of independent variables should not be too large, preferably fewer than 10. This is the preferred method if you have a reasonably good theory about how the variables are connected.
    2. Include everything – without a predetermined theory: Perform a multiple linear regression with all available independent variables without having any theory about whether these variables are meaningful. This can work if you only have a few variables. If you have many variables, it will likely result in a final model that contains many useless variables that mostly constitute “noise”. Avoid doing this.
    3. Fishing expedition: If you have many independent variables and no theory about which ones are useful, you can let the computer suggest which variables are relevant to include. This is usually called going on a “fishing expedition”. You can read more about this below.

Building a multivariable model – from a predetermined theory

Building a linear regression model where you, rather than a computer program, determine what to include is the recommended method. This is step 5a in the list above. The video below, recorded by Numiqo describes how to build such a model in accordance with point 5a above. The video also provides a good explanation to the concept multicolinearity.

Building a multivariable model – fishing expedition

If you have many independent variables and no theory about which ones are useful, you can let the computer suggest which variables are relevant to include. This is usually called a “fishing expedition“. The different methods you can ask the computer to use in a fishing expedition are:

  • All possible regressions: Here you ask the computer to perform a regression analysis for all possible combinations of the independent variables. You then choose the regression model that explains the highest proportion of the variance in the dependent variable (y). The problem is that the workload for the computer increases exponentially as the number of independent variables grows. There is usually a maximum limit of 10-15 independent variables; beyond that, it is not worth trying.
  • Forward inclusion/selection: Can be used if you have more variables than observations. Once a variable has been added, it usually stays in the model. The method might choose variable A first because it looks good on its own, but misses that variables B and C together would have been a much better predictor. It often fails to capture complex relationships that only appear when several variables interact. The methods below are better.
  • Backwards elimination: Useful if you have more observations than variables.
  • Stepwise regression: This is a hybrid. It usually starts as forward selection (adding variables), but at each step, it checks backward to see if any previously added variable has now become non-significant (perhaps because the new variable explains the same thing better). This addresses the biggest flaw of forward selection. If variable A was added early but became redundant when variable C was added, the stepwise regression will throw variable A out again. It is more flexible and better than both “Forward inclusion/selection” and “Backwards elimination”. Watch this video by Brandon Foltz introducing the concept: https://www.youtube.com/watch?v=An40g_j1dHA.
  • Regularization techniques such as LASSO regression, Ridge regression, and Elastic Net: These methods can also be called “penalized regression” or “regularized regression”. Stepwise regression carries a risk of overfitting your model to your specific observations. Regularization techniques solve this problem by introducing a penalty for complex models. They are now considered superior to conventional stepwise regression. Of these models, Elastic Net is perhaps the best, and it can be performed in R, STATA, and SPSS (in the newer versions). Read more about this on the page about regularized regression.

How many independent variables can be included?

If you throw too many independent variables into your model, you will run into something called overfitting. Overfitting means that your final model might work for your specific dataset, but as soon as you apply it to another dataset, the model performs poorly. So, where is the limit for having too many independent variables? There is a plethora of different rules of thumb for this, most of them with poor mathematical support . Green investigates this thoroughly and concludes that the least bad rule of thumb is:

N ≥ 50 + (8*m)
N = number of observations och m = number of predictors (often equal to independent variables)
(Example: If you have 5 independent variables, at least 90 observations are required.)

The number of observations required for a certain number of independent variables (x) depends heavily on the effect size (how strongly the independent variables covary with the dependent variable). The rule of thumb stated above corresponds reasonably well with accurate sample size estimates for a medium effect size and with fewer than seven independent variables . In other situations, the rule of thumb does not work well . Assuming a level of significance of 0.05 and a power of 0.8, the correct number of observations required for different numbers of independent variables is :

Number of predictors *Small effect size
R-square = 0.02
Medium effect size
R-square = 0.13
High effect size
R-square = 0.26
13905324
24816630
35477635
45998439
56459142
66869746
772610248
875710851
978811354
1084411756
1595213867
20106615677
30124718794
401407213110

* Here, the number of predictors the model can contain is listed (this is not exactly the same as regression parameters). If all independent variables are measured on an interval scale, the number of independent variables and the number of predictors are the same. This is because for an independent variable measured on an interval scale, there is only one beta coefficient and an associated p-value. It is different if the variable is measured on a nominal scale with more than two categories (levels). In that case, one category is designated as the reference, and the other categories each get their own beta coefficient. A nominal variable with four categories thus results in three predictors.

Multicollinearity

In all multivariable linear regression, you must first check all independent variables for multicollinearity. This means checking if any of the independent variables of potential interest correlate strongly with each other. If this is the case, you must make a choice before proceeding. If you use SPSS to perform a standard multiple linear regression, you can check the boxes for Descriptives och Collinearity diagnostics under the Statistics button.

Collinearity diagnostics

Here, two different measures are shown: Tolerance and Variance Inflation Factor (VIF). VIF is 1 divided by Tolerance, so it is sufficient to look at either of these values. Tolerance and VIF for a specific predictor show how much the variance of that predictor’s regression coefficient increases due to covariation with all the other predictors (all other x’s) in the model. You get a tolerance / VIF for each independent variable (for each x). Tolerance < 0.1 or VIF > 10 indicates that severe multicollinearity exists in your model. The limit of < 0.1 or > 10 is somewhat arbitrary, and the limit of < 0.2 and > 5, respectively, can be considered indicative of possible collinearity.

If you have multicollinearity, you need to investigate further to find where the problem lies, and the best place to look is the correlation matrix (covariance matrix). If you only have two independent variables (just two x’s), the tolerance/VIF shows the same thing as the correlation between those independent variables. Thus, if the tolerance / VIF looks good, you do not need to inspect the correlation matrix. If you have more than two independent variables, you must proceed and inspect the correlation matrix (see below) regardless of what the tolerance / VIF shows, because collinearity between only two independent variables does not always register in the tolerance / VIF.

Tolerance / VIF indicates whether your regression model is suffering from ‘collinearity fever,’ while the correlation matrix (see below) pinpoint the source of the issue.

Covariance matrix

This shows how the regression coefficients (b) of different independent variables correlate with one another. You get this as an additional output table in SPSS if you check the ‘Covariance matrix’ box under ‘Statistics.’ In most cases, this is information you don’t need to worry about, so there is no need to request it from SPSS.

Correlation matrix

Here, you look at how the independent variables correlate with each other. All independent variables are checked pairwise for ‘zero-order correlations.’ This involves checking whether any of the independent variables of potential interest are highly correlated with one another. If you only have two independent variables (just two x-variables), the correlation coefficient r relates to VIF roughly as follows::
(r=0.60) → VIF ≈ 1.56
(r=0.70) → VIF ≈ 1.96
(r=0.80) → VIF ≈ 2.78
(r=0.90) → VIF ≈ 5.26
(r=0.95) → VIF ≈ 10.3
There is no cut-and-dried definition of what constitutes a ‘strong correlation.’ I suggest that if two independent variables have a Pearson (or Spearman) correlation coefficient above +0.85 or below -0.85 with a p-value < 0.05, they should be considered correlated — meaning there is a potential collinearity problem. If you have collinearity, you must make a choice before proceeding, and your options are:

  • Keep both if you believe there are theoretical reasons for including them both. For example, they might measure entirely different concepts that happen to be correlated.
  • Create a new variable as a composite index of the two original variables. This is appropriate if they measure nearly the same thing.
  • Exclude one of them from further analysis. This choice should be guided by what is most practical to retain for the ongoing analysis (what is likely to be most useful). Which variable is theoretically most relevant to keep, or which variable is measured with the greatest precision?
  • Use regularized regression, such as ridge, lasso, or elastic net. With these techniques, issues regarding correlation between independent variables are often less severe than in ordinary linear regression. However, this does not mean that correlation can be ignored entirely. Regularization can be very useful, especially for prediction, but the choice of which variables to include should still be grounded in theory and the research question — particularly if the model is meant to be interpreted rather than solely used for prediction.
  • See if a non-linear regression method fits the data better.

Interpreting the Results

Once the regression has been performed, you will have a significant amount of information to review in a structured manner. You should examine the following in the specified order:

  1. When performing the regression, you should have checked the option for multicollinearity diagnostics (see above). Now, verify whether the assumptions for linear regression appear to be met — specifically, that you do not have an issue with multicollinearity.
  2. Missing Data. You should have previously examined the missing data for individual variables. Now, you must look at the total (listwise) missing data when using multiple independent variables (performing a multiple linear regression). How many observations were excluded from the regression? The more independent variables you include, the higher the likelihood of missing information in at least one of them. If the missing data is more than negligible, you must check whether it is evenly distributed across the different variables or if one variable stands out. Does the pattern of missing data suggest a potential systematic error in the data collection? If the total missing data is <5%, a non-response analysis (missing data analysis) is generally not required. Missing data >10% requires a non-response analysis to determine if there is a systematic error somewhere in your dataset. Missing data between 5–10% is considered a gray area.
  3. Evaluate your regression model as a whole. R-squared (R^2), or preferably “Adjusted R-squared” if available, indicates how much of the variation in the dependent variable (y) is explained by your independent variables (your x’s). R-squared ranges from 0 to 1. A value of zero means that your regression model explains none of the variation in the dependent variable, while 1.0 means that all variation in y is explained by your x’s. Ideally, you want this value to be as high as possible.
  4. Examine how categorical variables are coded in the regression. If you include gender as a variable, a dummy variable is often created with values of 0 and 1. The result you see represents the group coded as “1” compared to “0,” which serves as the reference. It is crucial to be clear on whether the software designated men or women as “1.” A mistake here will lead to incorrect conclusions.
  5. Examine the evaluation of each independent variable (your x’s). For each independent variable, you are provided with a beta coefficient and a corresponding p-value. Note that there are both “unstandardized” and “standardized” beta coefficients. If you intend to use your regression model to calculate the predicted value of the dependent variable (y) based on specific values for the independent variables (x), you should use the unstandardized coefficients. If, however, you want to determine which independent variables (x) explain the greatest proportion of variation in the dependent variable (y), you should use the standardized coefficients..

Common terms and what they mean

TypVad som stårVad det betyder
df = DF = degrees of freedom = frihetsgraderTotal dfNumber of observations / individuals in the  analysis (n) minus 1 (n – 1).
Regression df = Model DFNumber of independend variables (number of x) in your regression model. It is always 1 in simple linear regression.
Residuals df = Error DFNumber of observations minus number of independent variables minus 1. In a simple linear regression this is always number of observations minus 2 (n – 2).
Sum of squaresSum of squares TotalEtt mått på den totala variationen (spridningen) hos den beroende variabeln (y) runt sitt eget medelvärde, innan man tagit hänsyn till några oberoende variabler.
Sum of squares Regression (=Model SS)The portion of the total variation in $y$ that your regression model successfully explains. The higher this value is relative to the ‘Total,’ the better the model.
Sum of squares Residuals = Sum of squares errorThe variation in $y$ that your model cannot explain (also referred to as noise or error). It is the distance between the actual data points and the regression line.
Mean Square (MS = Genomsnittlig kvadratsumma)Mean square Regression (= Model MS)Tas fram genom att dela Sum of squares Regression med Regression df. I enkel linjär regression är detta värde exakt samma som Sum of squares (eftersom man delar med 1).
Mean square Residuals (= Error MS)Variansen för residualerna. Tas fram genom att dela Sum of squares Residuals med Residuals df.
Statistical testF-statistic = F-valueA measure of how well the overall model fits the data compared to a model with no independent variables. The value is calculated by dividing the Mean Square Regression by the Mean Square Residuals.
p-value (Often linked to the F-statistics)The F-value is converted into a p-value. If this p-value is significant (e.g., < 0.05), it means the model as a whole is statistically significant and that at least one of your x-variables affects y.

(Preliminary reviewed up until here. The rest will be reviewed soon.)

Uppskatta (prognostisera) utfallsvärden

Linear regression is often used to explore which independent variables (x) have the largest influence on the dependent variable (y). The result of that is that you can order independent variables according to their relative importance. Another use of linear regression is to estimate the value of the dependent variable (y) for a given set of independent variables (x). In that scenario you use independent variables you already know are associated toth the dependent variable. Have a look at thi video byv Dr Nic Petty:

ANOVA, ANCOVA, or Linear Regression?

ANOVA is a method for comparing different groups (where grouping is a categorical variable measured on a nominal scale). ANCOVA is the same type of analysis, but it also includes a continuous variable (a covariate). Linear regression can mix both continuous and categorical independent variables. The underlying mathematics of these analyses are essentially the same, and you obtain basically the same results.The difference lies in how your statistical software presents the results. ANOVA immediately provides a single p-value (known as an omnibus F-test) to determine if your categorical variable, with all its different levels, is significant as a whole. It offers a quick and clear “yes” or “no” presented in a simple way. In contrast, a corresponding regression analysis can tell you if a specific category is significant in relation to the reference category, but it does not compare all levels against each other — only against the reference. If you want to compare all levels of your independent variable with one another, it is often more convenient to use ANOVA.

Some examples using linear regression

The examples below do not claim to be perfect examples of how to perform linear regression.

  • Providing Mothers with a Pedometer and Subsequent Effect on the Physical Activity of Their Children: A Randomized Controlled Trial of Children with Obesity : The outcome (y) is change in steps and group allocation is one of the independent variables while adjusting for a few other independent variables.
  • Predictors for future activity limitation in women with chronic low back pain consulting primary care: a 2-year prospective longitudinal cohort study : Spearman’s rank correlation was first used to screen for the independent variables to be included in the linear regression. A multiple linear regression then examines which of the remaining independent variables are associated with the perceived level of activity problems (the dependent variable).
  • A randomized controlled trial comparing two ways of providing evidence-based drug information to GPs : Multivariable linear regression was used to compare two groups, where factors that differed between the groups at baseline were included as independent variables to adjust for the effects of these differences. Group membership was the primary independent variable.

References

{2262766:DZAEZZKT};{2262766:DZAEZZKT};{2262766:DZAEZZKT};{2262766:DZAEZZKT};{2262766:DZAEZZKT};{2262766:MXXMP4X7};{2262766:QZRRWHQ6};{2262766:FCS2BCZB} vancouver default asc 0 5765