Ronny Gunnarsson. "Clean data file" [on INFOVOICE.SE]. Available on: https://infovoice.se/clean-data-file/. Information was retrieved January 23, 2025.
Suggested pre-reading | What this web page adds |
---|---|
This web-page describes the utterly important process of cleaning data before commencing descriptive or inferential statistical analysis. |
Data usually needs some checking and subsequent cleaning before commencing statistical analysis. Once this is done you have a clean data file
. This is a very important process that should be done in the following order:
- Decide the structure of the data file
- Discovering and correcting errors in data.
- Adding more data where information is lacking.
- Looking for signs of systematic bias in missing data
- Re-coding or calculating some variables into new variables.
- Managing extreme observations (outliers).
- Verifying that you have a
clean data file
.
Decide the structure of the data file
Ideally this is made before data collection. However, sometimes you acquire a potentially interesting set of data and need to decide to what extent the recordings of these observations can be extracted and put into a final data set for further analysis. In both situations you need to decide how your final data set should look like.
This aim is to decide if your file is going to be one single data sheet (most common) or a relational database with several linked data sheets. The rest of this page will only refer to the situation where you aim for a single sheet consisting of rows and columns. This decision should normally be made long before you commence data collection. If you do a prospective data collection performed by yourself then you would expect that you can stick with your previous decision. However, sometimes you request and receive an extract from a database that may look quite different from what you expected. In those cases you may have to rethink the structure of your data file. There are two decisions that needs to be made:
- First you need to decide what the rows should be. Let us take an example. Assume you are interested in consultations between patients and their doctors. The rows can be defined as belonging to one of several different hierarchical levels such as: blood tests taken at one consultation (can be several tests in one consultation), consultations (can be several consultations made by the same patient), doctors (one doctor can see many patients), clinics (one clinic can have many doctors), geographical area (one area may have many clinics), countries (each country can have many regions). Your rows should be on the same level as you want to do your statistical analysis. Do you want to analyse blood samples, consultations or doctors? If you want to analyse doctors then you may have to create a measurement for each doctor derived from the underlying layer.
- The second decision is to define each column. It is very common that you later may add extra columns (see below).
Discovering and correcting errors in data
Normally you plan your project so that one column / variable is a unique id for each participant / observation. It would be a good idea to have another file linking up the unique id with the actual real data. You need to consider integrity and protecting data in case individuals (patients?) are involved so the link to real data might be stored in a separate encrypted file having the unique id with the link to how to find the raw real data. The link to real data might be a patient identification number in the electronic chart of a hospital. You are in some trouble if you have no link to real data making it impossible to check suspicious values. Also never throw away hard copies of questionnaires or any other research related material until several years after last publication.
So what should you do if you have no possibility to check with the true raw data? Some values are obviously wrong and I support changing them to missing data. However, quite often you find data that are unlikely but not impossible. Are they wrong or correct? There is no way to sort this out without a link to the actual data. My suggestion is to eliminate “impossible” observations coding them as missing. I would also suggest leaving unlikely but possible observations as they are unless you have a very good reason to believe it is an error.
Adding more data where information is lacking
The above process will identify where information is incorrect or lacking. You should try to find the true value representing the observation. Perhaps the information resides in another place. Let us assume you want to retrieve the width of aorta in millimeters estimated by an ultrasound investigation. You look at the charts of patients. It turns out that half of the charts only state that the width of the aorta was normal but does not actually state the width in millimeters. Instead of leaving the box with width empty for that patient you should try and look for the information in the ultrasound record which means looking at another place. Hence, you need to be a bit of a detective and try to find innovative ways of retrieving missing information. This usually takes much more time than the subsequent statistical analysis and it is important to put in a fair amount of work into this!
Looking for signs of systematic bias in missing data
Once you have added all data you possibly can track down the next question is to see if there is a pattern in the missing data. Are data missing mainly for patients of one gender or for patients with a specific severity of disease. Check if participants with missing data in one variable differ in the other variables compared to those not having missing data. It should not be any differences if data are missing due to pure random events. It is a problem requiring careful consideration if they differ in a systematic pattern since it may imply the presence of a systematic bias.
Re-coding or calculating some variables into new variables
Don’t do this unless you have completed the steps above. In this step you calculate new variables (columns). It is better to do this in SPSS in one go for all observations rather than doing it for one observation at the time when you enter data. A common example is calculating age from birthdate and a date for a visit. You would subtract birth date from the visit date. Doing so in SPSS give you the age in seconds which is not very useful. If you want age in years you need to use this formulae in SPSS:
(visitdate – birthdate) / (60*60*24*365.25)
The calculation above is done in SPSS using the menu options “Transform” → “Compute Variable”. In many statistical analysis it might be practical to have the age in decades rather than in years. It is usually more informative to get the odds ratio for an increase of age in decades rather than years because an increase of one year is often a rather small step. This can be done also for other variables where it may be sensible to create a variable reducing the number of steps.
It is common to have a variable containing several possible responses but it would be useful to recode to a binary variable. A common example is ethnicity where one variable can contain a number of possible options. In Australia it might be sensible to code:
Code | Clarification |
---|---|
0 | Not of Aboriginal or Torres Strait Islander origin |
1 | Aboriginal origin |
2 | Torres Strait Islander origin |
3 | Aboriginal and Torres Strait Islander origin |
In the analysis it might be sensible to create two new variables derived from the variable ethnicity:
Ethnicity | Aboriginal | Torres Strait Islander |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 0 | 1 |
3 | 1 | 1 |
This recoding is done in SPSS using the menu options “Transform” → “Recode into Different variables”.
Managing extreme observations (outliers)
The crucial question is if extreme observations (outliers) are legitimate registrations or errors. You need to check these with the source (charts?) to be sure it is not an error when transferring information from the original source into your SPSS file. Please note that a registration in the source (chart?) may sometimes be an error in the source so you need to use common sense. There are a few options of managing outliers that are not obvious errors:
- Keep the observations as they are.
- Delete the observations.
- “Clip” the data. Also labelled Winsorizing .
Deleting the observations mens a loss of data and that is rarely a good idea. I suggest keep your data as is if your data is analysed by non-parametric statistical analysis. If you plan to use parametric statistics consider “Winsorizing. The latter is losing some information but not all information. Please note that in case of Winsorizing always create a new variable keeping original data intact.
The above describes management of a few extreme observations. Another web page describes “transformation of data” that engages all observations. Transformation of data may solve the problem with outliers but should not be the primary method if you have only a few extreme observations.
Verifying that you have a clean data file
This is doing item 2 above (Discovering and correcting errors in data) again for all variables you intend to use for descriptive and inferential statistics to ensure that no unreasonable observations remains.