Product Information:-

  • Journals
  • Books
  • Case Studies
  • Regional information
Request a service from our experts.
Visit the JDAL journal page.

How to... collect data

Options:     Print Version - How to... collect data, part 4 Print view

Entering and cleaning up the data

Once you have collected your data, you need to get it into a form where it can be analysed. This initially involves entering the data into a spreadsheet, after which you will need to carry out some basic checks.

Setting up a spreadsheet

Quantitative data should generally be entered into a spreadsheet of some kind, rather than a database. Some analysis packages will allow direct data entry; alternatively a general spreadsheet package can be used. The usual format required is for the questions to be entered across the columns of the spreadsheet, with each respondent (record) in a new row. Ensure you have a unique identifier for each record, usually in the first column. Check before you start how your preferred analysis package expects its data to be formatted.

If the data are to be entered manually, it is helpful to set up some validation procedures on the range of responses allowed for each question. In Microsoft Excel, for example, this can be done from the “Data” menu, by selecting “Validation”. It is possible to restrict the values which can be typed into any cell, and set an error message if any other entry is attempted. See Figure 3. A simple example of how data might be entered onto an Excel spreadsheet.


Figure 3. A simple example of how data might be entered onto an Excel spreadsheet

Image: Figure 3. A simple example of how data might be entered onto an Excel spreadsheet

Alternatively, all responses can be double entered and discrepancies checked. This is staff intensive, as it requires two people to enter the data independently of each other, and a comparison made of their entries. If this is not practical, a random sample of responses could be checked, preferably by a second data entry clerk. In this case, an acceptable error level should be set in advance; if this is exceeded, more extensive checking will be required.

Note that some statistical analysis packages prefer quantitative data to be coded numerically, rather than alphabetically (e.g. code gender: female=1; male=2 rather than female=f; male=m). This can be an issue with data collected automatically from a web questionnaire, depending on the software used. It is important to set up the underlying data form with appropriate coding at the outset. The default option may be to repeat the response wording, as illustrated here:

An example of automatically coded data from a web survey
"Full-time" "No" "Yes"
"£25 001 - £30 000"
"Full-time" "No" "No"
"£15 001 - £20 000"
"Full-time" "No" "Yes"
"1 professional" "1 non-professional"
"£20 001 - £25 000"
"Full-time" "Yes" "No"
"£20 001 - £25 000"
"Full-time" "Yes" "No"
"Below £15 000"
"Full-time" "No" "No"
"£15 001 - £20 000"
"Full-time" "Yes" "No"
"£20 001 - £25 000"

Accuracy of data entry

Once you have entered your data into a spreadsheet or analysis package, it is essential to carry out some basic checks before you begin your main analysis.

If the data have been collected automatically, e.g. from a web-based questionnaire, this is not usually a problem. However, you should still check for duplicates (particularly where an incentive has been offered to participants). If the data have been entered manually, then some quality control measures should be incorporated into the data entry process.

It is always valuable to carry out a simple distribution analysis, showing how many respondents have marked each answer to each question. This will pinpoint any coding errors where out-of-range codes might have been entered, and highlight any unusual values.

Detecting outliers

In some analyses, outliers – individual values which are particularly high or low – can materially affect the results. In such cases, it may be desirable, and legitimate, to omit these from the analysis. Detecting them is largely subjective – a plot of the data distribution is usually adequate to spot extreme values. A more objective test is to examine as a possible outlier any value which is more than 3 standard deviations from the mean of the distribution. Outliers may be indicative of errors in the data or of atypical individuals in the population.

Example of outlier

In this (fictitious) data set, one point clearly stands out from the rest, and should be investigated. See Figure 4. Example of an outlier.


See Figure 4. Example of an outlier

Image: See Figure 4. Example of an outlier


Dealing with missing values

Many statistical packages will automatically exclude blank responses within numerical data from the analysis as being "missing". If a blank response should not be regarded as missing, it will usually be necessary to recode it, e.g. to zero. You can also specify set values to be treated as missing – for example if you have coded "don’t know" = 99 as a valid response. Zero values will not generally be treated as missing by default, nor will blank values in text fields.

If a respondent has clearly not answered most of the questions, but given up well before the end, the best option may be to omit that respondent from the data set entirely.

In some circumstances it may be necessary to include cases with missing values in the data set for certain analyses; the usual procedure in such cases is to replace the missing value by the mean of the remainder of the data.