How to...
Collect data

The general principles of good research practice apply to data collection methods as to any other area of research. Before you embark on any survey or other data collection method, there are a number of factors you should consider.

By Claire Creaser

Designing a data collection instrument

Some preliminary considerations

Whether the data you seek are already available from an existing source
What is the most appropriate method (e.g. questionnaire, interview, observation) or combination of methods?
The practicalities of carrying out the data collection (how, when, where, by whom…?)
How the data will be prepared for analysis (e.g. data entry procedures, coding).

The need for a formal instrument

Whatever methodology you use, you will need a formal instrument to administer the data collection. The design details of a self-completion questionnaire will differ from those of an interview schedule or observation record, but the overall principles are the same. The following is based on a self-completion questionnaire; relevant principles can be applied to any data collection methodology.

Relate to your research questions

For each question that you want to include in your instrument, consider:

Why do I want to know this?
Which of my research questions does it address?
What will I do with the data – how will I analyse and report them?

Include all those questions which are relevant and useful, and omit those which are unnecessary or repetitive.

Quantitative or qualitative data?

This is a key consideration. Quantitative data are not just measures with numerical values (e.g. age, income) but any data which relate to the quantity of the measure concerned. Quantitative data can be analysed in a variety of ways, using spreadsheet functions or specialist statistical packages.

Qualitative data are not amenable to automatic numerical analysis. Specialist packages are available to assist in analysing such data; these are not considered here. There can be considerable value in qualitative data, and most surveys will include at least one opportunity for respondents to make open ended comments.

These pages are concerned only with the analysis of quantitative data.

Coding and analysis

It is advantageous to consider how you will code and analyse the data collected during the design stage of your project. A little care at this stage will pay dividends later!

If you can predict the likely answers to a particular question, or if there is a fixed set of options in which you are interested, then list these with a series of boxes to tick – such data are much easier to analyse and interpret than open ended answers. An “Other” option can be included; although the data gathered from this will generally be incomplete and of little value in practice, it will cover any major areas which you may have forgotten, and allow respondents who are particularly keen to add extra information which may be of interest.

If the range of answers is likely to be extensive – for example you wish to know the respondent’s age – then an open ended question may be preferred. The point at which a series of tick boxes (or equivalent) becomes counter-productive may depend on the format of your questionnaire – for example a web-based questionnaire asking for month of birth could have a drop down menu from which to select, whereas the same question on a paper-based survey might be open-ended, rather than listing all 12 months with tick boxes.

Using tick boxes rather than open ended questions can also help to direct respondents to an appropriate answer level. For example, in a (paper based) survey which asked:

Country/region of the world where employed _______________________

The range of answers was extensive, ranging from individual towns and counties in the UK to whole continents. Considerable post-coding was necessary in order to analyse the responses in a meaningful way. A series of tick boxes could have directed respondents to a region within the UK, and continent beyond it, which was an appropriate level of detail for this particular analysis.

You should also consider whether the options listed for tick box responses are mutually exclusive or whether several answers could be marked. These require different coding schemes and analysis methods. "See Figure 1. Example survey questions".

For more on surveys, see How to design a survey.

If responses are recorded on paper, and require manual data entry, it may be helpful to indicate the coding and/or column reference for each question in a small typeface on the questionnaire.

Open ended text questions can be post-coded for analysis, analysed manually, or a specialist text analysis package could be used.

The value of piloting

Piloting your survey instrument has several functions, including:

Checking the interpretation of the questions:
a high proportion of "don’t know" or omitted responses may indicate that respondents have not understood the question.
Estimating selected parameters to calculate sample size:
if your research seeks to obtain estimates of means with a prescribed error level, an indication of the likely mean and variation in the population is required to calculate an appropriate sample size.
Refining open ended questions into tick box format:
you can use the responses from the pilot to define categories for a tick box question in the main survey.
Checking coding schemes and developing analysis procedures:
can you do all the analyses you want to with the data available?

The sample of respondents used for piloting your survey does not need to be large, as you are not aiming to draw any inferences from the results. It should be sufficient to obtain a range of responses, from a cross-section of potential respondents to the main survey. You can also ask pilot respondents to comment on the questionnaire itself – e.g. How long did it take them to complete? Was there anything they did not understand?

If your pilot survey indicates that substantial changes are required, it may be desirable to carry out a second pilot.

Data collected from a pilot survey should not be included with those from the main survey in the analysis, even where the questions and coding are the same.

Variable types

Quantitative data can be divided into four broad categories. The analyses which you can (legitimately) perform will depend on the data type. In order of complexity, these are:

Categorical
Ordinal
Interval
Ratio.

Categorical

Categorical data are descriptive variables which allocate subjects to categories which have no inherent order e.g. gender; country of origin. Note that categorical variables may be represented by numerical values in the data set; this does not change their type. Categorical data are commonly used to describe the data set, and to provide sub-divisions for analysis and comparison.

For categorical variables, the measure of position (average) is the mode (the most frequently occurring value).

Ordinal

Ordinal data are descriptive variables which allocate subjects into categories with a natural order – e.g. satisfaction ratings; frequency categories. Ordinal variables are often represented by numerical values in the data set; this does not change their type, and particular care must be taken. "See Figure 2. An ordinal data example".

In some instances, particularly when analysing items from Likert (rating) scales, ordinal variables may be assumed to be interval variables for analysis purposes.

For ordinal variables, the measure of position (average) is the median (the value where half the respondents are above, and half below). The measure of dispersion is the range (maximum minus minimum value).

Interval

Interval variables are those where there is a constant spacing between the values. These are usually numeric, e.g. expenditure; age; temperature; height; number of articles published – e.g. the difference in temperature between 15 and 30 degrees is the same as the difference in temperature between 30 and 45 degrees. In practice, interval variables are generally recorded only in specialist areas, and the majority of numeric variables are ratio variables.

For interval variables, the measure of position (average) is the arithmetic mean. The measure of dispersion is the variance.

Ratio

Ratio variables are interval variables where there is a clear definition of zero, meaning an absence of the item being measured e.g. expenditure; age; height; number of articles published. In practice, the vast majority of numerical measures are of this type. Temperature in degrees centigrade, for example, is not a ratio variable, as a temperature of 0°C is not the same as an absence of temperature.

Further, it is meaningful to discuss ratios for ratio variables (as their name implies) – e.g. someone who is earns £30,000 per annum earns twice as much as someone earning £15,000 per annum. Ratios have no intrinsic meaning for interval variables – a day with a temperature of 20°C is not twice as hot as one when the temperature is only 10°C.

For ratio variables, the measure of position (average) is the arithmetic mean. The measure of dispersion is the variance.

Changing a variable's type

It is always possible to reduce a variable to a lower status – a ratio or interval variable can be coded into an ordinal variable; and an ordinal variable can be analysed in the same way as a categorical variable, if required.

Ratio to Ordinal coding

Age in years is a ratio variable, but it could be recoded or collected grouped into an ordinal variable with six categories as follows:

0 - 15 = 1
16 - 30 = 2
31 - 45 = 3
46 - 60 = 4
61 - 75 = 5
over 75 = 6

It is not generally correct to promote a variable to a higher status. While ratio and interval variables can be analysed using the same methods, and in some cases it may be a reasonable assumption that steps on a rating scale are of equal size, so that the inherently ordinal variable can be considered as interval, this will depend on the context. In the example given in "Figure 2. An ordinal data example", it would not be reasonable to assume that the six steps from "never" to "always" are of equal size, regardless of the coding scheme used.

If you collect age in bands as described in the box immediately above, then these bands cannot be treated as ratio variables for analysis. If you want to do this, then ask respondents to give their age in years, rather than specify a band. Such questions, particularly where they may be considered personal, may attract a high level of non-response, and the choice of wording should be balanced against the sample size and value of the analysis.

Sampling strategies

The purpose of sampling is to balance out the costs of obtaining complete information with the need for an accurate picture of the population of interest. If it is possible to collect data on all the subjects in the population of interest, then this will inevitably give a more accurate picture than that obtained from a sample.

However, this may not always be practicable, so that a sample is usually required. For example, if your research is concerned only with the opinions of those in a particular village, it will probably be possible to survey every household; while if you seek to ascertain the views of the whole county, a sample will be necessary to control costs and complete the research in an appropriate time frame.

The resulting calculations give estimates for the population of interest, based on the responses from the sample selected, and as such are subject to a degree of error. The extent of this error, and the ease with which it might be quantified, are dependant on the sample design and size.

Some definitions

Population of interest – the whole of the people or objects which are the subject of the research.

Sampling frame – a complete list of the population of interest.

Sampling fraction – the proportion of the population which is selected for the sample.

Sample design – the method of selecting individuals from the population for the sample.

Sample designs

There are four main sampling strategies which can be used, alone or in combination, to give data for statistical analysis.

1. Simple random sampling

This is the most straightforward conceptually, although it is often difficult to achieve a true simple random sample in practice. A simple random sample is one in which every member of the population of interest has an equal chance of being selected for the sample, and every possible sample of size n has an equal chance of being selected from the population.

A simple random sample can only be drawn when a sampling frame exists covering the population of interest, and a random number generator is used to select individuals for the sample.

2. Systematic sampling

A systematic sample is statistically equivalent to a simple random sample, and generally easier to administer. It depends on knowing the size of the population of interest. A sampling fraction is calculated from the required sample size divided by the population size, expressed in the form 1/n. A random number between 1 and n is generated to give a starting point, then every subsequent nth member of the population is selected for the sample.

An example of systematic sampling

A survey of library users requires a sample of 400 individuals. A complete list of users (as opposed to members) is not available, so a systematic sample is used. In an average week, approximately 8,000 people visit the library, so the sampling fraction is 400/8,000 = 1/20. A random number between 1 and 20 is generated, say 15. For a period of one week, the 15th person and every following 20th person entering the library are asked to take part in the survey.

3. Stratified sampling

Stratified sampling is used when the population of interest comprises several distinct sub-groups, to ensure that the sample contains an adequate number of individuals from every group. The required sample size is divided between these sub-groups, known as strata, then a sample drawn from each stratum to the required size.

Often, the samples drawn from each stratum are proportional to the representation of that stratum in the population, but this does not have to be the case – stratified sampling can be particularly valuable when the population contains a small sub-group from which relatively few members might be selected without stratification, but from which the sample size can be artificially inflated by using a larger sampling fraction. Disproportionate sampling of this sort must be compensated for when estimating parameters for the whole population, by applying appropriate weights to the raw data in the analysis stage.

4. Cluster sampling

Cluster sampling is used when the population of interest comprises several similar sub-groups, to reduce the costs of administering the survey. The initial sample drawn is of the sub-groups, or clusters; all members of each cluster can then be surveyed, or a further sample drawn of individual members within each cluster. It is frequently used in population surveys where a restricted number of geographical areas might be targeted, rather than attempting to survey the whole country.

A fifth method frequently used in market research surveys is quota sampling. This takes stratified sampling to the extreme, in that a specified number, or quota, of individuals is required from each of a set of often very detailed strata. It is not equivalent to random sampling, although it may be representative of the population, and it is not recommended for academic research.

Sample size

The answer to "how many do I need?" is almost certainly "less than you might think". Calculation of sample sizes is complex, and depends on the sampling design used, the type of parameters to be estimated, the degree of precision required for those estimates, and the confidence level of the results. Here we shall concentrate on simple random sampling.

Estimates of proportions

The simplest case to evaluate is when the parameters of interest are proportions of the population, the population is large, and a simple random sample is to be selected. It can then be shown that, in order to obtain estimates of the population proportion which are within 5 percentage points of the true value, with 95 per cent confidence, then a sample size of 400 is sufficient, whatever the size of the population, or the true value of the population proportion:

Sample size for a population proportion

The sample size n0 for estimating a population proportion p with to an accuracy of d either side of the true value with confidence level 1-α is given by

n0 =	zα2p(1 -p)
	d 2

Where zα is the upper α / 2 point of the normal distribution.

Note that this formula does not depend on population size. The “worst case” scenario is when p = (1 – p) = 0.5. For the standard confidence level of 95 per cent, zα = 1.96, and the sample sizes can then be calculated:

Accuracy within +/-		Sample size
5 per cent		384
2.5 per cent		1,537
2 per cent		2,401
1 per cent		9,604

Estimates of means

Estimating an appropriate sample size when the parameters of interest are population means (averages) or totals is more complex, and requires prior knowledge of the population variance. This can be estimated from a small-scale pilot study.

The sample size n0 for estimating a population proportion p with to an accuracy of d either side of the true value with confidence level 1-α is given by

n0 =	zα2σ 2
	d 2

Where σ 2 is the population variance, and zα is the upper α/2 point of the normal distribution. Note that this formula does not depend on population size.

Note that larger samples will give improved precision in the estimates calculated. When the proportion of interest is likely to be very low, or the population is small, other considerations must be taken into account (see "Special considerations, below). If this is the case for your research, or if you are considering a complex sample design, it may be helpful to seek specialist advice at an early stage.

Special considerations

Central limit theorem

There are a number of statistical assumptions in the theory of parameter estimation for proportions. In particular, the sample size (n) should satisfy the following inequalities:

(8a) np ≥ 5 and n(1 - p) ≥ 5, where p is the proportion being estimated

In practice, this means that if p = 0.01 (i.e. 1 per cent) then the sample size should be at least 500.

Accuracy of estimation

If p = 0.1 (i.e. 10 per cent), then the sample size should be at least 50 to satisfy (8a) above. However, considerations of the accuracy of the estimate suggest that a larger sample size may be required. Equation (8b) gives a confidence interval of one percentage point either side of a given proportion p. If p = 10 per cent, then this can be manipulated to give a required sample size of at least 539. A smaller sample size would be sufficient if less accuracy were required in the estimate; 311 would be sufficient for accuracy of 2.5 percentage points either way (i.e. p in the range 7.5-12.5 per cent)

(8b) confidence interval:

p ±	2.58p(1 - p)
	√n

Power considerations

A third approach may be taken by considering the power of tests of proportions. The power of a hypothesis test is a measure of how well it rejects the hypothesis when it is false (whereas the more commonly used significance level of the test relates to how well it accepts the hypothesis when it is true). The theory here is complex, and again a number of assumptions must be made to arrive at a viable sample size figure. It is particularly relevant to the detection of small proportions. It can be shown that if p=1 per cent, then a sample size of 380 is sufficient for a power of 95 per cent against the alternative that p=0.

Small populations

Where the overall population is small, it may be possible to reduce the sample size without loss of precision. In such cases, the sample size can be calculated as

(8c)

n =	1
	1/n0 + 1/N

Where n0 is as defined previously, and N is the population size.

Response rate

It is important to remember that it is not the size of the sample which matters, but the number of responses made. This is what the analysis will be based on, and it will affect the accuracy of any parameter estimates calculated and inferences drawn. Some common analyses have specific requirements below which their assumptions become invalid – the x 2 test, for example, requires that the expected values in each cell are greater than or equal to 5. The more complex the table, the larger the sample size needed to meet this criterion.

The response rate you can achieve will depend on a variety of factors – the nature of the population, the type of survey undertaken, the length of the questionnaire, and how easy it is to fill in are just some of these. There are also actions you can take to improve the response rate, for example:

making questionnaires attractive and easy to complete
booking interviews at a mutually convenient time
offering an incentive, such as a prize draw from responses received by a given date
issuing a general reminder shortly before the closing date
following up individual non-respondents after the closing date and offering an extension.

Which of these, if any, might be appropriate will depend on the individual circumstances of each survey.

The likely response rate should be built in to the initial calculations of sample size. It may seem easy to select a large sample in the first instance, and not worry about response rates. However, the danger is that only those with a particular point of view may respond, and the survey will thus give a biased result. It is not generally possible to glean any information about the non-respondents to a survey, although if you have independent information about the population you can compare this to the survey results. If the survey attracts a low response, it may also be useful to compare the responses received at different times during the survey process, to see if any trends in key measures can be observed which might affect the outcomes of the research.

Entering and cleaning up the data

Once you have collected your data, you need to get it into a form where it can be analysed. This initially involves entering the data into a spreadsheet, after which you will need to carry out some basic checks.

Setting up a spreadsheet

Quantitative data should generally be entered into a spreadsheet of some kind, rather than a database. Some analysis packages will allow direct data entry; alternatively a general spreadsheet package can be used. The usual format required is for the questions to be entered across the columns of the spreadsheet, with each respondent (record) in a new row. Ensure you have a unique identifier for each record, usually in the first column. Check before you start how your preferred analysis package expects its data to be formatted.

If the data are to be entered manually, it is helpful to set up some validation procedures on the range of responses allowed for each question. In Microsoft Excel, for example, this can be done from the “Data” menu, by selecting “Validation”. It is possible to restrict the values which can be typed into any cell, and set an error message if any other entry is attempted. See Figure 3. A simple example of how data might be entered onto an Excel spreadsheet.

A simple example of how data might be entered onto an Excel spreadsheet

Alternatively, all responses can be double entered and discrepancies checked. This is staff intensive, as it requires two people to enter the data independently of each other, and a comparison made of their entries. If this is not practical, a random sample of responses could be checked, preferably by a second data entry clerk. In this case, an acceptable error level should be set in advance; if this is exceeded, more extensive checking will be required.

Note that some statistical analysis packages prefer quantitative data to be coded numerically, rather than alphabetically (e.g. code gender: female=1; male=2 rather than female=f; male=m). This can be an issue with data collected automatically from a web questionnaire, depending on the software used. It is important to set up the underlying data form with appropriate coding at the outset. The default option may be to repeat the response wording, as illustrated here:

An example of automatically coded data from a web survey
"Full-time"	"No"	"Yes"	"4"	"£25 001 - £30 000"
"Full-time"	"No"	"No"		"£15 001 - £20 000"
"Full-time"	"No"	"Yes"	"1 professional" "1 non-professional"	"£20 001 - £25 000"
"Full-time"	"Yes"	"No"		"£20 001 - £25 000"
"Full-time"	"Yes"	"No"		"Below £15 000"
"Full-time"	"No"	"No"		"£15 001 - £20 000"
"Full-time"	"Yes"	"No"		"£20 001 - £25 000"

Accuracy of data entry

Once you have entered your data into a spreadsheet or analysis package, it is essential to carry out some basic checks before you begin your main analysis.

If the data have been collected automatically, e.g. from a web-based questionnaire, this is not usually a problem. However, you should still check for duplicates (particularly where an incentive has been offered to participants). If the data have been entered manually, then some quality control measures should be incorporated into the data entry process.

It is always valuable to carry out a simple distribution analysis, showing how many respondents have marked each answer to each question. This will pinpoint any coding errors where out-of-range codes might have been entered, and highlight any unusual values.

Detecting outliers

In some analyses, outliers – individual values which are particularly high or low – can materially affect the results. In such cases, it may be desirable, and legitimate, to omit these from the analysis. Detecting them is largely subjective – a plot of the data distribution is usually adequate to spot extreme values. A more objective test is to examine as a possible outlier any value which is more than 3 standard deviations from the mean of the distribution. Outliers may be indicative of errors in the data or of atypical individuals in the population.

Example of outlier

In this (fictitious) data set, one point clearly stands out from the rest, and should be investigated. See Figure 4. Example of an outlier.

Dealing with missing values

Many statistical packages will automatically exclude blank responses within numerical data from the analysis as being "missing". If a blank response should not be regarded as missing, it will usually be necessary to recode it, e.g. to zero. You can also specify set values to be treated as missing – for example if you have coded "don’t know" = 99 as a valid response. Zero values will not generally be treated as missing by default, nor will blank values in text fields.

If a respondent has clearly not answered most of the questions, but given up well before the end, the best option may be to omit that respondent from the data set entirely.

In some circumstances it may be necessary to include cases with missing values in the data set for certain analyses; the usual procedure in such cases is to replace the missing value by the mean of the remainder of the data.

Resources

Connecting Mathematics
Brief explanations of mathematical terms and ideas

Statistics Glossary
compiled by Valerie J. Easton and John H. McColl of Glasgow University

Statsoft electronic textbook

Sampling (2nd edition)
by Steven K Thompson (Wiley, 2002, ISBN 0471291161)

How to...Collect data

On this page