It is well worth spending a little time considering how you will analyse your data before you design your survey instrument or start to collect any data. This will ensure that data are collected – and, more importantly, coded – in an appropriate way for the analysis you hope to do.
By Claire Creaser
Fundamentals
Start to think about the techniques you will use for your analysis before you collect any data.
What do you want to know?
The analysis must relate to the research questions, and this may dictate the techniques you should use.
What type of data do you have?
The type of data you have is also fundamental – the techniques and tools appropriate to interval and ratio variables are not suitable for categorical or ordinal measures. (See How to collect data for notes on types of data)
What assumptions can – and can’t – you make?
Many techniques rely on the sampling distribution of the test statistic being a Normal distribution (see below). This is always the case when the underlying distribution of the data is Normal, but in practice, the data may not be Normally distributed. For example, there could be a long tail of responses to one side or the other (skewed data). Nonparametric techniques are available to use in such situations, but these are inevitably less powerful and less flexible. However, if the sample size is sufficiently large, the Central Limit Theorem allows use of the standard analyses and tools.
Techniques for a nonNormal distribution
Parametric or nonparametric statistics?
Parametric methods and statistics rely on a set of assumptions about the underlying distribution to give valid results. In general, they require the variables to have a Normal distribution.
Nonparametric techniques must be used for categorical and ordinal data, but for interval & ratio data they are generally less powerful and less flexible, and should only be used where the standard, parametric, test is not appropriate – e.g. when the sample size is small (below 30 observations).
Central limit theorem
As the sample size increases, the shape of the sampling distribution of the test statistic tends to become Normal, even if the distribution of the variable which is being tested is not Normal.
In practice, this can be applied to test statistics calculated from more than 30 observations.
How much can you expect to get out of your data?
The smaller the sample size, the less you can get out of your data. Standard error is inversely related to sample size, so the larger your sample, the smaller the standard error, and the greater chance you will have of identifying statistically significant results in your analysis.
Basic techniques
In general, any technique which can be used on categorical data may also be used on ordinal data. Any technique which can be used on ordinal data may also be used on ratio or interval data. The reverse is not the case.
Describing your data
The first stage in any analysis should be to describe your data, and the hence the population from which it is drawn. The statistics appropriate for this activity fall into three broad groups, and depend on the type of data you have.
What do you want to do?  With what type of data?  Appropriate techniques 

Look at the distribution  Categorical / Ordinal  Plot the percentage in each category (column or bar chart) 
Ratio / Interval  Histogram Cumulative frequency diagram 

Describe the central tendency 
Categorical  n/a 
Ordinal  Median Mode 

Ratio / Interval  Mean Median 

Describe the spread  Categorical  n/a 
Ordinal  Range Interquartile range 

Ratio / Interval  Range Interquartile range Variance Standard variation 
See Graphical presentation for descriptions of the main graphical techniques.
Mean – the arithmetic average, calculated by summing all the values and dividing by the number of values in the sum.
Median – the mid point of the distribution, where half the values are higher and half lower.
Mode – the most frequently occurring value.
Range – the difference between the highest and lowest value.
Interquartile range – the difference between the upper quartile (the value where 25 per cent of the observations are higher and 75 per cent lower) and the lower quartile (the value where 75 per cent of the observations are higher and 25 per cent lower). This is particularly useful where there are a small number of extreme observations much higher, or lower, than the majority.
Variance – a measure of spread, calculated as the mean of the squared differences of the observations from their mean.
Standard deviation – the square root of the variance.
Differences between groups and variables
Chisquared test – used to compare the distributions of two or more sets of categorical or ordinal data.
ttests – used to compare the means of two sets of data.
Wilcoxon U test – nonparametric equivalent of the ttest. Based on the rank order of the data, it may also be used to compare medians.
ANOVA – analysis of variance, to compare the means of more than two groups of data.
What do you want to do?  With what type of data?  Appropriate techniques 

Compare two groups  Categorical  Chisquared test 
Ordinal  Chisquared test Wicoxon U test 

Ratio / Interval  ttest for independent samples 

Compare more than two groups  Categorical / Ordinal  Chisquared test 
Ratio / Interval  ANOVA  
Compare two variables over the same subjects 
Categorical / Ordinal  Chisquared test 
Ratio / Interval  ttest for dependent samples 
Relationships between variables
The correlation coefficient measures the degree of linear association between two variables, with a value in the range +1 to 1. Positive values indicate that the two variables increase and decrease together; negative values that one increases as the other decreases. A correlation coefficient of zero indicates no linear relationship between the two variables. The Spearman rank correlation is the nonparametric equivalent of the Pearson correlation.
What type of data?  Appropriate techniques 

Categorical  Chisquared test 
Ordinal  Chisquared test Spearman rank correlation (Tau) 
Ratio / Interval  Pearson correlation (Rho) 
Note that correlation analyses will only detect linear relationships between two variables. The figure below illustrates two small data sets where there are clearly relationships between the two variables. However, the correlation for the second data set, where the relationship is not linear, is 0.0. A simple correlation analysis of these data would suggest no relationship between the measures, when that is clearly not the case. This illustrates the importance of undertaking a series of basic descriptive analyses before embarking on analyses of the differences and relationships between variables.
Testing validity
Significance levels
The statistical significance of a test is a measure of probability  the probability that you would have obtained that particular result of the test on that sample if the null hypothesis (that there is no effect due to the parameters being tested) you are testing was true. The example below tests whether scores in an exam change after candidates have received training. The hypothesis suggests that they should, so the null hyopothesis is that they won't.
In general, any level of probability above 5 per cent (p>0.05) is not considered to be statistically significant, and for large surveys 1 per cent (p>0.01) is often taken as a more appropriate level.
Note that statistical significance does not mean that the results you have obtained actually have value in the context of your research. If you have a large enough sample, a very small difference between groups can be identified as statistically significant, but such a small difference may be irrelevant in practice. On the other hand, an apparently large difference may not be statistically significant in a small sample, due to the variation within the groups being compared.
Degrees of freedom
Some test statistics (e.g. chisquared) require the number of degrees of freedom to be known, in order to test for statistical significance against the correct probability table. In brief, the degrees of freedom is the number of values which can be assigned arbitrarily within the sample.
For example:
In a sample of size n divided into k classes, there are k1 degrees of freedom (the first k1 groups could be of any size up to n, while the last is fixed by the total of the first k1 and the value of n. In numerical terms, if a sample of 500 individuals is taken from the UK, and it is observed that 300 are from England, 100 from Scotland and 50 from Wales, then there must be 50 from Northern Ireland. Given the numbers from the first three groups, there is no flexibility in the size of the final group. Dividing the sample into four groups gives three degrees of freedom.
In a twoway contingency table with p rows and q columns, there are (p1)*(q1) degrees of freedom (given the values of the first rows and columns, the last row and column are constrained by the totals in the table)
Onetail or twotail tests
If, as is generally the case, what matters is simply that the statistics for the populations are different, then it is appropriate to use the critical values for a twotailed test.
If, however, you are only interested to find out if the statistic for population A has a larger value than that for population B, then a onetailed test would be appropriate. The critical value for a onetailed test is generally lower than for a twotailed test, and should only be used if your research hypothesis is that population A has a greater value than population B, and it does not matter how different they are if population A has a value that is less than that for population B.
For example
Scenario 1
Null hypothesis – there is no difference in mean exam scores before and after training (i.e. training has no effect on the exam score)
Alternative – there is a difference in the mean scores before and after training (i.e. training has an unspecified effect)
Use a twotail test
Scenario 2
Null hypothesis – Training does not increase the mean score
Alternative – Mean score increases after training
Use a onetail test, if there is an observed increase in mean score.
(If there is an observed fall in scores, there is no need to test, as you cannot reject the null hypothesis.)
Scenario 3
Null hypothesis – Training does not cause mean scores to fall
Alternative – Mean score falls after training
Use a onetail test, if there is an observed fall in mean score.
(If there is an observed increase in scores, there is no need to test, as you cannot reject the null hypothesis.)
Before  After  
Mean 
360.4 
361.1 
Variance 
46,547 
46,830 
Observations 
62 
62 
Degrees of freedom (df) 
61 

t Stat 
1.79 

P(T<=t) onetail 
0.04 

t Critical onetail 
1.67 

P(T<=t) twotail 
0.058 

t Critical twotail 
2.00 
If the above test results were obtained, then under scenario 1, using a twotail test, you might conclude that there was no statistically significant difference between the scores (p=0.08), and, as a consequence, that training had no effect. Similarly, under scenario 3, you would conclude that there is no evidence to suggest that training causes mean scores to fall, as they have in fact risen. However, under scenario 2, using a onetail test, you would conclude that there was an increase in mean scores, statistically significant at the 5 per cent level (p=0.04).
A final warning!
Statistical packages will do what you tell them, on the whole. They do not know whether the data you have provided is of good quality, or (with a very few exceptions) whether it is of an appropriate type for the analysis you have undertaken.
Advanced techniques
These tools and techniques have specialist applications, and will generally be designed into the research methodology at an early stage, before any data are collected. If you are considering using any of these, you may wish to consult a specialist text or an experienced statistician before you start.
In each case, we give some examples of Emerald articles which use the technique.
Factor analysis
To reduce the number of variables for subsequent analysis by creating combinations of the original variables measured which account for as much of the original variance as possible, but allow for easier interpretation of the results. Commonly used to create a small set of dimension ratings from a large number of opinion statements individually rated on Likert scales. You must have more observations (subjects) than you have variables to be analysed.
For example
A Likert scale variable: "I like to eat chocolate ice cream for breakfast"
Strongly agree 
1 
2 
3 
4 
5 
Strongly disagree 
A factor analysis of Page and Wong's servant leadership instrument
Rob Dennis and Bruce E. Winston
Leadership & Organization Development Journal , vol. 24 no. 8
Understanding factors for benchmarking adoption: New evidence from Malaysia
Yean Pin Lee, Suhaiza Zailani and Keng Lin Soh
Benchmarking: An International Journal , vol. 13 no. 5
Cluster analysis
To classify subjects into groups with similar characteristics, according to the values of the variables measured. You must have more observations than you have variables included in the analysis.
Organic product avoidance: Reasons for rejection and potential buyers' identification in a countrywide survey
C. Fotopoulos and A. Krystallis
British Food Journal, vol. 104 no. 3/4/5
Detection of financial distress via multivariate statistical analysis
S. Gamesalingam and Kuldeep Kumar
Managerial Finance, vol. 27 no. 4
Discriminant analysis
To identify those variables which best discriminate between known groups of subjects. The results may be used to allocate new subjects to the known groups based on their values of the discriminating variables
Detection of financial distress via multivariate statistical analysis
S. Gamesalingam and Kuldeep Kumar
Managerial Finance, vol. 27 no. 4
Understanding factors for benchmarking adoption: New evidence from Malaysia
Yean Pin Lee, Suhaiza Zailani and Keng Lin Soh
Benchmarking: An International Journal , vol. 13 no. 5
Methodology
Discriminant analysis was used to determine whether statistically significant differences exist between the average score profile on a set of variables for two a priori defined groups and so enabled them to be classified. Besides, it could help to determine which of the independent variables account the most for the differences in the average score profiles of the two groups. In this study, discriminant analysis was the main instrument to classify the benchmarking adopter and nonadopter. It was also utilised to determine which of the independent variables would contribute to benchmarking adoption.
Regression
To model how one, dependant, variable behaves depending on the values of a set of other, independent, variables. The dependant variable must be interval or ratio in type; the independent variables may be of any type, but special methods must be used when including categorical or ordinal independent variables in the analysis.
Developments in milk marketing in England and Wales during the 1990s
Jeremy Franks
British Food Journal, vol. 103 no. 9
Training under fire: The relationship between obstacles facing training and SMEs' development in Palestine
Mohammed Al Madhoun
Journal of European Industrial Training, vol. 30 no. 2
Time series analysis
To investigate the patterns and trends in a variable measured regularly over a period of time. May also be used to identify and adjust for seasonal variation, for example in financial statistics.
An analysis of the trends and cyclical behaviours of house prices in the Asian markets
MingChi Chen, Yuichiro Kawaguchi and Kanak Patel
Journal of Property Investment & Finance, vol. 22 no. 1
Graphical presentation
Presenting data in graphical form can increase the accessibility of your results to a nontechnical audience, and highlight effects and results which would otherwise require lengthy explanation, or complex tables. It is therefore important that appropriate graphical techniques are used. This section gives examples of some of the most commonly used graphical presentations, and indicates when they may be used. All, except the histogram, have been produced using Microsoft Excel®.
Column or bar charts
There are four main variations, and whether you display the data in horizontal bars or vertical columns is largely a matter of personal preference.
Histogram
To illustrate a frequency distribution in categorical or ordinal data, or grouped ratio/interval data. Usually displayed as a column graph.
Clustered column/bar
To compare categorical, ordinal or grouped ratio/interval data across categories. The data used in fig 4 are the same as those in Figs 5 and 6.
Stacked column/bar
To illustrate the actual contribution to the total for categorical, ordinal or grouped ratio/interval data by categories. The data used in Fig 5 are the same as those in Figs 4 and 6.
Percentage stacked column/bar
To compare the percentage contribution to the total for categorical, ordinal or grouped ratio/interval data across categories. The data used in fig 6 are the same as those in Figs 4 and 5.
Line graphs
To show trends in ordinal or ratio/interval data. Points on a graph should only be joined with a line if the data on the xaxis are at least ordinal. One particular application is to plot a frequency distribution for interval/ratio data (fig 8).
Pie charts
To show the percentage contribution to the whole of categorical, ordinal or grouped ratio/interval data.
Scatter graphs
To illustrate the relationship between two variables, of any type (although most useful where both variables are ratio/interval in type). Also useful in the identification of any unusual observations in the data.
Box and whisker plot
A specialist graph illustrating the central tendency and spread of a large data set, including any outliers.
Resources
Connecting Mathematics
Brief explanations of mathematical terms and ideas
Statistics Glossary
compiled by Valerie J. Easton and John H. McColl of Glasgow University
100 Statistical Tests by Gopal K. Kanji
(Sage, 1993, ISBN 141292376X)
Oxford Dictionary of Statistics by Graham Upton and Ian Cook
(Oxford University Press, 2006, ISBN 0198614314)