Sunday 16 October 2011

Data Analysis - Eye-balling your data for better understanding


The starting point for any data analysis should involve seeing what the data looks like.  This is called Eye Balling

There are quite sophisticated ways of eye-balling your data using mainly Exploratory Data Analysis. But this requires yet another level of statistical knowledge.
The informal approach to eye-balling starts with frequency distribution. You should always generate frequency distributions for your data.

The starting point for eye-balling is to simply inspect the type of distribution you have. You ask yourself the question : Are there any odd-bods?.

Technically, you are looking for outliers - cases which are extreme against the rest of the distribution. For example:
Rating
Frequency
1
20
2
37
3
28
4
12
5
8
6
2
7
7

This distribution could mean that there is a legitimate "lump" at the bottom end but it could also mean that you have a couple of cases which do not fit too well.

Another example could be:

Age
Frequency
18-25
100
25-30
57
31-35
22
36-40
8
41-45
0
46-50
0
51-60
3
The 3 cases in the 51-60 category are well outside the range for the rest of the sample. You would have to ask yourself Even if they belong to the sample, might they distort the result?

To check out possible outliers, you usually have to go back to your raw data and look at individual cases. Alternatively you can sort your data and look at the potential outliers in isolation from the rest.

What you are looking for is : 
dirty data
unexpected results
the missing cases you have for each item in the data base.

The shape of the distribution
The next thing to look at is the shape of the distribution: 
How flat is it - kurtosis
How much is it off centre - skew

When you have finished eye-balling your data, you have some basic ideas about how coherent your data base might be.
What you will be able to do is to say that you know that there is nothing odd about the data because you have been able to sort out any problems which have arisen. This means that whatever results you get you should be able to interpret them without worrying about data which might have distorted outcomes.

No comments:

Post a Comment