A numerical study can allow us to simply reduce a large data set into a single or a couple of numbers. For example the average Age of Australian’s can put millions of numbers into a single number. Obviously, in this case condensing all these age’s into a single number representation results in the loss of information. However, we can incorporate other numerical summaries into this number to provide a numerical summary which better represents the popultation.
The mean and the median allow us to represent the center of our data.
The mean is found by averaging all the numerical observations in the sample for a variable and dividing it by the sample size. For example the average height of people in a bus. You would add everyone’s heights and divide by the number of people. This means this number is affected by outliers.
The median is the middle score. This is found by placing numerical observations, such as height of people in a bus, and finding the middle score in the list of numbers.
e.g 154cm, 157cm, 159 cm, 167 cm, 192 cm
The median would be 159 cm.
However, if there are an even number of subjects then you divide the sum of the middle two numbers by 2.
This makes the median less effected by outliers then the average. Therefore, it is more useful to use the median for data which is skewed and contains outliers due to it being more robust.
If the median is equal to the mean then the data would be expected to be symmetrical. If the median is greater then the mean then the data would likely be left skewed. If the median is less then the mean then the data would be expected to be right skewed.
To determine the spread of data we can use the standard deviation.
To find the standard deviation we find the gaps in the data, using R we can do this by
# The gaps = all the data in a variable – the average price of that variable
The root mean square (RMS) can be used to find the square of the gaps in the data, then this number is averaged and then this is square rooted. This operation will find us the standard deviation of the data. Hence allowing us to know how spread out our data is.
The following is a general rule which we can apply to most data sets.
- 1 sd from the mean = 68% of our data
- 2 sd from the mean = 95% of our data
- 3 sd from the mean = 99.7% of our data
The z score can then be used to place our number into one of these categories. Essentially it is telling you how many sd the data point of interest is from the mean.
data point = mean + sd * (data point – mean) / sd
The inter quartile range (IQR) is able to tell us where the middle 50% of our data can be found. The first quartile is the median of the median in the first half of the data. The third quartile is the median of the median of the upper half of the quartile. The second median is just the median.
To determine whether your data contains outliers. You can use the following formula to calculate the upper threshold and the lower threshold of your data – any numerical observations greater then or less then these numbers will be considered outliers.
LQ = mean – 1.5 * IQR
UQ = mean + 1.5 * IQR
Where the IQR = the third quartile – the first quartile
The quartile referes to dividing your data into 4 divisions, where as, the quantile is dividing your data into n-1 divisions.
A useful trick to combine your mean and standard deviation into one numerical summary is the coefficient of variation (CV) = SD/ mean. This is used for quality control in manufacturing as well as quality control for analytical chemists.
Additionally, the maximum number and the minimum number provide useful insight. These are just the largest and the smallest number respectively in the data set.