Learn.

A numerical study can allow us to simply reduce a large data set into a single or a couple of numbers. For example the average Age of Australian’s can put millions of numbers into a single number. Obviously, in this case condensing all these age’s into a single number representation results in the loss of information. However, we can incorporate other numerical summaries into this number to provide a numerical summary which better represents the popultation.

The **mean** and the **median** allow us to represent the center of our data.

The** mean** is found by averaging all the numerical observations in the sample for a variable and dividing it by the sample size. For example the average height of people in a bus. You would add everyone’s heights and divide by the number of people. This means this number is affected by outliers.

The median is the **middle** score. This is found by placing numerical observations, such as height of people in a bus, and finding the middle score in the list of numbers.

e.g 154cm, 157cm, 159 cm, 167 cm, 192 cm

The median would be 159 cm.

However, if there are an even number of subjects then you divide the sum of the middle two numbers by 2.

This makes the median less effected by outliers then the average. Therefore, it is more useful to use the median for data which is skewed and contains outliers due to it being more robust.

If the median is **equal** to the mean then the data would be expected to be **symmetrical**. If the median is **greater** then the mean then the data would likely be **left skewed**. If the median is **less** then the mean then the data would be expected to be **right skewed**.

To determine the **spread** of data we can use the standard deviation.

To find the standard deviation we find the gaps in the data, using R we can do this by

# The gaps = all the data in a variable – the average price of that variable

The** root mean square (RMS)** can be used to find the square of the gaps in the data, then this number is averaged and then this is square rooted. This operation will find us the **standard deviation** of the data. Hence allowing us to know how spread out our data is.

The following is a general rule which we can apply to most data sets.

- 1 sd from the mean = 68% of our data
- 2 sd from the mean = 95% of our data
- 3 sd from the mean = 99.7% of our data

The **z score** can then be used to place our number into one of these categories. Essentially it is telling you how many sd the data point of interest is from the mean.

data point = mean + sd * (data point – mean) / sd

The inter quartile range (IQR) is able to tell us where the middle 50% of our data can be found. The first quartile is the median of the median in the first half of the data. The third quartile is the median of the median of the upper half of the quartile. The second median is just the median.

To determine whether your data contains **outliers**. You can use the following formula to calculate the upper threshold and the lower threshold of your data – any numerical observations greater then or less then these numbers will be considered outliers.

LQ = mean – 1.5 * IQR

UQ = mean + 1.5 * IQR

Where the IQR = the third quartile – the first quartile

The quartile referes to dividing your data into 4 divisions, where as, the quantile is dividing your data into n-1 divisions.

A useful trick to combine your mean and standard deviation into one numerical summary is the coefficient of variation (CV) = SD/ mean. This is used for quality control in manufacturing as well as quality control for analytical chemists.

Additionally, the maximum number and the minimum number provide useful insight. These are just the largest and the smallest number respectively in the data set.

Memorise.

Master.