Variance is used to calculate how spread out a dataset is. Calculation of variance is very helpful while creating statistical models. The low variance can alert the statistician about the over-fitting of the data. Often while calculating the variance for statistical measurement, it can be tricky. But once you got to the final formula, it can be straightforward.
In this section let’s find out how to find the variance of a sample data set.
Most of the times while working with different data sets statisticians only have access to a sample or the subset of the main data set. For instance, instead of calculating the “cost of every car in Germany”, a statistician could find the cost of a random sample of a few thousand cars. They can use this sample to get an estimated value for the price of the cars in Germany but most likely not the actual numbers. So in the first step, write down the sample set on which you want to work on.
After you got the sample set, start working with the sample variance formula. Write down the sample variance formula. As we discussed earlier, the variance lets you know about the spread out of your dataset. The closer the variance to zero, the clustered are they together.
The sample variance formula is, s2= ∑[(xi – x̅)2]/(n – 1)
- s2 is the variance. Remember that the variance is always measured in squared units.
- xi represents any term from the sample data set.
- ∑ Meaning “summation,” refers to calculate the following terms for each value of xi, then add them together.
- x̅ is the mean of the sample.
- n is the number of data points
Now as we have the sample variance formula, let’s explore more about the methods. In this step, you have to calculate the mean of the sample. The term x̅ or “x-bar” refers to the mean of the sample data set. To calculate the mean of the dataset you can use the regular method. First, add all the data points together and then divide the result by the no. of data points present in the sample set.
Let’s say we have a sample data set of 6 data points. The data points are 17, 15, 23, 7, 9, & 13. Now to calculate the mean of this sample data set we will add all the data points together.
17 + 15 + 23 + 7 + 9 + 13 = 84
After adding the data points now, we will divide the result with the no of data points present in the sample set to find the mean of the sample data set.
84/6 = 14
Hence, the mean of the data set is 14. Now as per the formula, we have x̅, the mean of the dataset. In next step subtract mean from each data point of the sample set to find xi – x̅. After subtracting, we have
x1 – x̅ = 17 – 14 = 3
x2 – x̅ = 15 – 14 = 1
x3 – x̅ = 23 – 14 = 9
x4 – x̅ = 7 – 14 = -7
x5 – x̅ = 9 – 14 = -5
x6 – x̅ = 13 – 14 = -1
Following the formula, now take the square of each result. After taking the squares, we have 9, 1, 81, 49, 25, 1. As we have the term (xi – x̅)2, now find the summation of (xi – x̅)2 for all the values of “i”.
i.e. 9 + 1 + 81 + 49 + 25 + 1 = 166.
Now we have solved the critical part of the formula. In the next step, we have to divide the result with n-1. As we know, the value of n is 6. Then n-1 = 5.
Hence after dividing the result will be 166/5 = 33.2
That means, s2 = 33.2 => s = square root of 33.2 = 5.76.
Hence the variance of the sample set is 5.76