General Information
  Home
Author Bio
Product/Purchase Info

Instructor Resources
Registration Required
  Register
Download Instructor Resources

Computer Activities
and Data Sets
  Table of Contents
Excel
Fathom
SPSS

Community
  Contact the Author
Ideas/Comments for Publisher
Testimonials
Coming soon!

Other Key Sites
  Key Curriculum Press

Key College Publishing

Data Matters with Excel®

Activity 6.2

One feature of the standard deviation is most puzzling. In calculating the standard deviation, we get the squared deviations and we summarize them. Rather than simply averaging them, we sum them and divide by one less than the sample size.

Section 6.2 claims that we divide by one less than the sample size rather than by the sample size in order to get a better estimate of the population’s variance (the square of the population’s standard deviation). The reason Section 6.2 claims this is that the sample’s mean is closer to the sample’s proportions than the population’s mean, so the deviations tend to be smaller than what you would see in the total population.

In this project, we are going to check that claim, by using the representative U.S. sample as our population again. Here are the steps.

  1. Pick a numeric variable and get the population variance of that variable. (Remember that these 50,000 are our population for the moment.)
  2. Pick a sample size, take random samples, and calculate the average squared deviation and the variance.
  3. Explore the two statistics to see which does a better job.

Here’s how to do these steps.

Step 1: Pick a numeric variable and get the population variance of that variable.

Open the workbook you used for the project in Section 5.2. (You could also do this project with RepUSSample.xls, but that will require your copying the macro from Section 5.2.) Select an empty cell to the right of the data, type =var( , select the column of the data you are interested in, type ) , and press “Enter.” That puts the variance of your column in that cell.

Step 2: Pick a sample size, take random samples, and calculate the average squared deviation and the variance.

Using the same strategy you used in the Section 5.2 project, you will take samples into your Sample worksheet and calculate and store the statistics you’re interested in on a third worksheet.

Add a worksheet, Stats. Label column A Variance, column B Mean Squared Error, column C Variances, and column D Mean Squared Errors.

Look at the data and note the column of the data you are interested in. In cell A2 in your Stats worksheet, enter the equation =VAR(Sample!J:J) replacing the “J:J” with “C:C” or “F:F” or whatever is the column of the data you’re interested in.

Before you proceed, you need to choose your sample size. Your sample size will have to be greater than 1. There is no variance or standard deviation for samples of size 1. You might also consider that the difference between the mean squared deviation and the standard deviation is smaller with larger sample sizes.

Now put the mean squared deviation in B2. An easy way to get the mean squared deviation is to multiply the variance by one less than the sample size (that gives you the sum of squares) then divide by the sample size (that gives you the mean squared deviation). The formula is =A2*([your sample size]-1)/[your sample size] . (Put in your own sample size where noted in brackets).

Here is the macro that will create the samples and record the variance and mean squared deviation of each sample.

Sub Sampling()
  Randomize Timer
  For r = 2 To 55 ‘Set “55” to the number of samples you want.
    For SampleRow = 2 To 4 ‘Set “4” to one over your sample size.
    dataRow = 2 + Fix(50000 * Rnd)
    Worksheets("Data").Rows(dataRow).Copy
    Worksheets("Sample").Rows(SampleRow).PasteSpecial
  Next SampleRow
  Worksheets("Stats").Cells(r, 3).Value = Cells(2, 1).Value
  Worksheets("Stats").Cells(r, 4).Value = Cells(2, 2).Value
  Next r
End Sub

This is roughly the same macro you have seen before. The difference (in bold) is that two cells get copied to record the statistics.

Step 3: Explore the two statistics to see which does a better job.

Use means, medians, and histograms to explore the variances and the mean squared deviations. Which seems to do a better job estimating the population’s variance?

Something to Think About

Look at the sampling distribution of variances. It isn’t symmetrical. That means that although the mean variance is a good estimate, most of the time the variance is too low. Often it is way too low. That causes special problems, as we will see later.

Save Your Work

In the next project, you will use the workbook you worked with in this project.


©2008 Key College Publishing. All rights reserved.