Data Matters with Fathom! Dynamic Statistics software
Activity 6.2
One feature of the standard deviation is most puzzling. In calculating the standard deviation, we get the squared deviations and we summarize them. Rather than simply averaging them, we sum them and divide by one less than the sample size.
Section 6.2 claims that we divide by one less than the sample size rather than by the sample size in order to get a better estimate of the populations variance (the square of the populations standard deviation). The reason Section 6.2 claims this is that the samples mean is closer to the samples proportions than the populations mean, so the deviations tend to be smaller than what you would see in the total population.
In this project, we are going to check that claim, by using the representative U.S. sample as our population again. Here are the steps.
- Pick a numeric attribute and get the population variance of that attribute. (Remember that these 50,000 are our population for the moment.)
- Pick a sample size, take random samples, and calculate the average squared deviation and the variance.
- Explore the two statistics to see which does a better job.
Heres how to do these steps.
Step 1: Pick a numeric attribute and get the population variance of that attribute.
Open Rep US Sample. Select Analyze, Estimate Parameters, Empty Estimate, Estimate Mean. Drag the numeric attribute you want to work with onto Attribute (continuous): <unassigned>. The display shows the standard deviation. You can square that to get the variance.
Step 2: Pick a sample size, take random samples, and calculate the average squared deviation and the variance.
Select the Collection, then Analyze, Sample Cases. Select the Sample Collection and press Ctrl-I to get the Collection Inspector. Set your sample size where you want it (but above 1there is no variance or standard deviation for samples of size 1). You might consider that the difference between the mean squared deviation and the standard deviation is smaller with larger sample sizes. Set Animation as you want it. Click on the Measures tab and enter a measure, Var, with the formula sampleVariance([your attribute]), replacing [your attribute] with the name of the attribute that you have the population for.
Enter another new variable. Call this one MSD (for mean squared deviation). Here is an easy way to get the mean squared deviation: Multiply the variance by one less than the sample size. That gives you the sum of squares. Then divide by the sample size. That gives you the mean squared deviation. The formula is sampleVariance([your attribute])([your sample size] 1)/[your sample size].
Where noted in brackets, put in your sample size and your attribute name.
To get multiple samples, select Analyze, Collect Measures. Drag a case table onto the workspace to see the measures. Inspect the Measures Collection to set the number of measures and Animation.
Step 3: Explore the two statistics to see which does a better job.
Use means, medians, and histograms to explore the variances and the mean squared deviations. Which seems to do a better job estimating the populations variance?
Something to Think About
Look at the sampling distribution of variances. It isnt symmetrical. That means that although the mean variance is a good estimate, most of the time the variance is too low. Often it is way too low. That causes special problems, as we will see later.
There is a function, populationVariance(), that you could have used in creating the measures. If you add it to the samples measures, you will see how it compares with the variance and the mean squared deviation.
| ©2008 Key College Publishing. All rights reserved. |
|