Data Matters with SPSS®
Activity 3.1
Section 3.1 claims that if we take random samples from a population, 95% of the time, the proportions in those samples will fall within two standard errors of their matching population proportions. That is a 95% prediction interval. A 67% prediction interval is from one standard error below the populations proportion to one standard error above it.
To test this claim, use the same data set from RepUSSample.sav that you used in Section 2.1.
The project in Section 3.1 requires these four steps.
- Find out some of the populations proportions.
- Pick a sample size and create prediction intervals for random samples proportions.
- Use your software to take random samples.
- Check what proportion fell within your prediction interval and what proportion fell outside.
Heres how to do each step.
Step 1: Find out some of the populations proportions.
Open the data set for this project, the same one you used in Section 2.1. Find out proportions on the data that interest you (select Analyze, Descriptive Statistics, Frequencies).
Step 2: Pick a sample size and create prediction intervals for random samples proportions.
Check Section 3.1 in Data Matters if youre not sure how to do this.
Step 3: Use your software to take random samples.
To do this, you will randomly sort the observations and add a variable to indicate which sample each person is selected for. Having done that, we will be able to use AGGREGATE to get the proportions in each sample.
This approach has two peculiarities, both of which you can somewhat overcome. One is that this is sampling without replacement; thus, if a person is selected for one sample, that person cannot show up in another sample. This makes the sampling somewhat different from what Section 3.1 discusses. The difference is that the chances of each kind of person dont stay the same on each draw. For example, lets suppose that only one person in our population knows all the words to America the Beautiful. On the first draw, the chances of getting that person are 1 out of 50,000, because we have 50,000 in our population. If that person is drawn on the first draw, in every subsequent draw, the chances of getting someone like that are 0. On the other hand, if that person is not drawn on the first sample, the chances of drawing that kind of person increase with the second draw, to 1 out of 49,999. On the third draw, the chances increase to 1 out of 49,998, and so on.
The same thing applies at the sample level. Once the person has been drawn for the first sample, the chances that kind of person will be drawn in other samples are 0, and each time the person is not drawn, the chances that the person will be drawn increase.
You could compensate a lot for these peculiarities by copying the data file and adding it to itself many times, so that there are many copies of each person. This is a lot of work for the computer, but I have included some Syntax under Diminishing the Effect of Sampling Without Replacement
at the end of this section, in case you would like to do this. Following that, there are instructions for an approach for getting a pure random sampling without replacement that is time-consuming, but possible.
Sorting Randomly
Open the representative U.S. sample data, then get into the data editor.
To sort the observations randomly, you will add a variable that is a random number, then sort the observations by that random number. To add the random number, in the data editor click on Transform, then Compute. Name your new target variable random, name the Numeric Expression RV.UNIFORM(0,100), and click OK. At each step, you can look at the data to see what is happening.
Select Data, Sort Cases. Double-click on random and click OK.
Adding the Sample Variable
Before you can add the sample variable, you need to know how many samples you would like to take and your sample size. Unless you follow the instructions at the end of this section, you wont be able to take more samples than 50,000 divided by your sample size. For example, if you use sample sizes of 100, you wont be able to have more than 500 samples. The reason is that, unless you add copies of the data, there are only 50,000 people to create samples from.
Use Transform, Compute to add the sample variable. The target variable is sample.
There is an If requirement. Before you enter a Numeric Expression, click on If, select Include if case satisfies condition: , and enter an equation so that SPSS doesnt add sample to more people than you want to consider. For example, if you want 50 samples of 10 people each, then you want to consider 500 people, so you would enter the equation $CASENUM <= 500 .
You could let SPSS do the multiplying. For example, if you want 200 samples of 14 people each, you could enter the equation $CASENUM <= 200*14 . $CASENUM is the row that the data are in. For example, in the first row, $CASENUM equals 1; in the second row, $CASENUM = 2; and so on. <= means is less than or equal to. $CASENUM <= 500 is every row from 1 to 500.
Click on Continue and youre ready to enter your Numeric Expression. If your sample size is 10, the Numeric Expression is TRUNC(($CASENUM-1)/10) .
You set the sample size with the last number in that expression. If your sample size is 55, then the expression is TRUNC(($CASENUM-1)/55) .
TRUNC() takes whatever is in its parentheses and strips off the numbers to the right of the decimal. For example, TRUNC(32.343) = 32 and TRUNC(2.6) = 2.
This equation first subtracts 1, then divides by the sample size, then strips off the decimal. For example, say your sample size is 5. For row 1, $CASENUM is 1. We subtract 1 and get 0; divide by 5 and get 0; and strip off the decimal and still have 0. For row 5, we subtract 1 and get 4; divide by 5 and get .8; and strip off the decimal and get 0.
Look at the sample column of the data to see how things worked out.
To get the proportions, you will have to know what the numeric code is for the measure you are interested in. For example, in Sex (gender), Female is 2, and Male is 1.
Click on Data and Aggregate. Scroll down, select sample, then click on the triangle next to the Break Variable(s) box. Select the variable you are interested in and click on the triangle next to the Aggregate Variable(s) box. Select Function and click on the circle next to Inside (under Percentages). Enter the code of the outcome that interests you into both boxes to the right of Inside. Select Continue, Replace working data file, and click on OK. Be sure to save your original data when prompted because what youre doing actually replaces the file youre working with.
SPSS returns a proportion for the variables you didnt assign a sample for. They appear at the top and have . (a period) for sample.
Step 4: Check what proportion fell within your prediction interval and what proportion fell outside.
To sort the samples proportions, click on Data, then select Sort Cases. Double-click on the name of the percentages variable, then click OK.
Does your prediction interval work correctly? That is, for the 95% prediction interval, do 95% of the samples proportions fall inside the interval and 5% fall outside? Do those that fall outside the interval fall evenly, with 2.5% on each side? For the two-thirds prediction interval, do two-thirds fall inside? Does one-sixth fall below and one-sixth fall above the interval?
Try other sample sizes. Try other variables and measurements. Does the prediction interval work equally well for all population proportions? What if your prediction interval goes below 0 or above 100%? Is it sensible to adjust to those limits? Should you change the other side of the interval?
Save the Data File
In the next project, you will use the output of your work in this project. Save the data file where you can access it easily.
Diminishing the Effect of Sampling Without Replacement and the Mutual Exclusivity of the Samples
To diminish the effect of sampling without replacement and the fact that each row can appear in only one sample, you could run this Syntax program before doing the analysis.
SAVE OUTFILE = rep.sav.
ADD FILE FILE = rep.sav /FILE = *.
EXECUTE.
|
Each time you run that program, you cumulatively double the number of appearances of the data. If you run it once, a person in the data could appear in two samples. If you run it twice, that person could appear in four samples. If you run it three times, that person could appear in eight samples.
The problem is that the data file gets larger. After five additions, which brings the data set up to 1,600,000 observations, my computer gets noticeably slow.
A Tiresome Way to Collect Independent Random Samples with Replacement
First, run this program once.
SAVE OUTFILE = rep.sav.
COMPUTE random = RV.UNIFORM(0,100).
EXECUTE.
SORT CASES BY random.
IF ($CASENUM < [REPLACE THIS WITH YOUR SAMPLE SIZE] ) Sample = 1.
EXECUTE.
FILTER BY Sample.
AGGREGATE
/OUTFILE='proportions.sav'
/BREAK=Sample
/sex_1 = PIN([Your Variable],[Your Outcome],[Your Outcome again]).
|
Then edit the program to this (changes noted in boldface).
GET FILE = rep.sav.
COMPUTE random = RV.UNIFORM(0,100).
EXECUTE. SORT CASES BY random.
IF ($CASENUM < [REPLACE THIS WITH YOUR SAMPLE SIZE]) Sample = 1. EXECUTE.
FILTER BY Sample.
AGGREGATE
/OUTFILE='t.sav'
/BREAK=Sample
/sex_1 = PIN([Your Variable],[Your Outcome],[Your Outcome again]).
ADD FILES FILE = proportions.sav /FILE=t.sav.
EXECUTE.
SAVE OUTFILE = proportions.sav.
|
Whats tiresome is that you have to run the program for every sample. If you want 1,000 samples, you have to run the program 1,000 times.
| ©2008 Key College Publishing. All rights reserved. |
|