General Information
  Home
Author Bio
Product/Purchase Info

Instructor Resources
Registration Required
  Register
Download Instructor Resources

Computer Activities
and Data Sets
  Table of Contents
Excel
Fathom
SPSS

Community
  Contact the Author
Ideas/Comments for Publisher
Testimonials
Coming soon!

Other Key Sites
  Key Curriculum Press

Key College Publishing

Data Matters with SPSS®

Activity 3.1

Section 3.1 claims that if we take random samples from a population, 95% of the time, the proportions in those samples will fall within two standard errors of their matching population proportions. That is a 95% prediction interval. A 67% prediction interval is from one standard error below the population’s proportion to one standard error above it.

To test this claim, use the same data set from RepUSSample.sav that you used in Section 2.1.

The project in Section 3.1 requires these four steps.

  • Find out some of the population’s proportions.
  • Pick a sample size and create prediction intervals for random samples’ proportions.
  • Use your software to take random samples.
  • Check what proportion fell within your prediction interval and what proportion fell outside.

Here’s how to do each step.

Step 1: Find out some of the population’s proportions.

Open the data set for this project, the same one you used in Section 2.1. Find out proportions on the data that interest you (select Analyze, Descriptive Statistics, Frequencies).

Step 2: Pick a sample size and create prediction intervals for random samples’ proportions.

Check Section 3.1 in Data Matters if you’re not sure how to do this.

Step 3: Use your software to take random samples.

To do this, you will randomly sort the observations and add a variable to indicate which sample each person is selected for. Having done that, we will be able to use AGGREGATE to get the proportions in each sample.

This approach has two peculiarities, both of which you can somewhat overcome. One is that this is sampling without replacement; thus, if a person is selected for one sample, that person cannot show up in another sample. This makes the sampling somewhat different from what Section 3.1 discusses. The difference is that the chances of each kind of person don’t stay the same on each draw. For example, let’s suppose that only one person in our population knows all the words to “America the Beautiful.” On the first draw, the chances of getting that person are 1 out of 50,000, because we have 50,000 in our population. If that person is drawn on the first draw, in every subsequent draw, the chances of getting someone like that are 0. On the other hand, if that person is not drawn on the first sample, the chances of drawing that kind of person increase with the second draw, to 1 out of 49,999. On the third draw, the chances increase to 1 out of 49,998, and so on.

The same thing applies at the sample level. Once the person has been drawn for the first sample, the chances that kind of person will be drawn in other samples are 0, and each time the person is not drawn, the chances that the person will be drawn increase.

You could compensate a lot for these peculiarities by copying the data file and adding it to itself many times, so that there are many copies of each person. This is a lot of work for the computer, but I have included some Syntax under “Diminishing the Effect of Sampling Without Replacement…” at the end of this section, in case you would like to do this. Following that, there are instructions for an approach for getting a pure random sampling without replacement that is time-consuming, but possible.

Sorting Randomly

Open the representative U.S. sample data, then get into the data editor.

To sort the observations randomly, you will add a variable that is a random number, then sort the observations by that random number. To add the random number, in the data editor click on Transform, then Compute. Name your new target variable random, name the Numeric Expression RV.UNIFORM(0,100), and click OK. At each step, you can look at the data to see what is happening.

Select Data, Sort Cases. Double-click on random and click OK.

Adding the Sample Variable

Before you can add the sample variable, you need to know how many samples you would like to take and your sample size. Unless you follow the instructions at the end of this section, you won’t be able to take more samples than 50,000 divided by your sample size. For example, if you use sample sizes of 100, you won’t be able to have more than 500 samples. The reason is that, unless you add copies of the data, there are only 50,000 people to create samples from.

Use Transform, Compute to add the sample variable. The target variable is sample.

There is an If requirement. Before you enter a Numeric Expression, click on If, select Include if case satisfies condition: , and enter an equation so that SPSS doesn’t add sample to more people than you want to consider. For example, if you want 50 samples of 10 people each, then you want to consider 500 people, so you would enter the equation $CASENUM <= 500 .

You could let SPSS do the multiplying. For example, if you want 200 samples of 14 people each, you could enter the equation $CASENUM <= 200*14 . $CASENUM is the row that the data are in. For example, in the first row, $CASENUM equals 1; in the second row, $CASENUM = 2; and so on. “<=” means “is less than or equal to.” $CASENUM <= 500 is every row from 1 to 500.

Click on Continue and you’re ready to enter your Numeric Expression. If your sample size is 10, the “Numeric Expression” is TRUNC(($CASENUM-1)/10) .

You set the sample size with the last number in that expression. If your sample size is 55, then the expression is TRUNC(($CASENUM-1)/55) .

TRUNC() takes whatever is in its parentheses and strips off the numbers to the right of the decimal. For example, TRUNC(32.343) = 32 and TRUNC(2.6) = 2.

This equation first subtracts 1, then divides by the sample size, then strips off the decimal. For example, say your sample size is 5. For row 1, $CASENUM is 1. We subtract 1 and get 0; divide by 5 and get 0; and strip off the decimal and still have 0. For row 5, we subtract 1 and get 4; divide by 5 and get .8; and strip off the decimal and get 0.

Look at the sample column of the data to see how things worked out.

To get the proportions, you will have to know what the numeric code is for the measure you are interested in. For example, in Sex (gender), Female is 2, and Male is 1.

Click on Data and Aggregate. Scroll down, select sample, then click on the triangle next to the Break Variable(s) box. Select the variable you are interested in and click on the triangle next to the Aggregate Variable(s) box. Select Function and click on the circle next to Inside (under Percentages). Enter the code of the outcome that interests you into both boxes to the right of Inside. Select Continue, Replace working data file, and click on OK. Be sure to save your original data when prompted because what you’re doing actually replaces the file you’re working with.

SPSS returns a proportion for the variables you didn’t assign a sample for. They appear at the top and have “.” (a period) for sample.

Step 4: Check what proportion fell within your prediction interval and what proportion fell outside.

To sort the samples’ proportions, click on Data, then select Sort Cases. Double-click on the name of the percentages variable, then click OK.

Does your prediction interval work correctly? That is, for the 95% prediction interval, do 95% of the samples’ proportions fall inside the interval and 5% fall outside? Do those that fall outside the interval fall evenly, with 2.5% on each side? For the two-thirds prediction interval, do two-thirds fall inside? Does one-sixth fall below and one-sixth fall above the interval?

Try other sample sizes. Try other variables and measurements. Does the prediction interval work equally well for all population proportions? What if your prediction interval goes below 0 or above 100%? Is it sensible to adjust to those limits? Should you change the other side of the interval?

Save the Data File

In the next project, you will use the output of your work in this project. Save the data file where you can access it easily.

Diminishing the Effect of Sampling Without Replacement and the Mutual Exclusivity of the Samples

To diminish the effect of sampling without replacement and the fact that each row can appear in only one sample, you could run this Syntax program before doing the analysis.

SAVE OUTFILE = rep.sav.
ADD FILE FILE = rep.sav /FILE = *.
EXECUTE.

Each time you run that program, you cumulatively double the number of appearances of the data. If you run it once, a person in the data could appear in two samples. If you run it twice, that person could appear in four samples. If you run it three times, that person could appear in eight samples.

The problem is that the data file gets larger. After five additions, which brings the data set up to 1,600,000 observations, my computer gets noticeably slow.

A Tiresome Way to Collect Independent Random Samples with Replacement

First, run this program once.

SAVE OUTFILE = rep.sav.
COMPUTE random = RV.UNIFORM(0,100).
EXECUTE.
SORT CASES BY random.
IF ($CASENUM < [REPLACE THIS WITH YOUR SAMPLE SIZE] ) Sample = 1.
EXECUTE.
FILTER BY Sample.
AGGREGATE
/OUTFILE='proportions.sav'
/BREAK=Sample
/sex_1 = PIN([Your Variable],[Your Outcome],[Your Outcome again]).

Then edit the program to this (changes noted in boldface).

GET FILE = rep.sav.
COMPUTE random = RV.UNIFORM(0,100).
EXECUTE. SORT CASES BY random.
IF ($CASENUM < [REPLACE THIS WITH YOUR SAMPLE SIZE]) Sample = 1. EXECUTE.
FILTER BY Sample.
AGGREGATE
  /OUTFILE='t.sav'
  /BREAK=Sample
  /sex_1 = PIN([Your Variable],[Your Outcome],[Your Outcome again]).
ADD FILES FILE = proportions.sav /FILE=t.sav.
EXECUTE.
SAVE OUTFILE = proportions.sav.

What’s tiresome is that you have to run the program for every sample. If you want 1,000 samples, you have to run the program 1,000 times.


©2008 Key College Publishing. All rights reserved.