General Information
  Home
Author Bio
Product/Purchase Info

Instructor Resources
Registration Required
  Register
Download Instructor Resources

Computer Activities
and Data Sets
  Table of Contents
Excel
Fathom
SPSS

Community
  Contact the Author
Ideas/Comments for Publisher
Testimonials
Coming soon!

Other Key Sites
  Key Curriculum Press

Key College Publishing

Data Matters with Excel®

Activity 10.2

Violations of Pearson’s assumptions of constant variance and normality can wreck Pearson’s correlation test. How much can they mess it up? In this project, you will find out.

You will set Excel to perform significance tests for correlation when the null hypothesis is true, then keep track of how often the true null hypothesis is rejected. You will do this by first setting up Excel to test distributions, then trying a variety of distributions to see how they do. (As before, until you put data in, the cells with equations may show “DIV/0!” markers.)

Setting Excel to Generate

Set up a worksheet that looks like this.

A

B C D E F

1

X

Y r r values Sig. Null Rejection Proportion

2

 

  =CORREL(A2:A5,B2:B5)     =COUNTIF(E:E,TRUE)/(COUNTA(E:E)-1)

You will be using sets with four observations in each. With four observations, a correlation coefficient of .95 has a p-value of .05, so any r over .95 or below –.95 is statistically significant.

To start, fill in A2 to A5 and B2 to B5 with =RAND()+RAND()+RAND()+RAND()+RAND()+RAND()+
RAND()+RAND()+RAND()+RAND()
. That will provide roughly normally distributed data.

Use this macro to create samples, calculate r, record r, and record whether r is significant. The formula in F2 will show what proportion of the r-values indicated that the null hypothesis should be rejected.

Sub Sample()
  For SampleRow = 2 To 1001
    Calculate
    Cells(SampleRow, 4).Value = Cells(2, 3).Value
    If WorksheetFunction.Count(Cells(SampleRow, 4)) = 1 Then
      Cells(SampleRow, 5).Value = Abs(Cells(SampleRow, 4).Value) > 0.95
    Else
      Cells(SampleRow, 5).Value = ““
    End If
  Next SampleRow
End Sub

That macro is built out of things you have seen in several previous macros. What is new are the lines If . . . Then, Else, and End If.

If-Then/Else/End If works like this: Excel evaluates what is between the If and the Then. If it is true, then Excel does the steps before the Else. If it is false, then Excel does the steps after the Else.

The If-Then/Else/End If code is in the macro in case you create a formula that creates data for which no correlation can be calculated.

Try Other Distributions

Try other formulas to create the x- and y-values. All of your formulas need to start with one RAND() or more, then modify what comes from RAND(). For a list of available functions, you can click on the fx button at the top of the screen.

The only requirements are that the y-values cannot be built from the x-values and the x-values cannot have the y-values in their formulas. That way, they are independent and the null hypothesis is true.

Rerun the macro to see how often Pearson’s correlation test rejects the true null hypotheses.

How do the violations of Pearson’s assumptions affect the Pearson correlation test?

Testing Independence

Pearson’s correlation test also assumed independence. Independence means that you can’t predict a variable’s observation from the other observations of that variable. When independence is violated, Pearson’s correlation test can be very unreliable.

For this project, you only need some data and a scatter plot.

Set up a worksheet like that looks this.

 

A

B

   1   

   X   

Y

2 1 1
3 2 =B2+RAND()-0.5

4

3

=B3+RAND()-0.5

Select A4 to B4, click on the little black box, and drag down about 50 rows. Those formulas set x to be counting numbers.

y is the previous y-value plus a random number. The random number has a mean of 0. RAND() produces random digits from 0 to 1, with all decimals equally likely. RAND()’s mean is .5, so when you subtract .5 at the end of the formula, on average, you add nothing to y.

That formula for y sets y to have a random shift. On the average, the changes in y will equal 0. On average, there is no trend and no correlation with x.

Select columns A and B, click on the chart wizard, select XY (Scatter) and one of the line scatter plots. Click on Finish.

Press “F9.” That causes Excel to re-create the values in the spreadsheet. Keep pressing “F9” until you can see that on average, there is no correlation between x and y.

In this part of the project, there are violations of independence. What effect would a lack of independence have on regression analysis?

Most analysis of time-related data, like stock prices, shows that observations are not independent. For example, today’s stock price is predictable from yesterday’s stock price. How does that affect correlational studies that try to predict something from the date?


©2008 Key College Publishing. All rights reserved.