Data Matters with Excel®
Activity 10.2
Violations of Pearsons assumptions of constant variance and normality can wreck Pearsons correlation test. How much can they mess it up? In this project, you will find out.
You will set Excel to perform significance tests for correlation when the null hypothesis is true, then keep track of how often the true null hypothesis is rejected. You will do this by first setting up Excel to test distributions, then trying a variety of distributions to see how they do. (As before, until you put data in, the cells with equations may show DIV/0! markers.)
Setting Excel to Generate
Set up a worksheet that looks like this.
|
|
A
|
B |
C |
D |
E |
F |
|
1
|
X
|
Y |
r |
r values |
Sig. |
Null Rejection Proportion |
|
2
|
|
|
=CORREL(A2:A5,B2:B5) |
|
|
=COUNTIF(E:E,TRUE)/(COUNTA(E:E)-1) |
You will be using sets with four observations in each. With four observations, a correlation coefficient of .95 has a p-value of .05, so any r over .95 or below .95 is statistically significant.
To start, fill in A2 to A5 and B2 to B5 with =RAND()+RAND()+RAND()+RAND()+RAND()+RAND()+
RAND()+RAND()+RAND()+RAND() . That will provide roughly normally distributed data.
Use this macro to create samples, calculate r, record r, and record whether r is significant. The formula in F2 will show what proportion of the r-values indicated that the null hypothesis should be rejected.
Sub Sample()
For SampleRow = 2 To 1001
Calculate
Cells(SampleRow, 4).Value = Cells(2, 3).Value
If WorksheetFunction.Count(Cells(SampleRow, 4)) = 1 Then
Cells(SampleRow, 5).Value = Abs(Cells(SampleRow, 4).Value) > 0.95
Else
Cells(SampleRow, 5).Value =
End If
Next SampleRow
End Sub
|
That macro is built out of things you have seen in several previous macros. What is new are the lines If . . . Then, Else, and End If.
If-Then/Else/End If works like this: Excel evaluates what is between the If and the Then. If it is true, then Excel does the steps before the Else. If it is false, then Excel does the steps after the Else.
The If-Then/Else/End If code is in the macro in case you create a formula that creates data for which no correlation can be calculated.
Try Other Distributions
Try other formulas to create the x- and y-values. All of your formulas need to start with one RAND() or more, then modify what comes from RAND(). For a list of available functions, you can click on the fx button at the top of the screen.
The only requirements are that the y-values cannot be built from the x-values and the x-values cannot have the y-values in their formulas. That way, they are independent and the null hypothesis is true.
Rerun the macro to see how often Pearsons correlation test rejects the true null hypotheses.
How do the violations of Pearsons assumptions affect the Pearson correlation test?
Testing Independence
Pearsons correlation test also assumed independence. Independence means that you cant predict a variables observation from the other observations of that variable. When independence is violated, Pearsons correlation test can be very unreliable.
For this project, you only need some data and a scatter plot.
Set up a worksheet like that looks this.
|
|
A |
B
|
|
1
|
X |
Y
|
| 2 |
1 |
1 |
| 3 |
2 |
=B2+RAND()-0.5 |
|
4
|
3 |
=B3+RAND()-0.5
|
Select A4 to B4, click on the little black box, and drag down about 50 rows. Those formulas set x to be counting numbers.
y is the previous y-value plus a random number. The random number has a mean of 0. RAND() produces random digits from 0 to 1, with all decimals equally likely. RAND()s mean is .5, so when you subtract .5 at the end of the formula, on average, you add nothing to y.
That formula for y sets y to have a random shift. On the average, the changes in y will equal 0. On average, there is no trend and no correlation with x.
Select columns A and B, click on the chart wizard, select XY (Scatter) and one of the line scatter plots. Click on Finish.
Press F9. That causes Excel to re-create the values in the spreadsheet. Keep pressing F9 until you can see that on average, there is no correlation between x and y.
In this part of the project, there are violations of independence. What effect would a lack of independence have on regression analysis?
Most analysis of time-related data, like stock prices, shows that observations are not independent. For example, todays stock price is predictable from yesterdays stock price. How does that affect correlational studies that try to predict something from the date?
| ©2008 Key College Publishing. All rights reserved. |
|