Data Matters with Excel®
Activity 6.1
The task for this project is exploring the data in the dataset that was used for the examples in Section 6.1 of Data Matters.
These are the exploratory data analysis tools presented in Section 6.1.
- Histogram
- Mean
- Median
- Minimum
- Maximum
- Stem-and-leaf
- Box plots
In previous projects, you have made histograms and found means and medians. Minimums can be gotten the same way as the medians, by sorting and scrolling through to find them. These instructions cover box plots and stem-and-leaf plots, relying on two macros that are in the workbook.
Box Plots
Neville Hunt of Coventry University explains how to get a box plot from Excel on his web page, http://www.mis.coventry.ac.uk/~nhunt/boxplot.htm. The explanation presented here is a small variant on his instructions.
The first step is to get the summary values (minimum, first quartile, median, third quartile, and maximum). We will arrange those in a table, then create the box plot from that table.
Here is what the table looks like.
|
Statistic
|
Category 1 |
Category 2
|
|
First Quartile
|
|
|
|
Minimum in Whisker
|
|
|
|
Median
|
|
|
|
Maximum in Whisker
|
|
|
|
Third Quartile
|
|
|
These instructions are based on getting the table into its own worksheet, Stats. Add and name that worksheet, then label the rows as in the table above. Any number of categories could be listed across the top. Before filling in the table, sort the data in the Data worksheet by both the categorical variable and the continuous variable. For example, lets say you wanted to get box plots of ages for each gender. You would select the Gender and Age columns, then click on Data, Sort. Select Gender in the top of the sort window, then Age, so that it reads Sort by Gender, Then by Age, and click OK.
You can scroll through the data to find the statistics, or you can run the following long macro. If you are finding the statistics by hand, remember that in a box plot, outliers are not included as part of the whiskers. An outlier is any value whose distance to its closest quartile is more than one and a half times the distance between the first and third quartile.
The following macro already appears in the Data Used for 6.1 workbook. To read and understand it, you need to know that in any line, characters (letters, words, numbers, symbols, and so on) that appear to the right of a single quote are ignored by Excel. Everything that begins with a single quote is a comment designed to explain what the macro is doing.
Run the macro, then create a box plot from the table. The instructions for creating a box plot are under Getting the Plots, following the macro.
Sub getStatsForBoxPlot()
This macro assumes that the data has already been sorted
by the categorical measure and then by the numeric measurement.
Before you run this macro, you need to do that sorting by clicking on
Data, then Sort in the worksheet.
The two lines below point the macro at the columns of the data for the box plot.
Change these lines so that categoryColumn is the number of the column
that has your categorical measurements and so that numericColumn is the number of
the column that has your numeric measurements.
categoryColumn = 2
numericColumn = 1
The following three lines start the macro out assuming that
the first row of data for the first category is on line 2,
that the last row is on line 3, and that the column for saving
the statistics for the first category is column 2.
firstRow = 2
lastRow = 3
col = 2
The macro code proceeds with an assumption that cells are on the Data worksheet.
The following line sets Excel to use the Data worksheet as its default.
Worksheets("Data").Activate
A pair of Do-Loop statements directs Excel to repeat the code between them.
In this case, the code is repeated until the first row of the next category is blank.
Do Until Cells(firstRow, 1).Value = ""
The name of the category is whatever is in the first row of data for the
current category.
Category = Cells(firstRow, categoryColumn).Value
Record the name of the category in the first row of the Stats worksheet.
Worksheets("Stats").Cells(1, col).Value = Category
Add to lastRow until the row after no longer lists the current category.
Do Until Cells(lastRow + 1, categoryColumn).Value <> Category
lastRow = lastRow + 1
Loop
Using Excels built-in median function, get the median.
Median = WorksheetFunction.Median(Range(Cells(firstRow, numericColumn), Cells(lastRow, numericColumn)))
Record the median in the fourth row of this categorys column on the Stats worksheet.
Worksheets("Stats").Cells(4, col).Value = Median
Get the first quartile (the median of the bottom half of the data).
medianRow = Fix((firstRow + lastRow) / 2)
firstQuartile=WorksheetFunction.Median(Range(Cells(firstRow, numericColumn),
Cells(medianRow, numericColumn)))
Record the first quartile.
Worksheets("Stats").Cells(2, col).Value = firstQuartile
Get and record the third quartile.
medianRow = Fix((firstRow + lastRow + 0.9) / 2)
thirdQuartile=WorksheetFunction.Median(Range(Cells(medianRow, numericColumn), Cells(lastRow, numericColumn)))
Worksheets("Stats").Cells(6, col).Value = thirdQuartile
Calculate minimumCutOff and maximumCutOff.
They are the minimum and maximum values that are able to be plotted.
The IQR is the distance between the first and third quartile.
IQR = thirdQuartile - firstQuartile
minimumCutOff = firstQuartile - 1.5 * IQR
Any value below the minimumCutOff is an outlier and will not be
included in the box plot.
maximumCutOff = thirdQuartile + 1.5 * IQR
Any value above the maximumCutOff is an outlier and will not be
included in the box plot.
Find the minimum for the box plot.
Start by guessing that the minimum value is the minimum.
minimum = Cells(firstRow, numericColumn).Value
Set minRow to the row we got the first guess from.
minRow = firstRow
Follow the steps inside the Do-Loop statements
until the minimum is greater than or equal to the minimumCutOff.
If the minimum starts out greater than or equal to the minimumCutOff,
the steps are not done at all.
Do Until minimum >= minimumCutOff
minRow = minRow + 1
minimum = Cells(minRow, numericColumn).Value
Loop
Record the minimum.
Worksheets("Stats").Cells(3, col).Value = minimum
Using essentially the same steps, find the maximum for the box plot.
Maximum = Cells(lastRow, numericColumn).Value
maxRow = lastRow
Do Until Maximum <= maximumCutOff
maxRow = maxRow - 1
Maximum = Cells(maxRow, numericColumn).Value
Loop
Worksheets("Stats").Cells(5, col).Value = Maximum
Now set things so that the next categorys statistics can be gathered and recorded.
Set the column for recording one column to the right.
col = col + 1
Set the first row of data to one row after the previous last row.
firstRow = lastRow + 1
Set the last row of data to one row after the new first row.
lastRow = firstRow + 1
Go back to the Do statement and repeat these steps (if there is data in the new first row).
Loop
End Sub
|
Getting the Plots
If there are more rows than columns, Excels chart wizard assumes that each line should be drawn for data in a single column. But for this project, you need the wizard to make one line for each row. To make this clear to the chart wizard, copy the second column and paste into columns D, E, and F to the right of the data. You will create the chart, then delete those pasted data columns from the worksheet, which will delete them from the chart.
Select all of the data in the table as well as the new columns that you just pasted. Click on the chart wizard button. Select Line and click on Finish. A chart appears.
On the worksheet, select just the cells that contain the data you pasted in to guide the chart wizard. Right-click on the selected cells. Select Delete, Shift Cells Left. Click OK.
The chart now looks something like this chart.
On each line, right-click on the line and select Format Data Series, Patterns. Set Line to None and click OK.
Now your chart looks something like this.
Right-click on any of the plotted dots, select Format Data Series, Options. Check the boxes for High-Low lines and Up-Down bars, then click OK. Theres your box plot. It looks something like this.
You can edit it, changing options and formatting to make it easier to read, but you dont have to.
A Stem-and-Leaf Plot
Stem-and-leaf plots were first developed to facilitate by-hand exploratory data analysis. If you would like one, the following macro, which is also in the Dataset for 6.1 workbook, will create a stem-and-leaf plot for you.
Before you can run the macro, you need to sort the data by whichever column you will get the stem-and-leaf from; add a new worksheet, named Stem; and edit the first line of the macro so that dataColumn is the column that has the data you sorted.
Stem-and-Leaf Macro
Sub StemAndLeaf()
dataColumn = 1
Clean everything out of the Stem worksheet.
Worksheets("Stem").Cells.Clear
Look at the Data worksheet.
Worksheets("Data").Activate
Find the maximum value.
rowPointer = 2
Do Until Cells(rowPointer, 1).Value = ""
rowPointer = rowPointer + 1
Loop
Maximum = Cells(rowPointer - 1, dataColumn).Value
Set the divisor to strip off leaves.
divisor = 1
Do Until Maximum / divisor <= 10
divisor = divisor * 10
Loop
If the first digit of the largest value is less than 5, then
use a smaller divisor.
Otherwise you could end up with four or fewer rows in the plot.
If Fix(Maximum / divisor) < 5 Then divisor = divisor * 10
Calculate the top stems value.
topStem = Fix(Maximum / divisor)
Set up the Stem worksheet.
Worksheets("Stem").Activate
Cells(1, 1).Value = "Count"
Cells(1, 2).Value = "Stem"
Cells(1, 3).Value = "Leaves"
For rowPointer = 2 To topStem + 2
Cells(rowPointer, 2).Value = rowPointer - 2
Cells(rowPointer, 3).Value = "|"
Next rowPointer
Calculate the counts.
The following code is slower than it needs to be,
but a faster code would be harder to read and understand.
Worksheets("Data").Activate
rowPointer = 2
Do Until Cells(rowPointer, dataColumn).Value = ""
measurement = Cells(rowPointer, dataColumn).Value
stem = Fix(measurement / divisor)
Worksheets("Stem").Cells(stem + 2, 1).Value=Worksheets("Stem").Cells(stem + 2, 1).Value+1
rowPointer = rowPointer + 1
Loop
Calculate the shrink factor.
Worksheets("Stem").Activate
maximumCount = 0
For rowPointer = 2 To topStem + 2
If Cells(rowPointer, 1).Value > maximumCount Then
maximumCount = Cells(rowPointer, 1).Value
End If
Next rowPointer
shrinkFactor = Fix(maximumCount / 50)
If shrinkFactor < 1 Then shrinkFactor = 1
Cells(1, 4).Value = "Each digit represents" + Str(shrinkFactor) + "cases."
Return to the data, and fill the leaves in light of the values in the data.
Worksheets("Data").Activate
rowPointer = 2
Do Until Cells(rowPointer, dataColumn).Value = ""
measurement = Cells(rowPointer, dataColumn).Value
stem = Fix(measurement / divisor)
leaf = measurement - stem * divisor
leaf = Fix(leaf * 10 / divisor)
Worksheets("Stem").Cells(stem+2,3).Value = Worksheets("Stem").Cells(stem+2,3).Value+Trim(Str(leaf))
rowPointer = rowPointer + shrinkFactor
Loop
Get to the Stem worksheet.
Worksheets("Stem").Activate
End Sub
|
| ©2008 Key College Publishing. All rights reserved. |
|