General Information
  Home
Author Bio
Product/Purchase Info

Instructor Resources
Registration Required
  Register
Download Instructor Resources

Computer Activities
and Data Sets
  Table of Contents
Excel
Fathom
SPSS

Community
  Contact the Author
Ideas/Comments for Publisher
Testimonials
Coming soon!

Other Key Sites
  Key Curriculum Press

Key College Publishing

Data Matters with Excel®

Activity 6.1

The task for this project is exploring the data in the dataset that was used for the examples in Section 6.1 of Data Matters.

These are the exploratory data analysis tools presented in Section 6.1.

  • Histogram
  • Mean
  • Median
  • Minimum
  • Maximum
  • Stem-and-leaf
  • Box plots

In previous projects, you have made histograms and found means and medians. Minimums can be gotten the same way as the medians, by sorting and scrolling through to find them. These instructions cover box plots and stem-and-leaf plots, relying on two macros that are in the workbook.

Box Plots

Neville Hunt of Coventry University explains how to get a box plot from Excel on his web page, http://www.mis.coventry.ac.uk/~nhunt/boxplot.htm. The explanation presented here is a small variant on his instructions.

The first step is to get the summary values (minimum, first quartile, median, third quartile, and maximum). We will arrange those in a table, then create the box plot from that table.

Here is what the table looks like.

Statistic

Category 1

Category 2

First Quartile

Minimum in Whisker

Median

Maximum in Whisker

Third Quartile

These instructions are based on getting the table into its own worksheet, Stats. Add and name that worksheet, then label the rows as in the table above. Any number of categories could be listed across the top. Before filling in the table, sort the data in the Data worksheet by both the categorical variable and the continuous variable. For example, let’s say you wanted to get box plots of ages for each gender. You would select the Gender and Age columns, then click on Data, Sort. Select Gender in the top of the sort window, then Age, so that it reads Sort by Gender, Then by Age, and click OK.

You can scroll through the data to find the statistics, or you can run the following long macro. If you are finding the statistics by hand, remember that in a box plot, outliers are not included as part of the whiskers. An outlier is any value whose distance to its closest quartile is more than one and a half times the distance between the first and third quartile.

The following macro already appears in the Data Used for 6.1 workbook. To read and understand it, you need to know that in any line, characters (letters, words, numbers, symbols, and so on) that appear to the right of a single quote are ignored by Excel. Everything that begins with a single quote is a comment designed to explain what the macro is doing.

Run the macro, then create a box plot from the table. The instructions for creating a box plot are under “Getting the Plots,” following the macro.

Sub getStatsForBoxPlot()

‘This macro assumes that the data has already been sorted
‘by the categorical measure and then by the numeric measurement.
‘Before you run this macro, you need to do that sorting by clicking on
‘Data, then Sort in the worksheet.

‘The two lines below point the macro at the columns of the data for the box plot.
‘Change these lines so that categoryColumn is the number of the column
‘that has your categorical measurements and so that numericColumn is the number of
‘the column that has your numeric measurements.
categoryColumn = 2
numericColumn = 1

‘The following three lines start the macro out assuming that
‘the first row of data for the first category is on line 2,
‘that the last row is on line 3, and that the column for saving
‘the statistics for the first category is column 2.
firstRow = 2
lastRow = 3
col = 2

‘The macro code proceeds with an assumption that cells are on the Data worksheet.
‘The following line sets Excel to use the Data worksheet as its default.
Worksheets("Data").Activate

‘A pair of Do-Loop statements directs Excel to repeat the code between them.
‘In this case, the code is repeated until the first row of the next category is blank.
Do Until Cells(firstRow, 1).Value = ""

‘The name of the category is whatever is in the first row of data for the
‘current category.
Category = Cells(firstRow, categoryColumn).Value

‘Record the name of the category in the first row of the Stats worksheet.
Worksheets("Stats").Cells(1, col).Value = Category

‘Add to lastRow until the row after no longer lists the current category.
Do Until Cells(lastRow + 1, categoryColumn).Value <> Category
lastRow = lastRow + 1
Loop

‘Using Excel’s built-in median function, get the median.
Median = WorksheetFunction.Median(Range(Cells(firstRow, numericColumn), Cells(lastRow, numericColumn)))

‘Record the median in the fourth row of this category’s column on the Stats worksheet.
Worksheets("Stats").Cells(4, col).Value = Median

‘Get the first quartile (the median of the bottom half of the data).
medianRow = Fix((firstRow + lastRow) / 2)
firstQuartile=WorksheetFunction.Median(Range(Cells(firstRow, numericColumn),
Cells(medianRow, numericColumn)))

‘Record the first quartile.
Worksheets("Stats").Cells(2, col).Value = firstQuartile

‘Get and record the third quartile.
medianRow = Fix((firstRow + lastRow + 0.9) / 2)
thirdQuartile=WorksheetFunction.Median(Range(Cells(medianRow, numericColumn), Cells(lastRow, numericColumn)))
Worksheets("Stats").Cells(6, col).Value = thirdQuartile

‘Calculate minimumCutOff and maximumCutOff.
‘They are the minimum and maximum values that are able to be plotted.
‘The IQR is the distance between the first and third quartile.
IQR = thirdQuartile - firstQuartile

minimumCutOff = firstQuartile - 1.5 * IQR
‘Any value below the minimumCutOff is an outlier and will not be
‘included in the box plot.

maximumCutOff = thirdQuartile + 1.5 * IQR
‘Any value above the maximumCutOff is an outlier and will not be
‘included in the box plot.

‘Find the minimum for the box plot.
‘Start by guessing that the minimum value is the minimum.
minimum = Cells(firstRow, numericColumn).Value
‘Set minRow to the row we got the first guess from.
minRow = firstRow

‘Follow the steps inside the Do-Loop statements
‘until the minimum is greater than or equal to the minimumCutOff.
‘If the minimum starts out greater than or equal to the minimumCutOff,
‘the steps are not done at all.
Do Until minimum >= minimumCutOff
  minRow = minRow + 1
  minimum = Cells(minRow, numericColumn).Value
Loop

‘Record the minimum.
Worksheets("Stats").Cells(3, col).Value = minimum

‘Using essentially the same steps, find the maximum for the box plot.
Maximum = Cells(lastRow, numericColumn).Value
maxRow = lastRow
Do Until Maximum <= maximumCutOff
  maxRow = maxRow - 1
  Maximum = Cells(maxRow, numericColumn).Value
Loop
Worksheets("Stats").Cells(5, col).Value = Maximum

‘Now set things so that the next category’s statistics can be gathered and recorded.
‘Set the column for recording one column to the right.
col = col + 1

‘Set the first row of data to one row after the previous last row.
firstRow = lastRow + 1
‘Set the last row of data to one row after the new first row.
lastRow = firstRow + 1

‘Go back to the Do statement and repeat these steps (if there is data in the new first row).
Loop

End Sub

Getting the Plots

If there are more rows than columns, Excel’s chart wizard assumes that each line should be drawn for data in a single column. But for this project, you need the wizard to make one line for each row. To make this clear to the chart wizard, copy the second column and paste into columns D, E, and F to the right of the data. You will create the chart, then delete those pasted data columns from the worksheet, which will delete them from the chart.

Select all of the data in the table as well as the new columns that you just pasted. Click on the chart wizard button. Select Line and click on Finish. A chart appears.

On the worksheet, select just the cells that contain the data you pasted in to guide the chart wizard. Right-click on the selected cells. Select Delete, Shift Cells Left. Click OK.

The chart now looks something like this chart.

On each line, right-click on the line and select Format Data Series, Patterns. Set Line to None and click OK.

Now your chart looks something like this.

Right-click on any of the plotted dots, select Format Data Series, Options. Check the boxes for High-Low lines and Up-Down bars, then click OK. There’s your box plot. It looks something like this.

You can edit it, changing options and formatting to make it easier to read, but you don’t have to.

A Stem-and-Leaf Plot

Stem-and-leaf plots were first developed to facilitate by-hand exploratory data analysis. If you would like one, the following macro, which is also in the Dataset for 6.1 workbook, will create a stem-and-leaf plot for you.

Before you can run the macro, you need to sort the data by whichever column you will get the stem-and-leaf from; add a new worksheet, named Stem; and edit the first line of the macro so that dataColumn is the column that has the data you sorted.

Stem-and-Leaf Macro

Sub StemAndLeaf()
dataColumn = 1

‘Clean everything out of the Stem worksheet.
Worksheets("Stem").Cells.Clear

‘Look at the Data worksheet.
Worksheets("Data").Activate

‘Find the maximum value.
rowPointer = 2
Do Until Cells(rowPointer, 1).Value = ""
  rowPointer = rowPointer + 1
Loop
Maximum = Cells(rowPointer - 1, dataColumn).Value

‘Set the divisor to strip off leaves.
divisor = 1
Do Until Maximum / divisor <= 10
  divisor = divisor * 10
Loop

‘If the first digit of the largest value is less than 5, then
‘use a smaller divisor.
‘Otherwise you could end up with four or fewer rows in the plot.
If Fix(Maximum / divisor) < 5 Then divisor = divisor * 10

‘Calculate the top stem’s value.
topStem = Fix(Maximum / divisor)

‘Set up the Stem worksheet.
Worksheets("Stem").Activate
Cells(1, 1).Value = "Count"
Cells(1, 2).Value = "Stem"
Cells(1, 3).Value = "Leaves"
For rowPointer = 2 To topStem + 2
  Cells(rowPointer, 2).Value = rowPointer - 2
  Cells(rowPointer, 3).Value = "|"
Next rowPointer

‘Calculate the counts.
‘The following code is slower than it needs to be,
‘but a faster code would be harder to read and understand.
Worksheets("Data").Activate
rowPointer = 2
Do Until Cells(rowPointer, dataColumn).Value = ""
  measurement = Cells(rowPointer, dataColumn).Value
  stem = Fix(measurement / divisor)
  Worksheets("Stem").Cells(stem + 2, 1).Value=Worksheets("Stem").Cells(stem + 2, 1).Value+1
  rowPointer = rowPointer + 1
Loop

‘Calculate the shrink factor.
Worksheets("Stem").Activate
maximumCount = 0
For rowPointer = 2 To topStem + 2
  If Cells(rowPointer, 1).Value > maximumCount Then
    maximumCount = Cells(rowPointer, 1).Value
  End If
Next rowPointer

shrinkFactor = Fix(maximumCount / 50)
If shrinkFactor < 1 Then shrinkFactor = 1
Cells(1, 4).Value = "Each digit represents" + Str(shrinkFactor) + "cases."

‘Return to the data, and fill the leaves in light of the values in the data.
Worksheets("Data").Activate
rowPointer = 2
Do Until Cells(rowPointer, dataColumn).Value = ""
  measurement = Cells(rowPointer, dataColumn).Value
  stem = Fix(measurement / divisor)
  leaf = measurement - stem * divisor
  leaf = Fix(leaf * 10 / divisor)

  Worksheets("Stem").Cells(stem+2,3).Value = Worksheets("Stem").Cells(stem+2,3).Value+Trim(Str(leaf))
  rowPointer = rowPointer + shrinkFactor
  Loop

‘Get to the Stem worksheet.
Worksheets("Stem").Activate
End Sub


©2008 Key College Publishing. All rights reserved.