Descriptive Statistics Data Analysis
Prior to beginning work on this assignment, review Chapter 1 and Chapter 2 in your course textbook, Chapter 3 in the Jarman e-book, the Week 1 Instructor Guidance, and review the Module 2: Describing Data video and the Khan Academy video on Interquartile Range (IQR) (Links to an external site.). Also, complete the Week 1 learning activity and Week 1 weekly review.
W1.A.Mean Median Mode.png
This exercise requires the use of a descriptive statistics calculator. You can find this tool in some versions of Excel (as part of the Analysis ToolPak) or you can use one of the many free online descriptive calculators such as the Descriptive Statistics Calculator (Links to an external site.)by Calculator Soup.
Your instructor will post an announcement with the data set for your Week 1 assignment.
First, use either Excel or the Calculator Soup descriptive statistics calculator to calculate the descriptive statistics for the given data set. This is explained in Chapter 1 of your course text.
You should get an output similar to the image in Figure 1.3 from your textbook. This output must contain the following values: mean, standard error, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum, maximum, sum, and count.
Next, begin writing your paper by reporting your results for each of the values listed above.
Include the data set, the output from the analysis, and the answers to the following questions:
Evaluate the measures of central tendency. Address the following when completing this component:
Which measure of central tendency is most appropriate based on the data type?
Are the mean, median, and mode close to the same value? If not, what does this tell you about the numbers in the set?
Identify any mode(s) in the data set. Is there a mode at all? Is there more than one mode?
Calculate manually the interquartile range and the values of Q1 and Q3. (It is important to calculate this manually because the interquartile range and quartiles output from Calculator Soup might not be accurate.) Address the following when completing this component:
Test to see if there are any outliers in the set. If so, which number(s)?
Which method from Section 2.4 of the text did you first use to check for outliers?
Now try the other method from Section 2.4 of the text. Do you come to the same conclusion about outliers in the data set?
Explain which descriptive statistic you think best summarizes this set of numbers and why.
Choose three of the descriptive statistics that you feel best represent this data set. Why were they chosen?
The Descriptive Statistics Data Analysis assignment
Must be two to three double-spaced pages in length (not including title and references pages) and formatted according to APA Style as outlined in the Ashford Writing Center’s APA Style (Links to an external site.)
Must include a separate title page with the following:
Title of paper
Student’s name
Course name and number
Instructor’s name
Date submitted
For further assistance with the formatting and the title page, refer to APA Formatting for Word 2013 (Links to an external site.).
Must include an introduction and conclusion paragraph. Your introduction paragraph needs to end with a clear statement that indicates the purpose of your paper, to report and explain your analysis of the data set.
For assistance on writing Introductions & Conclusions (Links to an external site.), refer to the Ashford Writing Center resources.
Must use the course text and Excel or the Descriptive Statistics Calculator (Links to an external site.).
Must document any information used from sources in APA Style as outlined in the Ashford Writing Center’s APA: Citing Within Your Paper (Links to an external site.)
Must include a separate references page that is formatted according to APA Style as outlined in the Ashford Writing Center. See the APA: Formatting Your References List (Links to an external site.) resource in the Ashford Writing Center for specifications.
References:
https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/measuring-spread-quantitative/v/calculating-interquartile-range-iqr
https://digital-films-com.proxy-library.ashford.edu/p_ViewVideo.aspx?xtid=6139
2Illustrating Data
John-Francis Bourke/Corbis
Chapter Learning Objectives After reading this chapter, you should be able to do the following:
1. Organize measures into frequency distributions, ordered arrays, and stem-and-leaf plots.
2. Create pie charts, bar graphs, and frequency polygons using Excel.
3. Describe the components of data normally.
4. Judge data normality by performing manual calculations and by using Excel output.
5. Develop tools to identify outliers.
tan82773_02_ch02_029-060.indd 29 3/3/16 9:58 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
Introduction People who like to organize things will especially like this chapter. What we cover here can be particularly helpful in an age where we are exposed to much more data than we can absorb. When the material is irrelevant, this data overload is not a problem, but when the information is important, we need ways to retain it. This chapter offers some solutions involving visual data displays, which an anecdote will help to illustrate.
During World War II, a British analyst was assigned to recommend to aircraft builders the points on airframes that should be reinforced with armor plating. Too much armor plating and the aircraft would lose maneuverability and range; too little and it would become too vulnerable to enemy fire. The analyst examined aircraft returning from com-
bat, noted which areas showed damage, and drew pictures of the places where they had been hit. He recommended reinforcing the areas where the return- ing planes had not been damaged. How counterintuitive was that? As illogical as his approach seems, he reasoned that if the damage had been fatal to either the pilot or the aircraft’s ability to fly, the airplanes he examined would not have returned. So damage to the other areas was apparently the most serious, and those were the areas that needed the most protection.
This story is a lesson in the value of clari- fying relationships with visual displays. Certainly, mathematical manipulation and statistical procedures are required at
times, but often a necessary first step to understanding a data set is to arrange the data so that they can be visually analyzed. The understanding researchers gain from observation can then guide the mathematical analyses that follow.
Chapter 1 emphasized the descriptors and the statistical shorthand that allow us to classify and describe groups of data. That chapter limited descriptions to the scale of the data and the measures of central tendency and variability that allow data summaries. This chapter uses visual display for some of the same purposes and expands the applications for descriptive statistics.
2.1 From Description to Display The study of statistics has an incremental nature: Each step becomes part of a more involved process later, which makes grasping the early topics important, since they are building blocks for subsequent ones. For now, we will use what we know about data scale and descriptive
Edward Koren/The New Yorker Collection/The Cartoon Bank
tan82773_02_ch02_029-060.indd 30 3/3/16 9:58 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
statistics to arrange measures into the tables and figures that reveal the multiple dimensions of numerical data. Although the stakes for us may be different than they were for the British warplane analyst, the issues are important nevertheless.
Most audiences are more engaged by a visual display than by a text presentation. When a good deal of data must be communicated in a short time, a visual display serves as a good place to begin. The discussions that follow suggest some of the more common procedures for repre- senting different kinds of data, if only to introduce them briefly. For someone interested in a more in- depth discussion, books by authors such as Friendly (2000) and Tufte (2001) will be helpful. Tufte in particular has a reputation for innovative and infor- mative data displays.
Data distributions of one sort or another are ubiquitous. A glance at the latest news reports indi- cates how unemployment numbers have changed during the year. Checking how the stock market has fluctuated over today’s trading session indicates highs, lows, and the volume of trading. The fact that data fluctuate makes them interesting. Data that either all have the same value or that always occur in the same proportions leave little to be analyzed. They interest us much less than data for which pro- portions and frequencies change.
Frequency Distributions Scores on most measures vary, but the variation will generally have some repetition. Whether college admissions test results or the scores on a statistics quiz, all scores are not equally likely; some will occur more frequently than others. Frequency distributions indicate the number of measures in a data set that have the same characteristic. They allow us to display scores in terms of both their variability and their frequency of occurrence.
Suppose a state board administers a licensing test for marriage and family counselors. Rather than report every individual score, the board finds it more economical to report test results in categories:
Meritorious
Exceeds Expectations
Pass
Pass with Exceptions
Fail
Consider the following example: A group of 25 graduates of State U’s marriage and family counseling program takes the test. Table 2.1 shows the group’s results.
John Moore/Getty Images News/Thinkstock
Tracking the highs, lows, and trading volume of stocks on a graph allows us to concisely evaluate what would otherwise be very large quantities of data.
tan82773_02_ch02_029-060.indd 31 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
Table 2.1: A frequency distribution for licensing test results
Licensing test results f
Meritorious 4
Exceeds expectations 6
Pass 8
Pass with exceptions 4
Fail 3
Total 25
Table 2.1 depicts a frequency distribution, with the symbol f indicating the number of scores that occur in a particular category. If each individual score had been entered rather than being grouped into categories, the result would have been a table with 25 discrete entries. Instead, the data in Table 2.1 represent a grouped frequency distribution. Such a table provides a compact presentation when there are many scores.
Ordered and Disordered Arrays Table 2.1 is divided into categories, but if each of the 25 results was listed in ranked order from the four that were meritorious down to the three fails, the display would reflect an ordered array. If instead of listing them from highest to lowest, the board arbitrarily piled all the scores into the table, it would show, not surprisingly, a disordered array. In such a table, for example, although the meritorious scores would still occur as a group, they would be in no particular order. Table 2.1 is a much shorter display than either an ordered or a disordered array.
When sample sizes are comparatively small—15 or 20 scores from a larger popula- tion, for example—the type of presentation is not an issue, but presentation would be a greater issue if the frequency distribution included data for every aspiring mar- riage and family counselor in the state who took the licensing test. Even if hundreds of scores were being reported, a grouped frequency distribution would have the same number of rows as Table 2.1. Frequency distributions, then, can make a presentation compact. Jokela (2012) studied whether associations between individuals’ personality traits and whether they have children are affected by when they were born. Table 2.2 is part of his subjects’ description. It shows the birth cohort, or particular period of birth, and gender for 6,259 subjects (2,971 men and 3,288 women) in a relatively compact display.
Class Intervals The “groups” in grouped frequency distributions—the birth cohorts in Table 2.2—are called class intervals. Although they provide an economical data presentation and make a great deal of data accessible to even a casual observer, some details are inevitably lost. It is not apparent from studying Table 2.1, for example, which numerical test scores belong to a particular class
tan82773_02_ch02_029-060.indd 32 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
interval. We can address that deficiency by incorporat- ing a list of score ranges, which might be the following:
28–34 Meritorious 21–27 Exceeds Expectations 14–20 Pass
7–13 Pass with Exceptions 0–6 Fail
With the ranges, we know how scores were classified, but it still is not apparent exactly how one individual whose score is in the “pass” interval, for example, scored. The person could have scored anywhere from 14 to 20. We know only the category. The same difficulty emerges in Table 2.2. The table shows 347 female subjects in the 1920–1929 birth cohort, but it does not make any distinction within the 1920–1929 group, a range of 9 years.
If we cannot know precisely how a particular individual scored, or the exact year in which a subject was born (Table 2.2), the data can at least be roughly ranked. Clearly, those in Table 2.1 who “exceeded expectations” did better than those in the pass category, although exactly how much better is not indicated.
Estimating the Mean from a Class Interval Indicating the score frequencies in the class intervals reduces the scores to values that can be ranked approximately. Even without the individual scores, we can use the categories to esti- mate the mean of the scores from class intervals. To estimate the mean from class intervals,
1. Determine the midpoint in each class interval. 2. Sum the midpoints of all the class intervals. 3. Divide the sum of the midpoints by the number of class intervals.
Table 2.2: A grouped frequency distribution of subjects’ birth cohort
Birth year Men (2,971) Women (3,288)
1914–1919 0 0
1920–1929 316 347
1930–1939 498 614
1940–1949 732 795
1950–1959 816 802
1960–1969 585 707
1970–1979 24 23
Source: Jokela, M. (2012). Birth-cohort effects in the association between personality and fertility. Psychological Science, 23, 835–841.
Try It!: #1 According to the discussion of the scale of data in Chapter 1, what scale do data cate- gories such as meritorious, exceeds expec- tations, and so on indicate?
tan82773_02_ch02_029-060.indd 33 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
To see how accurate the estimated mean is, using the data in Table 2.1, we will first calculate the actual mean. Perhaps for the licensing test data in the grouped frequency distribution above, the individual scores were the following:
Meritorious: 34, 33, 33, 29 Exceeds Expectations: 26, 26, 24, 23, 23, 22
Pass: 20, 19, 19, 18, 17, 15, 15, 14 Pass with Exceptions: 12, 11, 9, 8
Fail: 6, 3, 1
Using the formula for the mean, M 5 ∑x n
, verify that 460 25 5 18.40.
Now, to estimate the mean based on the class intervals, follow these four steps:
1. Determine the midpoint of each class interval by
a) adding the two possible extreme scores within each interval (not the actual scores) and then
b) dividing by 2.
For
Meritorious: (28 1 34)/2 5 31
Exceeds Expectations: (21 1 27)/2 5 24
Pass: (14 1 20)/2 5 17
Pass with Exceptions: (7 1 13)/2 5 10
Fail: (0 1 6)/2 5 3
2. Multiply the midpoint values from Step 1 by the number of scores in the interval.
31 3 4 5 124
24 3 6 5 144
17 3 8 5 136
10 3 4 5 40
3 3 3 5 9
3. Sum Step 2’s products (the midpoints times the number of values).
124 1 144 1 136 1 40 1 9 5 453
4. Divide the sum of the products from Step 3 by the number of scores.
453/25 5 18.12
The actual mean is 18.40. The estimated mean is 18.12.
tan82773_02_ch02_029-060.indd 34 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
Because this is an estimate, there will generally be a minor discrepancy between the value estimated from the class intervals and the actual value of the mean. In this exam- ple, the difference between the estimated and actual mean is 0.28. As the number of values in the data set increases, the discrepancy will usually diminish. The point is that with only the values that constitute the class intervals and the number of scores in each interval, it is possible to estimate the value of the mean. That can be helpful in a data summary when the original scores are unavailable, as is the case for data in Table 2.2. Whenever the value of M is estimated from the class intervals, any reporting of the value must clearly state that it is an estimate and that it was not calculated directly from the raw data.
The Difference Between Apparent and Actual Limits For the licensing data, the scores are all whole numbers: integers. This makes creating the class intervals easy, but researchers often work with data that include decimal values, and class limits must accommodate any value between the highest and lowest integers. The high- est and lowest integers in the category represent the apparent limits of the class interval. For example, in Table 2.1’s meritorious category, the apparent limits are 28 and 34. If the scores do not involve decimal values, determining class limits does not pose a problem, but sometimes decimals are part of the data being represented. A student’s grade point average, for example, is likely to have a decimal value. Ordinary grading procedures also often include decimals. If the lower limit for A work is 90% and the upper limit for B work is 89%, to which class interval does 89.5% belong?
To accommodate any value, class intervals must have actual limits in addition to apparent limits. In the case of grade averages and a great many other kinds of data, the class interval actually extends from a half point below the lower whole number in the interval to a half point above. That means the lower limit for an A would be 89.5%. For the 21–27 class interval (exceeds expectations), the actual limits are 20.5 to 27.5. If we subtract the lower from the upper actual limit we have the width of the class interval: 27.5 2 20.5 5 7.0.
That difference between the actual limits is the same as the number of whole numbers in the 21–27 apparent limits. In this case, that includes 21, 22, 23, 24, 25, 26, 27 or seven whole numbers.
In our licensing example, the use of actual limits involves a problem apparent limits did not present: The lower actual limit for exceeds expectations is the same as the upper actual limit for pass. Both are 20.5. So when scores happen to include whole numbers and deci- mals, where does a score like 20.5 belong? Sheskin’s (2004) solution is to adopt a rule. Such a rule could dictate, for example, that if the first value for the score in question is an odd number, it falls in one interval (perhaps the upper), and if the value is an even number, it falls in the lower interval. Under that rule, someone scoring 20.5 would receive a pass rating—because the first number, 2, is even. Still, which rule is followed does not matter so long as it is equitable and followed consistently.
Creating Grouped Frequency Distributions Speaking of consistency, grouped frequency distributions are also developed according to a couple of conventions:
tan82773_02_ch02_029-060.indd 35 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
• Each class interval must have the same range. Whether the class limits are apparent or actual, the ranges of the different intervals must be equal. In the licensing scores example, the range of the apparent limits is 6.0 for each interval: 34 2 28 5 6.0 for the meritorious interval, 27 2 21 5 6.0 for the exceeds-expectation interval, and so on.
• A score must fit into just one group. This is simple enough when scores involve only whole numbers, but with decimal values, the difference between actual and appar- ent limits becomes relevant.
No rules dictate how many intervals are too few or too many, and of course that is a nonissue when the data have their own categories, like the licensing results. But without prescribed categories, researchers must decide on the number of categories to use. As a rough rule of thumb, Sheskin (2004) suggests taking the square root of the number of scores to determine the number of class intervals. So if, for example, the data set has 50 scores, the square root of 50 (! 50 5 7.071), or about 7 class intervals, would be a reasonable number. Sheskin’s pro- posal is only a suggestion, however. When the data set is large, the rule may not be very help- ful. In the Jokela (2012) study, shown in Table 2.2, there were 2,971 male subjects. Fifty-five class intervals (! 2,9715 54.507) probably creates a larger table than anyone wants to use in a presentation or research report.
The researcher’s objective is to find a reasonable balance between the efficiency that a few categories provide and the precision that more categories yield. For example, if we were to reduce the five intervals in Table 2.1 to just pass and fail categories, the result would be a very compact table, but a good deal of information about the level at which a particular individual passed would be lost.
Score Frequencies and Score Aggregates Table 2.1 provides a simple summary of frequency, or how the scores on the licensing exam are distributed for 25 test-takers. Other arrangements of these data offer different pictures of how the scores are distributed.
• Frequency indicates how many scores are in each class interval (Table 2.1). • Relative frequency indicates the proportion or percentage of the total that scores in
the class interval represent. Relative frequencies can be reported as common frac- tions, but proportions or percentages of the whole are more common. The propor- tions are calculated by dividing the number of scores in the class interval by the total number of scores.
• Sometimes it is helpful to see a running total of scores as one proceeds from one class interval to the next. A cumulative relative frequency value adds each successive class interval to the proportions of scores that precede it so that the last interval will indicate 1.0, or 100%. The cumulative relative frequency for exceeds expectations will be the relative frequency for that class interval (0.24), plus the relative fre- quency for the preceding class interval (meritorious; 0.16).
0.24 1 0.16 5 0.40
Expanding Table 2.1 by adding columns for relative frequency and for cumulative relative frequency results in Table 2.3.
tan82773_02_ch02_029-060.indd 36 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
(The Stem)
3
2
1
0
(The Leaves)
3 3 4
0 2 3 3 4 6 6 9
1 2 4 5 5 7 8 9 9
1 3 6 8 9
Section 2.1 From Description to Display
Table 2.3: Frequencies, relative frequencies, and cumulative relative frequencies
Licensing test results f Relative f Cumulative relative f
Meritorious 4 0.16 0.16
Exceeds expectations 6 0.24 0.40
Pass 8 0.32 0.72
Pass with exceptions 4 0.16 0.88
Fail 3 0.12 1.00
Total 25
Stem-and-Leaf Displays Sometimes, rather than collapsing or abbreviating the data list, scores need to be orga- nized so that when they are all presented, they are easy to understand. Some data dis- plays accommodate all of the data and still manage to remain fairly compact. One such is the stem-and-leaf display or stem plot. Rather than collapsing the scores into class intervals (and losing some of the information about their original values), the stem-and- leaf approach displays all the original scores. The stem-and-leaf display has its name because each score is reduced to a stem and a leaf. The “stem” in the display is all values in the number preceding the last digit in the score. The “leaf” is the last value in the score.
Figure 2.1 depicts a stem-and-leaf display of the 25 test scores on which Table 2.1 and 2.3 are based.
At first glance the display appears a little odd, but the beauty of stem-and-leaf displays is that the data list
Try It!: #2 What would the stem be for a score of 1,012?
Figure 2.1: A stem-and-leaf display of test scores
A stem-and-leaf display condenses data to a series of stems (all values in the number except for the last digit) and leaves (the last digit of each value).
tan82773_02_ch02_029-060.indd 37 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
is complete. Single-digit original scores appear on the bottom row, where the stem (the number preceding the final value) is 0. The stem in this particular display is just a series of single values because all of the scores are either single-digit or two-digit numbers. If there were a score of 100, the stem for that score would be a two-digit number, 10.
• The single-digit original scores on the bottom row, then, are 1, 3, 6, 8, and 9. • The second-row test scores are those for which the first digit is a 1 (the stem is 1).
Those scores are 11, 12, 14, 15, 15, 17, 18, 19, and 19. • The third-row test scores are those for which the first number (the stem) is a 2.
These, of course, are the test scores in the 20s. • And the top row, with a stem of 3, contains the three highest scores: 33, 33, and 34.
Once a person is oriented to stems and leaves, the display is not difficult to interpret. A glance makes it clear, for example, that the bulk of these test scores are in the 10s and 20s.
Data Cross-Tabulations Beyond simply listing data, the stem-and-leaf display suggests that the way data are orga- nized can make what are often quite subtle relationships easier to recognize. Other types of displays also do this very well. For the sake of the licensing test example, assume that the 25 people represent all those from a particular city who took the test in a given year. Assume further that they are the products of two different universities in that city. We know from the earlier tables that the test had just three outright failures and an additional four who passed with exceptions, a kind of conditional pass. A researcher might find it important to determine whether students from the two universities performed similarly. Cross-tabulating the data is one way to present them so that such questions are easier to answer.
Tables 2.1 and 2.3 organized test results according to just the categories that constitute the class intervals. If the results add the university the student attended, a data table (Table 2.4) can be developed so that the columns indicate the test results, and the rows indicate the uni- versity attended.
Table 2.4: Cross-tabulating test results with the institution
Institution
Class intervals
Meritorious Exceeds
expectations Pass Pass with
exceptions Fail
University A 0 1 4 4 3
University B 4 5 4 0 0
This cross-tabulation reveals information about the relative success of students from the two universities. If we aggregate the data across institutions, it is not apparent, for example, that
Try It!: #3 How many “stems” would a stem-and-leaf plot have if scores represented every inte- ger from 1 to 99?
tan82773_02_ch02_029-060.indd 38 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.2 Graphs and Other Data Figures
no one from University B failed the test, nor is it clear that all those who scored at the meri- torious level were from University B. Cross-tabulating data allows a second variable to be represented and provides for a more sophisticated level of analysis.
If, for example, we also had access to marital status information, the rows could be divided further to reflect the variable that would allow us to compare results by test result, by uni- versity, and by marital status. If you have read enough to know what information is likely to be important in your report, the published research will guide you to the additional variables that ought to be gathered and presented in a data display.
Cross-tabulation is a visually simple way to represent multiple variables. Linn (2003) wanted to indicate the percentage of students in two different grades who were performing at pro- ficient or above in two different states on two different parts of the National Assessment of Educational Progress (NAEP). If you count them, Linn’s data contains four different variables, displayed in Table 2.5:
Table 2.5: Representing multiple variables in a cross-tabulation percentage of students performing at proficient and beyond on the NEAP
Grade
1998 Reading 1996 Mathematics
Colorado Massachusetts Colorado Massachusetts
4 34 37 22 24
8 30 36 25 28
Source: Linn, R.L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3–13.
Table 2.5 reveals that the percentage of students scoring at proficient and beyond on both the 1998 Reading test and the 1996 Mathematics test was modestly greater in Massachusetts than in Colorado. The table also reveals that the gap increased slightly from the fourth to the eighth grade. Cross-tabulations readily reveal trends and comparisons such as these.
2.2 Graphs and Other Data Figures Sometimes, rather than visual displays that group or arrange the scores, a more graphic presen- tation is helpful. That was certainly the case for the aircraft analyst discussed in the chapter introduction. Pie charts and bar graphs are both quite common because they require very little explanation. As compact and efficient as the stem-and-leaf display is, the unfamiliar observer must be oriented to it before the data make sense. This is less often the case with pie and bar graphs.
Pie Charts Perhaps better than any other type of graph or figure, the pie chart, or pie graph, clarifies propor- tions. Scholars have been using pie charts to illustrate proportional differences for probably two hundred years. Technically speaking, a pie chart is a circle that is divided into sectors. The size
tan82773_02_ch02_029-060.indd 39 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
African Americans Asian Americans Caucasian Americans Hispanic Americans Native Americans Other
Section 2.2 Graphs and Other Data Figures
of each sector is defined by the percentage of the total area of the circle. For instance, a pie chart is used to illustrate where people in a particular county live, with the following percentages:
• 25% are city dwellers, • 20% live in the suburbs, • 25% live in small towns, and • the remaining 30% live in rural areas.
In the circle used for the pie chart,
• 1/4th of the area will be the city sector, • 1/5th of the area will be for those in the suburbs, • 1/4th of the area will represent the small-town residents, and • the remaining 3/10ths of the area will represent the rural dwellers.
When we are interested in how much of the whole is explained by individual categories, a pie chart is usually more illustrative than a table. This is particularly the case when the data sets are large.
Perhaps a sociologist is interested in the ethnic group makeup of the residents in a particular county. Examining census data might produce the following statistics (depicted as a pie chart in Figure 2.2):
African Americans 23,375
Asian Americans 18,217
Caucasian Americans 32,667
Hispanic Americans 40,886
Native Americans 11,364
Other 5,887
Figure 2.2: A pie chart of the ethnic makeup of the county
This pie chart depicts census data by ethnic group for a single county. Pie charts are useful in showing a percentage of data as proportional to the whole.
African Americans Asian Americans Caucasian Americans Hispanic Americans Native Americans Other
tan82773_02_ch02_029-060.indd 40 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Series 1 African
Americans
Series 1 Asian
Americans
Series 1 Caucasian Americans
Series 1 Hispanic
Americans
Series 1 Native
Americans
Series 1 Other
Americans
23,375
18,217
32,667
40,886
11,364
5,887
Section 2.2 Graphs and Other Data Figures
To make this pie chart in Excel, enter the data in two columns just as they are listed above Figure 2.2. Drag the cursor so as to highlight both columns and then select the Insert tab at the top of the page. Select the Pie option. The default result is a two-dimensional pie chart.
Treated as a list, the data are certainly precise but perhaps are not as communicative as they might be. If the intent is to indicate how different ethnic groups compare as proportions of the entire population of the county, a pie chart is probably more helpful. Figure 2.2 shows the proportions of each ethnic group within the county population.
This particular graph does not indicate the numbers on which the proportions are based, but the exact counts can be listed separately. In any event, the raw numbers may not matter to someone who wants a graphic demonstration of the fact that Hispanic residents constitute the largest single ethnic group in the county, that the second largest group is the Caucasian group, that Native Americans are about half as numerous as African Americans, and so on.
Pie charts illustrate large proportional differences better than small ones. Note that Fig- ure 2.2 makes it difficult to assess how much of the Native American population constitutes the whole, or what proportion is Other. Pie charts generally work better when comparing an individual “slice” to the whole rather than one slice to another.
Bar Graphs Bar graphs or bar charts use a series of bars of different lengths to represent the different quantities of some variable. The bars can be either horizontal or vertical.
Gaps between the bars indicate that the categories in the graph are not continuous; they are discrete or independent categories. For example, such a chart might illustrate the popularity of different academic majors at a university or show the ethnic makeup of the student body, as in Figure 2.3.
Figure 2.3: A bar graph of the ethnic makeup of the county
Unlike a pie chart, a bar graph shows data values for each population group, allowing for a more exact representation of each group’s proportion within the whole population.
Series 1 African
Americans
Series 1 Asian
Americans
Series 1 Caucasian Americans
Series 1 Hispanic
Americans
Series 1 Native
Americans
Series 1 Other
Americans
23,375
18,217
32,667
40,886
11,364
5,887
tan82773_02_ch02_029-060.indd 41 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
South Midwest New England
Northeast Atlantic/ South
Mountain/ West
Pacific/ West
Southwwest
Total Population Under 18
Section 2.2 Graphs and Other Data Figures
One advantage of bar graphs over pie charts is that bar graphs supply the data values along the y (vertical) axis or along the bar, as Figure 2.3 illustrates. The presence of the data values makes it a good deal easier to get a rough idea of the approximate totals for each ethnic group and simplifies comparisons from group to group. In a bar graph with discrete categories, the order of the bars is usually not significant, although there may be an order the researcher wishes to emphasize. In Figure 2.3, the order of ethnic groups happens to be alphabetical.
To create this bar graph in Excel, perform these steps using the same data set used to create the pie chart:
1. Highlight both columns of data. 2. Select the Insert tab at the top of the page and choose Bar. 3. Select All Chart Types at the bottom of the page (the default charts all use horizon-
tal bars). 4. Select the upper-left-column graph. 5. Place your cursor on the series 1 notation at the right. 6. Press the Delete key on your keyboard, and then click OK.
Bar graphs can also depict different variables within a single data set. Using the 2000 U.S. cen- sus results, Lopez (2003) examined the ethnic-group characteristics of school-aged children. She used a bar graph to indicate the percentage of mixed-race children under age 18 for each region in the United States. Figure 2.4 shows the resulting bar graph.
The bar graph makes it clear that mixed-race children are a much greater proportion of the population in the West and Southwest than they are in the South or the Midwest, for example.
Figure 2.4: Percentage of mixed-race children under 18 by region
Bar graphs are capable of depicting multiple variables, as this graph of mixed-race children within each U.S. region illustrates.
Source: Lopez, 2003.
tan82773_02_ch02_029-060.indd 42 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
7
7 8 9
6
6
5
5
4
4
3
3
2
2
1
1 0
Stress Levels (1–9) Among 32 Hospital Personnel
Section 2.2 Graphs and Other Data Figures
Histograms The ethnicity data in the categories in both Figures 2.3 and 2.4 are nominal scale, meaning the categories are not continuous and the order of the categories is unimportant. Sometimes, data categories continue from one to the next, so that each category indicates an incremen- tal increase or decrease in the level of the same characteristic. This kind of bar graph is a histogram.
The subtle visual difference between histograms and other graphs is the absence of a gap between the bars or columns, which serves as a reminder that the data continue without interruption into the next category. Earlier, this chapter discussed actual versus apparent limits in class intervals; here the lack of interruption indicates that limits in a histogram are actual limits.