Chapter 4
Summary Statistics
In Chapter 4:
• The prior chapter used stemplots and histograms to look at the shape, location, and spread of a distribution.
• This chapter uses numerical summaries for similar purposes.
Summary Statistics
• Central location – Mean – Median – Mode
• Spread – Range and interquartile range (IQR) – Variance and standard deviation
• Shape summaries – seldom used in practice
Notation • n ≡ sample size • X ≡ the variable (e.g., ages of subjects) • xi ≡ the value of individual i for variable X • Σ ≡ sum all values (capital sigma) • Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52
n = 10 X = AGE variable x1= 21, x2= 42, …, x10= 52 Σxi = x1 + x2 + … + x10= 21 + 42 + … + 52 = 290
§4.1: Central Location: Sample Mean
• “Arithmetic average” • Traditional measure of central location • Sum the values and divide by n • “xbar” refers to the sample mean
( ) ∑ =
=+++= n
i
in x n
xxx n
x 1
11 21
Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52
Note that n = 10, Σxi = 21 + 42 + … + 52 = 290, and
0.29)290( 10 11
=== ∑ ixnx
Figure 4.1 The mean is the balancing point of a distribution
Uses of the Sample Mean
• The sample mean: • The value of an observation drawn at random
from the sample can be used to predict the population mean
Population Mean
• Same operation as sample mean except based on entire population (N ≡ population size)
• Conceptually important • Usually not available in practice • Sometimes referred to as the expected value
∑∑ == ii xNN x 1µ
§4.2 Central Location: Median The median is the value with a depth of (n+1)/2
When n is even, average the two values that straddle a depth of (n+1)/2
For the 10 values listed below, the median has depth (10+1) / 2 = 5.5, placing it between 27 and 28. Average these two values to get median = 27.5
05 11 21 24 27 28 30 42 50 52 ↑
median Average the adjacent values: M = 27.5
More Examples of Medians
• Example A: 2 4 6 Median = 4
• Example B: 2 4 6 8 Median = 5 (average of 4 and 6)
• Example C: 6 2 4 Median ≠ 2 (Values must be ordered first)
The Median is Robust • The median is more resistant to skews and
outliers than the mean; it is more robust. • This data set has a mean of 1636:
1362 1439 1460 1614 1666 1792 1867 • Here’s the same data set with a data entry
error “outlier” (highlighted). This data set has a mean of 2743:
1362 1439 1460 1614 1666 1792 9867 • The median is 1614 in both instances,
demonstrating its robustness in the face of outliers.
§4.3: Mode • The mode is the most commonly encountered
value in the dataset • This data set has a mode of 7
{4, 7, 7, 7, 8, 8, 9} • This data set has no mode
{4, 6, 7, 8} (each point appears only once)
• The mode is useful only in large data sets with repeating values
Figure 4.4 Effect of a skew on the mean, median, and mode.
Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew
§4.5 Spread: Quartiles • Two distributions can be quite
different yet can have the same mean
• This data compares particulate matter in air samples (μg/m3) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean.
Site 1| |Site 2 ---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
Spread: Range • Range = maximum – minimum • Illustrative example:
Site 1 range = 86 – 22 = 64 Site 2 range = 40 – 32 = 8
• Beware: the sample range will tend to underestimate the population range.
• Always supplement the range with at least one addition measure of spread
Site 1| |Site 2 ---------------- 42|2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
Spread: Quartiles • Quartile 1 (Q1): cuts off bottom quarter of data
= median of the lower half of the data set • Quartile 3 (Q3): cuts off top quarter of data
= median of the upper half of the data set • Interquartile Range (IQR) = Q3 – Q1
covers the middle 50% of the distribution
05 11 21 24 27 28 30 42 50 52
↑ ↑ ↑ Q1 median Q3
Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21
Quartiles (Tukey’s Hinges) – Example 2 Data are metabolic rates (cal/day), n = 7
1362 1439 1460 1614 1666 1792 1867 ↑
median • When n is odd, include the median in both
halves of the data set. • Bottom half: 1362 1439 1460 1614
which has a median of 1449.5 (Q1) • Top half: 1614 1666 1792 1867
which has a median of 1729 (Q3)
Five-Point Summary
• Q0 (the minimum) • Q1 (25th percentile) • Q2 (median) • Q3 (75th percentile) • Q4 (the maximum)
§4.6 Boxplots 1. Calculate 5-point summary. Draw box from Q1 to
Q3 w/ line at median 2. Calculate IQR and fences as follows:
FenceLower = Q1 – 1.5(IQR) FenceUpper = Q3 + 1.5(IQR) Do not draw fences
3. Determine if any values lie outside the fences (outside values). If so, plot these separately.
4. Determine values inside the fences (inside values) Draw whisker from Q3 to upper inside value. Draw whisker from Q1 to lower inside value
Illustrative Example: Boxplot
1. 5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5
2. IQR = 42 – 21 = 21. FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5 FL = Q1 – 1.5(IQR) = 21 – (1.5)(21) = –10.5
3. None values above upper fence None values below lower fence
4. Upper inside value = 52 Lower inside value = 5 Draws whiskers
Data: 05 11 21 24 27 28 30 42 50 52
60
50
40
30
20
10
0
Upper inside = 52
Q3 = 42
Q1 = 21
Lower inside = 5
Q2 = 27.5
Illustrative Example: Boxplot 2 Data: 3 21 22 24 25 26 28 29 31 51
60
50
40
30
20
10
0
Outside value (51)
Outside value (3)
Inside value (21)
Upper hinge (29)
Lower hinge (22) Median (25.5)
Inside value (31)
1. 5-point summary: 3, 22, 25.5, 29, 51: draw box
2. IQR = 29 – 22 = 7 FU = Q3 + 1.5(IQR) = 28 + (1.5)(7) = 39.5 FL = Q1 – 1.5(IQR) = 22 – (1.5)(7) = 11.6
3. One above top fence (51) One below bottom fence (3)
4. Upper inside value is 31 Lower inside value is 21 Draw whiskers
Illustrative Example: Boxplot 3 Seven metabolic rates:
1362 1439 1460 1614 1666 1792 1867