Chapter 3: Numerical Descriptions of Data
75
Chapter 3: Numerical Descriptions of Data Chapter 1 discussed what a population, sample, parameter, and statistic are, and how to take different types of samples. Chapter 2 discussed ways to graphically display data. There was also a discussion of important characteristics: center, variations, distribution, outliers, and changing characteristics of the data over time. Distributions and outliers can be answered using graphical means. Finding the center and variation can be done using numerical methods that will be discussed in this chapter. Both graphical and numerical methods are part of a branch of statistics known as descriptive statistics. Later descriptive statistics will be used to make decisions and/or estimate population parameters using methods that are part of the branch called inferential statistics. Section 3.1: Measures of Center This section focuses on measures of central tendency. Many times you are asking what to expect on average. Such as when you pick a major, you would probably ask how much you expect to earn in that field. If you are thinking of relocating to a new town, you might ask how much you can expect to pay for housing. If you are planting vegetables in the spring, you might want to know how long it will be until you can harvest. These questions, and many more, can be answered by knowing the center of the data set. There are three measures of the “center” of the data. They are the mode, median, and mean. Any of the values can be referred to as the “average.” The mode is the data value that occurs the most frequently in the data. To find it, you count how often each data value occurs, and then determine which data value occurs most often. The median is the data value in the middle of a sorted list of data. To find it, you put the data in order, and then determine which data value is in the middle of the data set. The mean is the arithmetic average of the numbers. This is the center that most people call the average, though all three – mean, median, and mode – really are averages. There are no symbols for the mode and the median, but the mean is used a great deal, and statisticians gave it a symbol. There are actually two symbols, one for the population parameter and one for the sample statistic. In most cases you cannot find the population parameter, so you use the sample statistic to estimate the population parameter.
Population Mean:
µ = Σx
N , pronounced mu
N is the size of the population. x represents a data value. x∑ means to add up all of the data values.
Chapter 3: Numerical Descriptions of Data
76
Sample Mean:
x = Σx
n , pronounced x bar.
n is the size of the sample. x represents a data value. x∑ means to add up all of the data values.
The value for x is used to estimate µ since µ can’t be calculated in most situations. Example #3.1.1: Finding the Mean, Median, and Mode
Suppose a vet wants to find the average weight of cats. The weights (in pounds) of five cats are in table #3.1.1.
Table #3.1.1: Weights of cats in pounds 6.8 8.2 7.5 9.4 8.2
Find the mean, median, and mode of the weight of a cat.
Solution: Before starting any mathematics problem, it is always a good idea to define the unknown in the problem. In this case, you want to define the variable. The symbol for the variable is x.
The variable is x = weight of a cat Mean:
x = 6.8+8.2+ 7.5+ 9.4+8.2
5 = 40.1
5 = 8.02 pounds
Median:
You need to sort the list for both the median and mode. The sorted list is in table #3.1.2. Table #3.1.2: Sorted List of Cats’ Weights
6.8 7.5 8.2 8.2 9.4 There are 5 data points so the middle of the list would be the 3rd number. (Just put a finger at each end of the list and move them toward the center one number at a time. Where your fingers meet is the median.) Table #3.1.3: Sorted List of Cats’ Weights with Median Marked
6.8 7.5 8.2 8.2 9.4
Chapter 3: Numerical Descriptions of Data
77
The median is therefore 8.2 pounds. Mode:
This is easiest to do from the sorted list that is in table #3.1.2. Which value appears the most number of times? The number 8.2 appears twice, while all other numbers appear once.
Mode = 8.2 pounds.
A data set can have more than one mode. If there is a tie between two values for the most number of times then both values are the mode and the data is called bimodal (two modes). If every data point occurs the same number of times, there is no mode. If there are more than two numbers that appear the most times, then usually there is no mode. In example #3.1.1, there were an odd number of data points. In that case, the median was just the middle number. What happens if there is an even number of data points? What would you do? Example #3.1.2: Finding the Median with an Even Number of Data Points
Suppose a vet wants to find the median weight of cats. The weights (in pounds) of six cats are in table #3.1.4. Find the median
Table #3.1.4: Weights of Six Cats 6.8 8.2 7.5 9.4 8.2 6.3
Solution: Variable: x = weight of a cat First sort the list if it is not already sorted. There are 6 numbers in the list so the number in the middle is between the 3rd and 4th number. Use your fingers starting at each end of the list in table #3.1.5 and move toward the center until they meet. There are two numbers there.
Table #3.1.5: Sorted List of Weights of Six Cats
6.3 6.8 7.5 8.2 8.2 9.4 To find the median, just average the two numbers.
median = 7.5 + 8.2 2
= 7.85 pounds
The median is 7.85 pounds.
Chapter 3: Numerical Descriptions of Data
78
Example #3.1.3: Finding Mean and Median using Technology Suppose a vet wants to find the median weight of cats. The weights (in pounds) of six cats are in table #3.1.4. Find the median
Solution: Variable: x = weight of a cat
You can do the calculations for the mean and median using the technology. The procedure for calculating the sample mean ( x ) and the sample median (Med) on the TI-83/84 is in figures 3.1.1 through 3.1.4. First you need to go into the STAT menu, and then Edit. This will allow you to type in your data (see figure #3.1.1).
Figure #3.1.1: TI-83/84 Calculator Edit Setup
Once you have the data into the calculator, you then go back to the STAT menu, move over to CALC, and then choose 1-Var Stats (see figure #3.1.2). The calculator will now put 1-Var Stats on the main screen. Now type in L1 (2nd button and 1) and then press ENTER. (Note if you have the newer operating system on the TI-84, then the procedure is slightly different.) If you press the down arrow, you will see the rest of the output from the calculator. The results from the calculator are in figure #3.1.3.
Figure #3.1.2: TI-83/84 Calculator CALC Menu
Chapter 3: Numerical Descriptions of Data
79
Figure #3.1.3: TI-83/84 Calculator Input for Example #3.1.3 Variable
Figure #3.1.4: TI-83/84 Calculator Results for Example #3.1.3 Variable
The commands for finding the mean and median using R are as follows:
variable<-c(type in your data with commas in between) To find the mean, use mean(variable) To find the median, use median(variable)
So for this example, the commands would be
weights<-c(6.8, 8.2, 7.5, 9.4, 8.2, 6.3) mean(weights) [1] 7.733333 median(weights) [1] 7.85
Example #3.1.4: Affect of Extreme Values on Mean and Median
Suppose you have the same set of cats from example 3.1.1 but one additional cat was added to the data set. Table #3.1.6 contains the six cats’ weights, in pounds.
Table #3.1.6: Weights of Six Cats
6.8 7.5 8.2 8.2 9.4 22.1 Find the mean and the median.
Chapter 3: Numerical Descriptions of Data
80
Solution: Variable: x = weight of a cat
mean = x = 6.8+ 7.5+8.2+8.2+ 9.4+ 22.1
6 = 10.37 pounds
The data is already in order, thus the median is between 8.2 and 8.2.
median = 8.2+8.2
2 = 8.2 pounds
The mean is much higher than the median. Why is this? Notice that when the value of 22.1 was added, the mean went from 8.02 to 10.37, but the median did not change at all. This is because the mean is affected by extreme values, while the median is not. The very heavy cat brought the mean weight up. In this case, the median is a much better measure of the center.
An outlier is a data value that is very different from the rest of the data. It can be really high or really low. Extreme values may be an outlier if the extreme value is far enough from the center. In example #3.1.4, the data value 22.1 pounds is an extreme value and it may be an outlier. If there are extreme values in the data, the median is a better measure of the center than the mean. If there are no extreme values, the mean and the median will be similar so most people use the mean. The mean is not a resistant measure because it is affected by extreme values. The median and the mode are resistant measures because they are not affected by extreme values. As a consumer you need to be aware that people choose the measure of center that best supports their claim. When you read an article in the newspaper and it talks about the “average” it usually means the mean but sometimes it refers to the median. Some articles will use the word “median” instead of “average” to be more specific. If you need to make an important decision and the information says “average”, it would be wise to ask if the “average” is the mean or the median before you decide. As an example, suppose that a company wants to use the mean salary as the average salary for the company. This is because the high salaries of the administration will pull the mean higher. The company can say that the employees are paid well because the average is high. However, the employees want to use the median since it discounts the extreme values of the administration and will give a lower value of the average. This will make the salaries seem lower and that a raise is in order. Why use the mean instead of the median? The reason is because when multiple samples are taken from the same population, the sample means tend to be more consistent than other measures of the center. The sample mean is the more reliable measure of center.
Chapter 3: Numerical Descriptions of Data
81
To understand how the different measures of center related to skewed or symmetric distributions, see figure #3.1.5. As you can see sometimes the mean is smaller than the median and mode, sometimes the mean is larger than the median and mode, and sometimes they are the same values.
Figure #3.1.5: Mean, Median, Mode as Related to a Distribution
One last type of average is a weighted average. Weighted averages are used quite often in real life. Some teachers use them in calculating your grade in the course, or your grade on a project. Some employers use them in employee evaluations. The idea is that some activities are more important than others. As an example, a fulltime teacher at a community college may be evaluated on their service to the college, their service to the community, whether their paperwork is turned in on time, and their teaching. However, teaching is much more important than whether their paperwork is turned in on time. When the evaluation is completed, more weight needs to be given to the teaching and less to the paperwork. This is a weighted average.
Weighted Average
Σxw Σw
where w is the weight of the data value, x.
Example #3.1.5: Weighted Average
In your biology class, your final grade is based on several things: a lab score, scores on two major tests, and your score on the final exam. There are 100 points available for each score. The lab score is worth 15% of the course, the two exams are worth 25% of the course each, and the final exam is worth 35% of the course. Suppose you earned scores of 95 on the labs, 83 and 76 on the two exams, and 84 on the final exam. Compute your weighted average for the course.
Solution: Variable: x = score
The weighted average is
Σxw Σw
= sum of the scores times their weights sum of all the weights
Chapter 3: Numerical Descriptions of Data
82
weighted average =
95 0.15( ) +83 0.25( ) + 76 0.25( ) +84 0.35( ) 0.15+ 0.25+ 0.25+ 0.35
= 83.4 1.00
= 83.4%
A weighted average can be found using technology. The procedure for calculating the weighted average on the TI-83/84 is in figures 3.1.6 through 3.1.9. First you need to go into the STAT menu, and then Edit. This will allow you to type in the scores into L1 and the weights into L2 (see figure #3.1.6).
Figure #3.1.6: TI-83/84 Calculator Edit Setup
Once you have the data into the calculator, you then go back to the STAT menu, move over to CALC, and then choose 1-Var Stats (see figure #3.1.7). The calculator will now put 1-Var Stats on the main screen. Now type in L1 (2nd button and 1), then a comma (button above the 7 button), and then L2 (2nd button and 2) and then press ENTER. (Note if you have the newer operating system on the TI-84, then the procedure is slightly different.) The results from the calculator are in figure #3.1.9. The x is the weighted average. Figure #3.1.7: TI-83/84 Calculator CALC Menu
Chapter 3: Numerical Descriptions of Data
83
Figure #3.1.8: TI-83/84 Calculator Input for Weighted Average
Figure #3.1.9: TI-83/84 Calculator Results for Weighted Average
The commands for finding the mean and median using R are as follows:
x<-c(type in your data with commas in between) w<-c(type in your weights with commas in between weighted.mean(x,w)
So for this example, the commands would be
x<-c(95, 83, 76, 84) w<-c(.15, .25, .25, .35) weighted.mean(x,w) [1] 83.4
Example #3.1.6: Weighted Average The faculty evaluation process at John Jingle University rates a faculty member on the following activities: teaching, publishing, committee service, community service, and submitting paperwork in a timely manner. The process involves reviewing student evaluations, peer evaluations, and supervisor evaluation for each teacher and awarding him/her a score on a scale from 1 to 10 (with 10 being the best). The weights for each activity are 20 for teaching, 18 for publishing, 6 for committee service, 4 for community service, and 2 for paperwork.
Chapter 3: Numerical Descriptions of Data
84
a) One faculty member had the following ratings: 8 for teaching, 9 for publishing, 2 for committee work, 1 for community service, and 8 for paperwork. Compute the weighted average of the evaluation. Solution:
Variable: x = rating
The weighted average is
Σxw Σw
= sum of the scores times their weights sum of all the weights
.
evaluation =
8 20( ) + 9 18( ) + 2 6( ) +1 4( ) +8 2( ) 20+18+ 6+ 4+ 2
= 354 50
= 7.08
b) Another faculty member had ratings of 6 for teaching, 8 for publishing, 9 for committee work, 10 for community service, and 10 for paperwork. Compute the weighted average of the evaluation.
Solution:
evaluation =
6 20( ) +8 18( ) + 9 6( ) +10 4( ) +10 2( ) 20+18+ 6+ 4+ 2
= 378 50
= 7.56
c) Which faculty member had the higher average evaluation? Solution:
The second faculty member has a higher average evaluation. You can find a weighted average using technology. On the The last thing to mention is which average is used on which type of data.
Mode can be found on nominal, ordinal, interval, and ratio data, since the mode is just the data value that occurs most often. You are just counting the data values. Median can be found on ordinal, interval, and ratio data, since you need to put the data in order. As long as there is order to the data you can find the median. Mean can be found on interval and ratio data, since you must have numbers to add together.
Chapter 3: Numerical Descriptions of Data
85
Section 3.1: Homework 1.) Cholesterol levels were collected from patients two days after they had a heart
attack (Ryan, Joiner & Ryan, Jr, 1985) and are in table #3.1.7. Find the mean, median, and mode. Table #3.1.7: Cholesterol Levels
270 236 210 142 280 272 160 220 226 242 186 266 206 318 294 282 234 224 276 282 360 310 280 278 288 288 244 236
2.) The lengths (in kilometers) of rivers on the South Island of New Zealand that flow
to the Pacific Ocean are listed in table #3.1.8 (Lee, 1994). Find the mean, median, and mode. Table #3.1.8: Lengths of Rivers (km) Flowing to Pacific Ocean
River Length (km)
River Length (km)
Clarence 209 Clutha 322 Conway 48 Taieri 288 Waiau 169 Shag 72 Hurunui 138 Kakanui 64 Waipara 64 Rangitata 121 Ashley 97 Ophi 80 Waimakariri 161 Pareora 56 Selwyn 95 Waihao 64 Rakaia 145 Waitaki 209 Ashburton 90
3.) The lengths (in kilometers) of rivers on the South Island of New Zealand that flow
to the Tasman Sea are listed in table #3.1.9 (Lee, 1994). Find the mean, median, and mode. Table #3.1.9: Lengths of Rivers (km) Flowing to Tasman Sea
River Length (km)
River Length (km)
Hollyford 76 Waimea 48 Cascade 64 Motueka 108 Arawhata 68 Takaka 72 Haast 64 Aorere 72 Karangarua 37 Heaphy 35 Cook 32 Karamea 80 Waiho 32 Mokihinui 56 Whataroa 51 Buller 177 Wanganui 56 Grey 121 Waitaha 40 Taramakau 80 Hokitika 64 Arahura 56
Chapter 3: Numerical Descriptions of Data
86
4.) Eyeglassmatic manufactures eyeglasses for their retailers. They research to see how many defective lenses they made during the time period of January 1 to March 31. Table #3.1.10 contains the defect and the number of defects. Find the mean, median, and mode. Table #3.1.10: Number of Defective Lenses
Defect type Number of defects Scratch 5865 Right shaped – small 4613 Flaked 1992 Wrong axis 1838 Chamfer wrong 1596 Crazing, cracks 1546 Wrong shape 1485 Wrong PD 1398 Spots and bubbles 1371 Wrong height 1130 Right shape – big 1105 Lost in lab 976 Spots/bubble – intern 976
5.) Print-O-Matic printing company’s employees have salaries that are contained in
table #3.1.1. Table #3.1.11: Salaries of Print-O-Matic Printing Company Employees
Employee Salary ($) CEO 272,500 Driver 58,456 CD74 100,702 CD65 57,380 Embellisher 73,877 Folder 65,270 GTO 74,235 Handwork 52,718 Horizon 76,029 ITEK 64,553 Mgmt 108,448 Platens 69,573 Polar 75,526 Pre Press Manager 108,448 Pre Press Manager/ IT 98,837 Pre Press/ Graphic Artist 75,311 Designer 90,090 Sales 109,739 Administration 66,346
a.) Find the mean and median. b.) Find the mean and median with the CEO’s salary removed.
Chapter 3: Numerical Descriptions of Data
87
c.) What happened to the mean and median when the CEO’s salary was removed? Why?
d.) If you were the CEO, who is answering concerns from the union that employees are underpaid, which average of the complete data set would you prefer? Why?
e.) If you were a platen worker, who believes that the employees need a raise, which average would you prefer? Why?
6.) Print-O-Matic printing company spends specific amounts on fixed costs every
month. The costs of those fixed costs are in table #3.1.12. Table #3.1.12: Fixed Costs for Print-O-Matic Printing Company
Monthly charges Monthly cost ($)
Bank charges 482 Cleaning 2208 Computer expensive 2471 Lease payments 2656 Postage 2117 Uniforms 2600
a.) Find the mean and median. b.) Find the mean and median with the bank charges removed. c.) What happened to the mean and median when the bank charges was removed?
Why? d.) If it is your job to oversee the fixed costs, which average using the complete
data set would you prefer to use when submitting a report to administration to show that costs are low? Why?
e.) If it is your job to find places in the budget to reduce costs, which average using the complete data set would you prefer to use when submitting a report to administration to show that fixed costs need to be reduced? Why?
7.) State which type of measurement scale each represents, and then which center
measures can be use for the variable? a.) You collect data on people’s likelihood (very likely, likely, neutral, unlikely,
very unlikely) to vote for a candidate. b.) You collect data on the diameter at breast height of trees in the Coconino
National Forest. c.) You collect data on the year wineries were started. d.) You collect the drink types that people in Sydney, Australia drink.
8.) State which type of measurement scale each represents, and then which center
measures can be use for the variable? a.) You collect data on the height of plants using a new fertilizer. b.) You collect data on the cars that people drive in Campbelltown, Australia. c.) You collect data on the temperature at different locations in Antarctica. d.) You collect data on the first, second, and third winner in a beer competition.
Chapter 3: Numerical Descriptions of Data
88
9.) Looking at graph #3.1.1, state if the graph is skewed left, skewed right, or symmetric and then state which is larger, the mean or the median? Graph #3.1.1: Skewed or Symmetric Graph
10.) Looking at graph #3.1.2, state if the graph is skewed left, skewed right, or symmetric and then state which is larger, the mean or the median? Graph #3.1.2: Skewed or Symmetric Graph
Chapter 3: Numerical Descriptions of Data
89
11.) An employee at Coconino Community College (CCC) is evaluated based on goal setting and accomplishments toward the goals, job effectiveness, competencies, and CCC core values. Suppose for a specific employee, goal 1 has a weight of 30%, goal 2 has a weight of 20%, job effectiveness has a weight of 25%, competency 1 has a goal of 4%, competency 2 has a goal has a weight of 3%, competency 3 has a weight of 3%, competency 4 has a weight of 3%, competency 5 has a weight of 2%, and core values has a weight of 10%. Suppose the employee has scores of 3.0 for goal 1, 3.0 for goal 2, 2.0 for job effectiveness, 3.0 for competency 1, 2.0 for competency 2, 2.0 for competency 3, 3.0 for competency 4, 4.0 for competency 5, and 3.0 for core values. Find the weighted average score for this employee. If an employee has a score less than 2.5, they must have a Performance Enhancement Plan written. Does this employee need a plan?
12.) An employee at Coconino Community College (CCC) is evaluated based on goal setting and accomplishments toward goals, job effectiveness, competencies, CCC core values. Suppose for a specific employee, goal 1 has a weight of 20%, goal 2 has a weight of 20%, goal 3 has a weight of 10%, job effectiveness has a weight of 25%, competency 1 has a goal of 4%, competency 2 has a goal has a weight of 3%, competency 3 has a weight of 3%, competency 4 has a weight of 5%, and core values has a weight of 10%. Suppose the employee has scores of 2.0 for goal 1, 2.0 for goal 2, 4.0 for goal 3, 3.0 for job effectiveness, 2.0 for competency 1, 3.0 for competency 2, 2.0 for competency 3, 3.0 for competency 4, and 4.0 for core values. Find the weighted average score for this employee. If an employee that has a score less than 2.5, they must have a Performance Enhancement Plan written. Does this employee need a plan?
13.) A statistics class has the following activities and weights for determining a grade
in the course: test 1 worth 15% of the grade, test 2 worth 15% of the grade, test 3 worth 15% of the grade, homework worth 10% of the grade, semester project worth 20% of the grade, and the final exam worth 25% of the grade. If a student receives an 85 on test 1, a 76 on test 2, an 83 on test 3, a 74 on the homework, a 65 on the project, and a 79 on the final, what grade did the student earn in the course?
14.) A statistics class has the following activities and weights for determining a grade
in the course: test 1 worth 15% of the grade, test 2 worth 15% of the grade, test 3 worth 15% of the grade, homework worth 10% of the grade, semester project worth 20% of the grade, and the final exam worth 25% of the grade. If a student receives a 92 on test 1, an 85 on test 2, a 95 on test 3, a 92 on the homework, a 55 on the project, and an 83 on the final, what grade did the student earn in the course?
Chapter 3: Numerical Descriptions of Data
90
Section 3.2: Measures of Spread Variability is an important idea in statistics. If you were to measure the height of everyone in your classroom, every observation gives you a different value. That means not every student has the same height. Thus there is variability in people’s heights. If you were to take a sample of the income level of people in a town, every sample gives you different information. There is variability between samples too. Variability describes how the data are spread out. If the data are very close to each other, then there is low variability. If the data are very spread out, then there is high variability. How do you measure variability? It would be good to have a number that measures it. This section will describe some of the different measures of variability, also known as variation. In example #3.1.1, the average weight of a cat was calculated to be 8.02 pounds. How much does this tell you about the weight of all cats? Can you tell if most of the weights were close to 8.02 or were the weights really spread out? What are the highest weight and the lowest weight? All you know is that the center of the weights is 8.02 pounds. You need more information. The range of a set of data is the difference between the highest and the lowest data values (or maximum and minimum values). Range = highest value − lowest value = maximum value − minimum value Example #3.2.1: Finding the Range
Look at the following three sets of data. Find the range of each of these. a) 10, 20, 30, 40, 50 Solution:
Graph #3.2.1: Dot Plot for Example #3.2.1a
mean = 30, median = 30, range = 50 −10 = 40
b) 10, 29, 30, 31, 50 Solution:
Graph #3.2.2: Dot Plot for Example #3.2.1b
mean = 30, median = 30, range = 50 −10 = 40
Chapter 3: Numerical Descriptions of Data
91
c) 28, 29, 30, 31, 32 Solution:
Graph #3.2.3: Dot Plot for Example #3.2.1
mean = 30, median = 30, range = 32 − 28 = 4
Based on the mean, median, and range in example #3.2.1, the first two distributions are the same, but you can see from the graphs that they are different. In example #3.2.1a the data are spread out equally. In example #3.2.1b the data has a clump in the middle and a single value at each end. The mean and median are the same for example #3.2.1c but the range is very different. All the data is clumped together in the middle.
The range doesn’t really provide a very accurate picture of the variability. A better way to describe how the data is spread out is needed. Instead of looking at the distance the highest value is from the lowest how about looking at the distance each value is from the mean. This distance is called the deviation. Example #3.2.2: Finding the Deviations
Suppose a vet wants to analyze the weights of cats. The weights (in pounds) of five cats are 6.8, 8.2, 7.5, 9.4, and 8.2. Find the deviation for each of the data values.
Solution: Variable: x = weight of a cat The mean for this data set is x = 8.02 pounds .
Table #3.2.1: Deviations of Weights of Cats
x x − x 6.8 6.8 – 8.02 = −1.22 8.2 8.2 – 8.02 = 0.18 7.5 7.5 – 8.02 = −0.52 9.4 9.4 – 8.02 = 1.38 8.2 8.2 – 8.02 = 0.18
Now you might want to average the deviation, so you need to add the deviations together.
Chapter 3: Numerical Descriptions of Data
92
Table #3.2.2: Sum of Deviations of Weights of Cats x x − x 6.8 6.8 – 8.02 = −1.22 8.2 8.2 – 8.02 = .018 7.5 7.5 – 8.02 = −0.52 9.4 9.4 – 8.02 = 1.38 8.2 8.2 – 8.02 = 0.18 Total 0
This can’t be right. The average distance from the mean cannot be 0. The reason it adds to 0 is because there are some positive and negative values. You need to get rid of the negative signs. How can you do that? You could square each deviation.
Table #3.2.3: Squared Deviations of Weights of Cats
x x − x x − x( )2 6.8 6.8 – 8.02 = −1.22 1.4884 8.2 8.2 – 8.02 = .018 0.0324 7.5 7.5 – 8.02 = −0.52 0.2704 9.4 9.4 – 8.02 = 1.38 1.9044 8.2 8.2 – 8.02 = 0.18 0.0324 Total 0 3.728
Now average the total of the squared deviations. The only thing is that in statistics there is a strange average here. Instead of dividing by the number of data values you divide by the number of data values minus 1. In this case you would have
s2 = 3.728 5 −1
= 3.728 4
= 0.932 pounds2
Notice that this is denoted as s2 . This is called the variance and it is a measure of the average squared distance from the mean. If you now take the square root, you will get the average distance from the mean. This is called the standard deviation, and is denoted with the letter s.
s = .932 ≈ 0.965 pounds
The standard deviation is the average (mean) distance from a data point to the mean. It can be thought of as how much a typical data point differs from the mean.
Chapter 3: Numerical Descriptions of Data
93
The sample variance formula:
s2 =
Σ x − x( )2 n−1
where x is the sample mean, n is the sample size, and Σ means to find the sum The sample standard deviation formula:
s = s2 =
Σ x − x( )2 n−1
The n−1 on the bottom has to do with a concept called degrees of freedom. Basically, it makes the sample standard deviation a better approximation of the population standard deviation. The population variance formula:
σ 2 = x − µ( )2∑ N
where σ is the Greek letter sigma and σ 2 represents the population variance, µ is the population mean, and N is the size of the population. The population standard deviation formula:
σ = σ 2 = x − µ( )2∑ N
Note: the sum of the deviations should always be 0. If it isn’t, then it is because you rounded, you used the median instead of the mean, or you made an error. Try not to round too much in the calculations for standard deviation since each rounding causes a slight error. Example #3.2.3: Finding the Standard Deviation
Suppose that a manager wants to test two new training programs. He randomly selects 5 people for each training type and measures the time it takes to complete a task after the training. The times for both trainings are in table #3.2.4. Which training method is better? Table #3.2.4: Time to Finish Task in Minutes Training 1 56 75 48 63 59 Training 2 60 58 66 59 58
Solution: It is important that you define what each variable is since there are two of them.
Variable 1: X1 = productivity from training 1 Variable 2: X2 = productivity from training 2
Chapter 3: Numerical Descriptions of Data
94
To answer which training method better, first you need some descriptive statistics. Start with the mean for each sample.
x1 = 56 + 75 + 48 + 63+ 59
5 = 60.2 minutes
x2 = 60 + 58 + 66 + 59 + 58
5 = 60.2 minutes
Since both means are the same values, you cannot answer the question about which is better. Now calculate the standard deviation for each sample.
Table #3.2.5: Squared Deviations for Training 1
x1 x1 − x1 x1 − x1( )2 56 −4.2 17.64 75 14.8 219.04 48 −12.2 148.84 63 2.8 7.84 59 −1.2 1.44 Total 0 394.8
Table #3.2.6: Squared Deviations for Training 2
x2 x2 − x2 x2 − x2( )2 60 −0.2 0.04 58 −2.2 4.84 66 5.8 33.64 59 −1.2 1.44 58 −2.2 4.84 Total 0 44.8
The variance for each sample is:
s1 2 = 394.8
5 −1 = 98.7 minutes2
s2 2 = 44.8
5 −1 = 11.2 minutes2
The standard deviations are:
s1 = 98.7 ≈ 9.93 minutes s2 = 11.2 ≈ 3.35 minutes
From the standard deviations, the second training seemed to be the better training since the data is less spread out. This means it is more consistent. It would be better for the managers in this case to have a training program that produces more
Chapter 3: Numerical Descriptions of Data
95
consistent results so they know what to expect for the time it takes to complete the task.
You can do the calculations for the descriptive statistics using the technology. The procedure for calculating the sample mean ( x ) and the sample standard deviation ( s x ) for X2 in example #3.2.3 on the TI-83/84 is in figures 3.2.1 through 3.2.4 (the procedure is the same for X1 ). Note the calculator gives you the population standard deviation (σ x ) because it doesn’t know whether the data you input is a population or a sample. You need to decide which value you need to use, based on whether you have a population or sample. In almost all cases you have a sample and will be using s x . Also, the calculator uses the notation of s x instead of just s. It is just a way for it to denote the information. First you need to go into the STAT menu, and then Edit. This will allow you to type in your data (see figure #3.2.1).
Figure #3.2.1: TI-83/84 Calculator Edit Setup
Once you have the data into the calculator, you then go back to the STAT menu, move over to CALC, and then choose 1-Var Stats (see figure #3.2.2). The calculator will now put 1-Var Stats on the main screen. Now type in L2 (2nd button and 2) and then press ENTER. (Note if you have the newer operating system on the TI-84, then the procedure is slightly different.) The results from the calculator are in figure #3.2.4.
Figure #3.2.2: TI-83/84 Calculator CALC Menu
Chapter 3: Numerical Descriptions of Data
96
Figure #3.2.3: TI-83/84 Calculator Input for Example #3.2.3 Variable X2
Figure #3.2.4: TI-83/84 Calculator Results for Example #3.2.3 Variable X2
The processes for finding the mean, median, range, standard deviation, and variance on R are as follows:
variable<-c(type in your data) To find the mean, use mean(variable) To find the median, use median(variable) To find the range, use range(variable). Then find maximum – minimum. To find the standard deviation, use sd(variable) To find the variance, use var(variable)
For the second data set in example #3.2.3, the commands and results would be
productivity_2<-c(60, 58, 66, 59, 58) mean(productivity_2) [1] 60.2 median(productivity_2) [1] 59 range(productivity_2) [1] 58 66 sd(productivity_2) [1] 3.34664 var(productivity_2) [1] 11.2
Chapter 3: Numerical Descriptions of Data
97
In general a “small” standard deviation means the data is close together (more consistent) and a “large” standard deviation means the data is spread out (less consistent). Sometimes you want consistent data and sometimes you don’t. As an example if you are making bolts, you want to lengths to be very consistent so you want a small standard deviation. If you are administering a test to see who can be a pilot, you want a large standard deviation so you can tell who are the good pilots and who are the bad ones. What do “small” and “large” mean? To a bicyclist whose average speed is 20 mph, s = 20 mph is huge. To an airplane whose average speed is 500 mph, s = 20 mph is nothing. The “size” of the variation depends on the size of the numbers in the problem and the mean. Another situation where you can determine whether a standard deviation is small or large is when you are comparing two different samples such as in example #3.2.3. A sample with a smaller standard deviation is more consistent than a sample with a larger standard deviation. Many other books and authors stress that there is a computational formula for calculating the standard deviation. However, this formula doesn’t give you an idea of what standard deviation is and what you are doing. It is only good for doing the calculations quickly. It goes back to the days when standard deviations were calculated by hand, and the person needed a quick way to calculate the standard deviation. It is an archaic formula that this author is trying to eradicate it. It is not necessary anymore, since most calculators and computers will do the calculations for you with as much meaning as this formula gives. It is suggested that you never use it. If you want to understand what the standard deviation is doing, then you should use the definition formula. If you want an answer quickly, use a computer or calculator. Use of Standard Deviation One of the uses of the standard deviation is to describe how a population is distributed by using Chebyshev’s Theorem. This theorem works for any distribution, whether it is skewed, symmetric, bimodal, or any other shape. It gives you an idea of how much data is a certain distance on either side of the mean. Chebyshev’s Theorem For any set of data:
At least 75% of the data fall in the interval from µ − 2σ to µ + 2σ . At least 88.9% of the data fall in the interval from µ − 3σ to µ + 3σ . At least 93.8% of the data fall in the interval from µ − 4σ to µ + 4σ .
Chapter 3: Numerical Descriptions of Data
98
Example #3.2.4: Using Chebyshev’s Theorem The U.S. Weather Bureau has provided the information in table #3.2.7 about the total annual number of reported strong to violent (F3+) tornados in the United States for the years 1954 to 2012. ("U.S. tornado climatology," 17) Table #3.2.7: Annual Number of Violent Tornados in the U.S.
46 47 31 41 24 56 56 23 31 59 39 70 73 85 33 38 45 39 35 22 51 39 51 131 37 24 57 42 28 45 98 35 54 45 30 15 35 64 21 84 40 51 44 62 65 27 34 23 32 28 41 98 82 47 62 21 31 29 32
a.) Use Chebyshev’s theorem to find an interval centered about the mean annual
number of strong to violent (F3+) tornados in which you would expect at least 75% of the years to fall.