Statistics Using
Technology By Kathryn Kozak
Photo taken by Richard Kozak at Dorrigo National Park in NSW, Australia
Creative Commons Attribution Sharealike. This license is considered to be some to be the most open license. It allows reuse, remixing, and distribution (including commercial), but requires any remixes use the same license as the original. This limits where the content can be remixed into, but on the other hand ensures that no-one can remix the content then put the remix under a more restrictive license.
2014 Kathryn Kozak ISBN: 978-1-312-18519-7
Statistics Using Technology
i
Table of Content: Preface iii
Chapter 1: Statistical Basics 1
Section 1.1: What is Statistics? 1 Section 1.2: Sampling Methods 8 Section 1.3: Experimental Design 14 Section 1.4: How Not to Do Statistics 19 Chapter 2: Graphical Descriptions of Data 25 Section 2.1: Qualitative Data 25 Section 2.2: Quantitative Data 36 Section 2.3: Other Graphical Representations of Data 56 Chapter 3: Numerical Descriptions of Data 71 Section 3.1: Measures of Center 71 Section 3.2: Measures of Spread 83 Section 3.3: Ranking 99 Chapter 4: Probability 111 Section 4.1: Empirical Probability 111 Section 4.2: Theoretical Probability 114 Section 4.3: Conditional Probability 130 Section 4.4: Counting Techniques 142 Chapter 5: Discrete Probability Distributions 147 Section 5.1: Basics of Probability Distributions 147 Section 5.2: Binomial Probability Distribution 156 Section 5.3: Mean and Standard Deviation of Binomial Distribution 169 Chapter 6: Continuous Probability Distributions 175 Section 6.1: Uniform Distribution 175 Section 6.2: Graphs of the Normal Distribution 178 Section 6.3: Finding Probabilities for the Normal Distribution 181 Section 6.4: Assessing Normality 190 Section 6.5: Sampling Distribution and the Central Limit Theorem 202
Statistics Using Technology
ii
Chapter 7: One-Sample Inference 215 Section 7.1: Basics of Hypothesis Testing 215 Section 7.2: One-Sample Proportion Test 228 Section 7.3: One-Sample Test for the Mean 234 Chapter 8: Estimation 247 Section 8.1: Basics of Confidence Intervals 247 Section 8.2: One-Sample Interval for the Proportion 250 Section 8.3: One-Sample Interval for the Mean 255 Chapter 9: Two-Sample Inference 265 Section 9.1: Paired Samples for Two Means 265 Section 9.2: Independent Samples for Two Means 284 Section 9.3: Two Proportions 306 Chapter 10: Regression and Correlation 317 Section 10.1: Regression 317 Section 10.2: Correlation 336 Section 10.3: Inference for Regression and Correlation 344 Chapter 11: Chi-Square and ANOVA Tests 359 Section 11.1: Chi-Square Test for Independence 359 Section 11.2: Chi-Square Goodness of Fit 375 Section 11.3: Analysis of Variance (ANOVA) 382 Appendix: Critical Value Tables 395 Table A.1: Normal Critical Values for Confidence Levels 396 Table A.2: Critical Values for t-Interval 397 Index 398
Statistics Using Technology
iii
Preface: I hope you find this book useful in teaching statistics. When writing this book, I tried to follow the GAISE Standards (GAISE recommendations. (2014, January 05). Retrieved from http://www.amstat.org/education/gaise/GAISECollege_Recommendations.pdf ), which are
1.) Emphasis statistical literacy and develop statistical understanding. 2.) Use real data. 3.) Stress conceptual understanding, rather than mere knowledge of procedure. 4.) Foster active learning in the classroom. 5.) Use technology for developing concepts and analyzing data.
To this end, I ask students to interpret the results of their calculations. I incorporated the use of technology for most calculations. Because of that you will not find me using any of the computational formulas for standard deviations or correlation and regression since I prefer students understand the concept of these quantities. Also, because I utilize technology you will not find the standard normal table, Student’s t-table, binomial table, chi-square distribution table, and F-distribution table in the book. The only tables I provided were for critical values for confidence intervals since they are more difficult to find using technology. Another difference between this book and other statistics books is the order of hypothesis testing and confidence intervals. Most books present confidence intervals first and then hypothesis tests. I find that presenting hypothesis testing first and then confidence intervals is more understandable for students. Lastly, I have de- emphasized the use of the z-test. In fact, I only use it to introduce hypothesis testing, and never utilize it again. You may also notice that when I introduced hypothesis testing and confidence intervals, proportions were introduced before means. However, when two sample tests and confidence intervals are introduced I switched this order. This is because usually many instructors do not discuss the proportions for two samples. However, you might try assigning problems for proportions without discussing it in class. After doing two samples for means, the proportions are similar. Lastly, to aid student understanding and interest, most of the homework and examples utilize real data. Again, I hope you find this book useful for your introductory statistics class. I want to make a comment about the mathematical knowledge that I assumed the students possess. The course for which I wrote this book has a higher prerequisite than most introductory statistics books. However, I do feel that students can read and understand this book as long as they have had basic algebra and can substitute numbers into formulas. I do not show how to create most of the graphs, but most students should have been exposed to them in high school. So I hope the mathematical level is appropriate for your course. The technology that I utilized for creating the graphs was Microsoft Excel, and I utilized the TI-83/84 graphing calculator for most calculations, including hypothesis testing, confidence intervals, and probability distributions. This is because these tools are readily available to my students. Please feel free to use any other technology that is more appropriate for your students. Do make sure that you use some technology.
Statistics Using Technology
iv
Acknowledgments: I would like to thank the following people for taking their valuable time to review the book. Their comments and insights improved this book immensely.
Jane Tanner, Onondaga Community College Rob Farinelli, College of Southern Maryland Carrie Kinnison, retired engineer Sean Simpson, Westchester Community College Kim Sonier, Coconino Community College Jim Ham, Delta College
I also want to thank Coconino Community College for granting me a sabbatical so that I would have the time to write the book. Lastly, I want to thank my husband Rich and my son Dylan for supporting me in this project. Without their love and support, I would not have been able to complete the book.
Chapter 1: Statistical Basics
1
Chapter 1: Statistical Basics Section 1.1: What is Statistics? You are exposed to statistics regularly. If you are a sports fan, then you have the statistics for your favorite player. If you are interested in politics, then you look at the polls to see how people feel about certain issues or candidates. If you are an environmentalist, then you research arsenic levels in the water of a town or analyze the global temperatures. If you are in the business profession, then you may track the monthly sales of a store or use quality control processes to monitor the number of defective parts manufactured. If you are in the health profession, then you may look at how successful a procedure is or the percentage of people infected with a disease. There are many other examples from other areas. To understand how to collect data and analyze it, you need to understand what the field of statistics is and the basic definitions. Statistics is the study of how to collect, organize, analyze, and interpret data collected from a group. There are two branches of statistics. One is called descriptive statistics, which is where you collect and organize data. The other is called inferential statistics, which is where you analyze and interpret data. First you need to look at descriptive statistics since you will use the descriptive statistics when making inferences. To understand how to create descriptive statistics and then conduct inferences, there are a few definitions that you need to look at. Note, many of the words that are defined have common definitions that are used in non-statistical terminology. In statistics, some have slightly different definitions. It is important that you notice the difference and utilize the statistical definitions. The first thing to decide in a statistical study is whom you want to measure and what you want to measure. You always want to make sure that you can answer the question of whom you measured and what you measured. The who is known as the individual and the what is the variable. Individual – a person or object that you are interested in finding out information about. Variable (also known as a random variable) – the measurement or observation of the individual. If you put the individual and the variable into one statement, then you obtain a population. Population – set of all values of the variable for the entire group of individuals. Notice, the population answers who you want to measure and what you want to measure. Make sure that your population always answers both of these questions. If it doesn’t, then you haven’t given someone who is reading your study the entire picture. As an example, if you just say that you are going to collect data from the senators in the U.S.
Chapter 1: Statistical Basics
2
Congress, you haven’t told your reader want you are going to collect. Do you want to know their income, their highest degree earned, their voting record, their age, their political party, their gender, their marital status, or how they feel about a particular issue? Without telling what you want to measure, your reader has no idea what your study is actually about. Sometimes the population is very easy to collect. Such as if you are interested in finding the average age of all of the current senators in the U.S. Congress, there are only 100 senators. This wouldn’t be hard to find. However, if instead you were interested in knowing the average age that a senator in the U.S. Congress first took office for all senators that ever served in the U.S. Congress, then this would be a bit more work. It is still doable, but it would take a bit of time to collect. But what if you are interested in finding the average diameter of breast height of all of the Ponderosa Pine trees in the Coconino National Forest? This would be impossible to actually collect. What do you do in these cases? Instead of collecting the entire population, you take a smaller group of the population, kind of a snap shot of the population. This smaller group is called a sample. Sample – a subset from the population. It looks just like the population, but contains less data. How you collect your sample can determine how accurate the results of your study are. There are many ways to collect samples. Some of them create better samples than others. No sampling method is perfect, but some are better than others. Sampling techniques will be discussed later. For now, realize that every time you take a sample you will find different data values. The sample is a snapshot of the population, and there is more information than is in the picture. The idea is to try to collect a sample that gives you an accurate picture, but you will never know for sure if your picture is the correct picture. Unlike previous mathematics classes where there was always one right answer, in statistics there can be many answers, and you don’t know which are right. Once you have your data, either from a population or a sample, you need to know how you want to summarize the data. As an example, suppose you are interested in finding the proportion of people who like a candidate, the average height a plant grows to using a new fertilizer, or the variability of the test scores. Understanding how you want to summarize the data helps to determine the type of data you want to collect. Since the population is what we are interested in, then you want to calculate a number from the population. This is known as a parameter. As mentioned already, you can’t really collect the entire population. Even though this is the number you are interested in, you can’t really calculate it. Instead you use the number calculated from the sample, called a statistic, to estimate the parameter. Since no sample is exactly the same, the statistic values are going to be different from sample to sample. They estimate the value of the parameter, but again, you do not know for sure if your answer is correct.
Chapter 1: Statistical Basics
3
Parameter – a number calculated from the population. Usually denoted with a Greek letter. This number is a fixed, unknown number that you want to find. Statistic – a number calculated from the sample. Usually denoted with letters from the Latin alphabet, though sometimes there is a Greek letter with a ^ (called a hat) above it. Since you can find samples, it is readily known, though it changes depending on the sample taken. It is used to estimate the parameter value. One last concept to mention is that there are two different types of variables – qualitative and quantitative. Each type of variable has different parameters and statistics that you find. It is important to know the difference between them. Qualitative or categorical variable – answer is a word or name that describes a quality of the individual. Quantitative or numerical variable – answer is a number, something that can be counted or measured from the individual. Example #1.1.1: Stating Definitions for Qualitative Variable
In 2010, the Pew Research Center questioned 1500 adults in the U.S. to estimate the proportion of the population favoring marijuana use for medical purposes. It was found that 73% are in favor of using marijuana for medical purposes. State the individual, variable, population, and sample. Solution: Individual – a U.S. adult Variable – the response to the question “should marijuana be used for medical purposes?” This is qualitative data since you are recording a person’s response – yes or no. Population – set of all responses of adults in the U.S. Sample – set of 1500 responses of U.S. adults who are questioned. Parameter – percentage who favor marijuana for medical purposes calculated from population Statistic– percentage who favor marijuana for medical purposes calculated from sample
Example #1.1.2: Stating Definitions for Qualitative Variable
A parking control officer records the manufacturer of every 5th car in the college parking lot in order to guess the most common manufacturer. Solution: Individual – a car in the college parking lot Variable – the name of the manufacturer. This is qualitative data since you are recording a car type. Population – set of all names of the manufacturer of cars in the college parking lot. Sample – set of recorded names of the manufacturer of the cars in college parking lot Parameter – percentage of each car type calculated from population Statistic – percentage of each car type calculated from sample
Chapter 1: Statistical Basics
4
Example #1.1.3: Stating Definitions for Quantitative Variable A biologist wants to estimate the average height of a plant that is given a new plant food. She gives 10 plants the new plant food. State the individual, variable, population, and sample. Solution: Individual – a plant given the new plant food Variable – the height of the plant (Note: it is not the average height since you cannot measure an average – it is calculated from data.) This is quantitative data since you will have a number. Population – set of all the heights of plants when the new plant food is used Sample – set of 10 heights of plants when the new plant food is used Parameter – average height of all plants Statistic – average height of 10 plants
Example #1.1.4: Stating Definitions for Quantitative Variable
A doctor wants to see if a new treatment for cancer extends the life expectancy of a patient versus the old treatment. She gives one group of 25 cancer patients the new treatment and another group of 25 the old treatment. She then measures the life expectancy of each of the patients. State the individuals, variables, populations, and samples. Solution: In this example there are two individuals, two variables, two populations, and two samples. Individual 1: cancer patient given new treatment Individual 2: cancer patient given old treatment Variable 1: life expectancy when given new treatment. This is quantitative data since you will have a number. Variable 2: life expectancy when given old treatment. This is quantitative data since you will have a number. Population 1: set of all life expectancies of cancer patients given new treatment Population 2: set of all life expectancies of cancer patients given old treatment Sample 1: set of 25 life expectancies of cancer patients given new treatment Sample 2: set of 25 life expectancies of cancer patients given old treatment Parameter 1 – average life expectancy of all cancer patients given new treatment Parameter 2 – average life expectancy of all cancer patients given old treatment Statistic 1 – average life expectancy of 25 cancer patients given new treatment Statistic 2 – average life expectancy of 25 cancer patients given old treatment
There are different types of quantitative variables, called discrete or continuous. The difference is in how many values can the data have. If you can actually count the number of data values (even if you are counting to infinity), then the variable is called discrete. If it is not possible to count the number of data values, then the variable is called continuous. Discrete data can only take on particular values like integers. Discrete data are usually things you count. Continuous data can take on any value. Continuous data are usually things you measure.
Chapter 1: Statistical Basics
5
Example #1.1.5: Discrete or Continuous Classify the quantitative variable as discrete or continuous.
a.) The weight of a cat. Solution:
This is continuous since it is something you measure. b.) The number of fleas on a cat. Solution:
This is discrete since it is something you count. c.) The size of a shoe. Solution:
This is discrete since you can only be certain values, such as 7, 7.5, 8, 8.5, 9. You can’t buy a 9.73 shoe.
There are also are four measurement scales for different types of data with each building on the ones below it. They are: Measurement Scales: Nominal – data is just a name or category. There is no order to any data and since there are no numbers, you cannot do any arithmetic on this level of data. Examples of this are gender, car name, ethnicity, and race. Ordinal – data that is nominal, but you can now put the data in order, since one value is more or less than another value. You cannot do arithmetic on this data, but you can now put data values in order. Examples of this are grades (A, B, C, D, F), place value in a race (1st, 2nd, 3rd), and size of a drink (small, medium, large). Interval – data that is ordinal, but you can now subtract one value from another and that subtraction makes sense. You can do arithmetic on this data, but only addition and subtraction. Examples of this are temperature and time on a clock. Ratio – data that is interval, but you can now divide one value by another and that ratio makes sense. You can now do all arithmetic on this data. Examples of this are height, weight, distance, and time. Nominal and ordinal data come from qualitative variables. Interval and ratio data come from quantitative variables. Most people have a hard time deciding if the data are nominal, ordinal, interval, or ratio. First, if the variable is qualitative (words instead of numbers) then it is either nominal or ordinal. Now ask yourself if you can put the data in a particular order. If you can it is ordinal. Otherwise, it is nominal. If the variable is quantitative (numbers), then it is either interval or ratio. For ratio data, a value of 0 means there is no measurement. This is known as the absolute zero. If there is an absolute zero in the data, then it means it is ratio. If there is no absolute zero, then the data are interval. An example of an absolute zero is if you have $0 in your bank account, then you are without money. The amount of
Chapter 1: Statistical Basics
6
money in your bank account is ratio data. Word of caution, sometimes ordinal data is displayed using numbers, such as 5 being strongly agree, and 1 being strongly disagree. These numbers are not really numbers. Instead they are used to assign numerical values to ordinal data. In reality you should not perform any computations on this data, though many people do. If there are numbers, make sure the numbers are inherent numbers, and not numbers that were assigned. Example #1.1.6: Measurement Scale
State which measurement scale each is.
a.) Time of first class Solution:
This is interval since it is a number, but 0 o’clock means midnight and not the absence of time.
b.) Hair color Solution:
This is nominal since it is not a number, and there is no specific order for hair color.
c.) Length of time to take a test Solution:
This is ratio since it is a number, and if you take 0 minutes to take a test, it means you didn’t take any time to complete it.
d.) Age groupings (baby, toddler, adolescent, teenager, adult, elderly) Solution:
This is ordinal since it is not a number, but you could put the data in order from youngest to oldest or the other way around.
Section 1.1: Homework 1.) Suppose you want to know how Arizona workers age 16 or older travel to work.
To estimate the percentage of people who use the different modes of travel, you take a sample containing 500 Arizona workers age 16 or older. State the individual, variable, population, sample, parameter, and statistic.
2.) You wish to estimate the mean cholesterol levels of patients two days after they had a heart attack. To estimate the mean you collect data from 28 heart patients. State the individual, variable, population, sample, parameter, and statistic.
Chapter 1: Statistical Basics
7
3.) Print-O-Matic would like to estimate their mean salary of all employees. To accomplish this they collect the salary of 19 employees. State the individual, variable, population, sample, parameter, and statistic.
4.) To estimate the percentage of households in Connecticut which use fuel oil as a
heating source, a researcher collects information from 1000 Connecticut households about what fuel is their heating source. State the individual, variable, population, sample, parameter, and statistic.
5.) The U.S. Census Bureau needs to estimate the median income of males in the
U.S., they collect incomes from 2500 males. State the individual, variable, population, sample, parameter, and statistic.
6.) The U.S. Census Bureau needs to estimate the median income of females in the
U.S., they collect incomes from 3500 females. State the individual, variable, population, sample, parameter, and statistic.
7.) Eyeglassmatic manufactures eyeglasses and they would like to know the
percentage of each defect type made. They review 25,891 defects and classify each defect that is made. State the individual, variable, population, sample, parameter, and statistic.
8.) The World Health Organization wishes to estimate the mean density of people per
square kilometer, they collect data on 56 countries. State the individual, variable, population, sample, parameter, and statistic
9.) State the measurement scale for each.
a.) Cholesterol level b.) Defect type c.) Time of first class d.) Opinion on a 5 point scale, with 5 being strongly agree and 1 being strongly
disagree
10.) State the measurement scale for each. a.) Temperature in degrees Celsius b.) Ice cream flavors available c.) Pain levels on a scale from 1 to 10, 10 being the worst pain ever d.) Salary of employees
Chapter 1: Statistical Basics
8
Section 1.2: Sampling Methods As stated before, if you want to know something about a population, it is often impossible or impractical to examine the whole population. It might be too expensive in terms of time or money. It might be impractical – you can’t test all batteries for their length of lifetime because there wouldn’t be any batteries left to sell. You need to look at a sample. Hopefully the sample behaves the same as the population. When you choose a sample you want it to be as similar to the population as possible. If you want to test a new painkiller for adults you would want the sample to include people who are fat, skinny, old, young, healthy, not healthy, male, female, etc. There are many ways to collect a sample. None are perfect, and you are not guaranteed to collect a representative sample. That is unfortunately the limitations of sampling. However, there are several techniques that can result in samples that give you a semi- accurate picture of the population. Just remember to be aware that the sample may not be representative. As an example, you can take a random sample of a group of people that are equally males and females, yet by chance everyone you choose is female. If this happens, it may be a good idea to collect a new sample if you have the time and money. There are many sampling techniques, though only four will be presented here. The simplest, and the type that is strive for is a simple random sample. This is where you pick the sample such that every sample has the same chance of being chosen. This type of sample is actually hard to collect, since it is sometimes difficult to obtain a complete list of all individuals. There are many cases where you cannot conduct a truly random sample. However, you can get as close as you can. Now suppose you are interested in what type of music people like. It might not make sense to try to find an answer for everyone in the U.S. You probably don’t like the same music as your parents. The answers vary so much you probably couldn’t find an answer for everyone all at once. It might make sense to look at people in different age groups, or people of different ethnicities. This is called a stratified sample. The issue with this sample type is that sometimes people subdivide the population too much. It is best to just have one stratification. Also, a stratified sample has similar problems that a simple random sample has. If your population has some order in it, then you could do a systematic sample. This is popular in manufacturing. The problem is that it is possible to miss a manufacturing mistake because of how this sample is taken. If you are collecting polling data based on location, then a cluster sample that divides the population based on geographical means would be the easiest sample to conduct. The problem is that if you are looking for opinions of people, and people who live in the same region may have similar opinions. As you can see each of the sampling techniques have pluses and minuses. Include convenience A simple random sample (SRS) of size n is a sample that is selected from a population in a way that ensures that every different possible sample of size n has the same chance of being selected. Also, every individual associated with the population has the same chance of being selected.
Chapter 1: Statistical Basics
9
Ways to select a simple random sample: Put all names in a hat and draw a certain number of names out. Assign each individual a number and use a random number table or a calculator or computer to randomly select the individuals that will be measured.
Example #1.2.1: Choosing a Simple Random Sample Describe how to take a simple random sample from a classroom.
Solution: Give each student in the class a number. Using a random number generator you could then pick the number of students you want to pick.
Example #1.2.2: How Not to Choose a Simple Random Sample
You want to choose 5 students out of a class of 20. Give some examples of samples that are not simple random samples: Solution: Choose 5 students from the front row. The people in the last row have no chance of being selected. Choose the 5 shortest students. The tallest students have no chance of being selected.
Stratified sampling is where you break the population into groups called strata, then take a simple random sample from each strata. For example:
If you want to look at musical preference, you could divide the individuals into age groups and then conduct simple random samples inside each group. If you want to calculate the average price of textbooks, you could divide the individuals into groups by major and then conduct simple random samples inside each group.