Introduction to the Practice of Statistics
NINTH EDITION
David S. Moore George P. McCabe Bruce A. Craig Purdue University
Vice President, STEM: Ben Roberts Publisher: Terri Ward Senior Acquisitions Editor: Karen Carson Marketing Manager: Tom DeMarco Marketing Assistant: Cate McCaffery Development Editor: Jorge Amaral Senior Media Editor: Catriona Kaplan Assistant Media Editor: Emily Tenenbaum Director of Digital Production: Keri deManigold Senior Media Producer: Alison Lorber Associate Editor: Victoria Garvey Editorial Assistant: Katharine Munz Photo Editor: Cecilia Varas Photo Researcher: Candice Cheesman Director of Design, Content Management: Diana Blume Text and Cover Designer: Blake Logan Project Editor: Edward Dionne, MPS North America LLC Illustrations: MPS North America LLC Production Manager: Susan Wein Composition: MPS North America LLC Printing and Binding: LSC Communications Cover Illustration: Drawing Water: Spring 2011 detail (Midwest) by David Wicks “Look Back” Arrow: NewCorner/Shutterstock
Library of Congress Control Number: 2016946039
Student Edition Hardcover: ISBN-13: 978-1-319-01338-7 ISBN-10: 1-319-01338-4
Student Edition Loose-leaf: ISBN-13: 978-1-319-01362-2 ISBN-10: 1-319-01362-7
Instructor Complimentary Copy: ISBN-13: 978-1-319-01428-5 ISBN-10: 1-319-01428-3
© 2017, 2014, 2012, 2009 by W. H. Freeman and Company All rights reserved Printed in the United States of America First printing
W. H. Freeman and Company One New York Plaza Suite 4500 New York, NY 10004-1562 www.macmillanlearning.com
http://www.macmillanlearning.com
Brief Contents
To Teachers: About This Book To Students: What Is Statistics? About the Authors Data Table Index Beyond the Basics Index
PART I Looking at Data CHAPTER 1 Looking at Data—Distributions
CHAPTER 2 Looking at Data—Relationships
CHAPTER 3 Producing Data
PART II Probability and Inference CHAPTER 4 Probability: The Study of Randomness
CHAPTER 5 Sampling Distributions
CHAPTER 6 Introduction to Inference
CHAPTER 7 Inference for Means
CHAPTER 8 Inference for Proportions
PART III Topics in Inference CHAPTER 9 Inference for Categorical Data
CHAPTER 10 Inference for Regression
CHAPTER 11 Multiple Regression
CHAPTER 12 One-Way Analysis of Variance
CHAPTER 13 Two-Way Analysis of Variance Tables Answers to Odd-Numbered Exercises Notes and Data Sources Index
Contents
To Teachers: About This Book To Students: What Is Statistics? About the Authors Data Table Index Beyond the Basics Index
PART I Looking at Data CHAPTER 1 Looking at Data—Distributions Introduction
1.1 Data Key characteristics of a data set
Section 1.1 Summary Section 1.1 Exercises 1.2 Displaying Distributions with Graphs
Categorical variables: Bar graphs and pie charts Quantitative variables: Stemplots and histograms Histograms Data analysis in action: Don’t hang up on me Examining distributions Dealing with outliers Time plots
Section 1.2 Summary Section 1.2 Exercises 1.3 Describing Distributions with Numbers
Measuring center: The mean Measuring center: The median Mean versus median Measuring spread: The quartiles The five-number summary and boxplots The 1.5 × IQR rule for suspected outliers Measuring spread: The standard deviation Properties of the standard deviation Choosing measures of center and spread Changing the unit of measurement
Section 1.3 Summary Section 1.3 Exercises 1.4 Density Curves and Normal Distributions
Density curves
Measuring center and spread for density curves Normal distributions The 68–95–99.7 rule Standardizing observations Normal distribution calculations Using the standard Normal table Inverse Normal calculations Normal quantile plots
Beyond the Basics: Density estimation Section 1.4 Summary Section 1.4 Exercises Chapter 1 Exercises
CHAPTER 2 Looking at Data—Relationships Introduction
2.1 Relationships Examining relationships
Section 2.1 Summary Section 2.1 Exercises 2.2 Scatterplots
Interpreting scatterplots The log transformation Adding categorical variables to scatterplots Scatterplot smoothers Categorical explanatory variables
Section 2.2 Summary Section 2.2 Exercises 2.3 Correlation
The correlation r Properties of correlation
Section 2.3 Summary Section 2.3 Exercises 2.4 Least-Squares Regression
Fitting a line to data Prediction Least-squares regression Interpreting the regression line Facts about least-squares regression Correlation and regression Another view of r2
Section 2.4 Summary Section 2.4 Exercises 2.5 Cautions about Correlation and Regression
Residuals Outliers and influential observations
Beware of the lurking variable Beware of correlations based on averaged data Beware of restricted ranges
Beyond the Basics: Data mining Section 2.5 Summary Section 2.5 Exercises 2.6 Data Analysis for Two-Way Tables
The two-way table Joint distribution Marginal distributions Describing relations in two-way tables Conditional distributions Simpson’s paradox
Section 2.6 Summary Section 2.6 Exercises 2.7 The Question of Causation
Explaining association Establishing causation
Section 2.7 Summary Section 2.7 Exercises Chapter 2 Exercises
CHAPTER 3 Producing Data Introduction
3.1 Sources of Data Anecdotal data Available data Sample surveys and experiments
Section 3.1 Summary Section 3.1 Exercises 3.2 Design of Experiments
Comparative experiments Randomization Randomized comparative experiments How to randomize Randomization using software Randomization using random digits Cautions about experimentation Matched pairs designs Block designs
Section 3.2 Summary Section 3.2 Exercises 3.3 Sampling Design
Simple random samples How to select a simple random sample
Stratified random samples Multistage random samples Cautions about sample surveys
Beyond the Basics: Capture-recapture sampling Section 3.3 Summary Section 3.3 Exercises 3.4 Ethics
Institutional review boards Informed consent Confidentiality Clinical trials Behavioral and social science experiments
Section 3.4 Summary Section 3.4 Exercises Chapter 3 Exercises
PART II Probability and Inference CHAPTER 4 Probability: The Study of Randomness Introduction
4.1 Randomness The language of probability Thinking about randomness The uses of probability
Section 4.1 Summary Section 4.1 Exercises 4.2 Probability Models
Sample spaces Probability rules Assigning probabilities: Finite number of outcomes Assigning probabilities: Equally likely outcomes Independence and the multiplication rule Applying the probability rules
Section 4.2 Summary Section 4.2 Exercises 4.3 Random Variables
Discrete random variables Continuous random variables Normal distributions as probability distributions
Section 4.3 Summary Section 4.3 Exercises 4.4 Means and Variances of Random Variables
The mean of a random variable Statistical estimation and the law of large numbers
Thinking about the law of large numbers Beyond the Basics: More laws of large numbers
Rules for means The variance of a random variable Rules for variances and standard deviations
Section 4.4 Summary Section 4.4 Exercises 4.5 General Probability Rules
General addition rules Conditional probability General multiplication rules Tree diagrams Bayes’s rule Independence again
Section 4.5 Summary Section 4.5 Exercises Chapter 4 Exercises
CHAPTER 5 Sampling Distributions Introduction
5.1 Toward Statistical Inference Sampling variability Sampling distributions Bias and variability Sampling from large populations Why randomize?
Section 5.1 Summary Section 5.1 Exercises 5.2 The Sampling Distribution of a Sample Mean
The mean and standard deviation of x̅ The central limit theorem A few more facts
Beyond the Basics: Weibull distributions Section 5.2 Summary Section 5.2 Exercises 5.3 Sampling Distributions for Counts and Proportions
The binomial distributions for sample counts Binomial distributions in statistical sampling Finding binomial probabilities Binomial mean and standard deviation Sample proportions Normal approximation for counts and proportions The continuity correction Binomial formula The Poisson distributions
Section 5.3 Summary
Section 5.3 Exercises Chapter 5 Exercises
CHAPTER 6 Introduction to Inference Introduction Overview of inference 6.1 Estimating with Confidence
Statistical confidence Confidence intervals Confidence interval for a population mean How confidence intervals behave Choosing the sample size Some cautions
Section 6.1 Summary Section 6.1 Exercises 6.2 Tests of Significance
The reasoning of significance tests Stating hypotheses Test statistics P-values Statistical significance Tests for a population mean Two-sided significance tests and confidence intervals The P-value versus a statement of significance
Section 6.2 Summary Section 6.2 Exercises 6.3 Use and Abuse of Tests
Choosing a level of significance What statistical significance does not mean Don’t ignore lack of significance Statistical inference is not valid for all sets of data Beware of searching for significance
Section 6.3 Summary Section 6.3 Exercises 6.4 Power and Inference as a Decision
Power Increasing the power Inference as decision Two types of error Error probabilities The common practice of testing hypotheses
Section 6.4 Summary Section 6.4 Exercises Chapter 6 Exercises
CHAPTER 7 Inference for Means
Introduction
7.1 Inference for the Mean of a Population The t distributions The one-sample t confidence interval The one-sample t test Matched pairs t procedures Robustness of the t procedures
Beyond the Basics: The bootstrap Section 7.1 Summary Section 7.1 Exercises 7.2 Comparing Two Means
The two-sample z statistic The two-sample t procedures The two-sample t confidence interval The two-sample t significance test Robustness of the two-sample procedures Inference for small samples Software approximation for the degrees of freedom The pooled two-sample t procedures
Section 7.2 Summary Section 7.2 Exercises 7.3 Additional Topics on Inference
Choosing the sample size Inference for non-Normal populations
Section 7.3 Summary Section 7.3 Exercises Chapter 7 Exercises
CHAPTER 8 Inference for Proportions Introduction
8.1 Inference for a Single Proportion Large-sample confidence interval for a single proportion
Beyond the Basics: The plus four confidence interval for a single proportion Significance test for a single proportion Choosing a sample size for a confidence interval Choosing a sample size for a significance test
Section 8.1 Summary Section 8.1 Exercises 8.2 Comparing Two Proportions
Large-sample confidence interval for a difference in proportions Beyond the Basics: The plus four confidence interval for a difference in proportions
Significance test for a difference in proportions Choosing a sample size for two sample proportions
Beyond the Basics: Relative risk Section 8.2 Summary
Section 8.2 Exercises Chapter 8 Exercises
PART III Topics in Inference CHAPTER 9 Inference for Categorical Data Introduction
9.1 Inference for Two-Way Tables The hypothesis: No association Expected cell counts The chi-square test Computations Computing conditional distributions The chi-square test and the z test
Beyond the Basics: Meta-analysis Section 9.1 Summary Section 9.1 Exercises 9.2 Goodness of Fit Section 9.2 Summary Section 9.2 Exercises Chapter 9 Exercises
CHAPTER 10 Inference for Regression Introduction
10.1 Simple Linear Regression Statistical model for linear regression Preliminary data analysis and inference considerations Estimating the regression parameters Checking model assumptions Confidence intervals and significance tests Confidence intervals for mean response Prediction intervals Transforming variables
Beyond the Basics: Nonlinear regression Section 10.1 Summary Section 10.1 Exercises 10.2 More Detail about Simple Linear Regression
Analysis of variance for regression The ANOVA F test Calculations for regression inference Inference for correlation
Section 10.2 Summary Section 10.2 Exercises Chapter 10 Exercises
CHAPTER 11 Multiple Regression Introduction
11.1 Inference for Multiple Regression Population multiple regression equation Data for multiple regression Multiple linear regression model Estimation of the multiple regression parameters Confidence intervals and significance tests for regression coefficients ANOVA table for multiple regression Squared multiple correlation R2
Section 11.1 Summary Section 11.1 Exercises 11.2 A Case Study
Preliminary analysis Relationships between pairs of variables Regression on high school grades Interpretation of results Examining the residuals Refining the model Regression on SAT scores Regression using all variables Test for a collection of regression coefficients
Beyond the Basics: Multiple logistic regression Section 11.2 Summary Section 11.2 Exercises Chapter 11 Exercises
CHAPTER 12 One-Way Analysis of Variance Introduction
12.1 Inference for One-Way Analysis of Variance Data for one-way ANOVA Comparing means The two-sample t statistic An overview of ANOVA The ANOVA model Estimates of population parameters Testing hypotheses in one-way ANOVA The ANOVA table The F test Software
Beyond the Basics: Testing the equality of spread Section 12.1 Summary Section 12.1 Exercises 12.2 Comparing the Means
Contrasts
Multiple comparisons Power
Section 12.2 Summary Section 12.2 Exercises Chapter 12 Exercises
CHAPTER 13 Two-Way Analysis of Variance Introduction
13.1 The Two-Way ANOVA Model Advantages of two-way ANOVA The two-way ANOVA model Main effects and interactions
13.2 Inference for Two-Way ANOVA The ANOVA table for two-way ANOVA
Chapter 13 Summary Chapter 13 Exercises Tables Answers to Odd-Numbered Exercises Notes and Data Sources Index
To Teachers: About This Book
Statistics is the science of data. Introduction to the Practice of Statistics (IPS) is an introductory text based on this principle. We present methods of basic statistics in a way that emphasizes working with data and mastering statistical reasoning. IPS is elementary in mathematical level but conceptually rich in statistical ideas. After completing a course based on our text, we would like students to be able to think objectively about conclusions drawn from data and use statistical methods in their own work.
In IPS, we combine attention to basic statistical concepts with a comprehensive presentation of the elementary statistical methods that students will find useful in their work. IPS has been successful for several reasons:
1. IPS examines the nature of modern statistical practice at a level suitable for beginners. We focus on the production and analysis of data as well as the traditional topics of probability and inference.
2. IPS has a logical overall progression, so data production and data analysis are a major focus, while inference is treated as a tool that helps us draw conclusions from data in an appropriate way.
3. IPS presents data analysis as more than a collection of techniques for exploring data. We emphasize systematic ways of thinking about data. Simple principles guide the analysis: always plot your data; look for overall patterns and deviations from them; when looking at the overall pattern of a distribution for one variable, consider shape, center, and spread; for relations between two variables, consider form, direction, and strength; always ask whether a relationship between variables is influenced by other variables lurking in the background. We warn students about pitfalls in clear cautionary discussions.
4. IPS uses real examples to drive the exposition. Students learn the technique of least-squares regression and how to interpret the regression slope. But they also learn the conceptual ties between regression and correlation and the importance of looking for influential observations.
5. IPS is aware of current developments both in statistical science and in teaching statistics. Brief, optional Beyond the Basics sections give quick overviews of topics such as density estimation, scatterplot smoothers, data mining, nonlinear regression, and meta-analysis. Chapter 16 gives an elementary introduction to the bootstrap and other computer-intensive statistical methods.
The title of the book expresses our intent to introduce readers to statistics as it is used in practice. Statistics in practice is concerned with drawing conclusions from data. We focus on problem solving rather than on methods that may be useful in specific settings.
GAISE The College Report of the Guidelines for Assessment and Instruction in Statistics Education (GAISE) Project (www.amstat.org/education/gaise/) was funded by the American Statistical Association to make recommendations for how introductory statistics courses should be taught. This report and its update contain many interesting teaching suggestions, and we strongly recommend that you read it. The philosophy and approach of IPS closely reflect the GAISE recommendations. Let’s examine each of the latest recommendations in the context of IPS.
1. Teach statistical thinking. Through our experiences as applied statisticians, we are very familiar with the components that are needed for the appropriate use of statistical methods. We focus on formulating questions, collecting and finding data, evaluating the quality of data, exploring the relationships among variables, performing statistical analyses, and drawing conclusions. In examples and exercises throughout the text, we emphasize putting the analysis in the proper context and translating numerical and graphical summaries into conclusions.
2. Focus on conceptual understanding. With the software available today, it is very easy for almost anyone to apply a wide variety of statistical procedures, both simple and complex, to a set of data. Without a firm grasp of the concepts, such applications are frequently meaningless. By using the methods that we present on real sets of data, we believe that students will gain an excellent understanding of these concepts. Our emphasis is on the input (questions of interest, collecting or finding data, examining data) and the output (conclusions) for a statistical analysis. Formulas are given only where they will provide some insight into concepts.
3. Integrate real data with a context and a purpose. Many of the examples and exercises in IPS include data that we have obtained from collaborators or consulting clients. Other data sets have come from research related to these activities. We have also used the Internet as a data source, particularly for data related to social media and other topics of interest to undergraduates. Our emphasis on real data, rather than artificial data chosen to illustrate a
http://www.amstat.org/education/gaise/
calculation, serves to motivate students and help them see the usefulness of statistics in everyday life. We also frequently encounter interesting statistical issues that we explore. These include outliers and nonlinear relationships. All data sets are available from the text website.
4. Foster active learning in the classroom. As we mentioned earlier, we believe that statistics is exciting as something to do rather than something to talk about. Throughout the text, we provide exercises in Use Your Knowledge sections that ask the students to perform some relatively simple tasks that reinforce the material just presented. Other exercises are particularly suited to being worked on and discussed within a classroom setting.
5. Use technology for developing concepts and analyzing data. Technology has altered statistical practice in a fundamental way. In the past, some of the calculations that we performed were particularly difficult and tedious. In other words, they were not fun. Today, freed from the burden of computation by software, we can concentrate our efforts on the big picture: what questions are we trying to address with a study and what can we conclude from our analysis?
6. Use assessments to improve and evaluate student learning. Our goal for students who complete a course based on IPS is that they are able to design and carry out a statistical study for a project in their capstone course or other setting. Our exercises are oriented toward this goal. Many ask about the design of a statistical study and the collection of data. Others ask for a paragraph summarizing the results of an analysis. This recommendation includes the use of projects, oral presentations, article critiques, and written reports. We believe that students using this text will be well prepared to undertake these kinds of activities. Furthermore, we view these activities not only as assessments but also as valuable tools for learning statistics.
Teaching Recommendations We have used IPS in courses taught to a variety of student audiences. For general undergraduates from mixed disciplines, we recommend covering Chapters 1 through 8 and Chapters 9, 10, or 12. For a quantitatively strong audience—sophomores planning to major in actuarial science or statistics—we recommend moving more quickly. Add Chapters 10 and 11 to the core material in Chapters 1 through 8. In general, we recommend deemphasizing the material on probability because these students will take a probability course later in their program. For beginning graduate students in such fields as education, family studies, and retailing, we recommend that the students read the entire text (Chapters 11 and 13 lightly), again with reduced emphasis on Chapter 4 and some parts of Chapter 5. In all cases, beginning with data analysis and data production (Part I) helps students overcome their fear of statistics and builds a sound base for studying inference. We believe that IPS can easily be adapted to a wide variety of audiences.
The Ninth Edition: What’s New? Chapter 1 now begins with a short section giving an overview of data. “Toward Statistical Inference” (previously Section 3.3), which introduces the concepts of statistical inference and sampling distributions, has been moved to Section 5.1 to better assist with the transition from a single data set to sampling distributions. Coverage of mosaic plots as a visual tool for relationships between two categorical variables has been added to Chapters 2 and 9. Chapter 3 now begins with a short section giving a basic overview of data sources. Coverage of equivalence testing has been added to Chapter 7. There is a greater emphasis on sample size determination using software in Chapters 7 and 8. Resampling and bootstrapping are now introduced in Chapter 7 rather than Chapter 6. “Inference for Categorical Data” is the new title for Chapter 9, which includes goodness of fit as well as inference for two-way tables. There are more JMP screenshots and updated screenshots of Minitab, Excel, and SPSS outputs. Design A new design incorporates colorful, revised figures throughout to aid the students’ understanding of text material. Photographs related to chapter examples and exercises make connections to real-life applications and provide a visual context for topics. More figures with software output have been included. Exercises and Examples More than 30% of the exercises are new or revised, and there are more than 1700 exercises total. Exercise sets have been added at the end of sections in Chapters 9 through 12. To maintain the attractiveness of the examples to students, we have replaced or updated a large number of them. More than 30% of the 430 examples are new or revised. A list of exercises and examples categorized by application area is provided on the inside of the front cover.
In addition to the new ninth edition enhancements, IPS has retained the successful pedagogical features from previous editions:
Look Back At key points in the text, Look Back margin notes direct the reader to the first explanation of a topic, providing page numbers for easy reference.
Caution Warnings in the text, signaled by a caution icon, help students avoid common errors and misconceptions.
Challenge Exercises More challenging exercises are signaled with an icon. Challenge exercises are varied: some are mathematical, some require open-ended investigation, and others require deeper thought about the basic concepts.
Applets Applet icons are used throughout the text to signal where related interactive statistical applets can be found on the IPS website and in LaunchPad. Use Your Knowledge Exercises We have found these exercises to be a very useful learning tool. They appear throughout each section and are listed, with page numbers, before the section-ending exercises. Technology output screenshots Most statistical analyses rely heavily on statistical software. In this book, we discuss the use of Excel 2013, JMP 12, Minitab 17, SPSS 23, CrunchIt, R, and a TI-83/-84 calculator for conducting statistical analysis. As specialized statistical packages, JMP, Minitab, and SPSS are the most popular software choices both in industry and in colleges and schools of business. R is an extremely powerful statistical environment that is free to anyone; it relies heavily on members of the academic and general statistical communities for support. As an all-purpose spreadsheet program, Excel provides a limited set of statistical analysis options in comparison. However, given its pervasiveness and wide acceptance in industry and the computer world at large, we believe it is important to give Excel proper attention. It should be noted that for users who want more statistical capabilities but want to work in an Excel environment, there are a number of commercially available add-on packages (if you have JMP, for instance, it can be invoked from within Excel). Finally, instructions are provided for the TI-83/-84 calculators.
Even though basic guidance is provided in the book, it should be emphasized that IPS is not bound to any of these programs. Computer output from statistical packages is very similar, so you can feel quite comfortable using any one these packages.
Acknowledgments We are pleased that the first eight editions of Introduction to the Practice of Statistics have helped to move the teaching of introductory statistics in a direction supported by most statisticians. We are grateful to the many colleagues and students who have provided helpful comments, and we hope that they will find this new edition another step forward. In particular, we would like to thank the following colleagues who offered specific comments on the new edition: Ali Arab, Georgetown University Tessema Astatkie, Dalhousie University Fouzia Baki, McMaster University Lynda Ballou, New Mexico Institute of Mining and Technology Sanjib Basu, Northern Illinois University David Bosworth, Hutchinson Community College
Max Buot, Xavier University Nadjib Bouzar, University of Indianapolis Matt Carlton, California Polytechnic State University–San Luis Obispo Gustavo Cepparo, Austin Community College Pinyuen Chen, Syracuse University Dennis L. Clason, University of Cincinnati–Blue Ash College Tadd Colver, Purdue University Chris Edwards, University of Wisconsin–Oshkosh Irina Gaynanova, Texas A&M University Brian T. Gill, Seattle Pacific University Mary Gray, American University Gary E. Haefner, University of Cincinnati Susan Herring, Sonoma State University Lifang Hsu, Le Moyne College Tiffany Kolba, Valparaiso University Lia Liu, University of Illinois at Chicago Xuewen Lu, University of Calgary Antoinette Marquard, Cleveland State University Frederick G. Schmitt, College of Marin James D. Stamey, Baylor University Engin Sungur, University of Minnesota–Morris Anatoliy Swishchuk, University of Calgary Richard Tardanico, Florida International University Melanee Thomas, University of Calgary Terri Torres, Oregon Institute of Technology Mahbobeh Vezvaei, Kent State University Yishi Wang, University of North Carolina–Wilmington John Ward, Jefferson Community and Technical College Debra Wiens, Rocky Mountain College Victor Williams, Paine College Christopher Wilson, Butler University Anne Yust, Birmingham-Southern College Biao Zhang, The University of Toledo Michael L. Zwilling, University of Mount Union
The professionals at Macmillan, in particular, Terri Ward, Karen Carson, Jorge Amaral, Emily Tenenbaum, Ed Dionne, Blake Logan, and Susan Wein, have contributed greatly to the success of IPS. In addition, we would like to thank Tadd Colver at Purdue University for his valuable contributions to the ninth edition, including authoring the back-of-book answers, solutions, and Instructor’s Guide. We’d also like to thank Monica Jackson at American University for accuracy reviewing the back-of-book answers and solutions and for authoring the test bank. Thanks also to Michael Zwilling at University of Mount Union for accuracy reviewing the test bank, Christopher Edwards at University of Wisconsin Oshkosh for authoring the lecture slides, and James Stamey at Baylor University for authoring the Clicker slides.
Most of all, we are grateful to the many friends and collaborators whose data and research questions have enabled us to gain a deeper understanding of the science of data. Finally, we would like to acknowledge the contributions of John W. Tukey, whose contributions to data analysis have had such a great influence on us as well as a whole generation of applied statisticians.
Media and Supplements
LaunchPad, our online course space, combines an interactive e-Book with high-quality multimedia content and ready- made assessment options, including LearningCurve adaptive quizzing. Content is easy to assign or adapt with your own material, such as readings, videos, quizzes, discussion groups, and more. LaunchPad also provides access to a Gradebook that offers a window into your students’ performance—either individually or as a whole. Use LaunchPad on its own or integrate it with your school’s learning management system so your class is always on the same page. To learn more about LaunchPad for Introduction to the Practice of Statistics, Ninth Edition, or to request access, go to
launchpadworks.com.
Assets integrated into LaunchPad include:
Interactive e-Book. Every LaunchPad e-Book comes with powerful study tools for students, video and multimedia content, and easy customization for instructors. Students can search, highlight, and bookmark, making it easier to study and access key content. And teachers can ensure that their classes get just the book they want to deliver: customize and rearrange chapters; add and share notes and discussions; and link to quizzes, activities, and other resources.
LearningCurve provides students and instructors with powerful adaptive quizzing, a game-like format, direct links to the e-Book, and instant feedback. The quizzing system features questions tailored specifically to the text and adapts to students’ responses, providing material at different difficulty levels and topics based on student performance.
JMP Student Edition (developed by SAS) is easy to learn and contains all the capabilities required for introductory statistics. JMP is the leading commercial data analysis software of choice for scientists, engineers, and analysts at companies throughout the world (for Windows and Mac). Register inside LaunchPad at no additional cost.
CrunchIt!® is a Web-based statistical program that allows users to perform all the statistical operations and graphing needed for an introductory statistics course and more. It saves users time by automatically loading data from IPS, 9e, and it provides the flexibility to edit and import additional data.
StatBoards Videos are brief whiteboard videos that illustrate difficult topics through additional examples, written and explained by a select group of statistics educators.
Stepped Tutorials are centered on algorithmically generated quizzing with step-by-step feedback to help students work their way toward the correct solution. These exercise tutorials (two to three per chapter) are easily assignable and assessable.
Statistical Video Series consists of StatClips, StatClips Examples, and Statistically Speaking “Snapshots.” View animated lecture videos, whiteboard lessons, and documentary-style footage that illustrate key statistical concepts and help students visualize statistics in real-world scenarios.
Video Technology Manuals, available for TI-83/84 calculators, Minitab, Excel, JMP, SPSS, R, Rcmdr, and CrunchIt! ®, provide brief instructions for using specific statistical software.
StatTutor Tutorials offer multimedia tutorials that explore important concepts and procedures in a presentation that combines video, audio, and interactive features. The newly revised format includes built-in, assignable assessments and a bright new interface.
Statistical Applets give students hands-on opportunities to familiarize themselves with important statistical concepts and procedures in an interactive setting that allows them to manipulate variables and see the results graphically. Icons in the textbook indicate when an applet is available for the material being covered. Applets are assessable and assignable in LaunchPad.
Stats@Work Simulations put students in the role of the statistical consultant, helping them better understand statistics interactively within the context of real-life scenarios.
EESEE Case Studies (Electronic Encyclopedia of Statistical Examples and Exercises), developed by The Ohio State University Statistics Department, teach students to apply their statistical skills by exploring actual case studies using real data.
http://launchpadworks.com
SolutionMaster offers an easy-to-use web-based version of the instructor’s solutions, allowing instructors to generate a solution file for any set of homework exercises.
Data files are available in JMP, ASCII, Excel, TI, Minitab, SPSS (an IBM Company)*, R, and CSV formats.
Student Solutions Manual provides solutions to the odd-numbered exercises in the text and is available as a print supplement and electronically in LaunchPad.
Instructor’s Guide with Full Solutions includes teaching suggestions, chapter comments, and detailed solutions to all exercises and is available electronically in LaunchPad.
Test Bank offers hundreds of multiple-choice questions and is available in LaunchPad.
Lecture Slides offer a customizable, detailed lecture presentation of statistical concepts covered in each chapter of IPS, 9e. Image slides contain all textbook figures and tables. Lecture slides and images slides are available in LaunchPad.
WebAssign offers algorithmic questions from IPS, 9e, in a powerful online instructional system. WebAssign lets you easily create assignments, grade homework, and give your students instant feedback. Along with flexible features, class and question-level analytics are available for instructors and students. WebAssign Premium also includes the following resources described above: e-Book, data files, LearningCurve, StatTutor Tutorials, Statistical Videos, Video Technology Manuals, solutions manuals, lecture and image slides, i-Clicker slides, test bank, and practice quizzes.
Additional Resources Available with IPS, 9e Special Software Package A student version of JMP is available for packaging with the printed text. JMP is also available inside LaunchPad at no additional cost.
i-Clicker is a two-way radio-frequency classroom response solution developed by educators for educators. Each step of i-Clicker’s development has been informed by teaching and learning.
* SPSS was acquired by IBM in October 2009
To Students: What Is Statistics?
Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. We are bombarded by data in our everyday lives. The news mentions movie box-office sales, the latest poll of the president’s popularity, and the average high temperature for today’s date. Advertisements claim that data show the superiority of the advertiser’s product. All sides in public debates about economics, education, and social policy argue from data. A knowledge of statistics helps separate sense from nonsense in this flood of data.
The study and collection of data are also important in the work of many professions, so training in the science of statistics is valuable preparation for a variety of careers. Each month, for example, government statistical offices release the latest numerical information on unemployment and inflation. Economists and financial advisers, as well as policymakers in government and business, study these data in order to make informed decisions. Doctors must understand the origin and trustworthiness of the data that appear in medical journals. Politicians rely on data from polls of public opinion. Business decisions are based on market research data that reveal consumer tastes and preferences. Engineers gather data on the quality and reliability of manufactured products. Most areas of academic study make use of numbers and, therefore, also make use of the methods of statistics. This means it is extremely likely that your undergraduate research projects will involve, at some level, the use of statistics.
Learning from Data The goal of statistics is to learn from data. To learn, we often perform calculations or make graphs based on a set of numbers. But to learn from data, we must do more than calculate and plot because data are not just numbers; they are numbers that have some context that helps us learn from them.
More than two-thirds of Americans are overweight or obese according to the Centers for Disease Control and Prevention (CDC) website (www.cdc.gov/nchs/nhanes.htm). What does it mean to be obese or to be overweight? To answer this question, we need to talk about body mass index (BMI). Your weight in kilograms divided by the square of your height in meters is your BMI. A man who is 6 feet tall (1.83 meters) and weighs 180 pounds (81.65 kilograms) will have a BMI of 81.65/(1.83)2 = 24.4 kg/m2. How do we interpret this number? According to the CDC, a person is classified as overweight if his or her BMI is between 25 and 29.9 kg/m2 and as obese if his or her BMI is 30 kg/m2 or more. Therefore, more than two-thirds of Americans have a BMI of 25 kg/m2 or more. The man who weighs 180 pounds and is 6 feet tall is not overweight or obese, but if he gains 5 pounds, his BMI would increase to 25.1, and he would be classified as overweight.
When you do statistical problems, even straightforward textbook problems, don’t just graph or calculate. Think about the context and state your conclusions in the specific setting of the problem. As you are learning how to do statistical calculations and graphs, remember that the goal of statistics is not calculation for its own sake but gaining understanding from numbers. The calculations and graphs can be automated by a calculator or software, but you must supply the understanding. This book presents only the most common specific procedures for statistical analysis. A thorough grasp of the principles of statistics will enable you to quickly learn more advanced methods as needed. On the other hand, a fancy computer analysis carried out without attention to basic principles will often produce elaborate nonsense. As you read, seek to understand the principles as well as the necessary details of methods and recipes.
The Rise of Statistics Historically, the ideas and methods of statistics developed gradually as society grew interested in collecting and using data for a variety of applications. The earliest origins of statistics lie in the desire of rulers to count the number of inhabitants or measure the value of taxable land in their domains. As the physical sciences developed in the seventeenth and eighteenth centuries, the importance of careful measurements of weights, distances, and other physical quantities grew. Astronomers and surveyors striving for exactness had to deal with variation in their measurements. Many measurements should be better than a single measurement, even though they vary among themselves. How can we best combine many varying observations? Statistical methods that are still important were invented in order to analyze scientific measurements.
By the nineteenth century, the agricultural, life, and behavioral sciences also began to rely on data to answer
http://www.cdc.gov/nchs/nhanes.htm
fundamental questions. How are the heights of parents and children related? Does a new variety of wheat produce higher yields than the old, and under what conditions of rainfall and fertilizer? Can a person’s mental ability and behavior be measured just as we measure height and reaction time? Effective methods for dealing with such questions developed slowly and with much debate.
As methods for producing and understanding data grew in number and sophistication, the new discipline of statistics took shape in the twentieth century. Ideas and techniques that originated in the collection of government data, in the study of astronomical or biological measurements, and in the attempt to understand heredity or intelligence came together to form a unified “science of data.” That science of data—statistics—is the topic of this text.
The Organization of This Book Part I of this book, called simply “Looking at Data,” concerns data analysis and data production. The first two chapters deal with statistical methods for organizing and describing data. These chapters progress from simpler to more complex data. Chapter 1 examines data on a single variable; Chapter 2 is devoted to relationships among two or more variables. You will learn both how to examine data produced by others and how to organize and summarize your own data. These summaries will first be graphical, then numerical, and then, when appropriate, in the form of a mathematical model that gives a compact description of the overall pattern of the data. Chapter 3 outlines arrangements (called designs) for producing data that answer specific questions. The principles presented in this chapter will help you to design proper samples and experiments for your research projects and to evaluate other such investigations in your field of study.
Part II, consisting of Chapters 4 through 8, introduces statistical inference—formal methods for drawing conclusions from properly produced data. Statistical inference uses the language of probability to describe how reliable its conclusions are, so some basic facts about probability are needed to understand inference. Probability is the subject of Chapters 4 and 5. Chapter 6, perhaps the most important chapter in the text, introduces the reasoning of statistical inference. Effective inference is based on good procedures for producing data (Chapter 3), careful examination of the data (Chapters 1 and 2), and an understanding of the nature of statistical inference as discussed in Chapter 6. Chapters 7 and 8 describe some of the most common specific methods of inference, for drawing conclusions about means and proportions from one and two samples.
The five shorter chapters in Part III introduce somewhat more advanced methods of inference, dealing with relations in categorical data, regression and correlation, and analysis of variance. Four supplementary chapters, available from the text website, present additional statistical topics.
What Lies Ahead Introduction to the Practice of Statistics is full of data from many different areas of life and study. Many exercises ask you to express briefly some understanding gained from the data. In practice, you would know much more about the background of the data you work with and about the questions you hope the data will answer. No textbook can be fully realistic. But it is important to form the habit of asking, “What do the data tell me?” rather than just concentrating on making graphs and doing calculations.
You should have some help in automating many of the graphs and calculations. You should certainly have a calculator with basic statistical functions. Look for keywords such as “two-variable statistics” or “regression” when you shop for a calculator. More advanced (and more expensive) calculators will do much more, including some statistical graphs. You may be asked to use software as well. There are many kinds of statistical software, from spreadsheets to large programs for advanced users of statistics. The kind of computing available to learners varies a great deal from place to place—but the big ideas of statistics don’t depend on any particular level of access to computing.
Because graphing and calculating are automated in statistical practice, the most important assets you can gain from the study of statistics are an understanding of the big ideas and the beginnings of good judgment in working with data. Ideas and judgment can’t (at least yet) be automated. They guide you in telling the computer what to do and in interpreting its output. This book tries to explain the most important ideas of statistics, not just teach methods. Some examples of big ideas that you will meet are “always plot your data,” “randomized comparative experiments,” and “statistical significance.”
You learn statistics by doing statistical problems. “Practice, practice, practice.” Be prepared to work problems. The basic principle of learning is persistence. Being organized and persistent is more helpful in reading this book than knowing lots of math. The main ideas of statistics, like the main ideas of any important subject, took a long time to discover and take some time to master. The gain will be worth the pain.
About the Authors
David S. Moore is Shanti S. Gupta Distinguished Professor of Statistics, Emeritus, at Purdue University and was 1998 president of the American Statistical Association. He received his AB from Princeton and his PhD from Cornell, both in mathematics. He has written many research papers in statistical theory and served on the editorial boards of several major journals.
Professor Moore is an elected fellow of the American Statistical Association and of the Institute of Mathematical Statistics and is an elected member of the International Statistical Institute. He has served as program director for statistics and probability at the National Science Foundation.
In recent years, Professor Moore has devoted his attention to the teaching of statistics. He was the content developer for the Annenberg/Corporation for Public Broadcasting college-level telecourse, Against All Odds: Inside Statistics, and for the series of video modules, Statistics: Decisions through Data, intended to aid the teaching of statistics in schools. He is the author of influential articles on statistics education and of several leading texts. Professor Moore has served as president of the International Association for Statistical Education and has received the Mathematical Association of America’s national award for distinguished college or university teaching of mathematics.
George P. McCabe is Associate Dean for Academic Affairs in the College of Science and Professor of Statistics at Purdue University. In 1966, he received a BS degree in mathematics from Providence College and in 1970 a PhD in mathematical statistics from Columbia University. His entire professional career has been spent at Purdue, with sabbaticals at Princeton University, the Commonwealth Scientific and Industrial Research Organization (CSIRO) in Melbourne (Australia), the University of Berne (Switzerland), the National Institute of Standards and Technology (NIST) in Boulder, Colorado, and the National University of Ireland in Galway. Professor McCabe is an elected fellow of the American Association for the Advancement of Science and of the American Statistical Association; he was 1998 chair of its section on Statistical Consulting. In 2008–2010, he served on the Institute of Medicine Committee on Nutrition Standards for the National School Lunch and Breakfast Programs. He has served on the editorial boards of several statistics journals. He has consulted with many major corporations and has testified as an expert witness on the use of statistics in several cases.
Professor McCabe’s research interests have focused on applications of statistics. Much of his recent work has focused on problems in nutrition, including nutrient requirements, calcium metabolism, and bone health. He is the author or coauthor of more than 190 publications in many different journals.
Bruce A. Craig is Professor of Statistics and Director of the Statistical Consulting Service at Purdue University. He received his BS in mathematics and economics from Washington University in St. Louis and his PhD in statistics from the University of Wisconsin–Madison. He is an elected fellow of the American Association for the Advancement of Science and of the American Statistical Association and was chair of its section on Statistical Consulting in 2009. He has also been an active member of the Eastern North American Region of the International Biometrics Society and was elected by the voting membership to the Regional Committee between 2003 and 2006.
Professor Craig has served on the editorial board of several statistical journals and has been a member of several data and safety monitoring boards, including Purdue’s institutional review board.
Professor Craig’s research interests focus on the development of novel statistical methodology to address research questions in the life sciences. Areas of current interest are diagnostic testing, inter-rater agreement, and abundance estimation. He is an author or coauthor of more than 100 papers in more than 50 different journals. In 2005, he was named Purdue University Faculty Scholar.
Data Table Index
TABLE 1.1 IQ test scores for 60 randomly chosen fifth-grade students TABLE 1.2 Service times (seconds) for calls to a customer service center TABLE 1.3 Educational data for 78 seventh-grade students TABLE 2.1 Four data sets for exploring correlation and regression TABLE 2.2 Two measures of glucose level in diabetics TABLE 2.3 Dwelling permits, sales, and production for 21 countries TABLE 2.4 World record times for the 10,000-meter run TABLE 5.1 Length (in minutes) of 60 visits to a statistics help room TABLE 7.1 Monthly rates of return on a portfolio (%) TABLE 7.2 Parts measurements using optical software TABLE 7.3 DRP scores for third-graders TABLE 7.4 Seated systolic blood pressure (mm Hg) TABLE 7.5 Length (in seconds) of audio files sampled from an iPod TABLE 10.1 Annual number of tornadoes in the United States between 1953 and 2014 TABLE 10.2 In-state tuition and fees (in dollars) for 33 public universities TABLE 10.3 Sales price and assessed value (in thousands of $) of 35 homes in a midwestern city TABLE 10.4 Watershed area (km2), percent forest, and index of biotic integrity TABLE 13.1 Iron content (mg/100 g) of food cooked in different pots TABLE 13.2 Tool diameter data
Beyond the Basics Index
Chapter 1 Density estimation Chapter 2 Data mining Chapter 3 Capture-recapture sampling Chapter 4 More laws of large numbers Chapter 5 Weibull distributions Chapter 7 The bootstrap Chapter 8 The plus four confidence interval for a single proportion Chapter 8 The plus four confidence interval for a difference in proportions Chapter 8 Relative risk Chapter 9 Meta-analysis Chapter 10 Nonlinear regression Chapter 11 Multiple logistic regression Chapter 12 Testing the equality of spread
1.1 1.2 1.3 1.4
CHAPTER 1 Looking at Data—Distributions
Data Displaying Distributions with Graphs Describing Distributions with Numbers Density Curves and Normal Distributions
Introduction Statistics is the science of learning from data. Data are numerical or qualitative descriptions of the objects that we want to study. In this chapter, we will master the art of examining data.
We begin in Section 1.1 with some basic ideas about data. We will learn about the different types of data that are collected and how data sets are organized.
Section 1.2 starts our process of learning from data by looking at graphs. These visual displays give us a picture of the overall patterns in a set of data. We have excellent software tools that help us make these graphs. However, it takes
a little experience and a lot of judgment to study the graphs carefully and to explain what they tell us about our data. Section 1.3 continues our process of learning from data by computing numerical summaries. These sets of numbers
describe key characteristics of the patterns that we saw in our graphical summaries. The final section in this chapter helps us make the transition from data summaries to statistical models that are used
to draw conclusions and to make predictions. Specifically, we learn about using density curves to describe a set of data and are introduced to the Normal distributions. These distributions can be used to describe many sets of data that we will encounter. They also play a fundamental role in many of the methods of statistical analysis.
1.1 Data
When you complete this section, you will be able to:
Give examples of cases in a data set. Identify the variables in a data set. Demonstrate how a label can be used as a variable in a data set. Identify the values of a variable. Classify variables as categorical or quantitative. Describe the key characteristics of a set of data. Explain how a rate is the result of adjusting one variable to create another.
A statistical analysis starts with a set of data. We construct a set of data by first deciding what cases, or units, we want to study. For each case, we record information about characteristics that we call variables.
CASES, LABELS, VARIABLES, AND VALUES
Cases are the objects described by a set of data. Cases may be customers, companies, subjects in a study, units in an experiment, or other objects.
A label is a special variable used in some data sets to distinguish the different cases.
A variable is a characteristic of a case.
Different cases can have different values of the variables.
EXAMPLE 1.1
COUPONS
Restaurant discount coupons. A website offers coupons that can be used to get discounts for various items at local restaurants. Coupons for food are very popular. Figure 1.1 gives information for seven restaurant coupons that were available for a recent weekend. These are the cases. Data for each coupon are listed on a different line, and the first column has the coupons numbered from 1 to 7. The remaining columns gives the type of restaurant, the name of the restaurant, the item being discounted, the regular price, and the discount price.
FIGURE 1.1 Spreadsheet of food discount coupons, Example 1.1.
1.1
1.2
Some variables, like the type of restaurant, the name of the restaurant, and the item simply place coupons into categories. The regular price and discount price columns have numerical values for which we can do arithmetic. It makes sense to give an average of the regular prices, but it does not make sense to give an “average” type of restaurant. We can, however, do arithmetic to compare the regular prices classified by type of restaurant.
CATEGORICAL AND QUANTITATIVE VARIABLES
A categorical variable places a case into one of several groups or categories.
A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense.
EXAMPLE 1.2
COUPONS
Categorical and quantitative variables for coupons. The restaurant discount coupon file has six variables: coupon number, type of restaurant, name of restaurant, item, regular price, and discount price. The two price variables are quantitative variables. Coupon number, type of restaurant, name of restaurant, and item are categorical variables.
An appropriate label for your cases should be chosen carefully. In our food coupon example, a natural choice of a label would be the name of the restaurant. However, if there are two or more coupons available for a particular restaurant, or if a restaurant is a chain with different discounts offered at different locations, then the name of the restaurant would not uniquely label each of the coupons. In the restaurant discount coupon file, the first variable, ID, is a unique label for each coupon.
spreadsheet
The display in Figure 1.1 is from an Excel spreadsheet. Spreadsheets are very useful for doing the kind of simple computations that you will do in Exercise 1.2. You can type in a formula and have the same computation performed for each row.
Note that the names we have chosen for the variables in our spreadsheet do not have spaces. For example, instead of “Restaurant Name” for the name of the restaurant, we simply use Name. In some statistical software packages, however, spaces are not allowed in variable names. For this reason, when creating spreadsheets for eventual use with statistical software, it is best to avoid spaces in variable names. Another convention is to use an underscore (_) where you would normally use a space. For our data set, we could have used Regular_Price and Discount_Price for the two price variables.
USE YOUR KNOWLEDGE
Read the spreadsheet. Refer to Figure 1.1. Give the regular price and the discount price for the Smokey Grill ribs coupon.
How much is the discount worth? Refer to Example 1.1. Consider adding another column to the spreadsheet that gives the coupon savings. Explain how you would compute the entries in this column. Does the new column contain values for a categorical variable or for a quantitative variable? Explain your answer.
unit of measurement
Another important part of the description of any quantitative variable is its unit of measurement. For both RegPrice and DiscPrice, the unit of measurement is clearly dollars. In other settings, it may not be as obvious. For example, if we were measuring heights of children, we might choose to use either inches or centimeters. The units of measurement are an important part of the description of a quantitative variable.
Key characteristics of a data set In practice, any set of data is accompanied by background information that helps us understand the data. When you plan a statistical study or explore data from someone else’s work, ask yourself the following questions:
1. Who? What cases do the data describe? How many cases does the data set contain? 2. What? How many variables do the data contain? What are the exact definitions of these variables? What are
the units of measurement for each quantitative variable? 3. Why? What purpose do the data have? Do we hope to answer some specific questions? Do we want to draw
conclusions about cases other than the ones we actually have data for? Are the variables that are recorded suitable for the intended purpose?
EXAMPLE 1.3
Statistics class data. Suppose that you are a teaching assistant for a statistics class and one of your jobs is to keep track of the grades for students in two sections of the course. The cases are the students in the class. There are weekly homework assignments, two exams during the semester, and a final exam. Each of these components is given a numerical score, and the components are added to get a total score that can range from 0 to 1000. Cutoffs of 900, 800, 700, etc., are used to assign letter grades of A, B, C, etc.
The spreadsheet for this course will have seven variables:
An identifier for each student. The number of points earned for homework. The number of points earned for the first exam. The number of points earned for the second exam. The number of points earned for the final exam. The total number of points earned. The letter grade earned.
The student identifier is a label and the letter grade earned is a categorical variable. All the other variables are measured in “points.” Because we can do arithmetic with their values, these variables are quantitative variables.
In our example of statistics class data, the possible values for the grade variable are A, B, C, D, and F. When computing grade point averages, many colleges and universities translate these letter grades into numbers using A = 4, B = 3, C = 2, D = 1, and F = 0. The transformed variable with numeric values is considered to be quantitative because we can average the numerical values across different courses to obtain a grade point average.
Sometimes, experts argue about numerical scales such as this. They ask whether or not the difference between an A and a B is the same as the difference between a D and an F. Similarly, many questionnaires ask people to respond on a 1 to 5 scale, with 1 representing strongly agree, 2 representing agree, etc. Again we could ask whether or not the five possible values for this scale are equally spaced in some sense. From a practical point of view, the averages that can be computed when we convert categorical scales such as these to numerical values frequently provide a very useful way to summarize data.
1.3
1.4
EXAMPLE 1.4
Who, what, and why for the statistics class data. The data set in Example 1.3 was constructed to keep track of the grades for students in an introductory statistics course. The cases are the students in the class. There are seven variables in this data set. These include a label for each student and scores for the various course requirements. There are no units for the label and grade. The other variables all have “points” as the unit.
USE YOUR KNOWLEDGE
Who, what, and why? For the restaurant discount coupon data of Example 1.1 (page 2), what cases do the data describe? How many cases are there? How many variables are there? What are their definitions and units of measurement? What purpose do the data have?
EXAMPLE 1.5
Statistics class data for a different purpose. Suppose that the data for the students in the introductory statistics class were also to be used to study relationships between student characteristics and success in the course. Here, we have decided to focus on the TotalPoints and Grade as the outcomes of interest. Other variables of interest would have been included—for example, Sex, PrevStat (whether or not the student has taken a statistics course previously), and Year (student classification as first, second, third, or fourth year). ID is a categorical variable, TotalPoints is a quantitative variable, and the remaining variables are all categorical.
USE YOUR KNOWLEDGE
Apartment rentals. A data set lists apartments available for students to rent. Information provided includes the monthly rent, whether or not cable is included free of charge, whether or not pets are allowed, the number of bedrooms, and the distance to the campus. Describe the cases in the data set, give the number of variables, and specify whether each variable is categorical or quantitative.
instrument
Often, the variables in a statistical study are easy to understand: height in centimeters, study time in minutes, and so on. But each area of work also has its own special variables. A psychologist uses the Minnesota Multiphasic Personality Inventory (MMPI), and a physical fitness expert measures “VO2 max” (the volume of oxygen consumed per minute while exercising at your maximum capacity). Both of these variables are measured with special instruments. VO2 max is measured by exercising while breathing into a mouthpiece connected to an apparatus that measures oxygen consumed. Scores on the MMPI are based on a long questionnaire, which is also called an instrument.
Part of mastering your field of work is learning what variables are important and how they are best measured. Because details of particular measurements usually require knowledge of the particular field of study, we will say little about them.
rate
Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions. Often, for example, the rate at which something occurs is a more meaningful measure than a simple count of occurrences.
EXAMPLE 1.6
1.5
1.6
Comparing colleges based on graduates. Think about comparing colleges based on the numbers of graduates. This view tells you something about the relative sizes of different colleges. However, if you are interested in how well colleges succeed at graduating students they admit, it would be better to use a rate. For example, you can find data on the Internet on the six-year graduation rates of different colleges. These rates are computed by examining the progress of first-year students who enroll in a given year. Suppose that at College A there were 1000 first-year students in a particular year, and 800 graduated within six years. The graduation rate is
or 80%. College B has 2000 students who entered in the same year, and 1200 graduated within six years. The graduation rate is
or 60%. How do we compare these two colleges? College B has more graduates but College A has a better graduation rate.
adjusting one variable to create another
In Example 1.6, when we computed the graduation rate, we used the total number of students to adjust the number of graduates. We constructed a new variable by dividing the number of graduates by the total number of students. adjusting one variable to Computing a rate is just one of several ways of adjusting one variable to create another. We often divide one variable by another to compute a more meaningful variable to study. Example 1.20 (page 20) is another type of adjustment.
USE YOUR KNOWLEDGE
How should you express the change? Between the first exam and the second exam in your statistics course, you increased the amount of time that you spent working exercises. Which of the following three ways would you choose to express the results of your increased work: (a) give the grades on the two exams, (b) give the ratio of the grade on the second exam divided by the grade on the first exam, (c) take the difference between the grade on the second exam and the grade on the first exam, and express this as a percent of the grade on the first exam. Give reasons for your answer.
Which variable would you choose? Refer to Example 1.6 on colleges and their graduates. (a) Give a setting in which you would prefer to evaluate the colleges based on the numbers of graduates.
Give a reason for your choice.
(b) Give a setting in which you would prefer to evaluate the colleges based on the graduation rates. Give a reason for your choice.
Exercises 1.5 and 1.6 illustrate an important point about presenting the results of your statistical calculations. Always consider how to best communicate your results to a general audience. For example, the numbers produced by your calculator or by statistical software frequently contain more digits that are needed. Be sure that you do not include extra information generated by software that will distract from a clear explanation of what you have found.
SECTION 1.1 SUMMARY A data set contains information on a number of cases. Cases may be customers, companies, subjects in a study, units in an experiment, or other objects. For each case, the data give values for one or more variables. A variable describes some characteristic of a case, such as a person’s height, gender, or salary. Variables can have different values for different cases. A label is a special variable used to identify cases in a data set. Some variables are categorical and others are quantitative. A categorical variable places each individual into a category, such as male or female. A quantitative variable has numerical values that measure some characteristic of each case, such as height in centimeters or annual salary in dollars. The key characteristics of a data set answer the questions Who?, What?, and Why?
SECTION 1.1 EXERCISES For Exercises 1.1 and 1.2, see page 3; for Exercise 1.3, see page 5; for Exercise 1.4, see page 5; and for Exercises 1.5 and 1.6, see page 6.
1.7 How do you do online research? A study of 552 first-year college students asked about their favorite choice for doing online research. Possible choices were “Google or Google Scholar,” “Library database or website,” “Wikipedia or online encyclopedia,” and “Other.” Names of the students were not recorded, but the students were numbered from 1 to 552 in the data file. The researchers also recorded age, sex, and major area of study for each student.
(a) What are the cases?
(b) Identify the variables and their possible values.
(c) Classify each variable as categorical or quantitative. Be sure to include at least one of each.
(d) Was a label used? Explain your answer.
(e) Summarize the key characteristics of your data set.
1.8 Summer jobs. You are collecting information about summer jobs that are available for college students in your area. Describe a data set that you could use to organize the information that you collect.
(a) What are the cases?
(b) Identify the variables and their possible values.
(c) Classify each variable as categorical or quantitative. Be sure to include at least one of each.
(d) Use a label and explain how you chose it.
(e) Summarize the key characteristics of your data set.
1.9 Employee application data. The personnel department keeps records on all employees in a company. Here is the information that they keep in one of their data files: employee identification number, last name, first name, middle initial, department, number of years with the company, salary, education (coded as high school, some college, or college degree), and age.
(a) What are the cases for this data set?
(b) Describe each type of information as a label, a quantitative variable, or a categorical variable.
(c) Set up a spreadsheet that could be used to record the data. Give appropriate column headings and five sample cases.
1.10 How would you rank cities? Various organizations rank cities and produce lists of the 10 or the 100 best based on various measures. Create a list of criteria that you would use to rank cities. Include at least eight variables, and give reasons for your choices. Say whether each variable is quantitative or categorical.
1.11 Survey of students. A survey of students in an introductory statistics class asked the following questions: (1) age; (2) do you like to sing? (Yes, No); (3) can you play a musical instrument (not at all, a little, pretty well); (4) how much did you spend on food last week (in dollars); (5) height.
(a) Classify each of these variables as categorical or quantitative and give reasons for your answers.
(b) For each variable give the possible values.
1.12 What questions would you ask? Refer to the previous exercise. Make up your own survey with at least six questions. Include at least two categorical variables and at least two quantitative variables. Tell which variables are categorical and which are quantitative. Give reasons for your answers. For each variable, give the possible values.
1.13 How would you rate colleges? Popular magazines rank colleges and universities on their “academic quality” in serving undergraduate students. Describe five variables that you would like to see measured for each college if you were choosing where to study. Give reasons for each of your choices.
1.14 Attending college in your state or in another state. The U.S. Census Bureau collects a large amount of information concerning higher education.1 For example, the bureau provides a table that includes the following variables: state, number of students from the state who attend college, number of students who attend college in their home state.
(a) What are the cases for this set of data?
(b) Is there a label variable? If yes, what is it?
(c) Identify each variable as categorical or quantitative.
(d) Explain how you might use each of the quantitative variables to explain something about the states.
(e) Consider a variable computed as the number of students in each state who attend college in the state divided by the total number of students from the state who attend college. Explain how you would use this variable to explain something about the states.
1.15 Alcohol-impaired driving fatalities. A report on drunk-driving fatalities in the United States gives the number of alcohol-impaired driving fatalities for each state.2 Discuss at least three different ways that these numbers could be converted to rates. Give the advantages and disadvantages of each.
1.2 Displaying Distributions with Graphs
When you complete this section, you will be able to:
Analyze the distribution of a categorical variable using a bar graph. Analyze the distribution of a categorical variable using a pie chart. Analyze the distribution of a quantitative variable using a stemplot. Analyze the distribution of a quantitative variable using a histogram. Examine the distribution of a quantitative variable with respect to the overall pattern of the data and deviations from that pattern. Identify the shape, center, and spread of the distribution of a quantitative variable. Identify and describe any outliers in the distribution of a quantitative variable. Use a time plot to describe the distribution of a quantitative variable that is measured over time.
exploratory data analysis
Statistical tools and ideas help us examine data to describe their main features. This examination is called exploratory data analysis. Like an explorer crossing unknown lands, we want first to simply describe what we see. Here are two basic strategies that help us organize our exploration of a set of data:
Begin by examining each variable by itself. Then move on to study the relationships among the variables. Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.
We follow these principles in organizing our learning. This chapter presents methods for describing a single variable. We will study relationships among several variables in Chapter 2. Within each chapter, we will begin with graphical displays, then add numerical summaries for a more complete description.
Categorical variables: Bar graphs and pie charts distribution of a categorical variable
count percent proportion
The values of a categorical variable are labels for the categories, such as “yes” and “no.” The distribution of a categorical variable lists the categories and gives either the count or the percent of cases that fall in each category. An alternative to the percent is the proportion, the count divided by the sum of the counts. Note that the percent is simply the proportion times 100.
EXAMPLE 1.7
ONLINE
How do you do online research? A study of 552 first-year college students asked about their preferences for online resources. One question asked them to pick their favorite.3 Here are the results:
Resource Count (n) Google or Google Scholar 406 Library database or website 75 Wikipedia or online encyclopedia 52 Other 19 Total 552
Resource is the categorical variable in this example, and the values are the names of the online resources.
Note that the last value of the variable resource is “Other,” which includes all other online resources that were given as selection options. For data sets that have a large number of values for a categorical variable, we often create a category such as this that includes categories that have relatively small counts or percents. Careful judgment is needed when doing this. You don’t want to cover up some important piece of information contained in the data by combining data in this way.
EXAMPLE 1.8
ONLINE
Favorites as percents. When we look at the online resources data set, we see that Google is the clear winner. We see that 406 reported Google or Google Scholar as their favorite. To interpret this number, we need to know that the total number of students polled was 552. When we say that Google is the winner, we can describe this win by saying that 73.6% (406 divided by 552, expressed as a percent) of the students reported Google as their favorite. Here is a table of the preference percents:
Resource Percent(%) Google or Google Scholar 73.6 Library database or website 13.6 Wikipedia or online encyclopedia 9.4 Other 3.4 Total 100.0
The use of graphical methods allows us to see this information and other characteristics of the data easily. We now examine two types of graphs.
EXAMPLE 1.9
ONLINE
bar graph
Bar graph for the online resource preference data. Figure 1.2 displays the online resource preference data using a bar graph. The heights of the four bars show the percents of the students who reported each of the resources as their favorite.
FIGURE 1.2 Bar graph for the online resource preference data, Example 1.9.
The categories in a bar graph can be put in any order. In Figure 1.2, we ordered the resources based on their preference percents. For other data sets, an alphabetical ordering or some other arrangement might produce a more useful graphical display.
You should always consider the best way to order the values of the categorical variable in a bar graph. Choose an ordering that will be useful to you. If you have difficulty, ask a friend if your choice communicates what you expect. Note that a bar graph using counts will look the same as a bar graph using percents. A pie chart naturally uses percents.
EXAMPLE 1.10
ONLINE
pie chart
Pie chart for the online resource preference data. The pie chart in Figure 1.3 helps us see what part of the whole each group forms. Here it is very easy to see that Google is the favorite for about three-quarters of the students.
FIGURE 1.3 Pie chart for the online resource preference data, Example 1.10.
USE YOUR KNOWLEDGE
1.16
ONLINE
Compare the bar graph with the pie chart. Refer to the bar graph in Figure 1.2 and the pie chart in Figure 1.3 for the online resource preference data. Which graphical display does a better job of describing the data? Give reasons for your answer.
To make a pie chart, you must include all the categories that make up a whole. A category such as “Other” in this example can be used, but the sum of the percents for all the categories should be 100%. This constraint makes bar graphs more flexible.
Quantitative variables: Stemplots and histograms A stemplot (also called a stem-and-leaf plot) gives a quick picture of the shape of a distribution while including the actual numerical values in the graph. Stemplots work best for small numbers of observations that are all greater than 0.
STEMPLOT
To make a stemplot,
1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.
EXAMPLE 1.11
STAT
Soluble corn fiber and calcium. Soluble corn fiber (SCF) has been promoted for various health benefits. One study examined the effect of SCF on the absorption of calcium of adolescent boys and girls. Calcium absorption is expressed as a percent of calcium in the diet. Here are the data for the condition where subjects consumed 12 grams per day (g/d) of SCF.4
50 43 43 44 50 44 35 49 54 76 31 48 61 70 62 47 42 45 43 59 53 53 73
To make a stemplot of these data, use the first digits as stems and the second digits as leaves. Figure 1.4 shows the steps in making the plot, We use the first digit of each value as the stem. Figure 1.4(a) shows the stems that have values 3, 4, 5, 6, and 7. The first entry in our data set is 50. This appears in Figure 1.4(b) on the 5 stem with a leaf of 0. Similarly, the second value, 43, appears in the 4 stem with a leaf of 3. The stemplot is completed in Figure 1.4(c), where the leaves are ordered from smallest to largest.
The center of the distribution is in the 40s, and the data are more stretched out toward high values than low values (the highest value is 76, while the lowest is 31). In the plot, we do not see any extreme values that lie far from the remaining data.
FIGURE 1.4 Making a stemplot of the data in Example 1.11. (a) Write the stems. (b) Go through the data and write each leaf on the proper stem. For example, the values on the 3-stem are 35 and 31 in the order given in the display for the example. (c) Arrange the leaves on each stem in order out from the stem. The 3-stem now has leaves 1 and 5.
1.17
USE YOUR KNOWLEDGE
STAT
Make a stemplot. Here are the scores on the first exam in an introductory statistics course for 30 students in one section of the course:
82 73 92 82 75 98 94 57 80 90 92 80 87 91 65 73 70 85 83 61 70 90 75 75 59 68 85 78 80 94
Use these data to make a stemplot. Then use the stemplot to describe the distribution of the first-exam scores for this course.
back-to-back stemplot
When you wish to compare two related distributions, a back-to-back stemplot with common stems is useful. The leaves on each side are ordered out from the common stem.
EXAMPLE 1.12
SCF
Soluble corn fiber and calcium. Refer to Example 1.11, which gives the data for subjects consuming 12 g/d of SCF. Here are the data for subjects under control conditions (0 g/d of SCF):
42 33 41 49 42 47 48 47 53 72 47 63 68 59 35 46 43 55 38 49 51 51 66
Figure 1.5 gives the back-to-back stemplot for the SCF and control conditions. The values on the left give absorption for the control condition, while the values on the right give absorption when SCF was consumed. The values for SCF appear to be somewhat higher than the controls.
FIGURE 1.5 A back-to-back stemplot to compare the distributions of calcium absorption under control and SCF conditions, Example 1.12.
splitting stems
trimming
There are two modifications of the basic stemplot that can be helpful in different situations. You can double the number of stems in a plot by splitting each stem into two: one with leaves 0 to 4 and the other with leaves 5 through 9.
1.18
1.19
When the observed values have many digits, it is often best to trim the numbers by removing the last digit or digits before making a stemplot. If you are using software, you can round the data, which is what was done for the data given in Example 1.11.
You must use your judgment in deciding whether to split stems and whether to trim or round, though statistical software will often make these choices for you. Remember that the purpose of a stemplot is to display the shape of a distribution. If there are many stems with no leaves or only one leaf, trimming will reduce the number of stems. Let’s take a look at the effect of splitting the stems for our SCF data.
EXAMPLE 1.13
SCF
Stemplot with split stems for SCF. Figure 1.6 presents the data from Example 1.12 in a stemplot with split stems.
FIGURE 1.6 A back-to-back stemplot with split stems to compare the distributions of calcium absorption under control and SCF conditions, Example 1.13.
USE YOUR KNOWLEDGE
Which stemplot do you prefer? Look carefully at the stemplots for the SCF data in Figures 1.5 and 1.6. Which do you prefer? Give reasons for your answer.
Why should you keep the space? Suppose that you had a data set similar to the one given in Example 1.12, but in which the control values of 66 and 68 were both changed to 64.
(a) Make a stemplot of these data using split stems.
(b) Should you use one stem or two stems for the 60s? Give a reason for your answer. (Hint: How would your choice reveal or conceal a potentially important characteristic of the data?)
TABLE 1.1
Histograms Stemplots display the actual values of the observations. This feature makes stemplots awkward for large data sets. Moreover, the picture presented by a stemplot divides the observations into groups (stems) determined by the number system rather than by judgment.
histogram
Histograms do not have these limitations. A histogram breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. You can choose any convenient number of classes, but you should choose classes of equal width.
Making a histogram by hand requires more work than a stemplot. Histograms do not display the actual values observed. For these reasons, we prefer stemplots for small data sets.
The construction of a histogram is best shown by example. Most statistical software packages will make a histogram for you.
EXAMPLE 1.14
IQ
Distribution of IQ scores. You have probably heard that the distribution of scores on IQ tests is supposed to be roughly “bell-shaped.” Let’s look at some actual IQ scores. Table 1.1 displays the IQ scores of 60 fifth-grade students chosen at random from one school.
1. Divide the range of the data into classes of equal width. Let’s use
75 ≤ IQ score < 85
85 ≤ IQ score < 95
145 ≤ IQ score < 155
IQ Test Scores for 60 Randomly Chosen Fifth-Grade Students
145 139 126 122 125 130 96 110 118 118 101 142 134 124 112 109 134 113 81 113 123 94 100 136 109 131 117 110 127 124 106 124 115 133 116 102 127 117 109 137 117 90 103 114 139 101 122 105 97 89 102 108 110 128 114 112 114 102 82 101
Be sure to specify the classes precisely so that each individual falls into exactly one class. A student with IQ 84 would fall into the first class, but IQ 85 falls into the second.
frequency frequency table
2. Count the number of individuals in each class. These counts are called frequencies, and a table of frequencies for all classes is a frequency table.
Class Count 75 ≤ IQ score < 85 2 85 ≤ IQ score < 95 3
1.20
95 ≤ IQ score < 105 10 105 ≤ IQ score < 115 16 115 ≤ IQ score < 125 13 125 ≤ IQ score < 135 10 135 ≤ IQ score < 145 5 145 ≤ IQ score < 155 1
3. Draw the histogram. First, on the horizontal axis mark the scale for the variable whose distribution you are displaying. That’s the IQ score. The scale runs from 75 to 155 because that is the span of the classes we chose. The vertical axis contains the scale of counts. Each bar represents a class. The base of the bar covers the class, and the bar height is the class count. There is no horizontal space between the bars unless a class is empty, so its bar has height zero. Figure 1.7 is our histogram. It does look roughly “bell-shaped.”
FIGURE 1.7 Histogram of the IQ scores of 60 fifth-grade students, Example 1.14.
Large sets of data are often reported in the form of frequency tables when it is not practical to publish the individual observations. In addition to the frequency (count) for each class, we may be interested in the fraction or percent of the observations that fall in each class. A histogram of percents looks just like a frequency histogram such as Figure 1.7. Simply relabel the vertical scale to read in percents. Use histograms of percents for comparing several distributions that have different numbers of observations.
USE YOUR KNOWLEDGE
STAT
Make a histogram. Refer to the first-exam scores from Exercise 1.17 (page 12). Use these data to make a histogram with classes 50 to 59, 60 to 69, etc. Compare the histogram with the stemplot as a way of describing this distribution. Which do you prefer for these data?
Our eyes respond to the area of the bars in a histogram. Because the classes are all the same width, area is determined by height and all classes are fairly represented. There is no one right choice of the classes in a histogram. Too few classes will give a “skyscraper” graph, with all values in a few classes with tall bars. Too many will produce a “pancake” graph, with most classes having one or no observations. Neither choice will give a good picture of the shape of the distribution. You must use your judgment in choosing classes to display the shape. Statistical software will choose the classes for you. The software’s choice is often a good one, but you can change it if you want.
1.21
1.22
You should be aware that the appearance of a histogram can change when you change the classes. The histogram function in the One-Variable Statistical Calculator applet on the text website allows you to change the number of classes by dragging with the mouse, so that it is easy to see how the choice of classes affects the histogram.
USE YOUR KNOWLEDGE
Change the classes in the histogram. Refer to the first-exam scores from Exercise 1.17 (page 12) and the histogram that you produced in Exercise 1.20. Now make a histogram for these data using classes 40 to 59, 60 to 79, and 80 to 100. Compare this histogram with the one that you produced in Exercise 1.20. Which do you prefer? Give a reason for your answer.
STAT
Use smaller classes. Repeat the previous exercise using classes 55 to 59, 60 to 64, 65 to 69, etc. Of the three histograms, which do you prefer? Give reasons for your answer.
Although histograms resemble bar graphs, their details and uses are distinct. A histogram shows the distribution of counts or percents among the values of a single variable. A bar graph compares the counts or percents of different items. The horizontal axis of a bar graph need not have any measurement scale but simply identifies the items being compared.
Draw bar graphs with blank space between the bars to separate the items being compared. Draw histograms with no space, to indicate that all values of the variable are covered. Some spreadsheet programs, which are not primarily intended for statistics, will draw histograms as if they were bar graphs, with space between the bars. Often, you can tell the software to eliminate the space to produce a proper histogram.
TABLE 1.2
Data analysis in action: Don’t hang up on me Many businesses operate call centers to serve customers who want to place an order or make an inquiry. Customers want their requests handled thoroughly. Businesses want to treat customers well, but they also want to avoid wasted time on the phone. They therefore monitor the length of calls and encourage their representatives to keep calls short.
Service Times (Seconds) for Calls to a Customer Service Center
77 289 128 59 19 148 157 203 126 118 104 141 290 48 3 2 372 140 438 56 44 274 479 211 179 1 68 386 2631 90 30 57 89 116 225 700 40 73 75 51 148 9 115 19 76 138 178 76 67 102 35 80 143 951 106 55 4 54 137 367 277 201 52 9 700 182 73 199 325 75 103 64 121 11 9 88 1148 2 465 25
EXAMPLE 1.15
CALLS80
How long are customer service center calls? We have data on the lengths of all 31,492 calls made to the customer service center of a small bank in a month. Table 1.2 displays the lengths of the first 80 calls.5
Take a look at the data in Table 1.2. In this data set, the cases are calls made to the bank’s call center. The variable recorded is the length of each call. The units are seconds. We see that the call lengths vary a great deal. The longest call lasted 2631 seconds, almost 44 minutes. More striking is that 8 of these 80 calls lasted less than 10 seconds.
We started our study of the customer service center data by examining a few cases, the ones displayed in Table 1.2. It would be very difficult to examine all 31,492 cases in this way. How can we do this? Let’s try a histogram.
EXAMPLE 1.16
CALLS
Histogram for customer service center call lengths. Figure 1.8 is a histogram of the lengths of all 31,492 calls. We did not plot the few lengths greater than 1200 seconds (20 minutes). As expected, the graph shows that most calls last between about 1 and 5 minutes, with some lasting much longer when customers have complicated problems. More striking is the fact that 7.6% of all calls are no more than 10 seconds long.
FIGURE 1.8 The distribution of call lengths for 31,492 calls to a bank’s customer service center, Example 1.16. The data show a surprising number of very short calls. These are mostly due to representatives deliberately hanging up in order to bring down their average call length.
It turned out that the bank penalized representatives whose average call length was too long—so some representatives just hung up on customers to bring their average length down. Neither the customers nor the bank were happy about this. The bank changed its policy, and later data showed that calls under 10 seconds had almost disappeared.
tails
The extreme values of a distribution are in the tails of the distribution. The high values are in the upper, or right, tail and the low values are in the lower, or left, tail. The overall pattern in Figure 1.8 is made up of the many moderate call lengths and the long right tail of more lengthy calls. The striking deviation from the overall pattern is the surprising number of very short calls in the left tail.
Our examination of the call center data illustrates some important principles:
After you understand the background of your data (cases, variables, units of measurement), the first thing to do is plot your data. When you look at a plot, look for an overall pattern and also for any striking deviations from the pattern.
Examining distributions Making a statistical graph is not an end in itself. The purpose of the graph is to help us understand the data. After you make a graph, always ask, “What do I see?” Once you have displayed a distribution, you can see its important features as follows.
EXAMINING A DISTRIBUTION
In any graph of data, look for the overall pattern and for striking deviations from that pattern.
You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of deviation is an outlier, an individual value that falls outside the overall pattern.
In Section 1.3, we will learn how to describe center and spread numerically. For now, we can describe the center of a distribution by its midpoint, the value with roughly half the observations taking smaller values and half taking larger values. We can describe the spread of a distribution by giving the smallest and largest values. Stemplots and histograms display the shape of a distribution in the same way. Just imagine a stemplot turned on its side so that the larger values lie to the right.
Some things to look for in describing shape are
modes unimodal
Does the distribution have one or several major peaks, called modes? A distribution with one major peak is called unimodal.
symmetric skewed
Is it approximately symmetric or is it skewed in one direction? A distribution is symmetric if the pattern of values smaller and larger than its midpoint are mirror images of each other. It is skewed to the right if the right tail (larger values) is much longer than the left tail (smaller values).
Some variables commonly have distributions with predictable shapes. Many biological measurements on specimens from the same species and sex—lengths of bird bills, heights of young women—have symmetric distributions. Money amounts, on the other hand, usually have right-skewed distributions. There are many moderately priced houses, for example, but the few very expensive mansions give the distribution of house prices a strong right-skew.
EXAMPLE 1.17
IQ
Examine the histogram of IQ scores. What does the histogram of IQ scores (Figure 1.7, page 15) tell us?
Shape: The distribution is roughly symmetric with a single peak in the center. We don’t expect real data to be perfectly symmetric, so in judging symmetry, we are satisfied if the two sides of the histogram are roughly similar in shape and extent.
Center: You can see from the histogram that the midpoint is not far from 110. Looking at the actual data shows that the midpoint is 114.
Spread: The histogram has a spread from 75 to 155. Looking at the actual data shows that the spread is from 81 to 145. There are no outliers or other strong deviations from the symmetric, unimodal pattern.
1.23
EXAMPLE 1.18
Examine the histogram of call lengths. The distribution of call lengths in Figure 1.8, on the other hand, is strongly skewed to the right. The midpoint, the length of a typical call, is about 115 seconds, or just under 2 minutes. The spread is very large, from 1 second to 28,739 seconds.
The longest few calls are outliers. They stand apart from the long right tail of the distribution, though we can’t see this from Figure 1.8, which omits the largest observations. The longest call lasted almost 8 hours—that may well be due to equipment failure rather than an actual customer call.
USE YOUR KNOWLEDGE
STAT
Describe the first-exam scores. Refer to the first-exam scores from Exercise 1.17 (page 12). Use your favorite graphical display to describe the shape, the center, and the spread of these data. Are there any outliers?
Dealing with outliers
In data sets smaller than the service call data, you can spot outliers by looking for observations that stand apart (either high or low) from the overall pattern of a histogram or stemplot. Identifying outliers is a matter for judgment. Look for points that are clearly apart from the body of the data, not just the most extreme observations in a distribution. You should search for an explanation for any outlier. Sometimes outliers point to errors made in recording the data. In other cases, the outlying observation may be caused by equipment failure or other unusual circumstances.
EXAMPLE 1.19
COLLEGE
College students. How does the number of undergraduate college students vary by state? Figure 1.9 is a histogram of the numbers of undergraduate students in each of the states.6 Notice that more than 50% of the states are included in the first bar of the histogram. These states have fewer than 300,000 undergraduates. The next bar includes another 30% of the states. These have between 300,000 and 600,000 students. The bar at the far right of the histogram corresponds to the state of California, which has 2,685,893 undergraduates. California certainly stands apart from the other states for this variable. It is an outlier.
FIGURE 1.9 The distribution of the numbers of undergraduate college students for the 50 states, Example 1.19.
The state of California is an outlier in the previous example because it has a very large number of undergraduate students. California has the largest population of all the states, so we might expect it to have a large number of college students. Let’s look at these data in a different way.
EXAMPLE 1.20
COLLEGE
College students per 1000. To account for the fact that there is large variation in the populations of the states, for each state we divide the number of undergraduate students by the population and then multiply by 1000. This gives the undergraduate college enrollment expressed as the number of students per 1000 people in each state. Figure 1.10
1.24
gives a stemplot of the distribution. California has 60 undergraduate students per 1000 people. This is one of the higher values in the distribution, but it is clearly not an outlier.
FIGURE 1.10 Stemplot of the numbers of undergraduate college students per 1000 people in each of the 50 states, Example 1.20.
USE YOUR KNOWLEDGE
COLLEGE
Four states with large populations. There are four states with populations greater than 15 million. (a) Examine the data file and report the names of these four states.
(b) Find these states in the distribution of number of undergraduate students per 1000 people. To what extent do these four states influence the distribution of number of undergraduate students per 1000 people?
In Example 1.19, we looked at the distribution of the number of undergraduate students, while in Example 1.20, we adjusted these data by expressing the counts as number per 1000 people in each state. Which way is correct? The answer depends upon why you are examining the data.
If you are interested in marketing a product to undergraduate students, the unadjusted numbers would be of interest because you want to reach the most people. On the other hand, if you are interested in comparing states with respect to how well they provide opportunities for higher education to their residents, the population-adjusted values would be more suitable. Always think about why you are doing a statistical analysis, and this will guide you in choosing an appropriate analytic strategy.
Here is an example with a different kind of outlier.
EXAMPLE 1.21
PTH
Healthy bones and PTH. Bones are constantly being built up (bone formation) and torn down (bone resorption). Young people who are growing have more formation than resorption. When we age, resorption increases to the point where it exceeds formation. (The same phenomenon occurs when astronauts travel in space.) The result is osteoporosis, a disease associated with fragile bones that are more likely to break. The underlying mechanisms that control these processes are complex and involve a variety of substances. One of these is parathyroid hormone (PTH). Here are the values of PTH measured on a sample of 29 boys and girls aged 12 to 15 years:7
39 59 30 48 71 31 25 31 71 50 38 63 49 45 31 33 28 40 127 49 59 50 64 28 46 35 28 19 29
The data are measured in picograms per milliliter (pg/ml) of blood. The original data were recorded with one digit after the decimal point. They have been rounded to simplify our presentation here. Figure 1.11 gives a stemplot of the data.
The observation 127 clearly stands out from the rest of the distribution. A PTH measurement on this individual taken on a different day was similar to the rest of the values in the data set. We conclude that this outlier was caused by a laboratory error or a recording error, and we are confident in discarding it for any additional analysis.
FIGURE 1.11 Stemplot of the values of PTH, Example 1.21.
Time plots
Whenever data are collected over time, it is a good idea to plot the observations in time order. Displays of the distribution of a variable that ignore time order, such as stemplots and histograms, can be misleading when there is systematic change over time.
TIME PLOT
A time plot of a variable plots each observation against the time at which it was measured. Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale.
EXAMPLE 1.22
VITDS
Seasonal variation in vitamin D. Although we get some of our vitamin D from food, most of us get about 75% of what we need from the sun. Cells in the skin make vitamin D in response to sunlight. If people do not get enough exposure to the sun, they can become deficient in vitamin D, resulting in weakened bones and other health problems. The elderly, who need more vitamin D than younger people, and people who live in northern areas, where there is relatively little sunlight in the winter, are particularly vulnerable to these problems.
Figure 1.12 is a plot of the serum levels of vitamin D versus time of year for samples of subjects from Switzerland.8 The units measuring Vitamin D are nanomoles per liter (nmol/l) of blood. The observations are grouped into periods of two months for the plot. Means are marked by filled-in circles and are connected by a line in the plot. The effect of the lack of sunlight in the winter months on vitamin D levels is clearly evident in the plot.
FIGURE 1.12 Plot of vitamin D versus months of the year, Example 1.22.
The data described in the preceding example are based on a subset of the subjects in a study of 248 subjects. The researchers were particularly concerned about subjects whose levels were deficient, defined as a serum vitamin D level of less than 50 nmol/l. They found that there was a 3.8-fold higher deficiency rate in February-March than in August- September: 91.2% versus 24.3%. To ensure that individuals from this population have adequate levels of vitamin D, some form of supplementation is needed, particularly during certain times of the year.
SECTION 1.2 SUMMARY Exploratory data analysis uses graphs and numerical summaries to describe the variables in a data set and the relations among them. The distribution of a variable tells us what values it takes and how often it takes these values. Bar graphs and pie charts display the distributions of categorical variables. These graphs use the counts or percents of the categories. Stemplots and histograms display the distributions of quantitative variables. Stemplots separate each observation into a stem and a one-digit leaf. Histograms plot the frequencies (counts) or the percents of equal- width classes of values. When examining a distribution, look for shape, center, and spread and for clear deviations from the overall shape. Some distributions have simple shapes, such as symmetric or skewed. The number of modes (major peaks) is another aspect of overall shape. Not all distributions have a simple overall shape, especially when there are few observations. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. When observations on a variable are taken over time, make a time plot that graphs time horizontally and the values of the variable vertically. A time plot can reveal changes over time.
SECTION 1.2 EXERCISES For Exercise 1.16, see page 11; for Exercise 1.17, see page 12; for Exercises 1.18 and 1.19, see page 14; for Exercise 1.20, see page 16; for Exercises 1.21 and 1.22, see page 16; for Exercise 1.23, see page 19; and for Exercise 1.24, see page 21.
1.25 Your Facebook app can generate a million dollars a month. A report on Facebook suggests that Facebook apps can generate large amounts of money, as much as $1 million a month.9 The following table gives the numbers of Facebook users by country for the top 10 countries based on the number of users:10 FACEBK
Country Facebook users (in millions)
Brazil 29.30 India 37.38 Mexico 29.80 Germany 21.46 France 23.19 Philippines 26.87 Indonesia 40.52 United Kingdom 30.39 United States 155.74 Turkey 30.63
(a) Use a bar graph to describe the numbers of users in these countries.
(b) Do you think that the United States is an outlier in this data set? Explain your answer.
(c) Describe the major features of your graph in a short paragraph.
1.26 Facebook use increases by country. Refer to the previous exercise. The report also gave the increases in the number of Facebook users for a one-month period for the same countries: FACEBK
Country Increase in users (in millions)
Brazil 2.47 India 1.75 Mexico 0.84 Germany 0.51 France 0.38 Philippines 0.38 Indonesia 0.37 United Kingdom 0.22 United States 0.65 Turkey 0.09
(a) Use a bar graph to describe the increase in users in these countries.
(b) Describe the major features of your graph in a short paragraph.
(c) Do you think a stemplot would be a better graphical display for these data? Give reasons for your answer.
(d) Write a short paragraph about possible business opportunities suggested by the data you described in this exercise and the previous one.
1.27 The Titanic and class. On April 15, 1912, on her maiden voyage, the Titanic collided with an iceberg and sank. The ship was luxurious but did not have enough lifeboats for the 2224 passengers and crew. As a result of the collision, 1502 people died.11 The ship had three classes of passengers. The level of luxury and the price of the ticket varied with the class, with first class being the most luxurious. There were 323 passengers in first class, 277 in second class, and 709 in third class.12
TITANIC
(a) Make a bar graph of these data.
(b) Give a short summary of how the number of passengers varied with class.
(c) If you made a bar graph of the percent of passengers in each class, would the general features of the graph differ from the one you made in part (a)? Explain your answer.
1.28 Another look at the Titanic and class. Refer to the previous exercise. TITANIC
(a) Make a pie chart to display the data.
(b) Compare the pie chart with the bar graph. Which do you prefer? Give reasons for your answer.
1.29 Who survived? Refer to the two previous exercises. The number of first-class passengers who survived was 200. For second and third class, the numbers were 119 and 181, respectively. Create a graphical summary that shows how the survival of passengers depended on class. TITANIC
1.30 Potassium from potatoes. The 2015 Dietary Guidelines for Americans13 notes that the average potassium (K) intake for U.S. adults is about half of the recommended amount. A major source of potassium is potatoes. Nutrients in the diet can have different absorption depending on the source. One study looked at absorption of potassium from different sources. Participants ate a controlled diet for five days, and the amount of potassium absorbed was measured. Data for a diet that included 40 milliequivalents (mEq) of potassium were collected from 27 adult subjects.14 KPOT40
(a) Make a stemplot of the data.
(b) Describe the pattern of the distribution.
(c) Are there any outliers? If yes, describe them and explain why you have declared them to be outliers.
(d) Describe the shape, center, and spread of the distribution.
1.31 Potassium from a supplement. Refer to the previous exercise. Data were also recorded for 29 subjects who received a potassium salt supplement with 40 mEq of potassium. Answer the questions in the previous exercise for the supplemented subjects. KSUP40
1.32 Energy consumption. The U.S. Energy Information Administration reports data summaries of various energy statistics. Let’s look at the total amount of energy consumed, in quadrillions of British thermal units (Btu), for each month in a recent year. Here are the data:15 ENERGY
Month Energy (quadrillion Btu)
January 9.58 February 8.46 March 8.56 April 7.56 May 7.66 June 7.79 July 8.23 August 8.21 September 7.64 October 7.78 November 8.19 December 8.82
(a) Look at the table and describe how the energy consumption varies from month to month.
(b) Make a time plot of the data and describe the patterns.
(c) Suppose you wanted to communicate information about the month-to-month variation in energy consumption. Which would be more effective, the table of the data or the graph? Give reasons for your answer.
1.33 Energy consumption in a different year. Refer to the previous exercise. Here are the data for the previous year: ENERGY
Month Energy (quadrillion Btu)
January 8.99 February 8.02 March 8.38 April 7.52 May 7.62 June 7.72 July 8.27 August 8.17 September 7.64 October 7.72 November 8.14 December 9.08
(a) Analyze these data using the questions in the previous exercise as a guide.
(b) Compare the patterns across the two years. Describe any similarities and differences.
1.34 Favorite colors. What is your favorite color? One survey produced the following summary of responses to that question: blue, 42%; green, 14%; purple, 14%; red, 8%; black, 7%; orange, 5%; yellow, 3%; brown, 3%; gray, 2%; and white, 2%.16 Make a bar graph of the percents and write a short summary of the major features of your graph. FAVCOL
1.35 Least-favorite colors. Refer to the previous exercise. The same study also asked people about their least-favorite color. Here are the results: orange, 30%; brown, 23%; purple, 13%; yellow, 13%; gray, 12%; green, 4%; white, 4%; red, 1%; black, 0%; and blue, 0%. Make a bar graph of these percents and write a summary of the results. LFAVCOL
1.36 Garbage. The formal name for garbage is “municipal solid waste.” In the United States, approximately 250 million tons of garbage are generated in a year. Following is a breakdown of the materials that made up American municipal solid waste in 2012:17 GARBAGE
Material Weight (million tons) Percent of total
Food scraps 36.4 14.5 Glass 11.6 4.6 Metals 22.4 8.9 Paper, paperboard 68.6 27.4 Plastics 31.7 12.7 Rubber, leather 7.5 3.0 Textiles 14.3 5.7 Wood 15.8 6.3 Yard trimmings 34.0 13.5 Other 8.5 3.4 Total 250.9 100.0
(a) Add the weights. The sum is not exactly equal to the value of 250.9 million tons given in the table. Why?
(b) Make a bar graph of the percents. The graph gives a clearer picture of the main contributors to garbage if you order the bars from tallest to shortest.
(c) Also make a pie chart of the percents. Comparing the two graphs, notice that it is easier to see the small differences among “Food scraps,” “Plastics,” and “Yard trimmings” in the bar graph.
1.37 Vehicle colors. Vehicle colors differ among regions of the world. Here are data on the most popular colors for vehicles
TABLE 1.3
in North America:18 VCOLOR
Color (percent)
White 24 Black 19 Silver 16 Gray 15 Red 10 Blue 7 Brown 5 Other 4
(a) Describe these data with a bar graph.
(b) Describe these data with a pie chart.
(c) Which graphical summary do you prefer. Give reasons for your answer.
1.38 Sketch a skewed distribution. Sketch a histogram for a distribution that is skewed to the left. Suppose that you and your friends emptied your pockets of coins and recorded the year marked on each coin. The distribution of dates would be skewed to the left. Explain why.
1.39 Grades and self-concept. Table 1.3 presents data on 78 seventh-grade students in a rural midwestern school.19 The researcher was interested in the relationship between the students’ “self-concept” and their academic performance. The data we give here include each student’s grade point average (GPA), score on a standard IQ test, and gender, taken from school records. Gender is coded as F for female and M for male. The students are identified only by an observation number (OBS). The missing OBS numbers show that some students dropped out of the study. The final variable is each student’s score on the Piers- Harris Children’s Self-Concept Scale, a psychological test administered by the researcher. SEVENGR
(a) How many variables does this data set contain? Which are categorical variables and which are quantitative variables?
(b) Make a stemplot of the distribution of GPA, after rounding to the nearest tenth of a point.
(c) Describe the shape, center, and spread of the GPA distribution. Identify any suspected outliers from the overall pattern.
(d) Make a back-to-back stemplot of the rounded GPAs for female and male students. Write a brief comparison of the two distributions.
1.40 Describe the IQ scores. Make a graph of the distribution of IQ scores for the seventh-grade students in Table 1.3. Describe the shape, center, and spread of the distribution, as well as any outliers. IQ scores are usually said to be centered at 100. Is the midpoint for these students close to 100, clearly above, or clearly below? SEVENGR
Educational Data for 78 Seventh-Grade Students
OBS GPA IQ Gender Self-concept
001 7.940 111 M 67 002 8.292 107 M 43 003 4.643 100 M 52 004 7.470 107 M 66 005 8.882 114 F 58 006 7.585 115 M 51 007 7.650 111 M 71 008 2.412 97 M 51 009 6.000 100 F 49 010 8.833 112 M 51 011 7.470 104 F 35 012 5.528 89 F 54 013 7.167 104 M 54
014 7.571 102 F 64 015 4.700 91 F 56 016 8.167 114 F 69 017 7.822 114 F 55 018 7.598 103 F 65 019 4.000 106 M 40 020 6.231 105 F 66 021 7.643 113 M 55 022 1.760 109 M 20 024 6.419 108 F 56 026 9.648 113 M 68 027 10.700 130 F 69 028 10.580 128 M 70 029 9.429 128 M 80 030 8.000 118 M 53 031 9.585 113 M 65 032 9.571 120 F 67 033 8.998 132 F 62 034 8.333 111 F 39 035 8.175 124 M 71 036 8.000 127 M 59 037 9.333 128 F 60 038 9.500 136 M 64 039 9.167 106 M 71 040 10.140 118 F 72 041 9.999 119 F 54 043 10.760 123 M 64 044 9.763 124 M 58 045 9.410 126 M 70 046 9.167 116 M 72 047 9.348 127 M 70 048 8.167 119 M 47 050 3.647 97 M 52 051 3.408 86 F 46 052 3.936 102 M 66 053 7.167 110 M 67 054 7.647 120 M 63 055 0.530 103 M 53 056 6.173 115 M 67 057 7.295 93 M 61 058 7.295 72 F 54 059 8.938 111 F 60 060 7.882 103 F 60 061 8.353 123 M 63 062 5.062 79 M 30 063 8.175 119 M 54 064 8.235 110 M 66 065 7.588 110 M 44 068 7.647 107 M 49 069 5.237 74 F 44
071 7.825 105 M 67 072 7.333 112 F 64 074 9.167 105 M 73 076 7.996 110 M 59 077 8.714 107 F 37 078 7.833 103 F 63 079 4.885 77 M 36 080 7.998 98 F 64 083 3.820 90 M 42 084 5.936 96 F 28 085 9.000 112 F 60 086 9.500 112 F 70 087 6.057 114 M 51 088 6.057 93 F 21 089 6.938 106 M 56
1.41 Describe the self-concept scores. Based on a suitable graph, briefly describe the distribution of self-concept scores for the students in Table 1.3. Be sure to identify any suspected outliers. SEVENGR
1.42 The Boston Marathon. Women were allowed to enter the Boston Marathon in 1972. Here are the times (in minutes, rounded to the nearest minute) for the winning women from 1972 to 2015.
Make a graph that shows change over time. What overall pattern do you see? Have times stopped improving in recent years? If so, when did improvement end? MARATH
Year Time
1972 190 1973 186 1974 167 1975 162 1976 167 1977 168 1978 165 1979 155 1980 154 1981 147 1982 150 1983 143 1984 149 1985 154 1986 145 1987 146 1988 145 1989 144 1990 145 1991 144 1992 144 1993 145 1994 142 1995 145 1996 147
1997 146 1998 143 1999 143 2000 146 2001 144 2002 141 2003 145 2004 144 2005 145 2006 143 2007 149 2008 145 2009 152 2010 146 2011 142 2012 151 2013 146 2014 139 2015 145
1.3 Describing Distributions with Numbers
When you complete this section, you will be able to:
Describe the center of a distribution by using the mean. Describe the center of a distribution by using the median. Compare the mean and the median as measures of center for a particular set of data. Describe the spread of a distribution by using quartiles. Describe a distribution by using the five-number summary. Describe a distribution by using a boxplot. Compare one or more sets of data measured on the same variable by using side-by-side boxplots. Identify outliers by using the 1.5 x IQR rule. Describe the spread of a distribution by using the standard deviation. Choose measures of center and spread for a particular set of data. Compute the effects of a linear transformation on the mean, the median, the standard deviation, and the interquartile range.
We can begin our data exploration with graphs, but numerical summaries make our analysis more specific. For categorical variables, numerical summaries are the counts or percents that we use to construct pie charts or bar graphs. In this section, we focus on numerical summaries for quantitative variables. A brief description of the distribution of a quantitative variable should include its shape and numbers describing its center and spread. We describe the shape of a distribution based on inspection of a histogram or a stemplot. Now we will learn specific ways to use numbers to measure the center and spread of a distribution. We can calculate these numerical measures for any quantitative variable. But to interpret measures of center and spread, and to choose among the several measures we will learn, you must think about the shape of the distribution and the meaning of the data. The numbers, like graphs, are aids to understanding, not “the answer” in themselves.
EXAMPLE 1.23
TTS24
The distribution of business start times. An entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. They have determined the time, in days, to complete all the procedures required to start a business.20 Data for 189 countries are included in the data set, TTS. For this section, we examine data, rounded to integers, for a sample of 24 of these countries. Here are the data:
FIGURE 1.13 Stemplot for the sample of 24 business start times, Example 1.23.
16 4 5 6 5 7 12 19 10 2 25 19 38 5 24 8 6 5 53 32 13 49 11 17
The stemplot in Figure 1.13 shows us the shape, center, and spread of the business start times. The stems are tens of days and the leaves are days. The distribution is skewed to the right with a very long tail of high values. All but six of the times are less than 20 days. The center appears to be about 10 days, and the values range from 2 days to 53 days. There do not appear to be any outliers.
Measuring center: The mean Numerical description of a distribution begins with a measure of its center or average. The two common measures of center are the mean and the median. The mean is the “average value” and the median is the “middle value.” These are two different ideas for “center,” and the two measures behave differently. We need precise recipes for the mean and the median.
THE MEAN x̅
To find the mean x̅ of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, . . . , xn, their mean is
or, in more compact notation,
The Σ (capital Greek sigma) in the formula for the mean is short for “add them all up.” The bar over the x indicates the mean of all the x-values. Pronounce the mean x̅ as “x-bar.” This notation is so common that writers who are discussing data use x̅, , etc., without additional explanation. The subscripts on the observations xi are a way of keeping the n observations separate.