Introduction to the Practice of Statistics
NINTH EDITION
David S. Moore George P. McCabe Bruce A. Craig Purdue University
Vice President, STEM: Ben Roberts Publisher: Terri Ward Senior Acquisitions Editor: Karen Carson Marketing Manager: Tom DeMarco Marketing Assistant: Cate McCaffery Development Editor: Jorge Amaral Senior Media Editor: Catriona Kaplan Assistant Media Editor: Emily Tenenbaum Director of Digital Production: Keri deManigold Senior Media Producer: Alison Lorber Associate Editor: Victoria Garvey Editorial Assistant: Katharine Munz Photo Editor: Cecilia Varas Photo Researcher: Candice Cheesman Director of Design, Content Management: Diana Blume Text and Cover Designer: Blake Logan Project Editor: Edward Dionne, MPS North America LLC Illustrations: MPS North America LLC Production Manager: Susan Wein Composition: MPS North America LLC Printing and Binding: LSC Communications Cover Illustration: Drawing Water: Spring 2011 detail (Midwest) by David Wicks “Look Back” Arrow: NewCorner/Shutterstock
Library of Congress Control Number: 2016946039
Student Edition Hardcover: ISBN-13: 978-1-319-01338-7 ISBN-10: 1-319-01338-4
Student Edition Loose-leaf: ISBN-13: 978-1-319-01362-2 ISBN-10: 1-319-01362-7
Instructor Complimentary Copy: ISBN-13: 978-1-319-01428-5 ISBN-10: 1-319-01428-3
© 2017, 2014, 2012, 2009 by W. H. Freeman and Company All rights reserved Printed in the United States of America First printing
W. H. Freeman and Company One New York Plaza Suite 4500 New York, NY 10004-1562 www.macmillanlearning.com
http://www.macmillanlearning.com
Brief Contents
To Teachers: About This Book To Students: What Is Statistics? About the Authors Data Table Index Beyond the Basics Index
PART I Looking at Data CHAPTER 1 Looking at Data—Distributions
CHAPTER 2 Looking at Data—Relationships
CHAPTER 3 Producing Data
PART II Probability and Inference CHAPTER 4 Probability: The Study of Randomness
CHAPTER 5 Sampling Distributions
CHAPTER 6 Introduction to Inference
CHAPTER 7 Inference for Means
CHAPTER 8 Inference for Proportions
PART III Topics in Inference CHAPTER 9 Inference for Categorical Data
CHAPTER 10 Inference for Regression
CHAPTER 11 Multiple Regression
CHAPTER 12 One-Way Analysis of Variance
CHAPTER 13 Two-Way Analysis of Variance Tables Answers to Odd-Numbered Exercises Notes and Data Sources Index
Contents
To Teachers: About This Book To Students: What Is Statistics? About the Authors Data Table Index Beyond the Basics Index
PART I Looking at Data CHAPTER 1 Looking at Data—Distributions Introduction
1.1 Data Key characteristics of a data set
Section 1.1 Summary Section 1.1 Exercises 1.2 Displaying Distributions with Graphs
Categorical variables: Bar graphs and pie charts Quantitative variables: Stemplots and histograms Histograms Data analysis in action: Don’t hang up on me Examining distributions Dealing with outliers Time plots
Section 1.2 Summary Section 1.2 Exercises 1.3 Describing Distributions with Numbers
Measuring center: The mean Measuring center: The median Mean versus median Measuring spread: The quartiles The five-number summary and boxplots The 1.5 × IQR rule for suspected outliers Measuring spread: The standard deviation Properties of the standard deviation Choosing measures of center and spread Changing the unit of measurement
Section 1.3 Summary Section 1.3 Exercises 1.4 Density Curves and Normal Distributions
Density curves
Measuring center and spread for density curves Normal distributions The 68–95–99.7 rule Standardizing observations Normal distribution calculations Using the standard Normal table Inverse Normal calculations Normal quantile plots
Beyond the Basics: Density estimation Section 1.4 Summary Section 1.4 Exercises Chapter 1 Exercises
CHAPTER 2 Looking at Data—Relationships Introduction
2.1 Relationships Examining relationships
Section 2.1 Summary Section 2.1 Exercises 2.2 Scatterplots
Interpreting scatterplots The log transformation Adding categorical variables to scatterplots Scatterplot smoothers Categorical explanatory variables
Section 2.2 Summary Section 2.2 Exercises 2.3 Correlation
The correlation r Properties of correlation
Section 2.3 Summary Section 2.3 Exercises 2.4 Least-Squares Regression
Fitting a line to data Prediction Least-squares regression Interpreting the regression line Facts about least-squares regression Correlation and regression Another view of r2
Section 2.4 Summary Section 2.4 Exercises 2.5 Cautions about Correlation and Regression
Residuals Outliers and influential observations
Beware of the lurking variable Beware of correlations based on averaged data Beware of restricted ranges
Beyond the Basics: Data mining Section 2.5 Summary Section 2.5 Exercises 2.6 Data Analysis for Two-Way Tables
The two-way table Joint distribution Marginal distributions Describing relations in two-way tables Conditional distributions Simpson’s paradox
Section 2.6 Summary Section 2.6 Exercises 2.7 The Question of Causation
Explaining association Establishing causation
Section 2.7 Summary Section 2.7 Exercises Chapter 2 Exercises
CHAPTER 3 Producing Data Introduction
3.1 Sources of Data Anecdotal data Available data Sample surveys and experiments
Section 3.1 Summary Section 3.1 Exercises 3.2 Design of Experiments
Comparative experiments Randomization Randomized comparative experiments How to randomize Randomization using software Randomization using random digits Cautions about experimentation Matched pairs designs Block designs
Section 3.2 Summary Section 3.2 Exercises 3.3 Sampling Design
Simple random samples How to select a simple random sample
Stratified random samples Multistage random samples Cautions about sample surveys
Beyond the Basics: Capture-recapture sampling Section 3.3 Summary Section 3.3 Exercises 3.4 Ethics
Institutional review boards Informed consent Confidentiality Clinical trials Behavioral and social science experiments
Section 3.4 Summary Section 3.4 Exercises Chapter 3 Exercises
PART II Probability and Inference CHAPTER 4 Probability: The Study of Randomness Introduction
4.1 Randomness The language of probability Thinking about randomness The uses of probability
Section 4.1 Summary Section 4.1 Exercises 4.2 Probability Models
Sample spaces Probability rules Assigning probabilities: Finite number of outcomes Assigning probabilities: Equally likely outcomes Independence and the multiplication rule Applying the probability rules
Section 4.2 Summary Section 4.2 Exercises 4.3 Random Variables
Discrete random variables Continuous random variables Normal distributions as probability distributions
Section 4.3 Summary Section 4.3 Exercises 4.4 Means and Variances of Random Variables
The mean of a random variable Statistical estimation and the law of large numbers
Thinking about the law of large numbers Beyond the Basics: More laws of large numbers
Rules for means The variance of a random variable Rules for variances and standard deviations
Section 4.4 Summary Section 4.4 Exercises 4.5 General Probability Rules
General addition rules Conditional probability General multiplication rules Tree diagrams Bayes’s rule Independence again
Section 4.5 Summary Section 4.5 Exercises Chapter 4 Exercises
CHAPTER 5 Sampling Distributions Introduction
5.1 Toward Statistical Inference Sampling variability Sampling distributions Bias and variability Sampling from large populations Why randomize?
Section 5.1 Summary Section 5.1 Exercises 5.2 The Sampling Distribution of a Sample Mean
The mean and standard deviation of x̅ The central limit theorem A few more facts
Beyond the Basics: Weibull distributions Section 5.2 Summary Section 5.2 Exercises 5.3 Sampling Distributions for Counts and Proportions
The binomial distributions for sample counts Binomial distributions in statistical sampling Finding binomial probabilities Binomial mean and standard deviation Sample proportions Normal approximation for counts and proportions The continuity correction Binomial formula The Poisson distributions
Section 5.3 Summary
Section 5.3 Exercises Chapter 5 Exercises
CHAPTER 6 Introduction to Inference Introduction Overview of inference 6.1 Estimating with Confidence
Statistical confidence Confidence intervals Confidence interval for a population mean How confidence intervals behave Choosing the sample size Some cautions
Section 6.1 Summary Section 6.1 Exercises 6.2 Tests of Significance
The reasoning of significance tests Stating hypotheses Test statistics P-values Statistical significance Tests for a population mean Two-sided significance tests and confidence intervals The P-value versus a statement of significance
Section 6.2 Summary Section 6.2 Exercises 6.3 Use and Abuse of Tests
Choosing a level of significance What statistical significance does not mean Don’t ignore lack of significance Statistical inference is not valid for all sets of data Beware of searching for significance
Section 6.3 Summary Section 6.3 Exercises 6.4 Power and Inference as a Decision
Power Increasing the power Inference as decision Two types of error Error probabilities The common practice of testing hypotheses
Section 6.4 Summary Section 6.4 Exercises Chapter 6 Exercises
CHAPTER 7 Inference for Means
Introduction
7.1 Inference for the Mean of a Population The t distributions The one-sample t confidence interval The one-sample t test Matched pairs t procedures Robustness of the t procedures
Beyond the Basics: The bootstrap Section 7.1 Summary Section 7.1 Exercises 7.2 Comparing Two Means
The two-sample z statistic The two-sample t procedures The two-sample t confidence interval The two-sample t significance test Robustness of the two-sample procedures Inference for small samples Software approximation for the degrees of freedom The pooled two-sample t procedures
Section 7.2 Summary Section 7.2 Exercises 7.3 Additional Topics on Inference
Choosing the sample size Inference for non-Normal populations
Section 7.3 Summary Section 7.3 Exercises Chapter 7 Exercises
CHAPTER 8 Inference for Proportions Introduction
8.1 Inference for a Single Proportion Large-sample confidence interval for a single proportion
Beyond the Basics: The plus four confidence interval for a single proportion Significance test for a single proportion Choosing a sample size for a confidence interval Choosing a sample size for a significance test
Section 8.1 Summary Section 8.1 Exercises 8.2 Comparing Two Proportions
Large-sample confidence interval for a difference in proportions Beyond the Basics: The plus four confidence interval for a difference in proportions
Significance test for a difference in proportions Choosing a sample size for two sample proportions
Beyond the Basics: Relative risk Section 8.2 Summary
Section 8.2 Exercises Chapter 8 Exercises
PART III Topics in Inference CHAPTER 9 Inference for Categorical Data Introduction
9.1 Inference for Two-Way Tables The hypothesis: No association Expected cell counts The chi-square test Computations Computing conditional distributions The chi-square test and the z test
Beyond the Basics: Meta-analysis Section 9.1 Summary Section 9.1 Exercises 9.2 Goodness of Fit Section 9.2 Summary Section 9.2 Exercises Chapter 9 Exercises
CHAPTER 10 Inference for Regression Introduction
10.1 Simple Linear Regression Statistical model for linear regression Preliminary data analysis and inference considerations Estimating the regression parameters Checking model assumptions Confidence intervals and significance tests Confidence intervals for mean response Prediction intervals Transforming variables
Beyond the Basics: Nonlinear regression Section 10.1 Summary Section 10.1 Exercises 10.2 More Detail about Simple Linear Regression
Analysis of variance for regression The ANOVA F test Calculations for regression inference Inference for correlation
Section 10.2 Summary Section 10.2 Exercises Chapter 10 Exercises
CHAPTER 11 Multiple Regression Introduction
11.1 Inference for Multiple Regression Population multiple regression equation Data for multiple regression Multiple linear regression model Estimation of the multiple regression parameters Confidence intervals and significance tests for regression coefficients ANOVA table for multiple regression Squared multiple correlation R2
Section 11.1 Summary Section 11.1 Exercises 11.2 A Case Study
Preliminary analysis Relationships between pairs of variables Regression on high school grades Interpretation of results Examining the residuals Refining the model Regression on SAT scores Regression using all variables Test for a collection of regression coefficients
Beyond the Basics: Multiple logistic regression Section 11.2 Summary Section 11.2 Exercises Chapter 11 Exercises
CHAPTER 12 One-Way Analysis of Variance Introduction
12.1 Inference for One-Way Analysis of Variance Data for one-way ANOVA Comparing means The two-sample t statistic An overview of ANOVA The ANOVA model Estimates of population parameters Testing hypotheses in one-way ANOVA The ANOVA table The F test Software
Beyond the Basics: Testing the equality of spread Section 12.1 Summary Section 12.1 Exercises 12.2 Comparing the Means
Contrasts
Multiple comparisons Power
Section 12.2 Summary Section 12.2 Exercises Chapter 12 Exercises
CHAPTER 13 Two-Way Analysis of Variance Introduction
13.1 The Two-Way ANOVA Model Advantages of two-way ANOVA The two-way ANOVA model Main effects and interactions
13.2 Inference for Two-Way ANOVA The ANOVA table for two-way ANOVA
Chapter 13 Summary Chapter 13 Exercises Tables Answers to Odd-Numbered Exercises Notes and Data Sources Index
To Teachers: About This Book
Statistics is the science of data. Introduction to the Practice of Statistics (IPS) is an introductory text based on this principle. We present methods of basic statistics in a way that emphasizes working with data and mastering statistical reasoning. IPS is elementary in mathematical level but conceptually rich in statistical ideas. After completing a course based on our text, we would like students to be able to think objectively about conclusions drawn from data and use statistical methods in their own work.
In IPS, we combine attention to basic statistical concepts with a comprehensive presentation of the elementary statistical methods that students will find useful in their work. IPS has been successful for several reasons:
1. IPS examines the nature of modern statistical practice at a level suitable for beginners. We focus on the production and analysis of data as well as the traditional topics of probability and inference.
2. IPS has a logical overall progression, so data production and data analysis are a major focus, while inference is treated as a tool that helps us draw conclusions from data in an appropriate way.
3. IPS presents data analysis as more than a collection of techniques for exploring data. We emphasize systematic ways of thinking about data. Simple principles guide the analysis: always plot your data; look for overall patterns and deviations from them; when looking at the overall pattern of a distribution for one variable, consider shape, center, and spread; for relations between two variables, consider form, direction, and strength; always ask whether a relationship between variables is influenced by other variables lurking in the background. We warn students about pitfalls in clear cautionary discussions.
4. IPS uses real examples to drive the exposition. Students learn the technique of least-squares regression and how to interpret the regression slope. But they also learn the conceptual ties between regression and correlation and the importance of looking for influential observations.
5. IPS is aware of current developments both in statistical science and in teaching statistics. Brief, optional Beyond the Basics sections give quick overviews of topics such as density estimation, scatterplot smoothers, data mining, nonlinear regression, and meta-analysis. Chapter 16 gives an elementary introduction to the bootstrap and other computer-intensive statistical methods.
The title of the book expresses our intent to introduce readers to statistics as it is used in practice. Statistics in practice is concerned with drawing conclusions from data. We focus on problem solving rather than on methods that may be useful in specific settings.
GAISE The College Report of the Guidelines for Assessment and Instruction in Statistics Education (GAISE) Project (www.amstat.org/education/gaise/) was funded by the American Statistical Association to make recommendations for how introductory statistics courses should be taught. This report and its update contain many interesting teaching suggestions, and we strongly recommend that you read it. The philosophy and approach of IPS closely reflect the GAISE recommendations. Let’s examine each of the latest recommendations in the context of IPS.
1. Teach statistical thinking. Through our experiences as applied statisticians, we are very familiar with the components that are needed for the appropriate use of statistical methods. We focus on formulating questions, collecting and finding data, evaluating the quality of data, exploring the relationships among variables, performing statistical analyses, and drawing conclusions. In examples and exercises throughout the text, we emphasize putting the analysis in the proper context and translating numerical and graphical summaries into conclusions.
2. Focus on conceptual understanding. With the software available today, it is very easy for almost anyone to apply a wide variety of statistical procedures, both simple and complex, to a set of data. Without a firm grasp of the concepts, such applications are frequently meaningless. By using the methods that we present on real sets of data, we believe that students will gain an excellent understanding of these concepts. Our emphasis is on the input (questions of interest, collecting or finding data, examining data) and the output (conclusions) for a statistical analysis. Formulas are given only where they will provide some insight into concepts.
3. Integrate real data with a context and a purpose. Many of the examples and exercises in IPS include data that we have obtained from collaborators or consulting clients. Other data sets have come from research related to these activities. We have also used the Internet as a data source, particularly for data related to social media and other topics of interest to undergraduates. Our emphasis on real data, rather than artificial data chosen to illustrate a
http://www.amstat.org/education/gaise/
calculation, serves to motivate students and help them see the usefulness of statistics in everyday life. We also frequently encounter interesting statistical issues that we explore. These include outliers and nonlinear relationships. All data sets are available from the text website.
4. Foster active learning in the classroom. As we mentioned earlier, we believe that statistics is exciting as something to do rather than something to talk about. Throughout the text, we provide exercises in Use Your Knowledge sections that ask the students to perform some relatively simple tasks that reinforce the material just presented. Other exercises are particularly suited to being worked on and discussed within a classroom setting.
5. Use technology for developing concepts and analyzing data. Technology has altered statistical practice in a fundamental way. In the past, some of the calculations that we performed were particularly difficult and tedious. In other words, they were not fun. Today, freed from the burden of computation by software, we can concentrate our efforts on the big picture: what questions are we trying to address with a study and what can we conclude from our analysis?
6. Use assessments to improve and evaluate student learning. Our goal for students who complete a course based on IPS is that they are able to design and carry out a statistical study for a project in their capstone course or other setting. Our exercises are oriented toward this goal. Many ask about the design of a statistical study and the collection of data. Others ask for a paragraph summarizing the results of an analysis. This recommendation includes the use of projects, oral presentations, article critiques, and written reports. We believe that students using this text will be well prepared to undertake these kinds of activities. Furthermore, we view these activities not only as assessments but also as valuable tools for learning statistics.
Teaching Recommendations We have used IPS in courses taught to a variety of student audiences. For general undergraduates from mixed disciplines, we recommend covering Chapters 1 through 8 and Chapters 9, 10, or 12. For a quantitatively strong audience—sophomores planning to major in actuarial science or statistics—we recommend moving more quickly. Add Chapters 10 and 11 to the core material in Chapters 1 through 8. In general, we recommend deemphasizing the material on probability because these students will take a probability course later in their program. For beginning graduate students in such fields as education, family studies, and retailing, we recommend that the students read the entire text (Chapters 11 and 13 lightly), again with reduced emphasis on Chapter 4 and some parts of Chapter 5. In all cases, beginning with data analysis and data production (Part I) helps students overcome their fear of statistics and builds a sound base for studying inference. We believe that IPS can easily be adapted to a wide variety of audiences.
The Ninth Edition: What’s New? Chapter 1 now begins with a short section giving an overview of data. “Toward Statistical Inference” (previously Section 3.3), which introduces the concepts of statistical inference and sampling distributions, has been moved to Section 5.1 to better assist with the transition from a single data set to sampling distributions. Coverage of mosaic plots as a visual tool for relationships between two categorical variables has been added to Chapters 2 and 9. Chapter 3 now begins with a short section giving a basic overview of data sources. Coverage of equivalence testing has been added to Chapter 7. There is a greater emphasis on sample size determination using software in Chapters 7 and 8. Resampling and bootstrapping are now introduced in Chapter 7 rather than Chapter 6. “Inference for Categorical Data” is the new title for Chapter 9, which includes goodness of fit as well as inference for two-way tables. There are more JMP screenshots and updated screenshots of Minitab, Excel, and SPSS outputs. Design A new design incorporates colorful, revised figures throughout to aid the students’ understanding of text material. Photographs related to chapter examples and exercises make connections to real-life applications and provide a visual context for topics. More figures with software output have been included. Exercises and Examples More than 30% of the exercises are new or revised, and there are more than 1700 exercises total. Exercise sets have been added at the end of sections in Chapters 9 through 12. To maintain the attractiveness of the examples to students, we have replaced or updated a large number of them. More than 30% of the 430 examples are new or revised. A list of exercises and examples categorized by application area is provided on the inside of the front cover.
In addition to the new ninth edition enhancements, IPS has retained the successful pedagogical features from previous editions:
Look Back At key points in the text, Look Back margin notes direct the reader to the first explanation of a topic, providing page numbers for easy reference.
Caution Warnings in the text, signaled by a caution icon, help students avoid common errors and misconceptions.
Challenge Exercises More challenging exercises are signaled with an icon. Challenge exercises are varied: some are mathematical, some require open-ended investigation, and others require deeper thought about the basic concepts.
Applets Applet icons are used throughout the text to signal where related interactive statistical applets can be found on the IPS website and in LaunchPad. Use Your Knowledge Exercises We have found these exercises to be a very useful learning tool. They appear throughout each section and are listed, with page numbers, before the section-ending exercises. Technology output screenshots Most statistical analyses rely heavily on statistical software. In this book, we discuss the use of Excel 2013, JMP 12, Minitab 17, SPSS 23, CrunchIt, R, and a TI-83/-84 calculator for conducting statistical analysis. As specialized statistical packages, JMP, Minitab, and SPSS are the most popular software choices both in industry and in colleges and schools of business. R is an extremely powerful statistical environment that is free to anyone; it relies heavily on members of the academic and general statistical communities for support. As an all-purpose spreadsheet program, Excel provides a limited set of statistical analysis options in comparison. However, given its pervasiveness and wide acceptance in industry and the computer world at large, we believe it is important to give Excel proper attention. It should be noted that for users who want more statistical capabilities but want to work in an Excel environment, there are a number of commercially available add-on packages (if you have JMP, for instance, it can be invoked from within Excel). Finally, instructions are provided for the TI-83/-84 calculators.
Even though basic guidance is provided in the book, it should be emphasized that IPS is not bound to any of these programs. Computer output from statistical packages is very similar, so you can feel quite comfortable using any one these packages.
Acknowledgments We are pleased that the first eight editions of Introduction to the Practice of Statistics have helped to move the teaching of introductory statistics in a direction supported by most statisticians. We are grateful to the many colleagues and students who have provided helpful comments, and we hope that they will find this new edition another step forward. In particular, we would like to thank the following colleagues who offered specific comments on the new edition: Ali Arab, Georgetown University Tessema Astatkie, Dalhousie University Fouzia Baki, McMaster University Lynda Ballou, New Mexico Institute of Mining and Technology Sanjib Basu, Northern Illinois University David Bosworth, Hutchinson Community College
Max Buot, Xavier University Nadjib Bouzar, University of Indianapolis Matt Carlton, California Polytechnic State University–San Luis Obispo Gustavo Cepparo, Austin Community College Pinyuen Chen, Syracuse University Dennis L. Clason, University of Cincinnati–Blue Ash College Tadd Colver, Purdue University Chris Edwards, University of Wisconsin–Oshkosh Irina Gaynanova, Texas A&M University Brian T. Gill, Seattle Pacific University Mary Gray, American University Gary E. Haefner, University of Cincinnati Susan Herring, Sonoma State University Lifang Hsu, Le Moyne College Tiffany Kolba, Valparaiso University Lia Liu, University of Illinois at Chicago Xuewen Lu, University of Calgary Antoinette Marquard, Cleveland State University Frederick G. Schmitt, College of Marin James D. Stamey, Baylor University Engin Sungur, University of Minnesota–Morris Anatoliy Swishchuk, University of Calgary Richard Tardanico, Florida International University Melanee Thomas, University of Calgary Terri Torres, Oregon Institute of Technology Mahbobeh Vezvaei, Kent State University Yishi Wang, University of North Carolina–Wilmington John Ward, Jefferson Community and Technical College Debra Wiens, Rocky Mountain College Victor Williams, Paine College Christopher Wilson, Butler University Anne Yust, Birmingham-Southern College Biao Zhang, The University of Toledo Michael L. Zwilling, University of Mount Union
The professionals at Macmillan, in particular, Terri Ward, Karen Carson, Jorge Amaral, Emily Tenenbaum, Ed Dionne, Blake Logan, and Susan Wein, have contributed greatly to the success of IPS. In addition, we would like to thank Tadd Colver at Purdue University for his valuable contributions to the ninth edition, including authoring the back-of-book answers, solutions, and Instructor’s Guide. We’d also like to thank Monica Jackson at American University for accuracy reviewing the back-of-book answers and solutions and for authoring the test bank. Thanks also to Michael Zwilling at University of Mount Union for accuracy reviewing the test bank, Christopher Edwards at University of Wisconsin Oshkosh for authoring the lecture slides, and James Stamey at Baylor University for authoring the Clicker slides.
Most of all, we are grateful to the many friends and collaborators whose data and research questions have enabled us to gain a deeper understanding of the science of data. Finally, we would like to acknowledge the contributions of John W. Tukey, whose contributions to data analysis have had such a great influence on us as well as a whole generation of applied statisticians.
Media and Supplements
LaunchPad, our online course space, combines an interactive e-Book with high-quality multimedia content and ready- made assessment options, including LearningCurve adaptive quizzing. Content is easy to assign or adapt with your own material, such as readings, videos, quizzes, discussion groups, and more. LaunchPad also provides access to a Gradebook that offers a window into your students’ performance—either individually or as a whole. Use LaunchPad on its own or integrate it with your school’s learning management system so your class is always on the same page. To learn more about LaunchPad for Introduction to the Practice of Statistics, Ninth Edition, or to request access, go to
launchpadworks.com.
Assets integrated into LaunchPad include:
Interactive e-Book. Every LaunchPad e-Book comes with powerful study tools for students, video and multimedia content, and easy customization for instructors. Students can search, highlight, and bookmark, making it easier to study and access key content. And teachers can ensure that their classes get just the book they want to deliver: customize and rearrange chapters; add and share notes and discussions; and link to quizzes, activities, and other resources.
LearningCurve provides students and instructors with powerful adaptive quizzing, a game-like format, direct links to the e-Book, and instant feedback. The quizzing system features questions tailored specifically to the text and adapts to students’ responses, providing material at different difficulty levels and topics based on student performance.
JMP Student Edition (developed by SAS) is easy to learn and contains all the capabilities required for introductory statistics. JMP is the leading commercial data analysis software of choice for scientists, engineers, and analysts at companies throughout the world (for Windows and Mac). Register inside LaunchPad at no additional cost.
CrunchIt!® is a Web-based statistical program that allows users to perform all the statistical operations and graphing needed for an introductory statistics course and more. It saves users time by automatically loading data from IPS, 9e, and it provides the flexibility to edit and import additional data.
StatBoards Videos are brief whiteboard videos that illustrate difficult topics through additional examples, written and explained by a select group of statistics educators.
Stepped Tutorials are centered on algorithmically generated quizzing with step-by-step feedback to help students work their way toward the correct solution. These exercise tutorials (two to three per chapter) are easily assignable and assessable.
Statistical Video Series consists of StatClips, StatClips Examples, and Statistically Speaking “Snapshots.” View animated lecture videos, whiteboard lessons, and documentary-style footage that illustrate key statistical concepts and help students visualize statistics in real-world scenarios.
Video Technology Manuals, available for TI-83/84 calculators, Minitab, Excel, JMP, SPSS, R, Rcmdr, and CrunchIt! ®, provide brief instructions for using specific statistical software.
StatTutor Tutorials offer multimedia tutorials that explore important concepts and procedures in a presentation that combines video, audio, and interactive features. The newly revised format includes built-in, assignable assessments and a bright new interface.
Statistical Applets give students hands-on opportunities to familiarize themselves with important statistical concepts and procedures in an interactive setting that allows them to manipulate variables and see the results graphically. Icons in the textbook indicate when an applet is available for the material being covered. Applets are assessable and assignable in LaunchPad.
Stats@Work Simulations put students in the role of the statistical consultant, helping them better understand statistics interactively within the context of real-life scenarios.
EESEE Case Studies (Electronic Encyclopedia of Statistical Examples and Exercises), developed by The Ohio State University Statistics Department, teach students to apply their statistical skills by exploring actual case studies using real data.
http://launchpadworks.com
SolutionMaster offers an easy-to-use web-based version of the instructor’s solutions, allowing instructors to generate a solution file for any set of homework exercises.
Data files are available in JMP, ASCII, Excel, TI, Minitab, SPSS (an IBM Company)*, R, and CSV formats.
Student Solutions Manual provides solutions to the odd-numbered exercises in the text and is available as a print supplement and electronically in LaunchPad.
Instructor’s Guide with Full Solutions includes teaching suggestions, chapter comments, and detailed solutions to all exercises and is available electronically in LaunchPad.
Test Bank offers hundreds of multiple-choice questions and is available in LaunchPad.
Lecture Slides offer a customizable, detailed lecture presentation of statistical concepts covered in each chapter of IPS, 9e. Image slides contain all textbook figures and tables. Lecture slides and images slides are available in LaunchPad.
WebAssign offers algorithmic questions from IPS, 9e, in a powerful online instructional system. WebAssign lets you easily create assignments, grade homework, and give your students instant feedback. Along with flexible features, class and question-level analytics are available for instructors and students. WebAssign Premium also includes the following resources described above: e-Book, data files, LearningCurve, StatTutor Tutorials, Statistical Videos, Video Technology Manuals, solutions manuals, lecture and image slides, i-Clicker slides, test bank, and practice quizzes.
Additional Resources Available with IPS, 9e Special Software Package A student version of JMP is available for packaging with the printed text. JMP is also available inside LaunchPad at no additional cost.
i-Clicker is a two-way radio-frequency classroom response solution developed by educators for educators. Each step of i-Clicker’s development has been informed by teaching and learning.
* SPSS was acquired by IBM in October 2009
To Students: What Is Statistics?
Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. We are bombarded by data in our everyday lives. The news mentions movie box-office sales, the latest poll of the president’s popularity, and the average high temperature for today’s date. Advertisements claim that data show the superiority of the advertiser’s product. All sides in public debates about economics, education, and social policy argue from data. A knowledge of statistics helps separate sense from nonsense in this flood of data.
The study and collection of data are also important in the work of many professions, so training in the science of statistics is valuable preparation for a variety of careers. Each month, for example, government statistical offices release the latest numerical information on unemployment and inflation. Economists and financial advisers, as well as policymakers in government and business, study these data in order to make informed decisions. Doctors must understand the origin and trustworthiness of the data that appear in medical journals. Politicians rely on data from polls of public opinion. Business decisions are based on market research data that reveal consumer tastes and preferences. Engineers gather data on the quality and reliability of manufactured products. Most areas of academic study make use of numbers and, therefore, also make use of the methods of statistics. This means it is extremely likely that your undergraduate research projects will involve, at some level, the use of statistics.
Learning from Data The goal of statistics is to learn from data. To learn, we often perform calculations or make graphs based on a set of numbers. But to learn from data, we must do more than calculate and plot because data are not just numbers; they are numbers that have some context that helps us learn from them.
More than two-thirds of Americans are overweight or obese according to the Centers for Disease Control and Prevention (CDC) website (www.cdc.gov/nchs/nhanes.htm). What does it mean to be obese or to be overweight? To answer this question, we need to talk about body mass index (BMI). Your weight in kilograms divided by the square of your height in meters is your BMI. A man who is 6 feet tall (1.83 meters) and weighs 180 pounds (81.65 kilograms) will have a BMI of 81.65/(1.83)2 = 24.4 kg/m2. How do we interpret this number? According to the CDC, a person is classified as overweight if his or her BMI is between 25 and 29.9 kg/m2 and as obese if his or her BMI is 30 kg/m2 or more. Therefore, more than two-thirds of Americans have a BMI of 25 kg/m2 or more. The man who weighs 180 pounds and is 6 feet tall is not overweight or obese, but if he gains 5 pounds, his BMI would increase to 25.1, and he would be classified as overweight.
When you do statistical problems, even straightforward textbook problems, don’t just graph or calculate. Think about the context and state your conclusions in the specific setting of the problem. As you are learning how to do statistical calculations and graphs, remember that the goal of statistics is not calculation for its own sake but gaining understanding from numbers. The calculations and graphs can be automated by a calculator or software, but you must supply the understanding. This book presents only the most common specific procedures for statistical analysis. A thorough grasp of the principles of statistics will enable you to quickly learn more advanced methods as needed. On the other hand, a fancy computer analysis carried out without attention to basic principles will often produce elaborate nonsense. As you read, seek to understand the principles as well as the necessary details of methods and recipes.
The Rise of Statistics Historically, the ideas and methods of statistics developed gradually as society grew interested in collecting and using data for a variety of applications. The earliest origins of statistics lie in the desire of rulers to count the number of inhabitants or measure the value of taxable land in their domains. As the physical sciences developed in the seventeenth and eighteenth centuries, the importance of careful measurements of weights, distances, and other physical quantities grew. Astronomers and surveyors striving for exactness had to deal with variation in their measurements. Many measurements should be better than a single measurement, even though they vary among themselves. How can we best combine many varying observations? Statistical methods that are still important were invented in order to analyze scientific measurements.
By the nineteenth century, the agricultural, life, and behavioral sciences also began to rely on data to answer
http://www.cdc.gov/nchs/nhanes.htm
fundamental questions. How are the heights of parents and children related? Does a new variety of wheat produce higher yields than the old, and under what conditions of rainfall and fertilizer? Can a person’s mental ability and behavior be measured just as we measure height and reaction time? Effective methods for dealing with such questions developed slowly and with much debate.
As methods for producing and understanding data grew in number and sophistication, the new discipline of statistics took shape in the twentieth century. Ideas and techniques that originated in the collection of government data, in the study of astronomical or biological measurements, and in the attempt to understand heredity or intelligence came together to form a unified “science of data.” That science of data—statistics—is the topic of this text.
The Organization of This Book Part I of this book, called simply “Looking at Data,” concerns data analysis and data production. The first two chapters deal with statistical methods for organizing and describing data. These chapters progress from simpler to more complex data. Chapter 1 examines data on a single variable; Chapter 2 is devoted to relationships among two or more variables. You will learn both how to examine data produced by others and how to organize and summarize your own data. These summaries will first be graphical, then numerical, and then, when appropriate, in the form of a mathematical model that gives a compact description of the overall pattern of the data. Chapter 3 outlines arrangements (called designs) for producing data that answer specific questions. The principles presented in this chapter will help you to design proper samples and experiments for your research projects and to evaluate other such investigations in your field of study.
Part II, consisting of Chapters 4 through 8, introduces statistical inference—formal methods for drawing conclusions from properly produced data. Statistical inference uses the language of probability to describe how reliable its conclusions are, so some basic facts about probability are needed to understand inference. Probability is the subject of Chapters 4 and 5. Chapter 6, perhaps the most important chapter in the text, introduces the reasoning of statistical inference. Effective inference is based on good procedures for producing data (Chapter 3), careful examination of the data (Chapters 1 and 2), and an understanding of the nature of statistical inference as discussed in Chapter 6. Chapters 7 and 8 describe some of the most common specific methods of inference, for drawing conclusions about means and proportions from one and two samples.
The five shorter chapters in Part III introduce somewhat more advanced methods of inference, dealing with relations in categorical data, regression and correlation, and analysis of variance. Four supplementary chapters, available from the text website, present additional statistical topics.
What Lies Ahead Introduction to the Practice of Statistics is full of data from many different areas of life and study. Many exercises ask you to express briefly some understanding gained from the data. In practice, you would know much more about the background of the data you work with and about the questions you hope the data will answer. No textbook can be fully realistic. But it is important to form the habit of asking, “What do the data tell me?” rather than just concentrating on making graphs and doing calculations.
You should have some help in automating many of the graphs and calculations. You should certainly have a calculator with basic statistical functions. Look for keywords such as “two-variable statistics” or “regression” when you shop for a calculator. More advanced (and more expensive) calculators will do much more, including some statistical graphs. You may be asked to use software as well. There are many kinds of statistical software, from spreadsheets to large programs for advanced users of statistics. The kind of computing available to learners varies a great deal from place to place—but the big ideas of statistics don’t depend on any particular level of access to computing.
Because graphing and calculating are automated in statistical practice, the most important assets you can gain from the study of statistics are an understanding of the big ideas and the beginnings of good judgment in working with data. Ideas and judgment can’t (at least yet) be automated. They guide you in telling the computer what to do and in interpreting its output. This book tries to explain the most important ideas of statistics, not just teach methods. Some examples of big ideas that you will meet are “always plot your data,” “randomized comparative experiments,” and “statistical significance.”
You learn statistics by doing statistical problems. “Practice, practice, practice.” Be prepared to work problems. The basic principle of learning is persistence. Being organized and persistent is more helpful in reading this book than knowing lots of math. The main ideas of statistics, like the main ideas of any important subject, took a long time to discover and take some time to master. The gain will be worth the pain.
About the Authors
David S. Moore is Shanti S. Gupta Distinguished Professor of Statistics, Emeritus, at Purdue University and was 1998 president of the American Statistical Association. He received his AB from Princeton and his PhD from Cornell, both in mathematics. He has written many research papers in statistical theory and served on the editorial boards of several major journals.
Professor Moore is an elected fellow of the American Statistical Association and of the Institute of Mathematical Statistics and is an elected member of the International Statistical Institute. He has served as program director for statistics and probability at the National Science Foundation.
In recent years, Professor Moore has devoted his attention to the teaching of statistics. He was the content developer for the Annenberg/Corporation for Public Broadcasting college-level telecourse, Against All Odds: Inside Statistics, and for the series of video modules, Statistics: Decisions through Data, intended to aid the teaching of statistics in schools. He is the author of influential articles on statistics education and of several leading texts. Professor Moore has served as president of the International Association for Statistical Education and has received the Mathematical Association of America’s national award for distinguished college or university teaching of mathematics.
George P. McCabe is Associate Dean for Academic Affairs in the College of Science and Professor of Statistics at Purdue University. In 1966, he received a BS degree in mathematics from Providence College and in 1970 a PhD in mathematical statistics from Columbia University. His entire professional career has been spent at Purdue, with sabbaticals at Princeton University, the Commonwealth Scientific and Industrial Research Organization (CSIRO) in Melbourne (Australia), the University of Berne (Switzerland), the National Institute of Standards and Technology (NIST) in Boulder, Colorado, and the National University of Ireland in Galway. Professor McCabe is an elected fellow of the American Association for the Advancement of Science and of the American Statistical Association; he was 1998 chair of its section on Statistical Consulting. In 2008–2010, he served on the Institute of Medicine Committee on Nutrition Standards for the National School Lunch and Breakfast Programs. He has served on the editorial boards of several statistics journals. He has consulted with many major corporations and has testified as an expert witness on the use of statistics in several cases.
Professor McCabe’s research interests have focused on applications of statistics. Much of his recent work has focused on problems in nutrition, including nutrient requirements, calcium metabolism, and bone health. He is the author or coauthor of more than 190 publications in many different journals.
Bruce A. Craig is Professor of Statistics and Director of the Statistical Consulting Service at Purdue University. He received his BS in mathematics and economics from Washington University in St. Louis and his PhD in statistics from the University of Wisconsin–Madison. He is an elected fellow of the American Association for the Advancement of Science and of the American Statistical Association and was chair of its section on Statistical Consulting in 2009. He has also been an active member of the Eastern North American Region of the International Biometrics Society and was elected by the voting membership to the Regional Committee between 2003 and 2006.
Professor Craig has served on the editorial board of several statistical journals and has been a member of several data and safety monitoring boards, including Purdue’s institutional review board.
Professor Craig’s research interests focus on the development of novel statistical methodology to address research questions in the life sciences. Areas of current interest are diagnostic testing, inter-rater agreement, and abundance estimation. He is an author or coauthor of more than 100 papers in more than 50 different journals. In 2005, he was named Purdue University Faculty Scholar.
Data Table Index
TABLE 1.1 IQ test scores for 60 randomly chosen fifth-grade students TABLE 1.2 Service times (seconds) for calls to a customer service center TABLE 1.3 Educational data for 78 seventh-grade students TABLE 2.1 Four data sets for exploring correlation and regression TABLE 2.2 Two measures of glucose level in diabetics TABLE 2.3 Dwelling permits, sales, and production for 21 countries TABLE 2.4 World record times for the 10,000-meter run TABLE 5.1 Length (in minutes) of 60 visits to a statistics help room TABLE 7.1 Monthly rates of return on a portfolio (%) TABLE 7.2 Parts measurements using optical software TABLE 7.3 DRP scores for third-graders TABLE 7.4 Seated systolic blood pressure (mm Hg) TABLE 7.5 Length (in seconds) of audio files sampled from an iPod TABLE 10.1 Annual number of tornadoes in the United States between 1953 and 2014 TABLE 10.2 In-state tuition and fees (in dollars) for 33 public universities TABLE 10.3 Sales price and assessed value (in thousands of $) of 35 homes in a midwestern city TABLE 10.4 Watershed area (km2), percent forest, and index of biotic integrity TABLE 13.1 Iron content (mg/100 g) of food cooked in different pots TABLE 13.2 Tool diameter data
Beyond the Basics Index
Chapter 1 Density estimation Chapter 2 Data mining Chapter 3 Capture-recapture sampling Chapter 4 More laws of large numbers Chapter 5 Weibull distributions Chapter 7 The bootstrap Chapter 8 The plus four confidence interval for a single proportion Chapter 8 The plus four confidence interval for a difference in proportions Chapter 8 Relative risk Chapter 9 Meta-analysis Chapter 10 Nonlinear regression Chapter 11 Multiple logistic regression Chapter 12 Testing the equality of spread
1.1 1.2 1.3 1.4
CHAPTER 1 Looking at Data—Distributions
Data Displaying Distributions with Graphs Describing Distributions with Numbers Density Curves and Normal Distributions
Introduction Statistics is the science of learning from data. Data are numerical or qualitative descriptions of the objects that we want to study. In this chapter, we will master the art of examining data.
We begin in Section 1.1 with some basic ideas about data. We will learn about the different types of data that are collected and how data sets are organized.
Section 1.2 starts our process of learning from data by looking at graphs. These visual displays give us a picture of the overall patterns in a set of data. We have excellent software tools that help us make these graphs. However, it takes
a little experience and a lot of judgment to study the graphs carefully and to explain what they tell us about our data. Section 1.3 continues our process of learning from data by computing numerical summaries. These sets of numbers
describe key characteristics of the patterns that we saw in our graphical summaries. The final section in this chapter helps us make the transition from data summaries to statistical models that are used
to draw conclusions and to make predictions. Specifically, we learn about using density curves to describe a set of data and are introduced to the Normal distributions. These distributions can be used to describe many sets of data that we will encounter. They also play a fundamental role in many of the methods of statistical analysis.
1.1 Data
When you complete this section, you will be able to:
Give examples of cases in a data set. Identify the variables in a data set. Demonstrate how a label can be used as a variable in a data set. Identify the values of a variable. Classify variables as categorical or quantitative. Describe the key characteristics of a set of data. Explain how a rate is the result of adjusting one variable to create another.
A statistical analysis starts with a set of data. We construct a set of data by first deciding what cases, or units, we want to study. For each case, we record information about characteristics that we call variables.
CASES, LABELS, VARIABLES, AND VALUES
Cases are the objects described by a set of data. Cases may be customers, companies, subjects in a study, units in an experiment, or other objects.
A label is a special variable used in some data sets to distinguish the different cases.
A variable is a characteristic of a case.
Different cases can have different values of the variables.
EXAMPLE 1.1
COUPONS
Restaurant discount coupons. A website offers coupons that can be used to get discounts for various items at local restaurants. Coupons for food are very popular. Figure 1.1 gives information for seven restaurant coupons that were available for a recent weekend. These are the cases. Data for each coupon are listed on a different line, and the first column has the coupons numbered from 1 to 7. The remaining columns gives the type of restaurant, the name of the restaurant, the item being discounted, the regular price, and the discount price.
FIGURE 1.1 Spreadsheet of food discount coupons, Example 1.1.
1.1
1.2
Some variables, like the type of restaurant, the name of the restaurant, and the item simply place coupons into categories. The regular price and discount price columns have numerical values for which we can do arithmetic. It makes sense to give an average of the regular prices, but it does not make sense to give an “average” type of restaurant. We can, however, do arithmetic to compare the regular prices classified by type of restaurant.
CATEGORICAL AND QUANTITATIVE VARIABLES
A categorical variable places a case into one of several groups or categories.
A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense.
EXAMPLE 1.2
COUPONS
Categorical and quantitative variables for coupons. The restaurant discount coupon file has six variables: coupon number, type of restaurant, name of restaurant, item, regular price, and discount price. The two price variables are quantitative variables. Coupon number, type of restaurant, name of restaurant, and item are categorical variables.
An appropriate label for your cases should be chosen carefully. In our food coupon example, a natural choice of a label would be the name of the restaurant. However, if there are two or more coupons available for a particular restaurant, or if a restaurant is a chain with different discounts offered at different locations, then the name of the restaurant would not uniquely label each of the coupons. In the restaurant discount coupon file, the first variable, ID, is a unique label for each coupon.
spreadsheet
The display in Figure 1.1 is from an Excel spreadsheet. Spreadsheets are very useful for doing the kind of simple computations that you will do in Exercise 1.2. You can type in a formula and have the same computation performed for each row.
Note that the names we have chosen for the variables in our spreadsheet do not have spaces. For example, instead of “Restaurant Name” for the name of the restaurant, we simply use Name. In some statistical software packages, however, spaces are not allowed in variable names. For this reason, when creating spreadsheets for eventual use with statistical software, it is best to avoid spaces in variable names. Another convention is to use an underscore (_) where you would normally use a space. For our data set, we could have used Regular_Price and Discount_Price for the two price variables.
USE YOUR KNOWLEDGE
Read the spreadsheet. Refer to Figure 1.1. Give the regular price and the discount price for the Smokey Grill ribs coupon.
How much is the discount worth? Refer to Example 1.1. Consider adding another column to the spreadsheet that gives the coupon savings. Explain how you would compute the entries in this column. Does the new column contain values for a categorical variable or for a quantitative variable? Explain your answer.
unit of measurement
Another important part of the description of any quantitative variable is its unit of measurement. For both RegPrice and DiscPrice, the unit of measurement is clearly dollars. In other settings, it may not be as obvious. For example, if we were measuring heights of children, we might choose to use either inches or centimeters. The units of measurement are an important part of the description of a quantitative variable.
Key characteristics of a data set In practice, any set of data is accompanied by background information that helps us understand the data. When you plan a statistical study or explore data from someone else’s work, ask yourself the following questions:
1. Who? What cases do the data describe? How many cases does the data set contain? 2. What? How many variables do the data contain? What are the exact definitions of these variables? What are
the units of measurement for each quantitative variable? 3. Why? What purpose do the data have? Do we hope to answer some specific questions? Do we want to draw
conclusions about cases other than the ones we actually have data for? Are the variables that are recorded suitable for the intended purpose?
EXAMPLE 1.3
Statistics class data. Suppose that you are a teaching assistant for a statistics class and one of your jobs is to keep track of the grades for students in two sections of the course. The cases are the students in the class. There are weekly homework assignments, two exams during the semester, and a final exam. Each of these components is given a numerical score, and the components are added to get a total score that can range from 0 to 1000. Cutoffs of 900, 800, 700, etc., are used to assign letter grades of A, B, C, etc.
The spreadsheet for this course will have seven variables:
An identifier for each student. The number of points earned for homework. The number of points earned for the first exam. The number of points earned for the second exam. The number of points earned for the final exam. The total number of points earned. The letter grade earned.
The student identifier is a label and the letter grade earned is a categorical variable. All the other variables are measured in “points.” Because we can do arithmetic with their values, these variables are quantitative variables.
In our example of statistics class data, the possible values for the grade variable are A, B, C, D, and F. When computing grade point averages, many colleges and universities translate these letter grades into numbers using A = 4, B = 3, C = 2, D = 1, and F = 0. The transformed variable with numeric values is considered to be quantitative because we can average the numerical values across different courses to obtain a grade point average.
Sometimes, experts argue about numerical scales such as this. They ask whether or not the difference between an A and a B is the same as the difference between a D and an F. Similarly, many questionnaires ask people to respond on a 1 to 5 scale, with 1 representing strongly agree, 2 representing agree, etc. Again we could ask whether or not the five possible values for this scale are equally spaced in some sense. From a practical point of view, the averages that can be computed when we convert categorical scales such as these to numerical values frequently provide a very useful way to summarize data.
1.3
1.4
EXAMPLE 1.4
Who, what, and why for the statistics class data. The data set in Example 1.3 was constructed to keep track of the grades for students in an introductory statistics course. The cases are the students in the class. There are seven variables in this data set. These include a label for each student and scores for the various course requirements. There are no units for the label and grade. The other variables all have “points” as the unit.
USE YOUR KNOWLEDGE
Who, what, and why? For the restaurant discount coupon data of Example 1.1 (page 2), what cases do the data describe? How many cases are there? How many variables are there? What are their definitions and units of measurement? What purpose do the data have?
EXAMPLE 1.5
Statistics class data for a different purpose. Suppose that the data for the students in the introductory statistics class were also to be used to study relationships between student characteristics and success in the course. Here, we have decided to focus on the TotalPoints and Grade as the outcomes of interest. Other variables of interest would have been included—for example, Sex, PrevStat (whether or not the student has taken a statistics course previously), and Year (student classification as first, second, third, or fourth year). ID is a categorical variable, TotalPoints is a quantitative variable, and the remaining variables are all categorical.
USE YOUR KNOWLEDGE
Apartment rentals. A data set lists apartments available for students to rent. Information provided includes the monthly rent, whether or not cable is included free of charge, whether or not pets are allowed, the number of bedrooms, and the distance to the campus. Describe the cases in the data set, give the number of variables, and specify whether each variable is categorical or quantitative.
instrument
Often, the variables in a statistical study are easy to understand: height in centimeters, study time in minutes, and so on. But each area of work also has its own special variables. A psychologist uses the Minnesota Multiphasic Personality Inventory (MMPI), and a physical fitness expert measures “VO2 max” (the volume of oxygen consumed per minute while exercising at your maximum capacity). Both of these variables are measured with special instruments. VO2 max is measured by exercising while breathing into a mouthpiece connected to an apparatus that measures oxygen consumed. Scores on the MMPI are based on a long questionnaire, which is also called an instrument.
Part of mastering your field of work is learning what variables are important and how they are best measured. Because details of particular measurements usually require knowledge of the particular field of study, we will say little about them.
rate
Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions. Often, for example, the rate at which something occurs is a more meaningful measure than a simple count of occurrences.
EXAMPLE 1.6
1.5
1.6
Comparing colleges based on graduates. Think about comparing colleges based on the numbers of graduates. This view tells you something about the relative sizes of different colleges. However, if you are interested in how well colleges succeed at graduating students they admit, it would be better to use a rate. For example, you can find data on the Internet on the six-year graduation rates of different colleges. These rates are computed by examining the progress of first-year students who enroll in a given year. Suppose that at College A there were 1000 first-year students in a particular year, and 800 graduated within six years. The graduation rate is
or 80%. College B has 2000 students who entered in the same year, and 1200 graduated within six years. The graduation rate is
or 60%. How do we compare these two colleges? College B has more graduates but College A has a better graduation rate.
adjusting one variable to create another
In Example 1.6, when we computed the graduation rate, we used the total number of students to adjust the number of graduates. We constructed a new variable by dividing the number of graduates by the total number of students. adjusting one variable to Computing a rate is just one of several ways of adjusting one variable to create another. We often divide one variable by another to compute a more meaningful variable to study. Example 1.20 (page 20) is another type of adjustment.
USE YOUR KNOWLEDGE
How should you express the change? Between the first exam and the second exam in your statistics course, you increased the amount of time that you spent working exercises. Which of the following three ways would you choose to express the results of your increased work: (a) give the grades on the two exams, (b) give the ratio of the grade on the second exam divided by the grade on the first exam, (c) take the difference between the grade on the second exam and the grade on the first exam, and express this as a percent of the grade on the first exam. Give reasons for your answer.
Which variable would you choose? Refer to Example 1.6 on colleges and their graduates. (a) Give a setting in which you would prefer to evaluate the colleges based on the numbers of graduates.
Give a reason for your choice.
(b) Give a setting in which you would prefer to evaluate the colleges based on the graduation rates. Give a reason for your choice.
Exercises 1.5 and 1.6 illustrate an important point about presenting the results of your statistical calculations. Always consider how to best communicate your results to a general audience. For example, the numbers produced by your calculator or by statistical software frequently contain more digits that are needed. Be sure that you do not include extra information generated by software that will distract from a clear explanation of what you have found.
SECTION 1.1 SUMMARY A data set contains information on a number of cases. Cases may be customers, companies, subjects in a study, units in an experiment, or other objects. For each case, the data give values for one or more variables. A variable describes some characteristic of a case, such as a person’s height, gender, or salary. Variables can have different values for different cases. A label is a special variable used to identify cases in a data set. Some variables are categorical and others are quantitative. A categorical variable places each individual into a category, such as male or female. A quantitative variable has numerical values that measure some characteristic of each case, such as height in centimeters or annual salary in dollars. The key characteristics of a data set answer the questions Who?, What?, and Why?
SECTION 1.1 EXERCISES For Exercises 1.1 and 1.2, see page 3; for Exercise 1.3, see page 5; for Exercise 1.4, see page 5; and for Exercises 1.5 and 1.6, see page 6.
1.7 How do you do online research? A study of 552 first-year college students asked about their favorite choice for doing online research. Possible choices were “Google or Google Scholar,” “Library database or website,” “Wikipedia or online encyclopedia,” and “Other.” Names of the students were not recorded, but the students were numbered from 1 to 552 in the data file. The researchers also recorded age, sex, and major area of study for each student.
(a) What are the cases?
(b) Identify the variables and their possible values.
(c) Classify each variable as categorical or quantitative. Be sure to include at least one of each.
(d) Was a label used? Explain your answer.
(e) Summarize the key characteristics of your data set.
1.8 Summer jobs. You are collecting information about summer jobs that are available for college students in your area. Describe a data set that you could use to organize the information that you collect.
(a) What are the cases?
(b) Identify the variables and their possible values.
(c) Classify each variable as categorical or quantitative. Be sure to include at least one of each.
(d) Use a label and explain how you chose it.
(e) Summarize the key characteristics of your data set.
1.9 Employee application data. The personnel department keeps records on all employees in a company. Here is the information that they keep in one of their data files: employee identification number, last name, first name, middle initial, department, number of years with the company, salary, education (coded as high school, some college, or college degree), and age.
(a) What are the cases for this data set?
(b) Describe each type of information as a label, a quantitative variable, or a categorical variable.
(c) Set up a spreadsheet that could be used to record the data. Give appropriate column headings and five sample cases.
1.10 How would you rank cities? Various organizations rank cities and produce lists of the 10 or the 100 best based on various measures. Create a list of criteria that you would use to rank cities. Include at least eight variables, and give reasons for your choices. Say whether each variable is quantitative or categorical.
1.11 Survey of students. A survey of students in an introductory statistics class asked the following questions: (1) age; (2) do you like to sing? (Yes, No); (3) can you play a musical instrument (not at all, a little, pretty well); (4) how much did you spend on food last week (in dollars); (5) height.
(a) Classify each of these variables as categorical or quantitative and give reasons for your answers.
(b) For each variable give the possible values.
1.12 What questions would you ask? Refer to the previous exercise. Make up your own survey with at least six questions. Include at least two categorical variables and at least two quantitative variables. Tell which variables are categorical and which are quantitative. Give reasons for your answers. For each variable, give the possible values.
1.13 How would you rate colleges? Popular magazines rank colleges and universities on their “academic quality” in serving undergraduate students. Describe five variables that you would like to see measured for each college if you were choosing where to study. Give reasons for each of your choices.
1.14 Attending college in your state or in another state. The U.S. Census Bureau collects a large amount of information concerning higher education.1 For example, the bureau provides a table that includes the following variables: state, number of students from the state who attend college, number of students who attend college in their home state.
(a) What are the cases for this set of data?
(b) Is there a label variable? If yes, what is it?
(c) Identify each variable as categorical or quantitative.
(d) Explain how you might use each of the quantitative variables to explain something about the states.
(e) Consider a variable computed as the number of students in each state who attend college in the state divided by the total number of students from the state who attend college. Explain how you would use this variable to explain something about the states.
1.15 Alcohol-impaired driving fatalities. A report on drunk-driving fatalities in the United States gives the number of alcohol-impaired driving fatalities for each state.2 Discuss at least three different ways that these numbers could be converted to rates. Give the advantages and disadvantages of each.
1.2 Displaying Distributions with Graphs
When you complete this section, you will be able to:
Analyze the distribution of a categorical variable using a bar graph. Analyze the distribution of a categorical variable using a pie chart. Analyze the distribution of a quantitative variable using a stemplot. Analyze the distribution of a quantitative variable using a histogram. Examine the distribution of a quantitative variable with respect to the overall pattern of the data and deviations from that pattern. Identify the shape, center, and spread of the distribution of a quantitative variable. Identify and describe any outliers in the distribution of a quantitative variable. Use a time plot to describe the distribution of a quantitative variable that is measured over time.
exploratory data analysis
Statistical tools and ideas help us examine data to describe their main features. This examination is called exploratory data analysis. Like an explorer crossing unknown lands, we want first to simply describe what we see. Here are two basic strategies that help us organize our exploration of a set of data:
Begin by examining each variable by itself. Then move on to study the relationships among the variables. Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.
We follow these principles in organizing our learning. This chapter presents methods for describing a single variable. We will study relationships among several variables in Chapter 2. Within each chapter, we will begin with graphical displays, then add numerical summaries for a more complete description.
Categorical variables: Bar graphs and pie charts distribution of a categorical variable
count percent proportion
The values of a categorical variable are labels for the categories, such as “yes” and “no.” The distribution of a categorical variable lists the categories and gives either the count or the percent of cases that fall in each category. An alternative to the percent is the proportion, the count divided by the sum of the counts. Note that the percent is simply the proportion times 100.
EXAMPLE 1.7
ONLINE
How do you do online research? A study of 552 first-year college students asked about their preferences for online resources. One question asked them to pick their favorite.3 Here are the results:
Resource Count (n) Google or Google Scholar 406 Library database or website 75 Wikipedia or online encyclopedia 52 Other 19 Total 552
Resource is the categorical variable in this example, and the values are the names of the online resources.
Note that the last value of the variable resource is “Other,” which includes all other online resources that were given as selection options. For data sets that have a large number of values for a categorical variable, we often create a category such as this that includes categories that have relatively small counts or percents. Careful judgment is needed when doing this. You don’t want to cover up some important piece of information contained in the data by combining data in this way.
EXAMPLE 1.8
ONLINE
Favorites as percents. When we look at the online resources data set, we see that Google is the clear winner. We see that 406 reported Google or Google Scholar as their favorite. To interpret this number, we need to know that the total number of students polled was 552. When we say that Google is the winner, we can describe this win by saying that 73.6% (406 divided by 552, expressed as a percent) of the students reported Google as their favorite. Here is a table of the preference percents:
Resource Percent(%) Google or Google Scholar 73.6 Library database or website 13.6 Wikipedia or online encyclopedia 9.4 Other 3.4 Total 100.0
The use of graphical methods allows us to see this information and other characteristics of the data easily. We now examine two types of graphs.
EXAMPLE 1.9
ONLINE
bar graph
Bar graph for the online resource preference data. Figure 1.2 displays the online resource preference data using a bar graph. The heights of the four bars show the percents of the students who reported each of the resources as their favorite.
FIGURE 1.2 Bar graph for the online resource preference data, Example 1.9.
The categories in a bar graph can be put in any order. In Figure 1.2, we ordered the resources based on their preference percents. For other data sets, an alphabetical ordering or some other arrangement might produce a more useful graphical display.
You should always consider the best way to order the values of the categorical variable in a bar graph. Choose an ordering that will be useful to you. If you have difficulty, ask a friend if your choice communicates what you expect. Note that a bar graph using counts will look the same as a bar graph using percents. A pie chart naturally uses percents.
EXAMPLE 1.10
ONLINE
pie chart
Pie chart for the online resource preference data. The pie chart in Figure 1.3 helps us see what part of the whole each group forms. Here it is very easy to see that Google is the favorite for about three-quarters of the students.
FIGURE 1.3 Pie chart for the online resource preference data, Example 1.10.
USE YOUR KNOWLEDGE
1.16
ONLINE
Compare the bar graph with the pie chart. Refer to the bar graph in Figure 1.2 and the pie chart in Figure 1.3 for the online resource preference data. Which graphical display does a better job of describing the data? Give reasons for your answer.
To make a pie chart, you must include all the categories that make up a whole. A category such as “Other” in this example can be used, but the sum of the percents for all the categories should be 100%. This constraint makes bar graphs more flexible.
Quantitative variables: Stemplots and histograms A stemplot (also called a stem-and-leaf plot) gives a quick picture of the shape of a distribution while including the actual numerical values in the graph. Stemplots work best for small numbers of observations that are all greater than 0.
STEMPLOT
To make a stemplot,
1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.
EXAMPLE 1.11
STAT
Soluble corn fiber and calcium. Soluble corn fiber (SCF) has been promoted for various health benefits. One study examined the effect of SCF on the absorption of calcium of adolescent boys and girls. Calcium absorption is expressed as a percent of calcium in the diet. Here are the data for the condition where subjects consumed 12 grams per day (g/d) of SCF.4
50 43 43 44 50 44 35 49 54 76 31 48 61 70 62 47 42 45 43 59 53 53 73
To make a stemplot of these data, use the first digits as stems and the second digits as leaves. Figure 1.4 shows the steps in making the plot, We use the first digit of each value as the stem. Figure 1.4(a) shows the stems that have values 3, 4, 5, 6, and 7. The first entry in our data set is 50. This appears in Figure 1.4(b) on the 5 stem with a leaf of 0. Similarly, the second value, 43, appears in the 4 stem with a leaf of 3. The stemplot is completed in Figure 1.4(c), where the leaves are ordered from smallest to largest.
The center of the distribution is in the 40s, and the data are more stretched out toward high values than low values (the highest value is 76, while the lowest is 31). In the plot, we do not see any extreme values that lie far from the remaining data.
FIGURE 1.4 Making a stemplot of the data in Example 1.11. (a) Write the stems. (b) Go through the data and write each leaf on the proper stem. For example, the values on the 3-stem are 35 and 31 in the order given in the display for the example. (c) Arrange the leaves on each stem in order out from the stem. The 3-stem now has leaves 1 and 5.
1.17
USE YOUR KNOWLEDGE
STAT
Make a stemplot. Here are the scores on the first exam in an introductory statistics course for 30 students in one section of the course:
82 73 92 82 75 98 94 57 80 90 92 80 87 91 65 73 70 85 83 61 70 90 75 75 59 68 85 78 80 94
Use these data to make a stemplot. Then use the stemplot to describe the distribution of the first-exam scores for this course.
back-to-back stemplot
When you wish to compare two related distributions, a back-to-back stemplot with common stems is useful. The leaves on each side are ordered out from the common stem.
EXAMPLE 1.12
SCF
Soluble corn fiber and calcium. Refer to Example 1.11, which gives the data for subjects consuming 12 g/d of SCF. Here are the data for subjects under control conditions (0 g/d of SCF):
42 33 41 49 42 47 48 47 53 72 47 63 68 59 35 46 43 55 38 49 51 51 66
Figure 1.5 gives the back-to-back stemplot for the SCF and control conditions. The values on the left give absorption for the control condition, while the values on the right give absorption when SCF was consumed. The values for SCF appear to be somewhat higher than the controls.
FIGURE 1.5 A back-to-back stemplot to compare the distributions of calcium absorption under control and SCF conditions, Example 1.12.
splitting stems
trimming
There are two modifications of the basic stemplot that can be helpful in different situations. You can double the number of stems in a plot by splitting each stem into two: one with leaves 0 to 4 and the other with leaves 5 through 9.
1.18
1.19
When the observed values have many digits, it is often best to trim the numbers by removing the last digit or digits before making a stemplot. If you are using software, you can round the data, which is what was done for the data given in Example 1.11.
You must use your judgment in deciding whether to split stems and whether to trim or round, though statistical software will often make these choices for you. Remember that the purpose of a stemplot is to display the shape of a distribution. If there are many stems with no leaves or only one leaf, trimming will reduce the number of stems. Let’s take a look at the effect of splitting the stems for our SCF data.
EXAMPLE 1.13
SCF
Stemplot with split stems for SCF. Figure 1.6 presents the data from Example 1.12 in a stemplot with split stems.
FIGURE 1.6 A back-to-back stemplot with split stems to compare the distributions of calcium absorption under control and SCF conditions, Example 1.13.
USE YOUR KNOWLEDGE
Which stemplot do you prefer? Look carefully at the stemplots for the SCF data in Figures 1.5 and 1.6. Which do you prefer? Give reasons for your answer.
Why should you keep the space? Suppose that you had a data set similar to the one given in Example 1.12, but in which the control values of 66 and 68 were both changed to 64.
(a) Make a stemplot of these data using split stems.
(b) Should you use one stem or two stems for the 60s? Give a reason for your answer. (Hint: How would your choice reveal or conceal a potentially important characteristic of the data?)
TABLE 1.1
Histograms Stemplots display the actual values of the observations. This feature makes stemplots awkward for large data sets. Moreover, the picture presented by a stemplot divides the observations into groups (stems) determined by the number system rather than by judgment.
histogram
Histograms do not have these limitations. A histogram breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. You can choose any convenient number of classes, but you should choose classes of equal width.
Making a histogram by hand requires more work than a stemplot. Histograms do not display the actual values observed. For these reasons, we prefer stemplots for small data sets.
The construction of a histogram is best shown by example. Most statistical software packages will make a histogram for you.
EXAMPLE 1.14
IQ
Distribution of IQ scores. You have probably heard that the distribution of scores on IQ tests is supposed to be roughly “bell-shaped.” Let’s look at some actual IQ scores. Table 1.1 displays the IQ scores of 60 fifth-grade students chosen at random from one school.
1. Divide the range of the data into classes of equal width. Let’s use
75 ≤ IQ score < 85
85 ≤ IQ score < 95
145 ≤ IQ score < 155
IQ Test Scores for 60 Randomly Chosen Fifth-Grade Students
145 139 126 122 125 130 96 110 118 118 101 142 134 124 112 109 134 113 81 113 123 94 100 136 109 131 117 110 127 124 106 124 115 133 116 102 127 117 109 137 117 90 103 114 139 101 122 105 97 89 102 108 110 128 114 112 114 102 82 101
Be sure to specify the classes precisely so that each individual falls into exactly one class. A student with IQ 84 would fall into the first class, but IQ 85 falls into the second.
frequency frequency table
2. Count the number of individuals in each class. These counts are called frequencies, and a table of frequencies for all classes is a frequency table.
Class Count 75 ≤ IQ score < 85 2 85 ≤ IQ score < 95 3
1.20
95 ≤ IQ score < 105 10 105 ≤ IQ score < 115 16 115 ≤ IQ score < 125 13 125 ≤ IQ score < 135 10 135 ≤ IQ score < 145 5 145 ≤ IQ score < 155 1
3. Draw the histogram. First, on the horizontal axis mark the scale for the variable whose distribution you are displaying. That’s the IQ score. The scale runs from 75 to 155 because that is the span of the classes we chose. The vertical axis contains the scale of counts. Each bar represents a class. The base of the bar covers the class, and the bar height is the class count. There is no horizontal space between the bars unless a class is empty, so its bar has height zero. Figure 1.7 is our histogram. It does look roughly “bell-shaped.”
FIGURE 1.7 Histogram of the IQ scores of 60 fifth-grade students, Example 1.14.
Large sets of data are often reported in the form of frequency tables when it is not practical to publish the individual observations. In addition to the frequency (count) for each class, we may be interested in the fraction or percent of the observations that fall in each class. A histogram of percents looks just like a frequency histogram such as Figure 1.7. Simply relabel the vertical scale to read in percents. Use histograms of percents for comparing several distributions that have different numbers of observations.
USE YOUR KNOWLEDGE
STAT
Make a histogram. Refer to the first-exam scores from Exercise 1.17 (page 12). Use these data to make a histogram with classes 50 to 59, 60 to 69, etc. Compare the histogram with the stemplot as a way of describing this distribution. Which do you prefer for these data?
Our eyes respond to the area of the bars in a histogram. Because the classes are all the same width, area is determined by height and all classes are fairly represented. There is no one right choice of the classes in a histogram. Too few classes will give a “skyscraper” graph, with all values in a few classes with tall bars. Too many will produce a “pancake” graph, with most classes having one or no observations. Neither choice will give a good picture of the shape of the distribution. You must use your judgment in choosing classes to display the shape. Statistical software will choose the classes for you. The software’s choice is often a good one, but you can change it if you want.
1.21
1.22
You should be aware that the appearance of a histogram can change when you change the classes. The histogram function in the One-Variable Statistical Calculator applet on the text website allows you to change the number of classes by dragging with the mouse, so that it is easy to see how the choice of classes affects the histogram.
USE YOUR KNOWLEDGE
Change the classes in the histogram. Refer to the first-exam scores from Exercise 1.17 (page 12) and the histogram that you produced in Exercise 1.20. Now make a histogram for these data using classes 40 to 59, 60 to 79, and 80 to 100. Compare this histogram with the one that you produced in Exercise 1.20. Which do you prefer? Give a reason for your answer.
STAT
Use smaller classes. Repeat the previous exercise using classes 55 to 59, 60 to 64, 65 to 69, etc. Of the three histograms, which do you prefer? Give reasons for your answer.
Although histograms resemble bar graphs, their details and uses are distinct. A histogram shows the distribution of counts or percents among the values of a single variable. A bar graph compares the counts or percents of different items. The horizontal axis of a bar graph need not have any measurement scale but simply identifies the items being compared.
Draw bar graphs with blank space between the bars to separate the items being compared. Draw histograms with no space, to indicate that all values of the variable are covered. Some spreadsheet programs, which are not primarily intended for statistics, will draw histograms as if they were bar graphs, with space between the bars. Often, you can tell the software to eliminate the space to produce a proper histogram.
TABLE 1.2
Data analysis in action: Don’t hang up on me Many businesses operate call centers to serve customers who want to place an order or make an inquiry. Customers want their requests handled thoroughly. Businesses want to treat customers well, but they also want to avoid wasted time on the phone. They therefore monitor the length of calls and encourage their representatives to keep calls short.
Service Times (Seconds) for Calls to a Customer Service Center
77 289 128 59 19 148 157 203 126 118 104 141 290 48 3 2 372 140 438 56 44 274 479 211 179 1 68 386 2631 90 30 57 89 116 225 700 40 73 75 51 148 9 115 19 76 138 178 76 67 102 35 80 143 951 106 55 4 54 137 367 277 201 52 9 700 182 73 199 325 75 103 64 121 11 9 88 1148 2 465 25
EXAMPLE 1.15
CALLS80
How long are customer service center calls? We have data on the lengths of all 31,492 calls made to the customer service center of a small bank in a month. Table 1.2 displays the lengths of the first 80 calls.5
Take a look at the data in Table 1.2. In this data set, the cases are calls made to the bank’s call center. The variable recorded is the length of each call. The units are seconds. We see that the call lengths vary a great deal. The longest call lasted 2631 seconds, almost 44 minutes. More striking is that 8 of these 80 calls lasted less than 10 seconds.
We started our study of the customer service center data by examining a few cases, the ones displayed in Table 1.2. It would be very difficult to examine all 31,492 cases in this way. How can we do this? Let’s try a histogram.
EXAMPLE 1.16
CALLS
Histogram for customer service center call lengths. Figure 1.8 is a histogram of the lengths of all 31,492 calls. We did not plot the few lengths greater than 1200 seconds (20 minutes). As expected, the graph shows that most calls last between about 1 and 5 minutes, with some lasting much longer when customers have complicated problems. More striking is the fact that 7.6% of all calls are no more than 10 seconds long.
FIGURE 1.8 The distribution of call lengths for 31,492 calls to a bank’s customer service center, Example 1.16. The data show a surprising number of very short calls. These are mostly due to representatives deliberately hanging up in order to bring down their average call length.
It turned out that the bank penalized representatives whose average call length was too long—so some representatives just hung up on customers to bring their average length down. Neither the customers nor the bank were happy about this. The bank changed its policy, and later data showed that calls under 10 seconds had almost disappeared.
tails
The extreme values of a distribution are in the tails of the distribution. The high values are in the upper, or right, tail and the low values are in the lower, or left, tail. The overall pattern in Figure 1.8 is made up of the many moderate call lengths and the long right tail of more lengthy calls. The striking deviation from the overall pattern is the surprising number of very short calls in the left tail.
Our examination of the call center data illustrates some important principles:
After you understand the background of your data (cases, variables, units of measurement), the first thing to do is plot your data. When you look at a plot, look for an overall pattern and also for any striking deviations from the pattern.
Examining distributions Making a statistical graph is not an end in itself. The purpose of the graph is to help us understand the data. After you make a graph, always ask, “What do I see?” Once you have displayed a distribution, you can see its important features as follows.
EXAMINING A DISTRIBUTION
In any graph of data, look for the overall pattern and for striking deviations from that pattern.
You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of deviation is an outlier, an individual value that falls outside the overall pattern.
In Section 1.3, we will learn how to describe center and spread numerically. For now, we can describe the center of a distribution by its midpoint, the value with roughly half the observations taking smaller values and half taking larger values. We can describe the spread of a distribution by giving the smallest and largest values. Stemplots and histograms display the shape of a distribution in the same way. Just imagine a stemplot turned on its side so that the larger values lie to the right.
Some things to look for in describing shape are
modes unimodal
Does the distribution have one or several major peaks, called modes? A distribution with one major peak is called unimodal.
symmetric skewed
Is it approximately symmetric or is it skewed in one direction? A distribution is symmetric if the pattern of values smaller and larger than its midpoint are mirror images of each other. It is skewed to the right if the right tail (larger values) is much longer than the left tail (smaller values).
Some variables commonly have distributions with predictable shapes. Many biological measurements on specimens from the same species and sex—lengths of bird bills, heights of young women—have symmetric distributions. Money amounts, on the other hand, usually have right-skewed distributions. There are many moderately priced houses, for example, but the few very expensive mansions give the distribution of house prices a strong right-skew.
EXAMPLE 1.17
IQ
Examine the histogram of IQ scores. What does the histogram of IQ scores (Figure 1.7, page 15) tell us?
Shape: The distribution is roughly symmetric with a single peak in the center. We don’t expect real data to be perfectly symmetric, so in judging symmetry, we are satisfied if the two sides of the histogram are roughly similar in shape and extent.
Center: You can see from the histogram that the midpoint is not far from 110. Looking at the actual data shows that the midpoint is 114.
Spread: The histogram has a spread from 75 to 155. Looking at the actual data shows that the spread is from 81 to 145. There are no outliers or other strong deviations from the symmetric, unimodal pattern.
1.23
EXAMPLE 1.18
Examine the histogram of call lengths. The distribution of call lengths in Figure 1.8, on the other hand, is strongly skewed to the right. The midpoint, the length of a typical call, is about 115 seconds, or just under 2 minutes. The spread is very large, from 1 second to 28,739 seconds.
The longest few calls are outliers. They stand apart from the long right tail of the distribution, though we can’t see this from Figure 1.8, which omits the largest observations. The longest call lasted almost 8 hours—that may well be due to equipment failure rather than an actual customer call.
USE YOUR KNOWLEDGE
STAT
Describe the first-exam scores. Refer to the first-exam scores from Exercise 1.17 (page 12). Use your favorite graphical display to describe the shape, the center, and the spread of these data. Are there any outliers?
Dealing with outliers
In data sets smaller than the service call data, you can spot outliers by looking for observations that stand apart (either high or low) from the overall pattern of a histogram or stemplot. Identifying outliers is a matter for judgment. Look for points that are clearly apart from the body of the data, not just the most extreme observations in a distribution. You should search for an explanation for any outlier. Sometimes outliers point to errors made in recording the data. In other cases, the outlying observation may be caused by equipment failure or other unusual circumstances.
EXAMPLE 1.19
COLLEGE
College students. How does the number of undergraduate college students vary by state? Figure 1.9 is a histogram of the numbers of undergraduate students in each of the states.6 Notice that more than 50% of the states are included in the first bar of the histogram. These states have fewer than 300,000 undergraduates. The next bar includes another 30% of the states. These have between 300,000 and 600,000 students. The bar at the far right of the histogram corresponds to the state of California, which has 2,685,893 undergraduates. California certainly stands apart from the other states for this variable. It is an outlier.
FIGURE 1.9 The distribution of the numbers of undergraduate college students for the 50 states, Example 1.19.
The state of California is an outlier in the previous example because it has a very large number of undergraduate students. California has the largest population of all the states, so we might expect it to have a large number of college students. Let’s look at these data in a different way.
EXAMPLE 1.20
COLLEGE
College students per 1000. To account for the fact that there is large variation in the populations of the states, for each state we divide the number of undergraduate students by the population and then multiply by 1000. This gives the undergraduate college enrollment expressed as the number of students per 1000 people in each state. Figure 1.10
1.24
gives a stemplot of the distribution. California has 60 undergraduate students per 1000 people. This is one of the higher values in the distribution, but it is clearly not an outlier.
FIGURE 1.10 Stemplot of the numbers of undergraduate college students per 1000 people in each of the 50 states, Example 1.20.
USE YOUR KNOWLEDGE
COLLEGE
Four states with large populations. There are four states with populations greater than 15 million. (a) Examine the data file and report the names of these four states.
(b) Find these states in the distribution of number of undergraduate students per 1000 people. To what extent do these four states influence the distribution of number of undergraduate students per 1000 people?
In Example 1.19, we looked at the distribution of the number of undergraduate students, while in Example 1.20, we adjusted these data by expressing the counts as number per 1000 people in each state. Which way is correct? The answer depends upon why you are examining the data.
If you are interested in marketing a product to undergraduate students, the unadjusted numbers would be of interest because you want to reach the most people. On the other hand, if you are interested in comparing states with respect to how well they provide opportunities for higher education to their residents, the population-adjusted values would be more suitable. Always think about why you are doing a statistical analysis, and this will guide you in choosing an appropriate analytic strategy.
Here is an example with a different kind of outlier.
EXAMPLE 1.21
PTH
Healthy bones and PTH. Bones are constantly being built up (bone formation) and torn down (bone resorption). Young people who are growing have more formation than resorption. When we age, resorption increases to the point where it exceeds formation. (The same phenomenon occurs when astronauts travel in space.) The result is osteoporosis, a disease associated with fragile bones that are more likely to break. The underlying mechanisms that control these processes are complex and involve a variety of substances. One of these is parathyroid hormone (PTH). Here are the values of PTH measured on a sample of 29 boys and girls aged 12 to 15 years:7
39 59 30 48 71 31 25 31 71 50 38 63 49 45 31 33 28 40 127 49 59 50 64 28 46 35 28 19 29
The data are measured in picograms per milliliter (pg/ml) of blood. The original data were recorded with one digit after the decimal point. They have been rounded to simplify our presentation here. Figure 1.11 gives a stemplot of the data.
The observation 127 clearly stands out from the rest of the distribution. A PTH measurement on this individual taken on a different day was similar to the rest of the values in the data set. We conclude that this outlier was caused by a laboratory error or a recording error, and we are confident in discarding it for any additional analysis.
FIGURE 1.11 Stemplot of the values of PTH, Example 1.21.
Time plots
Whenever data are collected over time, it is a good idea to plot the observations in time order. Displays of the distribution of a variable that ignore time order, such as stemplots and histograms, can be misleading when there is systematic change over time.
TIME PLOT
A time plot of a variable plots each observation against the time at which it was measured. Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale.
EXAMPLE 1.22
VITDS
Seasonal variation in vitamin D. Although we get some of our vitamin D from food, most of us get about 75% of what we need from the sun. Cells in the skin make vitamin D in response to sunlight. If people do not get enough exposure to the sun, they can become deficient in vitamin D, resulting in weakened bones and other health problems. The elderly, who need more vitamin D than younger people, and people who live in northern areas, where there is relatively little sunlight in the winter, are particularly vulnerable to these problems.
Figure 1.12 is a plot of the serum levels of vitamin D versus time of year for samples of subjects from Switzerland.8 The units measuring Vitamin D are nanomoles per liter (nmol/l) of blood. The observations are grouped into periods of two months for the plot. Means are marked by filled-in circles and are connected by a line in the plot. The effect of the lack of sunlight in the winter months on vitamin D levels is clearly evident in the plot.
FIGURE 1.12 Plot of vitamin D versus months of the year, Example 1.22.
The data described in the preceding example are based on a subset of the subjects in a study of 248 subjects. The researchers were particularly concerned about subjects whose levels were deficient, defined as a serum vitamin D level of less than 50 nmol/l. They found that there was a 3.8-fold higher deficiency rate in February-March than in August- September: 91.2% versus 24.3%. To ensure that individuals from this population have adequate levels of vitamin D, some form of supplementation is needed, particularly during certain times of the year.
SECTION 1.2 SUMMARY Exploratory data analysis uses graphs and numerical summaries to describe the variables in a data set and the relations among them. The distribution of a variable tells us what values it takes and how often it takes these values. Bar graphs and pie charts display the distributions of categorical variables. These graphs use the counts or percents of the categories. Stemplots and histograms display the distributions of quantitative variables. Stemplots separate each observation into a stem and a one-digit leaf. Histograms plot the frequencies (counts) or the percents of equal- width classes of values. When examining a distribution, look for shape, center, and spread and for clear deviations from the overall shape. Some distributions have simple shapes, such as symmetric or skewed. The number of modes (major peaks) is another aspect of overall shape. Not all distributions have a simple overall shape, especially when there are few observations. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. When observations on a variable are taken over time, make a time plot that graphs time horizontally and the values of the variable vertically. A time plot can reveal changes over time.
SECTION 1.2 EXERCISES For Exercise 1.16, see page 11; for Exercise 1.17, see page 12; for Exercises 1.18 and 1.19, see page 14; for Exercise 1.20, see page 16; for Exercises 1.21 and 1.22, see page 16; for Exercise 1.23, see page 19; and for Exercise 1.24, see page 21.
1.25 Your Facebook app can generate a million dollars a month. A report on Facebook suggests that Facebook apps can generate large amounts of money, as much as $1 million a month.9 The following table gives the numbers of Facebook users by country for the top 10 countries based on the number of users:10 FACEBK
Country Facebook users (in millions)
Brazil 29.30 India 37.38 Mexico 29.80 Germany 21.46 France 23.19 Philippines 26.87 Indonesia 40.52 United Kingdom 30.39 United States 155.74 Turkey 30.63
(a) Use a bar graph to describe the numbers of users in these countries.
(b) Do you think that the United States is an outlier in this data set? Explain your answer.
(c) Describe the major features of your graph in a short paragraph.
1.26 Facebook use increases by country. Refer to the previous exercise. The report also gave the increases in the number of Facebook users for a one-month period for the same countries: FACEBK
Country Increase in users (in millions)
Brazil 2.47 India 1.75 Mexico 0.84 Germany 0.51 France 0.38 Philippines 0.38 Indonesia 0.37 United Kingdom 0.22 United States 0.65 Turkey 0.09
(a) Use a bar graph to describe the increase in users in these countries.
(b) Describe the major features of your graph in a short paragraph.
(c) Do you think a stemplot would be a better graphical display for these data? Give reasons for your answer.
(d) Write a short paragraph about possible business opportunities suggested by the data you described in this exercise and the previous one.
1.27 The Titanic and class. On April 15, 1912, on her maiden voyage, the Titanic collided with an iceberg and sank. The ship was luxurious but did not have enough lifeboats for the 2224 passengers and crew. As a result of the collision, 1502 people died.11 The ship had three classes of passengers. The level of luxury and the price of the ticket varied with the class, with first class being the most luxurious. There were 323 passengers in first class, 277 in second class, and 709 in third class.12
TITANIC
(a) Make a bar graph of these data.
(b) Give a short summary of how the number of passengers varied with class.
(c) If you made a bar graph of the percent of passengers in each class, would the general features of the graph differ from the one you made in part (a)? Explain your answer.
1.28 Another look at the Titanic and class. Refer to the previous exercise. TITANIC
(a) Make a pie chart to display the data.
(b) Compare the pie chart with the bar graph. Which do you prefer? Give reasons for your answer.
1.29 Who survived? Refer to the two previous exercises. The number of first-class passengers who survived was 200. For second and third class, the numbers were 119 and 181, respectively. Create a graphical summary that shows how the survival of passengers depended on class. TITANIC
1.30 Potassium from potatoes. The 2015 Dietary Guidelines for Americans13 notes that the average potassium (K) intake for U.S. adults is about half of the recommended amount. A major source of potassium is potatoes. Nutrients in the diet can have different absorption depending on the source. One study looked at absorption of potassium from different sources. Participants ate a controlled diet for five days, and the amount of potassium absorbed was measured. Data for a diet that included 40 milliequivalents (mEq) of potassium were collected from 27 adult subjects.14 KPOT40
(a) Make a stemplot of the data.
(b) Describe the pattern of the distribution.
(c) Are there any outliers? If yes, describe them and explain why you have declared them to be outliers.
(d) Describe the shape, center, and spread of the distribution.
1.31 Potassium from a supplement. Refer to the previous exercise. Data were also recorded for 29 subjects who received a potassium salt supplement with 40 mEq of potassium. Answer the questions in the previous exercise for the supplemented subjects. KSUP40
1.32 Energy consumption. The U.S. Energy Information Administration reports data summaries of various energy statistics. Let’s look at the total amount of energy consumed, in quadrillions of British thermal units (Btu), for each month in a recent year. Here are the data:15 ENERGY
Month Energy (quadrillion Btu)
January 9.58 February 8.46 March 8.56 April 7.56 May 7.66 June 7.79 July 8.23 August 8.21 September 7.64 October 7.78 November 8.19 December 8.82
(a) Look at the table and describe how the energy consumption varies from month to month.
(b) Make a time plot of the data and describe the patterns.
(c) Suppose you wanted to communicate information about the month-to-month variation in energy consumption. Which would be more effective, the table of the data or the graph? Give reasons for your answer.
1.33 Energy consumption in a different year. Refer to the previous exercise. Here are the data for the previous year: ENERGY
Month Energy (quadrillion Btu)
January 8.99 February 8.02 March 8.38 April 7.52 May 7.62 June 7.72 July 8.27 August 8.17 September 7.64 October 7.72 November 8.14 December 9.08
(a) Analyze these data using the questions in the previous exercise as a guide.
(b) Compare the patterns across the two years. Describe any similarities and differences.
1.34 Favorite colors. What is your favorite color? One survey produced the following summary of responses to that question: blue, 42%; green, 14%; purple, 14%; red, 8%; black, 7%; orange, 5%; yellow, 3%; brown, 3%; gray, 2%; and white, 2%.16 Make a bar graph of the percents and write a short summary of the major features of your graph. FAVCOL
1.35 Least-favorite colors. Refer to the previous exercise. The same study also asked people about their least-favorite color. Here are the results: orange, 30%; brown, 23%; purple, 13%; yellow, 13%; gray, 12%; green, 4%; white, 4%; red, 1%; black, 0%; and blue, 0%. Make a bar graph of these percents and write a summary of the results. LFAVCOL
1.36 Garbage. The formal name for garbage is “municipal solid waste.” In the United States, approximately 250 million tons of garbage are generated in a year. Following is a breakdown of the materials that made up American municipal solid waste in 2012:17 GARBAGE
Material Weight (million tons) Percent of total
Food scraps 36.4 14.5 Glass 11.6 4.6 Metals 22.4 8.9 Paper, paperboard 68.6 27.4 Plastics 31.7 12.7 Rubber, leather 7.5 3.0 Textiles 14.3 5.7 Wood 15.8 6.3 Yard trimmings 34.0 13.5 Other 8.5 3.4 Total 250.9 100.0
(a) Add the weights. The sum is not exactly equal to the value of 250.9 million tons given in the table. Why?
(b) Make a bar graph of the percents. The graph gives a clearer picture of the main contributors to garbage if you order the bars from tallest to shortest.
(c) Also make a pie chart of the percents. Comparing the two graphs, notice that it is easier to see the small differences among “Food scraps,” “Plastics,” and “Yard trimmings” in the bar graph.
1.37 Vehicle colors. Vehicle colors differ among regions of the world. Here are data on the most popular colors for vehicles
TABLE 1.3
in North America:18 VCOLOR
Color (percent)
White 24 Black 19 Silver 16 Gray 15 Red 10 Blue 7 Brown 5 Other 4
(a) Describe these data with a bar graph.
(b) Describe these data with a pie chart.
(c) Which graphical summary do you prefer. Give reasons for your answer.
1.38 Sketch a skewed distribution. Sketch a histogram for a distribution that is skewed to the left. Suppose that you and your friends emptied your pockets of coins and recorded the year marked on each coin. The distribution of dates would be skewed to the left. Explain why.
1.39 Grades and self-concept. Table 1.3 presents data on 78 seventh-grade students in a rural midwestern school.19 The researcher was interested in the relationship between the students’ “self-concept” and their academic performance. The data we give here include each student’s grade point average (GPA), score on a standard IQ test, and gender, taken from school records. Gender is coded as F for female and M for male. The students are identified only by an observation number (OBS). The missing OBS numbers show that some students dropped out of the study. The final variable is each student’s score on the Piers- Harris Children’s Self-Concept Scale, a psychological test administered by the researcher. SEVENGR
(a) How many variables does this data set contain? Which are categorical variables and which are quantitative variables?
(b) Make a stemplot of the distribution of GPA, after rounding to the nearest tenth of a point.
(c) Describe the shape, center, and spread of the GPA distribution. Identify any suspected outliers from the overall pattern.
(d) Make a back-to-back stemplot of the rounded GPAs for female and male students. Write a brief comparison of the two distributions.
1.40 Describe the IQ scores. Make a graph of the distribution of IQ scores for the seventh-grade students in Table 1.3. Describe the shape, center, and spread of the distribution, as well as any outliers. IQ scores are usually said to be centered at 100. Is the midpoint for these students close to 100, clearly above, or clearly below? SEVENGR
Educational Data for 78 Seventh-Grade Students
OBS GPA IQ Gender Self-concept
001 7.940 111 M 67 002 8.292 107 M 43 003 4.643 100 M 52 004 7.470 107 M 66 005 8.882 114 F 58 006 7.585 115 M 51 007 7.650 111 M 71 008 2.412 97 M 51 009 6.000 100 F 49 010 8.833 112 M 51 011 7.470 104 F 35 012 5.528 89 F 54 013 7.167 104 M 54
014 7.571 102 F 64 015 4.700 91 F 56 016 8.167 114 F 69 017 7.822 114 F 55 018 7.598 103 F 65 019 4.000 106 M 40 020 6.231 105 F 66 021 7.643 113 M 55 022 1.760 109 M 20 024 6.419 108 F 56 026 9.648 113 M 68 027 10.700 130 F 69 028 10.580 128 M 70 029 9.429 128 M 80 030 8.000 118 M 53 031 9.585 113 M 65 032 9.571 120 F 67 033 8.998 132 F 62 034 8.333 111 F 39 035 8.175 124 M 71 036 8.000 127 M 59 037 9.333 128 F 60 038 9.500 136 M 64 039 9.167 106 M 71 040 10.140 118 F 72 041 9.999 119 F 54 043 10.760 123 M 64 044 9.763 124 M 58 045 9.410 126 M 70 046 9.167 116 M 72 047 9.348 127 M 70 048 8.167 119 M 47 050 3.647 97 M 52 051 3.408 86 F 46 052 3.936 102 M 66 053 7.167 110 M 67 054 7.647 120 M 63 055 0.530 103 M 53 056 6.173 115 M 67 057 7.295 93 M 61 058 7.295 72 F 54 059 8.938 111 F 60 060 7.882 103 F 60 061 8.353 123 M 63 062 5.062 79 M 30 063 8.175 119 M 54 064 8.235 110 M 66 065 7.588 110 M 44 068 7.647 107 M 49 069 5.237 74 F 44
071 7.825 105 M 67 072 7.333 112 F 64 074 9.167 105 M 73 076 7.996 110 M 59 077 8.714 107 F 37 078 7.833 103 F 63 079 4.885 77 M 36 080 7.998 98 F 64 083 3.820 90 M 42 084 5.936 96 F 28 085 9.000 112 F 60 086 9.500 112 F 70 087 6.057 114 M 51 088 6.057 93 F 21 089 6.938 106 M 56
1.41 Describe the self-concept scores. Based on a suitable graph, briefly describe the distribution of self-concept scores for the students in Table 1.3. Be sure to identify any suspected outliers. SEVENGR
1.42 The Boston Marathon. Women were allowed to enter the Boston Marathon in 1972. Here are the times (in minutes, rounded to the nearest minute) for the winning women from 1972 to 2015.
Make a graph that shows change over time. What overall pattern do you see? Have times stopped improving in recent years? If so, when did improvement end? MARATH
Year Time
1972 190 1973 186 1974 167 1975 162 1976 167 1977 168 1978 165 1979 155 1980 154 1981 147 1982 150 1983 143 1984 149 1985 154 1986 145 1987 146 1988 145 1989 144 1990 145 1991 144 1992 144 1993 145 1994 142 1995 145 1996 147
1997 146 1998 143 1999 143 2000 146 2001 144 2002 141 2003 145 2004 144 2005 145 2006 143 2007 149 2008 145 2009 152 2010 146 2011 142 2012 151 2013 146 2014 139 2015 145
1.3 Describing Distributions with Numbers
When you complete this section, you will be able to:
Describe the center of a distribution by using the mean. Describe the center of a distribution by using the median. Compare the mean and the median as measures of center for a particular set of data. Describe the spread of a distribution by using quartiles. Describe a distribution by using the five-number summary. Describe a distribution by using a boxplot. Compare one or more sets of data measured on the same variable by using side-by-side boxplots. Identify outliers by using the 1.5 x IQR rule. Describe the spread of a distribution by using the standard deviation. Choose measures of center and spread for a particular set of data. Compute the effects of a linear transformation on the mean, the median, the standard deviation, and the interquartile range.
We can begin our data exploration with graphs, but numerical summaries make our analysis more specific. For categorical variables, numerical summaries are the counts or percents that we use to construct pie charts or bar graphs. In this section, we focus on numerical summaries for quantitative variables. A brief description of the distribution of a quantitative variable should include its shape and numbers describing its center and spread. We describe the shape of a distribution based on inspection of a histogram or a stemplot. Now we will learn specific ways to use numbers to measure the center and spread of a distribution. We can calculate these numerical measures for any quantitative variable. But to interpret measures of center and spread, and to choose among the several measures we will learn, you must think about the shape of the distribution and the meaning of the data. The numbers, like graphs, are aids to understanding, not “the answer” in themselves.
EXAMPLE 1.23
TTS24
The distribution of business start times. An entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. They have determined the time, in days, to complete all the procedures required to start a business.20 Data for 189 countries are included in the data set, TTS. For this section, we examine data, rounded to integers, for a sample of 24 of these countries. Here are the data:
FIGURE 1.13 Stemplot for the sample of 24 business start times, Example 1.23.
16 4 5 6 5 7 12 19 10 2 25 19 38 5 24 8 6 5 53 32 13 49 11 17
The stemplot in Figure 1.13 shows us the shape, center, and spread of the business start times. The stems are tens of days and the leaves are days. The distribution is skewed to the right with a very long tail of high values. All but six of the times are less than 20 days. The center appears to be about 10 days, and the values range from 2 days to 53 days. There do not appear to be any outliers.
Measuring center: The mean Numerical description of a distribution begins with a measure of its center or average. The two common measures of center are the mean and the median. The mean is the “average value” and the median is the “middle value.” These are two different ideas for “center,” and the two measures behave differently. We need precise recipes for the mean and the median.
THE MEAN x̅
To find the mean x̅ of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, . . . , xn, their mean is
or, in more compact notation,
The Σ (capital Greek sigma) in the formula for the mean is short for “add them all up.” The bar over the x indicates the mean of all the x-values. Pronounce the mean x̅ as “x-bar.” This notation is so common that writers who are discussing data use x̅, , etc., without additional explanation. The subscripts on the observations xi are a way of keeping the n observations separate.
EXAMPLE 1.24
TTS24
Mean time to start a business. The mean time to start a business is
The mean time to start a business for the 24 countries in our data set is 16.3 days. Note that we have rounded the answer. Our goal in using the mean to describe the center of a distribution is not to demonstrate that we can compute with great accuracy. The additional digits do not provide any additional useful information. In fact, they distract our attention from the important digits that are meaningful. Do you think it would be better to report the mean as 16 days?
The value of the mean will not necessarily be equal to the value of one of the observations in the data set. Our example of time to start a business illustrates this fact.
In practice, you can key the data into your calculator and hit the Mean key. You don’t have to actually add and divide. But you should know that this is what the calculator is doing.
USE YOUR KNOWLEDGE
1.43
1.44
TTS25
Include the outlier. For Example 1.23, a random sample of 24 countries was selected from a data set that included 189 countries. The South American country of Suriname, where the start time is 208 days, was not included in the random sample. Consider the effect of adding Suriname to the original set. Show that the mean for the new sample of 25 countries has increased to 24 days. (This is a rounded number. You should report the mean with two digits after the decimal to show that you have performed this calculation.)
STAT
Find the mean. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Find the mean first-exam score for these students.
resistant measure
Exercise 1.43 illustrates an important weakness of the mean as a measure of center: the mean is sensitive to the influence of a few extreme observations. These may be outliers, but a skewed distribution that has no outliers will also pull the mean toward its long tail. Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measure of center.
robust measure
A measure that is resistant does more than limit the influence of outliers. Its value does not respond strongly to changes in a few observations, no matter how large those changes may be. The mean fails this requirement because we can make the mean as large as we wish by making a large enough increase in just one observation. A resistant measure is sometimes called a robust measure.
Measuring center: The median We used the midpoint of a distribution as an informal measure of center in Section 1.2. The median is the formal version of the midpoint, with a specific rule for calculation.
THE MEDIAN M
The median M is the midpoint of a distribution. Half the observations are smaller than the median and the other half are larger than the median. Here is a rule for finding the median:
1. Arrange all observations in order of size, from smallest to largest. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. Find the
location of the median by counting (n + 1)/2 observations up from the bottom of the list. 3. If the number of observations n is even, the median M is the mean of the two center observations in the
ordered list. The location of the median is again (n + 1)/2 from the bottom of the list.
Note that the formula (n + 1)/2 does not give the median, just the location of the median in the ordered list. Medians require little arithmetic, so they are easy to find by hand for small sets of data. Arranging even a moderate number of observations in order is tedious, however, so that finding the median by hand for larger sets of data is unpleasant. Even simple calculators have an x button, but you will need computer software or a graphing calculator to automate finding the median.
EXAMPLE 1.25
TTS24
Median time to start a business. To find the median time to start a business for our 24 countries, we first arrange the data in order from smallest to largest:
2 4 5 5 5 5 6 6 7 8 10 11 12 13 16 17 19 19 24 25 32 38 49 53
The count of observations n = 24 is even. The median, then, is the average of the two center observations in the ordered list. To find the location of the center observations, we first compute
location of
Therefore, the center observations are the 12 th and 13th observations in the ordered list. The median is
Note that you can use the stemplot in Figure 1.13 directly to compute the median. In the stemplot the cases are already ordered and you simply need to count from the top or the bottom to the desired location.
USE YOUR KNOWLEDGE
1.45
1.46
1.47
TTS25
Include the outlier. Include Suriname, where the start time is 208 days, in the data set, and show that the median is 12 days. Note that with this case included, the sample size is now 25 and the median is the 13 th observation in the ordered list. Write out the ordered list and circle the outlier. Describe the effect of the outlier on the median for this set of data.
CALLS80
Calls to a customer service center. The service times for 80 calls to a customer service center are given in Table 1.2 (page 17). Use these data to compute the median service time.
STAT
Find the median. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Find the median first-exam score for these students.
Mean versus median Exercises 1.43 and 1.45 illustrate an important difference between the mean and the median. Suriname is an outlier. It pulls the mean time to start a business up from 16 days to 24 days. The median increased slightly, from 11.5 days to 12 days.
The median is more resistant than the mean. If the largest start time in the data set was 1200 days, the median for all 25 countries would still be 12 days. The largest observation just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and so will chase a single large observation upward.
The best way to compare the response of the mean and median to extreme observations is to use an interactive applet that allows you to place points on a line and then drag them with your computer’s mouse. Exercises 1.83, 1.84, and 1.85 use the Mean and Median applet on the website for this text to compare the mean and the median.
The median and mean are the most common measures of the center of a distribution. The mean and median of a symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median.
The endowment for a college or university is money set aside and invested. The income from the endowment is usually used to support various programs. The distribution of the sizes of the endowments of colleges and universities is strongly skewed to the right. Most institutions have modest endowments, but a few are very wealthy. The median endowment of colleges and universities in a recent year was $93 million—but the mean endowment was $498 million.21 The few wealthy institutions pull the mean up but do not affect the median. Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median.
We can now give a better answer to the question of how to deal with outliers in data. First, look at the data to identify outliers and investigate their causes. You can then correct outliers if they are wrongly recorded, delete them for good reason, or otherwise give them individual attention. The outlier in Example 1.21 (page 21) can be dropped from the data once we discover that it is an error. If you have no clear reason to drop outliers, you may want to use resistant measures in your analysis, so that outliers have little influence over your conclusions. The choice is often a matter for judgment.
Measuring spread: The quartiles A measure of center alone can be misleading. Two countries with the same median family income are very different if one has extremes of wealth and poverty and the other has little variation among families. A drug manufactured with the correct mean concentration of active ingredient is dangerous if some batches are much too high and others much too low.
We are interested in the spread or variability of incomes and drug potencies as well as their centers. The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread.
quartile
We can describe the spread or variability of a distribution by giving several percentiles. The median divides the data in two; half of the observations are above the median and half are below the median. We could call the median the 50th percentile. The upper quartile is the median of the upper half of the data. Similarly, the lower quartile is the median of the lower half of the data. With the median, the quartiles divide the data into four equal parts; 25% of the data are in each part.
percentile
We can do a similar calculation for any percent. The pth percentile of a distribution is the value that has p percent of the observations fall at or below it. To calculate a percentile, arrange the observations in increasing order and count up the required percent from the bottom of the list.
Our definition of percentiles is a bit inexact because there is not always a value with exactly p percent of the data at or below it. We will be content to take the nearest observation for most percentiles, but the quartiles are important enough to require an exact rule.
THE QUARTILES Q1 AND Q3
To calculate the quartiles:
1. Arrange the observations in increasing order and locate the median M in the ordered list of observations. 2. The first quartile Q1 is the median of the observations whose positions in the ordered list are to the left of
the location of the overall median. 3. The third quartile Q3 is the median of the observations whose positions in the ordered list are to the right
of the location of the overall median.
Here is an example.
EXAMPLE 1.26
TTS24
Finding the quartiles. Here is the ordered list of the times to start a business in our sample of 24 countries:
2 4 5 5 5 5 6 6 7 8 10 11 12 13 16 17 19 19 24 25 32 38 49 53
The count of observations n = 24 is even, so the median is at position (24 + 1)/2 = 12.5, that is, between the 12th and the 13th observation in the ordered list. There are 12 cases above this position and 12 below it. The first quartile is the median of the first 12 observations, and the third quartile is the median of the last 12 observations. Check that Q1 = 5.5 and Q3 = 21.5.
1.48
Notice that the quartiles are resistant. For example, Q3 would have the same value if the highest start time was 530 days rather than 53 days.
Be careful when several observations take the same numerical value. Write down all the observations and apply the rules just as if they all had distinct values.
USE YOUR KNOWLEDGE
STAT
Find the quartiles. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Find the quartiles for these first-exam scores.
There are several rules for calculating quartiles, which often give slightly different values. The differences are generally small. For describing data, just report the values that your software gives.
1.49
1.50
The five-number summary and boxplots In Section 1.2, we used the smallest and largest observations to indicate the spread of a distribution. These single observations tell us little about the distribution as a whole, but they give information about the tails of the distribution that is missing if we know only Q1, M, and Q3. To get a quick summary of both center and spread, use all five numbers.
THE FIVE-NUMBER SUMMARY
The five-number summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. In symbols, the five-number summary is
Minimum Q1 M Q3 Maximum
EXAMPLE 1.27
CALLS80
Service center call lengths. Table 1.2 (page 17) gives the service center call lengths for the sample of 80 calls that we discussed in Example 1.15. The five-number summary for these data is 1.0, 54.5, 103.5, 200, and 2631. The distribution is highly skewed. The mean is 197 seconds, a value that is very close to the third quartile.
USE YOUR KNOWLEDGE
CALLS80
Verify the calculations. Refer to the five-number summary and the mean for service center call lengths given in Example 1.27. Verify these results. Do not use software for this exercise and be sure to show all your work.
STAT
Find the five-number summary. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Find the five-number summary for these first-exam scores.
The five-number summary leads to another visual representation of a distribution, the boxplot.
1.51
BOXPLOT
A boxplot is a graph of the five-number summary.
A central box spans the quartiles Q1 and Q3.
A line in the box marks the median M. Lines extend from the box out to the smallest and largest observations.
whiskers box-and-whisker plots
The lines extending to the smallest and largest observations are sometimes called whiskers, and boxplots are sometimes called box-and-whisker plots. Software provides many varieties of boxplots, some of which use different choices for the placement of the whiskers.
When you look at a boxplot, first locate the median, which marks the center of the distribution. Then look at the spread. The quartiles show the spread of the middle half of the data, and the extremes (the smallest and largest observations) show the spread of the entire data set.
EXAMPLE 1.28
IQ
IQ scores. In Example 1.14 (page 14), we used a histogram to examine the distribution of a sample of 60 IQ scores. A boxplot for these data is given in Figure 1.14. Note that the mean is marked with a “+” and appears very close to the median. The two quartiles are each approximately the same distance from the median, and the two whiskers are approximately the same distance from the corresponding quartiles. All these characteristics are consistent with a symmetric distribution, as illustrated by the histogram in Figure 1.7.
FIGURE 1.14 Boxplot for sample of 60 IQ scores, Example 1.28.
USE YOUR KNOWLEDGE
STAT
Make a boxplot. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Make a boxplot for these first-exam scores.
The 1.5 × IQR rule for suspected outliers If we look at the data in Table 1.2 (page 17), we can spot a clear outlier, a call lasting 2631 seconds, more than twice the length of any other call. How can we describe the spread of this distribution? The smallest and largest observations are extremes that do not describe the spread of the majority of the data. The distance between the quartiles (the range of the center half of the data) is a more resistant measure of spread than the range. This distance is called the interquartile range.
THE INTERQUARTILE RANGE IQR
The interquartile range IQR is the distance between the first and third quartiles, IQR = Q3 − Q1
EXAMPLE 1.29
IQR for service center call length data. In Exercise 1.49 (page 34) you verified that the five-number summary for our data on service center call lengths was 1.0, 54.5, 103.5, 200, and 2631. Therefore, we calculate
The quartiles and the IQR are not affected by changes in either tail of the distribution. They are resistant, therefore, because changes in a few data points have no further effect once these points move outside the quartiles.
However, no single numerical measure of spread, such as IQR, is very useful for describing skewed distributions. The two sides of a skewed distribution have different spreads, so one number can’t summarize them. We can often detect skewness from the five-number summary by comparing how far the first quartile and the minimum are from the median (left tail) with how far the third quartile and the maximum are from the median (right tail). The interquartile range is mainly used as the basis for a rule of thumb for identifying suspected outliers.
THE 1.5 × IQR RULE FOR OUTLIERS
Call an observation a suspected outlier if it falls more than 1.5 × IQR above the third quartile or below the first quartile.
EXAMPLE 1.30
CALLS80
Suspected outliers for call length data. For the call length data in Table 1.2 (page 17), 1.5 × IQR = 1.5 × 145.5 = 218.25
1.52
Any values below 54.5 – 218.25 = −163.75 or above 200 + 218.25 = 418.25 are flagged as possible outliers. There are no low outliers, but the eight longest calls are flagged as possible high outliers. Their lengths are
438 465 479 700 700 951 1148 2631
It is difficult to imagine calls lasting this long.
USE YOUR KNOWLEDGE
STAT
Find the IQR. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Find the interquartile range and use the 1.5 × IQR rule to check for outliers. How low would the lowest score need to be for it to be an outlier according to this rule?
modified boxplot
Two variations on the basic boxplot can be very useful. The first, called a modified boxplot, uses the 1.5 × IQR rule. The lines that extend out from the quartiles are terminated in whiskers that are 1.5 × IQR in length. Points beyond the whiskers are plotted individually and are classified as outliers according to the 1.5 × IQR rule.
side-by-side boxplots
The other variation is to use two or more boxplots in the same graph to compare groups measured on the same variable. These are called side-by-side boxplots. The following example illustrates these two variations.
EXAMPLE 1.31
POETS
Do poets die young? According to William Butler Yeats, “She is the Gaelic muse, for she gives inspiration to those she persecutes. The Gaelic poets die young, for she is restless, and will not let them remain long on earth.” One study designed to investigate this issue examined the age at death for writers from different cultures and genders.22
Three categories of writers examined were novelists, poets, and nonfiction writers. We examine the ages at death for female writers in these categories from North America. Figure 1.15 shows modified side-by-side boxplots for the three categories of writers.
Displaying the boxplots for the three categories of writers lets us compare the three distributions. We see that nonfiction writers tend to live the longest, followed by novelists. The poets do appear to die young! There is one outlier among the nonfiction writers, which is plotted individually along with the value of its label. This writer died at the age of 40, young for a nonfiction writer, but not for a novelist or a poet!
FIGURE 1.15 Modified side-by-side boxplots for the data on writers’ age at death, for Example 1.31.
Measuring spread: The standard deviation The five-number summary is not the most common numerical description of a distribution. That distinction belongs to the combination of the mean to measure center and the standard deviation to measure spread, or variability. The standard deviation measures spread by looking at how far the observations are from their mean.
THE STANDARD DEVIATION s
The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations x1, x2, . . . , xn is
or, in more compact notation,
The standard deviation s is the square root of the variance s2:
The idea behind the variance and the standard deviation as measures of spread is as follows: The deviations xi − x̅ display the spread of the values xt about their mean x̅. Some of these deviations will be positive and some negative because some of the observations fall on each side of the mean. In fact, the sum of the deviations of the observations from their mean will always be zero. Squaring the deviations makes the negative deviations positive so that observations far from the mean in either direction have large positive squared deviations. The variance is the average squared deviation. Therefore, S2 and s will be large if the observations are widely spread about their mean and small if the observations are all close to the mean.
EXAMPLE 1.32
METABOL
Metabolic rate. A person’s metabolic rate is the rate at which the body consumes energy. Metabolic rate is important in studies of weight gain, dieting, and exercise. Here are the metabolic rates of seven men who took part in a study of dieting. (The units are calories per 24 hours. These are the same calories used to describe the energy content of foods.)
1792 1666 1362 1614 1460 1867 1439
Enter these data into your calculator or software and verify that
x̅ = 1600 calories s = 189.24 calories
Figure 1.16 plots these data as dots on the calorie scale, with their mean marked by an asterisk (*). The arrows mark two of the deviations from the mean. If you were calculating s by hand, you would find the first deviation as
x1 − x̅ = 1792 − 1600 = 192
1.53
FIGURE 1.16 Metabolic rates for seven men, with the mean (*) and the deviations of two observations from the mean, Example 1.32.
Exercise 1.80 asks you to calculate the seven deviations from Example 1.32, square them, and find s2 and s directly from the deviations. Working one or two short examples by hand helps you understand how the standard deviation is obtained. In practice, you will use either software or a calculator that will find s.
USE YOUR KNOWLEDGE
STAT
Find the variance and the standard deviation. Here are the scores on the first exam in an introductory statistics course for 10 students:
83 74 93 85 75 97 93 55 92 81
Find the variance and the standard deviation for these first-exam scores.
The idea of the variance is straightforward: it is the average of the squares of the deviations of the observations from their mean. The details we have just presented, however, raise some questions.
Why do we square the deviations?
First, the sum of the squared deviations of any set of observations from their mean is the smallest that the sum of squared deviations from any number can possibly be. This is not true of the unsquared distances. So squared deviations point to the mean as center in a way that distances do not. Second, the standard deviation turns out to be the natural measure of spread for a particularly important class of symmetric unimodal distributions, the Normal distributions. We will meet the Normal distributions in the next section.
Why do we emphasize the standard deviation rather than the variance?
One reason is that s, not s2, is the natural measure of spread for Normal distributions, which are introduced in the next section.
There is also a more general reason to prefer s to s2. Because the variance involves squaring the deviations, it does not have the same unit of measurement as the original observations. The variance of the metabolic rates, for example, is measured in squared calories. Taking the square root gives us a description of the spread of the distribution in the original measurement units.
Why do we average by dividing by n – 1 rather than n in calculating the variance?
Because the sum of the deviations is always zero, the last deviation can be found once we know the other n − 1. So we are not averaging n unrelated numbers. Only n − 1 of the squared deviations can vary freely, and we average by dividing the total by n − 1.
degrees of freedom
The number n – 1 is called the degrees of freedom of the variance or standard deviation. Many calculators offer a choice between dividing by n and dividing by n − 1, so be sure to use n − 1.
1.54
1.55
Properties of the standard deviation Here are the basic properties of the standard deviation s as a measure of spread.
PROPERTIES OF THE STANDARD DEVIATION
s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger. s, like the mean x̅, is not resistant. A few outliers can make s very large.
USE YOUR KNOWLEDGE
A standard deviation of zero. Construct a data set with 6 cases that has a variable with s = 0.
The use of squared deviations renders s even more sensitive than x̅ to a few extreme observations. For example, when we add Suriname to our sample of 24 countries for the analysis of the time to start a business (Exercise 1.43, page 29, and Exercise 1.45, page 31), we increase the standard deviation from 14.2 to 40.8! Distributions with outliers and strongly skewed distributions have standard deviations that do not give much helpful information about such distributions.
USE YOUR KNOWLEDGE
Effect of an outlier on the IQR. Find the IQR for the time to start a business with and without Suriname. What do you conclude about the sensitivity of this measure of spread to the inclusion of an outlier?
Choosing measures of center and spread
TTS24, TTS25
How do we choose between the five-number summary and x̅ and s to describe the center and spread of a distribution? Because the two sides of a strongly skewed distribution have different spreads, no single number such as s describes the spread well. The five-number summary, with its two quartiles and two extremes, does a better job.
CHOOSING A SUMMARY
The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x̅ and s for reasonably symmetric distributions that are free of outliers.
Remember that a graph gives the best overall picture of a distribution. Numerical measures of center and spread report specific facts about a distribution, but they do not describe its shape. Numerical summaries do not disclose the presence of multiple modes or gaps, for example. Always plot your data.
EXAMPLE 1.33
TTS24
Results from software. We prefer to examine the numerical summaries and graphical summaries together. Figure 1.17 gives (a) a boxplot, (b) a histogram, and (c) numerical summaries for the time to start a business from Example 1.23 (page 28) using Minitab. Similar displays are given for SPSS in Figure 1.18 (a), (b), and (c) and for JMP in Figure 1.19. Examine and compare the outputs carefully. Notice that they give different numbers of significant digits for some of these numerical summaries. There are also variations in how they make the boxplots and how they define classes for the histograms.
FIGURE 1.17 Graphical and numerical summaries from Minitab: (a) boxplot, (b) histogram, and (c) numerical summaries for the time to start a business, Example 1.33.
FIGURE 1.18 Graphical and numerical summaries from SPSS: (a) boxplot, (b) histogram, and (c) numerical summaries for the time to start a business, Example 1.33.
FIGURE 1.19 Graphical and numerical summaries from JMP for the time to start a business, Example 1.33.
Changing the unit of measurement The same variable can be recorded in different units of measurement. Americans commonly record distances in miles and temperatures in degrees Fahrenheit, while the rest of the world measures distances in kilometers and temperatures in degrees Celsius. Fortunately, it is easy to convert numerical descriptions of a distribution from one unit of measurement to another. This is true because a change in the measurement unit is a linear transformation of the measurements.
LINEAR TRANSFORMATIONS
A linear transformation changes the original variable x into the new variable xnew given by an equation of the form
xnew = a + bx
Adding the constant a shifts all values of x upward or downward by the same amount. In particular, such a shift changes the origin (zero point) of the variable. Multiplying by the positive constant b changes the size of the unit of measurement.
EXAMPLE 1.34
Change the units. (a) If a distance x is measured in kilometers, the same distance in miles is
xnew = 0.62x
For example, a 10-kilometer race covers 6.2 miles. This transformation changes the units without changing the origin— a distance of 0 kilometers is the same as a distance of 0 miles.
(b) A temperature x measured in degrees Fahrenheit must be reexpressed in degrees Celsius to be easily understood by the rest of the world. The transformation is
Thus, the high of 95°F on a hot American summer day translates into 35°C. In this case,
This linear transformation changes both the unit size and the origin of the measurements. The origin in the Celsius scale (0°C, the temperature at which water freezes) is 32° in the Fahrenheit scale.
Linear transformations do not change the shape of a distribution. If measurements on a variable x have a right- skewed distribution, any new variable xnew obtained by a linear transformation xnew = a + bx (for b > 0) will also have a right-skewed distribution. If the distribution of x is symmetric and unimodal, the distribution of xnew remains symmetric and unimodal.
Although a linear transformation preserves the basic shape of a distribution, the center and spread will change. Because linear changes of measurement scale are common, we must be aware of their effect on numerical descriptive measures of center and spread. Fortunately, the changes follow a simple pattern.
EXAMPLE 1.35
Use scores to find the points. In an introductory statistics course, homework counts for 300 points out of a total of 1000 possible points for all course requirements. During the semester, there were 12 homework assignments, and each was given a grade on a scale of 0 to 100. The maximum total score for the 12 homework assignments is therefore 1200. To convert the homework scores to final grade points, we need to convert the scale of 0 to 1200 to a scale of 0 to 300. We do this by multiplying the homework scores by 300/1200. In other words, we divide the
1.56
homework scores by 4. Here are the homework scores and the corresponding final grade points for five students:
Student 1 2 3 4 5 Score 1056 1080 900 1164 1020 Points 264 270 225 291 255
These two sets of numbers measure the same performance on homework for the course. Because we obtained the points by dividing the scores by 4, the mean of the points will be the mean of the scores divided by 4. Similarly, the standard deviation of points will be the standard deviation of the scores divided by 4.
USE YOUR KNOWLEDGE
Calculate the points for a student. Use the setting of Example 1.35 to find the points for a student whose score is 950.
Here is a summary of the rules for linear transformations:
EFFECT OF A LINEAR TRANSFORMATION
To see the effect of a linear transformation on measures of center and spread, apply these rules:
Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and standard deviation) by b. Adding the same number a (either positive or negative) to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread.
In Example 1.35, when we converted from score to points, we described the transformation as dividing by 4. The multiplication part of the summary of the effect of a linear transformation applies to this case because division by 4 is the same as multiplication by 0.25. Similarly, the second part of the summary applies to subtraction as well as addition because subtraction is simply the addition of a negative number.
The measures of spread IQR and s do not change when we add the same number a to all the observations because adding a constant changes the location of the distribution but leaves the spread unaltered. You can find the effect of a linear transformation xnew = a + bx by combining these rules. For example, if x has mean x̅, the transformed variable xnew has mean a + b x̅.
SECTION 1.3 SUMMARY A numerical summary of a distribution should report its center and its spread or variability. The mean x̅ and the median M describe the center of a distribution in different ways. The mean is the arithmetic average of the observations, and the median is their midpoint. When you use the median to describe the center of a distribution, describe its spread by giving the quartiles. The first quartile Q1 has one-fourth of the observations below it, and the third quartile Q3 has three-fourths of the observations below it. The interquartile range is the difference between the quartiles. It is the spread of the center half of the data. The 1.5 × IQR rule flags observations more than 1.5 × IQR beyond the quartiles as possible outliers. The five-number summary consisting of the median, the quartiles, and the smallest and largest individual observations provides a quick overall description of a distribution. The median describes the center, and the quartiles and extremes show the spread. Boxplots based on the five-number summary are useful for comparing several distributions. The box spans the quartiles and shows the spread of the central half of the distribution. The median is marked within the box. Lines extend from the box to the extremes and show the full spread of the data. In a modified boxplot, points identified by the 1.5 × IQR rule are plotted individually. Side-by-side boxplots can be used to display boxplots for more than one group on the same graph. The variance s2 and especially its square root, the standard deviation s, are common measures of spread about the mean as center. The standard deviation s is zero when there is no spread and gets larger as the spread increases. A resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large these changes are. The median and quartiles are resistant, but the mean and the standard deviation are not. The mean and standard deviation are good descriptions for symmetric distributions without outliers. They are most useful for the Normal distributions introduced in the next section. The five-number summary is a better exploratory description for skewed distributions. Linear transformations have the form xnew = a + bx. A linear transformation changes the origin if a ≠ 0 and changes the size of the unit of measurement if b > 0. Linear transformations do not change the overall shape of a distribution. A linear transformation multiplies a measure of spread by b and changes a percentile or measure of center m into a + bm. Numerical measures of particular aspects of a distribution, such as center and spread, do not report the entire shape of most distributions. In some cases, particularly distributions with multiple peaks and gaps, these measures may not be very informative.
SECTION 1.3 EXERCISES For Exercises 1.43 and 1.44, see page 29; for Exercises 1.45 to 1.47, see page 31; for Exercise 1.48, see page 33; for Exercises 1.49 and 1.50, see page 34; for Exercise 1.51, see page 35; for Exercise 1.52, see page 37; for Exercise 1.53, see page 39; for Exercise 1.54, see page 40; for Exercise 1.55, see page 40; and for Exercise 1.56, see page 45.
1.57 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. KPOT40
(a) Compute the mean for these data.
(b) Compute the median for these data.
(c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.58 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. KSUP40
(a) Compute the mean for these data.
(b) Compute the median for these data.
(c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.59 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. KPOT40
(a) Compute the standard deviation for these data.
(b) Compute the quartiles for these data.
(c) Give the five-number summary and explain the meaning of each of the five numbers.
(d) Which numerical summaries do you prefer for describing the distribution, the mean, and the standard deviation of the five- number summary? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.60 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. KSUP40
(a) Compute the standard deviation for these data.
(b) Compute the quartiles for these data.
(c) Give the five-number summary and explain the meaning of each of the five numbers.
(d) Which numerical summaries do you prefer for describing the distribution, the mean, and the standard deviation of the five- number summary? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.61 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. In Exercise 1.30, you used a stemplot to examine the distribution of the potassium absorption. KPOT40
(a) Make a histogram and use it to describe the distribution of potassium absorption.
(b) Make a boxplot and use it to describe the distribution of potassium absorption.
(c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.
1.62 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. In Exercise 1.31, you used a stemplot to examine the distribution of the potassium absorption. KSUP40
(a) Make a histogram and use it to describe the distribution of potassium absorption.
(b) Make a boxplot and use it to describe the distribution of potassium absorption.
(c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.
1.63 Compare the potatoes with the supplement. Refer to Exercises 1.30 and 1.31 (page 24). Use a back-to-back stemplot to display the data for the two sources of potassium. Use the stemplot to compare the two distributions and write a short summary of your findings. KPS40
1.64 Potassium sources. Refer to Exercises 1.30 and 1.31 (page 24). Use side-by-side boxplots in to describe the distributions. KPS40
(a) Summarize what you see in the boxplots and compare it with what you saw in the stemplots.
(b) For comparing these two distributions, do you prefer back-to-back stemplots or side-by-side boxplots? Give reasons for your answer.
1.65 Gosset’s data on double stout sales. William Sealy Gosset worked at the Guinness Brewery in Dublin and made substantial contributions to the practice of statistics.23 In his work at the brewery, he collected and analyzed a great deal of data. Archives with Gosset’s handwritten tables, graphs, and notes have been preserved at the Guinness Storehouse in Dublin.24 In one study, Gosset examined the change in the double stout market before and after World War I (1914–1918). For various regions in England and Scotland, he calculated the ratio of sales in 1925, after the war, as a percent of sales in 1913, before the war. Here are the data: STOUT
Bristol 94 Cardiff 112 English Agents 78 English O 68 English P 46 English R 111 Glasgow 66 Liverpool 140 London 428 Manchester 190 Newcastle-on-Tyne 118 Scottish 24
(a) Compute the mean for these data.
(b) Compute the median for these data.
(c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.66 Measures of spread for the double stout data. Refer to the previous exercise. STOUT
(a) Compute the standard deviation for these data.
(b) Compute the quartiles for these data.
(c) Which measure do you prefer for describing the spread of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.67 Are there outliers in the double stout data? Refer to the previous two exercises. STOUT
(a) Find the IQR for these data.
(b) Use the 1.5 × IQR rule to identify and name any outliers.
(c) Make a boxplot for these data and describe the distribution using only the information in the boxplot.
(d) Make a modified boxplot for these data and describe the distribution using only the information in the boxplot.
(e) Make a stemplot for these data.
(f) Compare the boxplot, the modified boxplot, and the stemplot. Evaluate the advantages and disadvantages of each graphical summary for describing the distribution of the double stout data.
1.68 Smolts. Smolts are young salmon at a stage when their skin becomes covered with silvery scales and they start to migrate from freshwater to the sea. The reflectance of a light shined on a smolt’s skin is a measure of the smolt’s readiness for the migration. Here are the reflectances, in percents, for a sample of 50 smolts:25 SMOLTS
57.6 54.8 63.4 57.0 54.7 42.3 63.6 55.5 33.5 63.3 58.3 42.1 56.1 47.8 56.1 55.9 38.8 49.7 42.3 45.6 69.0 50.4 53.0 38.3 60.4 49.3 42.8 44.5 46.4 44.3 58.9 42.1 47.6 47.9 69.2 46.6 68.1 42.8 45.6 47.3 59.6 37.8 53.9 43.2 51.4 64.5 43.8 42.7 50.9 43.8
(a) Find the mean reflectance for these smolts.
(b) Find the median reflectance for these smolts.
(c) Do you prefer the mean or the median as a measure of center for these data? Give reasons for your preference.
1.69 Measures of spread for smolts. Refer to the previous exercise. SMOLTS
(a) Find the standard deviation of the reflectance for these smolts.
(b) Find the quartiles of the reflectance for these smolts.
(c) Compute the quartiles for these data.
(d) Do you prefer the standard deviation or the quartiles as a measure of spread for these data? Give reasons for your preference.
1.70 Are there outliers in the smolt data? Refer to the previous two exercises. SMOLTS
(a) Find the IQR for the smolt data.
(b) Use the 1.5 × IQR rule to identify any outliers.
(c) Make a boxplot for the smolt data and describe the distribution using only the information in the boxplot.
(d) Make a modified boxplot for these data and describe the distribution using only the information in the boxplot.
(e) Make a stemplot for these data.
(f) Compare the boxplot, the modified boxplot, and the stemplot. Evaluate the advantages and disadvantages of each graphical summary for describing the distribution of the smolt reflectance data.
1.71 Potatoes. A quality product is one that is consistent and has very little variability in its characteristics. Controlling variability can be more difficult with agricultural products than with those that are manufactured. The following table gives the weights, in ounces, of the 25 potatoes sold in a 10-pound bag: POTATO
7.6 7.9 8.0 6.9 6.7 7.9 7.9 7.9 7.6 7.8 7.0 4.7 7.6 6.3 4.7 4.7 4.7 6.3 6.0 5.3 4.3 7.9 5.2 6.0 3.7
(a) Summarize the data graphically and numerically. Give reasons for the methods you chose to use in your summaries.
(b) Do you think that your numerical summaries do an effective job of describing these data? Why or why not?
(c) There appear to be two distinct clusters of weights for these potatoes. Divide the sample into two subsamples based on the clustering. Give the mean and standard deviation for each subsample. Do you think that this way of summarizing these data is better than a numerical summary that uses all the data as a single sample? Give a reason for your answer.
1.72 The alcohol content of beer. Brewing beer involves a variety of steps that can affect the alcohol content. A website
gives the percent alcohol for 159 domestic brands of beer.26 BEER
(a) Use graphical and numerical summaries of your choice to describe the data. Give reasons for your choice.
(b) The data set contains an outlier. Explain why this particular beer is unusual.
(c) For the outlier, give a short description of how you think this particular beer should be marketed.
1.73 Outlier for alcohol content of beer. Refer to the previous exercise. BEER
(a) Calculate the mean with and without the outlier. Do the same for the median. Explain how these values change when the outliers is excluded.
(b) Calculate the standard deviation with and without the outlier. Do the same for the quartiles. Explain how these values change when the outlier is excluded.
(c) Write a short paragraph summarizing what you have learned in this exercise.
1.74 Calories in beer. Refer to the previous two exercises. The data set also lists calories per 12 ounces of beverage. BEER
(a) Analyze the data and summarize the distribution of calories for these 159 brands of beer.
(b) In the previous exercise, you identified one brand of beer as an outlier. To what extent is this brand an outlier in the distribution of calories? Explain your answer.
(c) Does the distribution of calories suggest marketing strategies for this brand of beer? Describe some marketing strategies.
1.75 Median versus mean for net worth. A report on the assets of American households says that the median net worth of U.S. families is $81,200. The mean net worth of these families is $534,600.27 What explains the difference between these two measures of center?
1.76 Create a data set. Create a data set with seven observations for which the median would change by a large amount if the smallest observation were deleted.
1.77 Mean versus median. A small accounting firm pays each of its seven clerks $55,000, three junior accountants $80,000 each, and the firm’s owner $650,000. What is the mean salary paid at this firm? How many of the employees earn less than the mean? What is the median salary?
1.78 Be careful about how you treat the zeros. In computing the median income of any group, some federal agencies omit all members of the group who had no income. Give an example to show that the reported median income of a group can go down even though the group becomes economically better off. Is this also true of the mean income?
1.79 How does the median change? The firm in Exercise 1.77 gives no raises to the clerks and junior accountants, while the owner’s take increases to $900,000. How does this change affect the mean? How does it affect the median?
1.80 Metabolic rates. Calculate the mean and standard deviation of the metabolic rates in Example 1.32 (page 38), showing each step in detail. First find the mean x̅ by summing the seven observations and dividing by 7. Then find each of the deviations xi – x̅ and their squares. Check that the deviations have sum 0. Calculate the variance as an average of the squared deviations (remember to divide by n – 1). Finally, obtain s as the square root of the variance. METABOL
1.81 Earthquakes. Each year there are about 900,000 earthquakes of magnitude 2.5 or less that are usually not felt. In contrast, there are about 10 of magnitude 7.0 that cause serious damage.28 Explain why the average magnitude of
earthquakes is not a good measure of their impact. IQ
1.82 IQ scores. Many standard statistical methods that you will study in Part II of this book are intended for use with distributions that are symmetric and have no outliers. These methods start with the mean and standard deviation, x̅ and s. For example, standard methods would typically be used for the IQ and GPA data in Table 1.3 (page 26). IQ
(a) Find x̅ and s for the IQ data. In large populations, IQ scores are standardized to have mean 100 and standard deviation 15. In what way does the distribution of IQ among these students differ from the overall population?
(b) Find the median IQ score. It is, as we expect, close to the mean.
(c) Find the mean and median for the GPA data. The two measures of center differ a bit. What feature of the data (see your stemplot in Exercise 1.39 or make a new stemplot) explains the difference?
1.83 Mean and median for two observations. The Mean and Median applet allows you to place observations on a line and see their mean and median visually. Place two observations on the line by clicking below it. Why does only one arrow appear?
1.84 Mean and median for four observations. In the Mean and Median applet, place four observations on the line by clicking below it, three close together near the center of the line and one somewhat to the right of these two.
(a) Pull the single rightmost observation out to the right. (Place the cursor on the point, hold down a mouse button, and drag the point.) How does the mean behave? How does the median behave? Explain briefly why each measure acts as it does.
(b) Now drag the rightmost point to the left as far as you can. What happens to the mean? What happens to the median as you drag this point past the other two (watch carefully)?
1.85 Mean and median for seven observations. Place seven observations on the line in the Mean and Median applet by clicking below it.
(a) Add one additional observation without changing the median. Where is your new point?
(b) Use the applet to convince yourself that when you add yet another observation (there are now nine in all), the median does not change no matter where you put the seventh point. Explain why this must be true.
1.86 Imputation. Various problems with data collection can cause some observations to be missing. Suppose a data set has 20 cases. Here are the values of the variable x for 10 of these cases: IMPUTE
17 6 12 14 20 23 9 12 16 21
The values for the other 10 cases are missing. One way to deal with missing data is called imputation. The basic idea is that missing values are replaced, or imputed, with values that are based on an analysis of the data that are not missing. For a data set with a single variable, the usual choice of a value for imputation is the mean of the values that are not missing. The mean for this data set is 15.
(a) Verify that the mean is 15 and find the standard deviation for the 10 cases for which x is not missing.
(b) Create a new data set with 20 cases by setting the values for the 10 missing cases to 15. Compute the mean and standard deviation for this data set.
(c) Summarize what you have learned about the possible effects of this type of imputation on the mean and the standard deviation.
1.87 A standard deviation contest. This is a standard deviation contest. You must choose four numbers from the whole numbers 10 to 20, with repeats allowed.
(a) Choose four numbers that have the smallest possible standard deviation.
(b) Choose four numbers that have the largest possible standard deviation.
(c) Is more than one choice possible in either part (a) or part (b)? Explain.
1.88 Longleaf pine trees. The Wade Tract in Thomas County, Georgia, is an old-growth forest of longleaf pine trees (Pinus palustris) that has survived in a relatively undisturbed state since before the settlement of the area by Europeans. A study collected data on 584 of these trees.29 One of the variables measured was the diameter at breast height (DBH). This is the diameter of the tree at 4.5 feet and the units are centimeters (cm). Only trees with DBH greater than 1.5 cm were sampled. Here are the diameters of a random sample of 40 of these trees: PINES
10.5 13.3 26.0 18.3 52.2 9.2 26.1 17.6 40.5 31.8 47.2 11.4 2.7 69.3 44.4 16.9 35.7 5.4 44.2 2.2 4.3 7.8 38.1 2.2 11.4 51.5 4.9 39.7 32.6 51.8 43.6 2.3 44.6 31.5 40.3 22.3 43.3 37.5 29.1 27.9
(a) Find the five-number summary for these data.
(b) Make a boxplot.
(c) Make a histogram.
(d) Write a short summary of the major features of this distribution. Do you prefer the boxplot or the histogram for these data?
1.89 Weight gain. A study of diet and weight gain deliberately overfed 15 volunteers for eight weeks. The mean increase in fat was x̅ = 2.41 kilograms, and the standard deviation was s = 1.25 kilograms. What are x̅ and s in pounds? (A kilogram is 2.2 pounds.)
1.90 Changing units from inches to centimeters. Changing the unit of length from inches to centimeters multiplies each length by 2.54 because there are 2.54 centimeters in an inch. This change of units multiplies our usual
measures of spread by 2.54. This is true of IQR and the standard deviation. What happens to the variance when we change units in this way?
1.91 A different type of mean. The trimmed mean is a measure of center that is more resistant than the mean but uses more of the available information than the median. To compute the 10% trimmed mean, discard the highest 10% and the lowest 10% of the observations and compute the mean of the remaining 80%. Trimming eliminates the effect of a small number of outliers. Compute the 10% trimmed mean of the service time data in Table 1.2 (page 17). Then compute the 20% trimmed mean. Compare the values of these measures with the median and the ordinary untrimmed mean.
1.92 Changing units from centimeters to inches. Refer to Exercise 1.88 (page 50). Change the measurements from centimeters to inches by multiplying each value by 0.39. Answer the questions from that exercise and explain the
effect of the transformation on these data.
1.4 Density Curves and Normal Distributions
When you complete this section, you will be able to:
Compare the mean and the median for symmetric and skewed distributions. Sketch a Normal distribution for any given mean and standard deviation. Apply the 68–95–99.7 rule to find proportions of observations within one, two, and three standard deviations of the mean for any Normal distribution. Transform values of a variable from a general Normal distribution to the standard Normal distribution. Compute areas under a Normal curve using software or Table A. Perform inverse Normal calculations to find values of a Normal variable corresponding to various areas. Assess the extent to which the distribution of a set of data can be approximated by a Normal distribution.
We now have a kit of graphical and numerical tools for describing distributions. What is more, we have a clear strategy for exploring data on a single quantitative variable:
1. Always plot your data: make a graph, usually a stemplot or a histogram. 2. Look for the overall pattern and for striking deviations such as outliers. 3. Calculate an appropriate numerical summary to briefly describe center and spread.
density curves
Technology has expanded the set of graphs that we can choose for Step 1. It is possible, though painful, to make histograms by hand. Using software, clever algorithms can describe a distribution in a way that is not feasible by hand, by fitting a smooth curve to the data in addition to or instead of a histogram. The curves used are called density curves. Before we examine density curves in detail, here is an example of what software can do.
EXAMPLE 1.36
TTS
Density curves for times to start a business and Titanic passenger ages. Figure 1.20 illustrates the use of density curves along with histograms to describe distributions. Figure 1.20(a) shows the distribution of the times to start a business for 189 countries (see Example 1.23. page 28). The outlier, Suriname, described in Exercise 1.43 (page 29) has been deleted from the data set. The distribution is highly skewed to the right. Most of the data are in the first two classes, with 40 or fewer days to start a business.
TITANIC
Exercise 1.27 (page 24) describes data on the class of the ticket of the Titanic passengers, and Figure 1.20(b) shows the distribution of the ages of these passengers. It has a single mode, a long right tail, and a relatively short left tail.
FIGURE 1.20 (a) The distribution of the time to start a business, Example 1.36. The distribution is pictured with both a histogram and a density curve. (b) The distribution of the ages of the Titanic passengers, Example 1.36. These distributions have a single mode with tails of two different lengths.
FIGURE 1.21 (a) The distribution of Iowa Test vocabulary scores for Gary, Indiana, seventh-graders, Example 1.37. The shaded bars in the histogram represent scores less than or equal to 6.0. (b) The shaded area under the Normal density curve also represents scores less than or equal to 6.0. This area is 0.293, close to the true 0.303 for the actual data.
A smooth density curve is an idealization that gives the overall pattern of the data but ignores minor irregularities. We first discuss density curves in general and then focus on a special class of density curves, the bell-shaped Normal curves.
Density curves One way to think of a density curve is as a smooth approximation to the irregular bars of a histogram. Figure 1.21 shows a histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills. Scores of many students on this national test have a very regular distribution. The histogram is symmetric, and both tails fall off quite smoothly from a single center peak. There are no large gaps or obvious outliers. The curve drawn through the tops of the histogram bars in Figure 1.21 is a good description of the overall pattern of the data.
EXAMPLE 1.37
Vocabulary scores. In a histogram, the areas of the bars represent either counts or proportions of the observations. In Figure 1.21(a), we shaded the bars that represent students with vocabulary scores 6.0 or lower. There are 287 such students, who make up the proportion 287/947 = 0.303 of all Gary seventh-graders. The shaded bars in Figure 1.21(a) make up proportion 0.303 of the total area under all the bars. If we adjust the scale so that the total area of the bars is 1, the area of the shaded bars will also be 0.303.
In Figure 1.21(b), we shaded the area under the curve to the left of 6.0. If we adjust the scale so that the total area under the curve is exactly 1, areas under the curve will then represent proportions of the observations. That is, area = proportion. The curve is then a density curve. The shaded area under the density curve in Figure 1.21(b) represents the proportion of students with score 6.0 or lower. This area is 0.293, only 0.010 away from the histogram result. You can see that areas under the density curve give quite good approximations of areas given by the histogram.
DENSITY CURVE
A density curve is a curve that
Is always on or above the horizontal axis. Has area exactly 1 underneath it.
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values is the proportion of all observations that fall in that range.
The density curve in Figure 1.21 is a Normal curve. Density curves, like distributions, come in many shapes. Figure 1.22 shows two density curves, a symmetric Normal density curve and a right-skewed curve.
We will discuss Normal density curves in detail in this section because of the important role that they play in statistics. There are, however, many applications where the use of other families of density curves are essential.
A density curve of an appropriate shape is often an adequate description of the overall pattern of a distribution. Outliers, which are deviations from the overall pattern, are not described by the curve.
Measuring center and spread for density curves Our measures of center and spread apply to density curves as well as to actual sets of observations, but only some of these measures are easily seen from the curve. A mode of a distribution described by a density curve is a peak point of the curve, the location where the curve is highest. Because areas under a density curve represent proportions of the observations, the median is the point with half the total area on each side. You can roughly locate the quartiles by dividing the area under the curve into quarters as accurately as possible by eye. The IQR is the distance between the first and third quartiles. There are mathematical ways of calculating areas under curves. These allow us to locate the median and quartiles exactly on any density curve.
FIGURE 1.22 (a) A symmetric Normal density curve with its mean and median marked. (b) a right-skewed density curve with its mean and median marked.
FIGURE 1.23 The a density curve is the point at which it would balance.
What about the mean and standard deviation? The mean of a set of observations is their arithmetic average. If we think of the observations as weights strung out along a thin rod, the mean is the point at which the rod would balance. This fact is also true of density curves. The mean is the point at which the curve would balance if it were made out of solid material. Figure 1.23 illustrates this interpretation of the mean.
A symmetric curve, such as the Normal curve in Figure 1.22(a), balances at its center of symmetry. Half the area under a symmetric curve lies on either side of its center, so this is also the median.
For a right-skewed curve, such as those shown in Figures 1.22(b) and 1.23, the small area in the long right tail tips the curve more than the same area near the center. The mean (the balance point), therefore, lies to the right of the median. It is hard to locate the balance point by eye on a skewed curve. There are mathematical ways of calculating the mean for any density curve, so we are able to mark the mean as well as the median in Figure 1.22(b). The standard deviation can also be calculated mathematically, but it can’t be located by eye on most density curves.
MEDIAN AND MEAN OF A DENSITY CURVE
The median of a density curve is the equal-areas point, the point that divides the area under the curve in half.
mean µ standard deviation σ
The mean of a density curve is the balance point, at which the curve would balance if made of solid material.
The median and mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.
A density curve is an idealized description of a distribution of data. For example, the density curve in Figure 1.21 is exactly symmetric, but the histogram of vocabulary scores is only approximately symmetric. We therefore need to distinguish between the mean and standard deviation of the density curve and the numbers x̅ and s computed from the actual observations. The usual notation for the mean of an idealized distribution is µ (the Greek letter mu). We write the standard deviation of a density curve as σ (the Greek letter sigma). In Chapter 5, we refer to x̅ and s as statistics associated with a sample and to µ and σ as parameters associated with a population.
Normal distributions Normal curves
Normal distributions
One particularly important class of density curves has already appeared in Figures 1.21 and 1.22(a). These density curves are symmetric, unimodal, and bell-shaped. They are called Normal curves, and they describe Normal distributions. All Normal distributions have the same overall shape.
The exact density curve for a particular Normal distribution is specified by giving the distribution’s mean µ and its standard deviation σ. The mean is located at the center of the symmetric curve and is the same as the median. Changing µ without changing σ moves the Normal curve along the horizontal axis without changing its spread.
The standard deviation s controls the spread of a Normal curve. Figure 1.24 shows two Normal curves with different values of σ. The curve with the larger standard deviation is more spread out.
The standard deviation σ is the natural measure of spread for Normal distributions. Not only do µ and σ completely determine the shape of a Normal curve, but we can locate σ by eye on the curve. Here’s how. As we move out in either direction from the center µ, the curve changes from falling ever more steeply
The points at which this change of curvature takes place are located at distance σ on either side of the mean µ. You can feel the change as you run your finger along a Normal curve, and so find the standard deviation. Remember that µ and σ alone do not specify the shape of most distributions, and that the shape of density curves in general does not reveal σ. These are special properties of Normal distributions.
FIGURE 1.24 Two Normal curves, showing the mean µ and the standard deviation σ.
There are other symmetric bell-shaped density curves that are not Normal. The Normal density curves are specified by a particular equation. The height of the density curve at any point x is given by
We will not make direct use of this fact, although it is the basis of mathematical work with Normal distributions. Notice that the equation of the curve is completely determined by the mean µ and the standard deviation σ.
Why are the Normal distributions important in statistics? Here are three reasons.
1. Normal distributions are good descriptions for some distributions of real data. Distributions that are often close to Normal include scores on tests taken by many people (such as the Iowa Test of Figure 1.21, page 53), repeated careful measurements of the same quantity, and characteristics of biological populations (such as lengths of baby pythons and yields of corn).
2. Normal distributions are good approximations to the results of many kinds of chance outcomes, such as tossing a coin many times.
3. Many statistical inference procedures based on Normal distributions work well for other roughly symmetric distributions.
However, even though many sets of data follow a Normal distribution, many do not. Most income distributions, for example, are skewed to the right and so are not Normal. Non-Normal data, like nonnormal people, not only are common but are also sometimes more interesting than their Normal counterparts.
The 68–95–99.7 rule Although there are many Normal curves, they all have common properties. Here is one of the most important.
THE 68–95–99.7 RULE
In the Normal distribution with mean µ and standard deviation σ:
Approximately 68% of the observations fall within σ of the mean µ. Approximately 95% of the observations fall within 2σ of µ. Approximately 99.7% of the observations fall within 3σ of µ.
Figure 1.25 illustrates the 68–95–99.7 rule. By remembering these three numbers, you can think about Normal distributions without constantly making detailed calculations.
FIGURE 1.25 The 68–95–99.7 rule for Normal distributions.
EXAMPLE 1.38
Heights of young women. The distribution of heights of young women aged 18 to 24 is approximately Normal with mean µ = 64.5 inches and standard deviation σ = 2.5 inches. Figure 1.26 shows what the 68–95–99.7 rule says about this distribution.
Two standard deviations equals five inches for this distribution. The 95 part of the 68–95–99.7 rule says that the middle 95% of young women are between 64.5 – 5 and 64.5 + 5 inches tall, that is, between 59.5 and 69.5 inches. This fact is exactly true for an exactly Normal distribution. It is approximately true for the heights of young women because the distribution of heights is approximately Normal.
The other 5% of young women have heights outside the range from 59.5 to 69.5 inches. Because the Normal distributions are symmetric, half of these women are on the tall side. So the tallest 2.5% of young women are taller than 69.5 inches.
1.93
1.94
FIGURE 1.26 The 68–95–99.7 rule applied to the heights of young women, Example 1.38.
N(µ, σ)
Because we will mention Normal distributions often, a short notation is helpful. We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ, σ). For example, the distribution of young women’s heights is N(64.5, 2.5).
USE YOUR KNOWLEDGE
Test scores. Many states assess the skills of their students in various grades. One program that is available for this purpose is the National Assessment of Educational Progress (NAEP).30 One of the tests provided by the NAEP assesses the reading skills of 12th-grade students. In a recent year, the national mean score was 288 and the standard deviation was 38. Assuming that these scores are approximately Normally distributed, N(288, 38), use the 68–95–99.7 rule to give a range of scores that includes 95% of these students.
Use the 68–95–99.7 rule. Refer to the previous exercise. Use the 68–95–99.7 rule to give a range of scores that includes 99.7% of these students.
1.95
1.96
Standardizing observations As the 68–95–99.7 rule suggests, all Normal distributions share many properties. In fact, all Normal distributions are the same if we measure in units of size σ about the mean µ as center. Changing to these units is called standardizing. To standardize a value, subtract the mean of the distribution and then divide by the standard deviation.
STANDARDIZING AND z-SCORES
If x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is
A standardized value is often called a z-score.
A z-score tells us how many standard deviations the original observation falls away from the mean, and in which direction. Observations larger than the mean are positive when standardized, and observations smaller than the mean are negative.
To compare scores based on different measures, z-scores can be very useful. For example, see Exercise 1.124 (page 73), where you are asked to compare an SAT score with an ACT score.
EXAMPLE 1.39
Find some z-scores. The heights of young women are approximately Normal with µ = 64.5 inches and σ = 2.5 inches. The z-score for height is
A woman’s standardized height is the number of standard deviations by which her height differs from the mean height of all young women. A woman 68 inches tall, for example, has z-score
or 1.4 standard deviations above the mean. Similarly, a woman 5 feet (60 inches) tall has z-score
or 1.8 standard deviations less than the mean height.
USE YOUR KNOWLEDGE
Find the z-score. Consider the NAEP scores (see Exercise 1.93, page 59), which we assume are approximately Normal, N(288, 38). Give the z-score for a student who received a score of 350.
Find another z-score. Consider the NAEP scores, which we assume are approximately Normal, N(288, 38). Give the z-score for a student who received a score of 240. Explain why your answer is negative even though all the test scores are positive.
We need a way to write variables, such as “height” in Example 1.38, that follow a theoretical distribution such as a Normal distribution. We use capital letters near the end of the alphabet for such variables. If X is the height of a young woman, we can then shorten “the height of a young woman is less than 68 inches” to “X < 68.” We will use lowercase x to stand for any specific value of the variable X.
We often standardize observations from symmetric distributions to express them in a common scale. We might, for example, compare the heights of two children of different ages by calculating their z-scores. The standardized heights tell us where each child stands in the distribution for his or her age group.
Standardizing is a linear transformation that transforms the data into the standard scale of z-scores. We know that a
linear transformation does not change the shape of a distribution, and that the mean and standard deviation change in a simple manner. In particular, the standardized values for any distribution always have mean 0 and standard deviation 1.
If the variable we standardize has a Normal distribution, standardizing does more than give a common scale. It makes all Normal distributions into a single distribution, and this distribution is still Normal. Standardizing a variable that has any Normal distribution produces a new variable that has the standard Normal distribution.
THE STANDARD NORMAL DISTRIBUTION
The standard Normal distribution is the Normal distribution N(0, 1) with mean 0 and standard deviation 1.
If a variable X has any Normal distribution N(µ, σ) with mean µ and standard deviation σ, then the standardized variable
has the standard Normal distribution.
FIGURE 1.27 The cumulative proportion for a value x is the proportion of all observations from the distribution that are less than or equal to x. This is the area to the left of x under the Normal curve.
Normal distribution calculations cumulative proportion
Areas under a Normal curve represent proportions of observations from that Normal distribution. There is no formula for areas under a Normal curve. Calculations use either software that calculates areas or a table of areas. The table and most software calculate one kind of area: cumulative proportions. A cumulative proportion is the proportion of observations in a distribution that lie at or below a given value. When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the left of a given value. Figure 1.27 shows the idea more clearly than words do.
The key to calculating Normal proportions is to match the area you want with areas that represent cumulative proportions. Then get areas for cumulative proportions either from software or (with an extra step) from a table. The following examples show the method in pictures.
EXAMPLE 1.40
NCAA eligibility for competition. To be eligible to compete in their first year of college, the National Collegiate Athletic Association (NCAA) requires Division I athletes to meet certain academic standards. These are based on their grade point average (GPA) in certain courses and combined scores on the SAT Critical Reading and Mathematics sections or the ACT composite score.31
For a student with a 3.0 GPA, the combined SAT score must be 800 or higher. Based on the distribution of SAT scores for college-bound students, we assume that the distribution of the combined Critical Reading and Mathematics scores is approximately Normal with mean 1010 and standard deviation 225.32 What proportion of college-bound students have SAT scores of 800 or more?
Here is the calculation in pictures: the proportion of scores above 800 is the area under the curve to the right of 800. That’s the total area under the curve (which is always 1) minus the cumulative proportion up to 800.
That is, the proportion of college-bound SAT takers with a 3.0 GPA who are eligible to compete is 0.8247, or about 82%.
There is no area under a smooth curve that is exactly over the point 800. Consequently, the area to the right of 800 (the proportion of scores > 800) is the same as the area at or to the right of this point (the proportion of scores ≥ 800). The actual data may contain a student who scored exactly 800 on the SAT. That the proportion of scores exactly equal to 800 is 0 for a Normal distribution is a consequence of the idealized smoothing of Normal distributions for data.
EXAMPLE 1.41
NCAA eligibility for aid and practice. The NCAA has a category of eligibility in which a first-year student may not compete but is still eligible to receive an athletic scholarship and to practice with the team. The requirements for this category are a 3.0 GPA and combined SAT Critical Reading and Mathematics scores of at least 620.
What proportion of college-bound students who take the SAT would be eligible to receive an athletic scholarship and to practice with the team but would not be eligible to compete? That is, what proportion have scores between 620 and 800? Here are the pictures:
About 13% of college-bound students with a 3.0 GPA have SAT scores between 620 and 800.
How do we find the numerical values of the areas in Examples 1.40 and 1.41? If you use software, just plug in mean 1010 and standard deviation 225. Then ask for the cumulative proportions for 800 and for 620. (Your software will probably refer to these as “cumulative probabilities.” We will learn in Chapter 4 why the language of probability fits.) Sketches of the areas that you want similar to the ones in Examples 1.40 and 1.41 are very helpful in making sure that you are doing the correct calculations.
You can use the Normal Curve applet on the text website to find Normal proportions. The applet is more flexible than most software—it will find any Normal proportion, not just cumulative proportions. The applet is an excellent way to understand Normal curves. But, because of the limitations of web browsers, the applet is not as accurate as statistical software.
If you are not using software, you can find cumulative proportions for Normal curves from a table. That requires an extra step, as we now explain.
Using the standard Normal table The extra step in finding cumulative proportions from a table is that we must first standardize to express the problem in the standard scale of z-scores. This allows us to get by with just one table, a table of standard Normal cumulative proportions. Table A in the back of the book gives standard Normal probabilities. The picture at the top of the table reminds us that the entries are cumulative proportions, areas under the curve to the left of a value z.
EXAMPLE 1.42
Find the proportion from z. What proportion of observations on a standard Normal variable Z take values less than 1.47? We need to find the area to the left of 1.47; locate 1.4 in the left-hand column of Table A and then locate the remaining digit 7 as .07 in the top row. The entry opposite 1.4 and under .07 is 0.9292. This is the cumulative proportion we seek. Figure 1.28 illustrates this area.
FIGURE 1.28 The area under a standard Normal curve to the left of the point z = 1.47 is 0.9292, Example 1.42.
Now that you see how Table A works, let’s redo the NCAA Examples 1.40 and 1.41 using the table.
EXAMPLE 1.43
Find the proportion from x. What proportion of college-bound students who take the SAT have scores of at least 800? The picture that leads to the answer is exactly the same as in Example 1.40. The extra step is that we first standardize to read cumulative proportions from Table A. If X is SAT score, we want the proportion of students for which X ≥ x where x = 800.
1. Standardize. Subtract the mean, then divide by the standard deviation, to transform the problem about X into a problem about a standard Normal Z:
2. Use the table. Look at the pictures in Example 1.40. From Table A, we see that the proportion of observations less than −0.93 is 0.1762. The area to the right of −0.93 is therefore 1 – 0.1762 = 0.8238. This is about 82%.
The area from the table in Example 1.43 (0.8238) is slightly less accurate than the area from software in Example 1.40 (0.8247) because we must round z to two places when we use Table A. The difference is rarely important in practice.
1.97
1.98
EXAMPLE 1.44
Eligibility for aid and practice. What proportion of all students who take the SAT would be eligible to receive athletic scholarships and to practice with the team but would not be eligible to compete in the eyes of the NCAA? That is, what proportion of students have SAT scores between 620 and 800? First, sketch the areas, exactly as in Example 1.41. We again use X as shorthand for an SAT score.
1. Standardize.
2. Use the table.
As in Example 1.41, about 13% of students would be eligible to receive athletic scholarships and to practice with the team.
Sometimes we encounter a value of z more extreme than those appearing in Table A. For example, the area to the left of z = −4 is not given in the table. The z-values in Table A leave only area 0.0002 in each tail unaccounted for. For practical purposes, we can act as if there is zero area outside the range of Table A.
USE YOUR KNOWLEDGE
Find the proportion. Consider the NAEP scores, which are approximately Normal, N(288, 38). Find the proportion of students who have scores less than 350. Find the proportion of students who have scores greater than or equal to 350. Sketch the relationship between these two calculations using pictures of Normal curves similar to the ones given in Example 1.40 (page 61).
Find another proportion. Consider the NAEP scores, which are approximately Normal, N(288, 38). Find the proportion of students who have scores between 300 and 350. Use pictures of Normal curves similar to the ones given in Example 1.41 (page 62) to illustrate your calculations.
1.99
Inverse Normal calculations Examples 1.40 to 1.44 illustrate the use of Normal distributions to find the proportion of observations in a given event, such as “SAT score between 620 and 800.” We may instead want to find the observed value corresponding to a given proportion.
Statistical software will do this directly. Without software, use Table A backward, finding the desired proportion in the body of the table and then reading the corresponding z from the left column and top row.
EXAMPLE 1.45
How high for the top 10%? Scores for college-bound students on the SAT Critical Reading test in recent years follow approximately the N(500, 120) distribution.33 How high must a student score to place in the top 10% of all students taking the SAT?
Again, the key to the problem is to draw a picture. Figure 1.29 shows that we want the score x with an area of 0.10 above it. That’s the same as area below x equal to 0.90.
FIGURE 1.29 Locating the point on a Normal curve with area 0.10 to its right, Example 1.45.
Statistical software has a function that will give you the x for any cumulative proportion you specify. The function often has a name such as “inverse cumulative probability.” Plug in mean 500, standard deviation 120, and cumulative proportion 0.9. The software tells you that x = 653.786. We see that a student must score at least 654 to place in the highest 10%.
Without software, first find the standard score z with cumulative proportion 0.9, then “unstandardize” to find x. Here is the two-step process:
1. Use the table. Look in the body of Table A for the entry closest to 0.9. It is 0.8997. This is the entry corresponding to z = 1.28. So z = 1.28 is the standardized value with area 0.9 to its left.
2. Unstandardize to transform the solution from z back to the original x scale. We know that the standardized value of the unknown x is z = 1.28. So x itself satisfies
Solving this equation for x gives
x = 500 + (1.28)(120) = 653.6
This equation should make sense: it finds the x that lies 1.28 standard deviations above the mean on this particular Normal curve. That is the “unstandardized” meaning of z = 1.28. The general rule for unstandardizing a z-score is
x = μ + zσ
USE YOUR KNOWLEDGE
What score is needed to be in the top 20%? Consider the NAEP scores, which are approximately Normal, N(288, 38). How high a score is needed to be in the top 20% of students who take this exam?
1.100 Find the score that 75% of students will exceed. Consider the NAEP scores, which are approximately Normal, N(288, 38). Seventy-five percent of the students will score above x on this exam. Find x.
Normal quantile plots The Normal distributions provide good descriptions of some distributions of real data, such as the Iowa Test vocabulary scores. The distributions of some other common variables are usually skewed and therefore distinctly non- Normal. Examples include economic variables such as personal income and gross sales of business firms, the survival times of cancer patients after treatment, and the service lifetime of mechanical or electronic components. While experience can suggest whether or not a Normal distribution is plausible in a particular case, it is risky to assume that a distribution is Normal without actually inspecting the data.
Normal quantile plot
A histogram or stemplot can reveal distinctly non-Normal features of a distribution, such as outliers, pronounced skewness, or gaps and clusters. If the stemplot or histogram appears roughly symmetric and unimodal, however, we need a more sensitive way to judge the adequacy of a Normal model. The most useful tool for assessing Normality is another graph, the Normal quantile plot.
Here is the basic idea of a Normal quantile plot. The graphs produced by software use more sophisticated versions of this idea. It is not practical to make Normal quantile plots by hand.
1. Arrange the observed data values from smallest to largest. Record what percentile of the data each value occupies. For example, the smallest observation in a set of 20 is at the 5% point, the second smallest is at the 10% point, and so on.