Recent Orders

Our Reviews

Sample Papers

How It Works

Get First 2 Pages Of Your Homework Absolutely Free!

Messages

Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free Quotes Post Your Requirements

Https www openintro org stat textbook php

26/11/2021 Client: muhammad11 Deadline: 2 Day

OpenIntro Statistics Fourth Edition

David Diez Data Scientist

OpenIntro

Mine Çetinkaya-Rundel Associate Professor of the Practice, Duke University

Professional Educator, RStudio

Christopher D Barr Investment Analyst

Varadero Capital

This book may be downloaded as a free PDF at openintro.org/os. This textbook is also available under a Creative Commons license, with the source files hosted on Github.

http://www.openintro.org/redirect.php?go=os&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=license&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=os_source&referrer=os4_pdf
3

Table of Contents

1 Introduction to data 7 1.1 Case study: using stents to prevent strokes . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Sampling principles and strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2 Summarizing data 39 2.1 Examining numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 Considering categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3 Case study: malaria vaccine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3 Probability 79 3.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3 Sampling from a small population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.5 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4 Distributions of random variables 131 4.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.4 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5 Foundations for inference 168 5.1 Point estimates and sampling variability . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2 Confidence intervals for a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.3 Hypothesis testing for a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

6 Inference for categorical data 206 6.1 Inference for a single proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.2 Difference of two proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.3 Testing for goodness of fit using chi-square . . . . . . . . . . . . . . . . . . . . . . . . 229 6.4 Testing for independence in two-way tables . . . . . . . . . . . . . . . . . . . . . . . 240

7 Inference for numerical data 249 7.1 One-sample means with the t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 251 7.2 Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 7.3 Difference of two means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.4 Power calculations for a difference of means . . . . . . . . . . . . . . . . . . . . . . . 278 7.5 Comparing many means with ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 285

4 TABLE OF CONTENTS

8 Introduction to linear regression 303 8.1 Fitting a line, residuals, and correlation . . . . . . . . . . . . . . . . . . . . . . . . . 305 8.2 Least squares regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 8.3 Types of outliers in linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 8.4 Inference for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

9 Multiple and logistic regression 341 9.1 Introduction to multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 9.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 9.3 Checking model conditions using graphs . . . . . . . . . . . . . . . . . . . . . . . . . 358 9.4 Multiple regression case study: Mario Kart . . . . . . . . . . . . . . . . . . . . . . . 365 9.5 Introduction to logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

A Exercise solutions 384

B Data sets within the text 403

C Distribution tables 408

Preface

OpenIntro Statistics covers a first course in statistics, providing a rigorous introduction to applied statistics that is clear, concise, and accessible. This book was written with the undergraduate level in mind, but it’s also popular in high schools and graduate courses.

We hope readers will take away three ideas from this book in addition to forming a foundation of statistical thinking and methods.

• Statistics is an applied field with a wide range of practical applications.

• You don’t have to be a math guru to learn from real, interesting data.

• Data are messy, and statistical tools are imperfect. But, when you understand the strengths and weaknesses of these tools, you can use them to learn about the world.

Textbook overview

The chapters of this book are as follows:

1. Introduction to data. Data structures, variables, and basic data collection techniques.

2. Summarizing data. Data summaries, graphics, and a teaser of inference using randomization.

3. Probability. Basic principles of probability.

4. Distributions of random variables. The normal model and other key distributions.

5. Foundations for inference. General ideas for statistical inference in the context of estimating the population proportion.

6. Inference for categorical data. Inference for proportions and tables using the normal and chi-square distributions.

7. Inference for numerical data. Inference for one or two sample means using the t-distribution, statistical power for comparing two groups, and also comparisons of many means using ANOVA.

8. Introduction to linear regression. Regression for a numerical outcome with one predictor variable. Most of this chapter could be covered after Chapter 1.

9. Multiple and logistic regression. Regression for numerical and categorical data using many predictors.

OpenIntro Statistics supports flexibility in choosing and ordering topics. If the main goal is to reach multiple regression (Chapter 9) as quickly as possible, then the following are the ideal prerequisites:

• Chapter 1, Sections 2.1, and Section 2.2 for a solid introduction to data structures and statis- tical summaries that are used throughout the book.

• Section 4.1 for a solid understanding of the normal distribution.

• Chapter 5 to establish the core set of inference tools.

• Section 7.1 to give a foundation for the t-distribution

• Chapter 8 for establishing ideas and principles for single predictor regression.

6 TABLE OF CONTENTS

Examples and exercises

Examples are provided to establish an understanding of how to apply methods

EXAMPLE 0.1

This is an example. When a question is asked here, where can the answer be found?

The answer can be found here, in the solution section of the example!

When we think the reader should be ready to try determining the solution to an example, we frame it as Guided Practice.

GUIDED PRACTICE 0.2

The reader may check or learn the answer to any Guided Practice problem by reviewing the full solution in a footnote.1

Exercises are also provided at the end of each section as well as review exercises at the end of each chapter. Solutions are given for odd-numbered exercises in Appendix A.

Additional resources

Video overviews, slides, statistical software labs, data sets used in the textbook, and much more are readily available at

openintro.org/os

We also have improved the ability to access data in this book through the addition of Appendix B, which provides additional information for each of the data sets used in the main text and is new in the Fourth Edition. Online guides to each of these data sets are also provided at openintro.org/data and through a companion R package.

We appreciate all feedback as well as reports of any typos through the website. A short-link to report a new typo or review known typos is openintro.org/os/typos.

For those focused on statistics at the high school level, consider Advanced High School Statistics, which is a version of OpenIntro Statistics that has been heavily customized by Leah Dorazio for high school courses and AP® Statistics.

Acknowledgements

This project would not be possible without the passion and dedication of many more people beyond those on the author list. The authors would like to thank the OpenIntro Staff for their involvement and ongoing contributions. We are also very grateful to the hundreds of students and instructors who have provided us with valuable feedback since we first started posting book content in 2009.

We also want to thank the many teachers who helped review this edition, including Laura Acion, Matthew E. Aiello-Lammens, Jonathan Akin, Stacey C. Behrensmeyer, Juan Gomez, Jo Hardin, Nicholas Horton, Danish Khan, Peter H.M. Klaren, Jesse Mostipak, Jon C. New, Mario Orsi, Steve Phelps, and David Rockoff. We appreciate all of their feedback, which helped us tune the text in significant ways and greatly improved this book.

1Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote solution for any Guided Practice.

http://www.openintro.org/redirect.php?go=os&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=data&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=textbook-github_openintro&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=os_typos&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=textbook-books&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=people&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=textbook-openintro_about&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=matthew_e_aiello-lammens&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=jonathan_akin&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=nicholas_horton&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=danish_khan&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=peter_hm_klaren&referrer=os4_pdf
7

Chapter 1 Introduction to data

1.1 Case study: using stents to prevent strokes

1.2 Data basics

1.3 Sampling principles and strategies

1.4 Experiments

Scientists seek to answer questions using rigorous methods and careful

observations. These observations – collected from the likes of field

notes, surveys, and experiments – form the backbone of a statistical

investigation and are called data. Statistics is the study of how best

to collect, analyze, and draw conclusions from data, and in this first

chapter, we focus on both the properties of data and on the collection

of data.

For videos, slides, and other resources, please visit

www.openintro.org/os

http://www.openintro.org/redirect.php?go=stat&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=os&referrer=os4_pdf
1.1. CASE STUDY: USING STENTS TO PREVENT STROKES 9

1.1 Case study: using stents to prevent strokes

Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice.

In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke. Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death. Many doctors have hoped that there would be similar benefits for patients at risk of stroke. We start by writing the principal question the researchers hope to answer:

Does the use of stents reduce the risk of stroke?

The researchers who asked this question conducted an experiment with 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups:

Treatment group. Patients in the treatment group received a stent and medical manage- ment. The medical management included medications, management of risk factors, and help in lifestyle modification.

Control group. Patients in the control group received the same medical management as the treatment group, but they did not receive stents.

Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group.

Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment. The results of 5 patients are summarized in Figure 1.1. Patient outcomes are recorded as “stroke” or “no event”, representing whether or not the patient had a stroke at the end of a time period.

Patient group 0-30 days 0-365 days 1 treatment no event no event 2 treatment stroke stroke 3 treatment no event no event ...

... ...

450 control no event no event 451 control no event no event

Figure 1.1: Results for five patients from the stent study.

Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once. Figure 1.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study. For instance, to identify the number of patients in the treatment group who had a stroke within 30 days, we look on the left-side of the table at the intersection of the treatment and stroke: 33.

0-30 days 0-365 days stroke no event stroke no event

treatment 33 191 45 179 control 13 214 28 199 Total 46 405 73 378

Figure 1.2: Descriptive statistics for the stent study.

10 CHAPTER 1. INTRODUCTION TO DATA

GUIDED PRACTICE 1.1

Of the 224 patients in the treatment group, 45 had a stroke by the end of the first year. Using these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year. (Please note: answers to all Guided Practice exercises are provided using footnotes.)1

We can compute summary statistics from the table. A summary statistic is a single number summarizing a large amount of data. For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.

Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.

Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.

These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the groups?

This second question is subtle. Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating process. It is possible that the 8% difference in the stent study is due to this natural variation. However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance. So what we are really asking is the following: is the difference so large that we should reject the notion that it was due to chance?

While we don’t yet have our statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.

Be careful: Do not generalize the results of this study to all patients and all stents. This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients. In addition, there are many types of stents and this study only considered the self-expanding Wingspan stent (Boston Scientific). However, this study does leave us with an important lesson: we should keep our eyes open for surprises.

1The proportion of the 224 patients who had a stroke within 365 days: 45/224 = 0.20.

1.1. CASE STUDY: USING STENTS TO PREVENT STROKES 11

Exercises

1.1 Migraine and acupuncture, Part I. A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture. To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 females diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control. 43 patients in the treatment group received acupuncture that is specifically designed to treat migraines. 46 patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). 24 hours after patients received acupuncture, they were asked if they were pain free. Results are summarized in the contingency table below.2

Pain free Yes No Total

Treatment 10 33 43 Group

Control 2 44 46 Total 12 77 89

identified on the antero-internal part of the antitragus, the

anterior part of the lobe and the upper auricular concha, on the same side of pain. The majority of these points were

effective very rapidly (within 1 min), while the remaining

points produced a slower antalgic response, between 2 and 5 min. The insertion of a semi-permanent needle in these

zones allowed stable control of the migraine pain, which

occurred within 30 min and still persisted 24 h later. Since the most active site in controlling migraine pain

was the antero-internal part of the antitragus, the aim of this study was to verify the therapeutic value of this elec-

tive area (appropriate point) and to compare it with an area

of the ear (representing the sciatic nerve) which is probably inappropriate in terms of giving a therapeutic effect on

migraine attacks, since it has no somatotopic correlation

with head pain.

Materials and methods

The study enrolled 94 females, diagnosed as migraine

without aura following the International Classification of Headache Disorders [5], who were subsequently examined

at the Women’s Headache Centre, Department of Gynae-

cology and Obstetrics of Turin University. They were all included in the study during a migraine attack provided that

it started no more than 4 h previously. According to a

predetermined computer-made randomization list, the eli- gible patients were randomly and blindly assigned to the

following two groups: group A (n = 46) (average age 35.93 years, range 15–60), group B (n = 48) (average age 33.2 years, range 16–58).

Before enrollment, each patient was asked to give an

informed consent to participation in the study. Migraine intensity was measured by means of a VAS

before applying NCT (T0).

In group A, a specific algometer exerting a maximum pressure of 250 g (SEDATELEC, France) was chosen to

identify the tender points with Pain–Pressure Test (PPT).

Every tender point located within the identified area by the pilot study (Fig. 1, area M) was tested with NCT for 10 s

starting from the auricle, that was ipsilateral, to the side of

prevalent cephalic pain. If the test was positive and the reduction was at least 25% in respect to basis, a semi-

permanent needle (ASP SEDATELEC, France) was

inserted after 1 min. On the contrary, if pain did not lessen after 1 min, a further tender point was challenged in the

same area and so on. When patients became aware of an

initial decrease in the pain in all the zones of the head affected, they were invited to use a specific diary card to

score the intensity of the pain with a VAS at the following

intervals: after 10 min (T1), after 30 min (T2), after 60 min (T3), after 120 min (T4), and after 24 h (T5).

In group B, the lower branch of the anthelix was

repeatedly tested with the algometer for about 30 s to ensure it was not sensitive. On both the French and Chinese

auricular maps, this area corresponds to the representation

of the sciatic nerve (Fig. 1, area S) and is specifically used to treat sciatic pain. Four needles were inserted in this area,

two for each ear.

In all patients, the ear acupuncture was always per- formed by an experienced acupuncturist. The analysis of

the diaries collecting VAS data was conducted by an

impartial operator who did not know the group each patient was in.

The average values of VAS in group A and B were

calculated at the different times of the study, and a statis- tical evaluation of the differences between the values

obtained in T0, T1, T2, T3 and T4 in the two groups studied was performed using an analysis of variance

(ANOVA) for repeated measures followed by multiple

t test of Bonferroni to identify the source of variance. Moreover, to evaluate the difference between group B

and group A, a t test for unpaired data was always per- formed for each level of the variable ‘‘time’’. In the case of proportions, a Chi square test was applied. All analyses

were performed using the Statistical Package for the Social

Sciences (SPSS) software program. All values given in the following text are reported as arithmetic mean (±SEM).

Results

Only 89 patients out of the entire group of 94 (43 in group A, 46 in group B) completed the experiment. Four patients

withdrew from the study, because they experienced an

unbearable exacerbation of pain in the period preceding the last control at 24 h (two from group A and two from group

B) and were excluded from the statistical analysis since

they requested the removal of the needles. One patient from group A did not give her consent to the implant of the

semi-permanent needles. In group A, the mean number of

Fig. 1 The appropriate area (M) versus the inappropriate area (S) used in the treatment of migraine attacks

S174 Neurol Sci (2011) 32 (Suppl 1):S173–S175

123

Figure from the original pa-

per displaying the appropri-

ate area (M) versus the in-

appropriate area (S) used in

the treatment of migraine at-

tacks.

(a) What percent of patients in the treatment group were pain free 24 hours after receiving acupuncture?

(b) What percent were pain free in the control group?

(d) Your findings so far might suggest that acupuncture is an effective treatment for migraines for all people who suffer from migraines. However this is not the only possible conclusion that can be drawn based on your findings so far. What is one other possible explanation for the observed difference between the percentages of patients that are pain free 24 hours after receiving acupuncture in the two groups?

1.2 Sinusitis and antibiotics, Part I. Researchers studying the effect of antibiotic treatment for acute sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with acute sinusitis to one of two groups: treatment or control. Study participants received either a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste. The placebo consisted of symptomatic treatments such as acetaminophen, nasal decongestants, etc. At the end of the 10-day period, patients were asked if they experienced improvement in symptoms. The distribution of responses is summarized below.3

Self-reported improvement in symptoms

Yes No Total Treatment 66 19 85

Group Control 65 16 81 Total 131 35 166

(a) What percent of patients in the treatment group experienced improvement in symptoms?

(b) What percent experienced improvement in symptoms in the control group?

(d) Your findings so far might suggest a real difference in effectiveness of antibiotic and placebo treatments for improving symptoms of sinusitis. However, this is not the only possible conclusion that can be drawn based on your findings so far. What is one other possible explanation for the observed difference between the percentages of patients in the antibiotic and placebo treatment groups that experience improvement in symptoms of sinusitis?

2G. Allais et al. “Ear acupuncture in the treatment of migraine attacks: a randomized trial on the efficacy of appropriate versus inappropriate acupoints”. In: Neurological Sci. 32.1 (2011), pp. 173–175.

3J.M. Garbutt et al. “Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial”. In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685–692.

http://www.openintro.org/redirect.php?go=textbook-acupuncture_migraine_2011&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=textbook-acupuncture_migraine_2011&referrer=os4_pdf
http://www.openintro.org/redirect.php?go=textbook-amoxicillin_acute_rhinosinusitis_2012&referrer=os4_pdf
12 CHAPTER 1. INTRODUCTION TO DATA

1.2 Data basics

Effective organization and description of data is a first step in most analyses. This section introduces the data matrix for organizing data as well as some terminology about different forms of data that will be used throughout this book.

1.2.1 Observations, variables, and data matrices

Figure 1.3 displays rows 1, 2, 3, and 50 of a data set for 50 randomly sampled loans offered through Lending Club, which is a peer-to-peer lending company. These observations will be referred to as the loan50 data set.

Each row in the table represents a single loan. The formal name for a row is a case or observational unit. The columns represent characteristics, called variables, for each of the loans. For example, the first row represents a loan of $7,500 with an interest rate of 7.34%, where the borrower is based in Maryland (MD) and has an income of $70,000.

GUIDED PRACTICE 1.2

What is the grade of the first loan in Figure 1.3? And what is the home ownership status of the borrower for that first loan? For these Guided Practice questions, you can check your answer in the footnote.4

In practice, it is especially important to ask clarifying questions to ensure important aspects of the data are understood. For instance, it is always important to be sure we know what each variable means and the units of measurement. Descriptions of the loan50 variables are given in Figure 1.4.

loan amount interest rate term grade state total income homeownership

1 7500 7.34 36 A MD 70000 rent 2 25000 9.43 60 B OH 254000 mortgage 3 14500 6.08 36 A MO 80000 mortgage ...

... ...

... 50 3000 7.96 36 A CA 34000 rent

Figure 1.3: Four rows from the loan50 data matrix.

variable description

loan amount Amount of the loan received, in US dollars. interest rate Interest rate on the loan, in an annual percentage. term The length of the loan, which is always set as a whole number of months. grade Loan grade, which takes a values A through G and represents the quality

of the loan and its likelihood of being repaid. state US state where the borrower resides. total income Borrower’s total income, including any second income, in US dollars. homeownership Indicates whether the person owns, owns but has a mortgage, or rents.

Figure 1.4: Variables and their descriptions for the loan50 data set.

The data in Figure 1.3 represent a data matrix, which is a convenient and common way to organize data, especially if collecting data in a spreadsheet. Each row of a data matrix corresponds to a unique case (observational unit), and each column corresponds to a variable.

4The loan’s grade is A, and the borrower rents their residence.

1.2. DATA BASICS 13

When recording data, use a data matrix unless you have a very good reason to use a different structure. This structure allows new cases to be added as rows or new variables as new columns.

GUIDED PRACTICE 1.3

The grades for assignments, quizzes, and exams in a course are often recorded in a gradebook that takes the form of a data matrix. How might you organize grade data using a data matrix?5

GUIDED PRACTICE 1.4

We consider data for 3,142 counties in the United States, which includes each county’s name, the state where it resides, its population in 2017, how its population changed from 2010 to 2017, poverty rate, and six additional characteristics. How might these data be organized in a data matrix?6

The data described in Guided Practice 1.4 represents the county data set, which is shown as a data matrix in Figure 1.5. The variables are summarized in Figure 1.6.

5There are multiple strategies that can be followed. One common strategy is to have each student represented by a row, and then add a column for each assignment, quiz, or exam. Under this setup, it is easy to review a single line to understand a student’s grade history. There should also be columns to include student information, such as one column to list student names.

6Each county may be viewed as a case, and there are eleven pieces of information recorded for each case. A table with 3,142 rows and 11 columns could hold these data, where each row represents a county and each column represents a particular piece of information.

14 CHAPTER 1. INTRODUCTION TO DATA

n a m e

s t a t e

p o p

p o p c h a n g e

p o v e r t y

h o m e o w n e r s h i p

m u l t i u n i t

u n e m p r a t e

m e t r o

m e d i a n e d u

m e d i a n h h i n c o m e

1 A

u ta

u g a

A la

b a m

a 5 5 5 0 4

1 .4

8 1 3 .7

7 7 .5

7 .2

3 .8

6 y es

so m

e co

ll eg

e 5 5 3 1 7

2 B

a ld

w in

A la

b a m

a 2 1 2 6 2 8

9 .1

9 1 1 .8

7 6 .7

2 2 .6

3 .9

9 y es

so m

e co

ll eg

e 5 2 5 6 2

3 B

a rb

o u r

A la

b a m

a 2 5 2 7 0

-6 .2

2 2 7 .2

6 8 .0

1 1 .1

5 .9

0 n o

h s

d ip

lo m

a 3 3 3 6 8

4 B

ib b

A la

b a m

a 2 2 6 6 8

0 .7

3 1 5 .2

8 2 .9

6 .6

4 .3

9 y es

h s

d ip

lo m

a 4 3 4 0 4

5 B

lo u n t

A la

b a m

a 5 8 0 1 3

0 .6

8 1 5 .6

8 2 .0

3 .7

4 .0

2 y es

h s

d ip

lo m

a 4 7 4 1 2

6 B

u ll o ck

A la

b a m

a 1 0 3 0 9

-2 .2

8 2 8 .5

7 6 .9

9 .9

4 .9

3 n o

h s

d ip

lo m

a 2 9 6 5 5

7 B

u tl

er A

la b a m

a 1 9 8 2 5

-2 .6

9 2 4 .4

6 9 .0

1 3 .7

5 .4

9 n o

h s

d ip

lo m

a 3 6 3 2 6

8 C

a lh

o u n

A la

b a m

a 1 1 4 7 2 8

-1 .5

1 1 8 .6

7 0 .7

1 4 .3

4 .9

3 y es

so m

e co

ll eg

e 4 3 6 8 6

9 C

h a m

b er

s A

la b a m

a 3 3 7 1 3

-1 .2

0 1 8 .8

7 1 .4

8 .7

4 .0

8 n o

h s

d ip

lo m

a 3 7 3 4 2

1 0

C h er

o k ee

A la

b a m

a 2 5 8 5 7

-0 .6

0 1 6 .1

7 7 .5

4 .3

4 .0

5 n o

h s

d ip

lo m

a 4 0 0 4 1

. . . . . .

3 1 4 2

W es

to n

W y o m

in g

6 9 2 7

-2 .9

3 1 4 .4

7 7 .9

6 .5

3 .9

8 n o

so m

e co

ll eg

e 5 9 6 0 5

F ig

u re

1 .5

: E

le ve

n ro

w s

fr o m

th e c o u n t y

d a ta

se t.

v a ri a b le

d e sc ri p ti o n

n a m e

C o u n ty

n a m

e. s t a t e

S ta

te w

h er

e th

e co

u n ty

re si

d es

, o r

th e

D is

tr ic

t o f

C o lu

m b ia

. p o p

P o p u la

ti o n

in 2 0 1 7 .

p o p c h a n g e

P er

ce n t

ch a n g e

in th

e p

o p u la

ti o n

fr o m

2 0 1 0

to 2 0 1 7 .

F o r

ex a m

p le

, th

e va

lu e

1 . 4 8

in th

e fi rs

t ro

w m

ea n s

th e

p o p u la

ti o n

fo r

th is

co u n ty

in cr

ea se

d b y

1 .4

8 %

fr o m

2 0 1 0

to 2 0 1 7 .

p o v e r t y

P er

ce n t

o f

th e

p o p u la

ti o n

in p

ov er

ty .

h o m e o w n e r s h i p

P er

ce n t

o f

th e

p o p u la

ti o n

th a t

li v es

in th

ei r

ow n

h o m

e o r

li v es

w it

h th

e ow

n er

, e.

g .

ch il d re

n li v in

g w

it h

p a re

n ts

w h o

ow n

th e

h o m

e. m u l t i u n i t

P er

ce n t

o f

li v in

g u n it

s th

a t

a re

in m

u lt

i- u n it

st ru

ct u re

s, e.

g .

a p a rt

m en

ts .

u n e m p r a t e

U n em

p lo

y m

en t

ra te

a s

a p

er ce

n t.

m e t r o

W h et

h er

th e

co u n ty

co n ta

in s

a m

et ro

p o li ta

n a re

a .

m e d i a n e d u

M ed

ia n

ed u ca

ti o n

le v el

, w

h ic

h ca

n ta

k e

a va

lu e

a m

o n g b e l o w h s , h s d i p l o m a ,

s o m e c o l l e g e ,

a n d b a c h e l o r s .

m e d i a n h h i n c o m e

M ed

ia n

h o u se

h o ld

in co

m e

fo r

th e

co u n ty

, w

h er

e a

h o u se

h o ld

’s in

co m

e eq

u a ls

th e

to ta

l in

co m

e o f

it s

o cc

u p a n ts

w h o

a re

1 5

y ea

rs o r

o ld

er .

F ig

u re

1. 6 :

V a ri

a b

le s

a n

d th

ei r

d es

cr ip

ti o n

s fo

r th

e c o u n t y

d a ta

se t.

1.2. DATA BASICS 15

1.2.2 Types of variables

Examine the unemp rate, pop, state, and median edu variables in the county data set. Each of these variables is inherently different from the other three, yet some share certain characteristics.

First consider unemp rate, which is said to be a numerical variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes doesn’t have any clear meaning.

The pop variable is also numerical, although it seems to be a little different than unemp rate. This variable of the population count can only take whole non-negative numbers (0, 1, 2, ...). For this reason, the population variable is said to be discrete since it can only take numerical values with jumps. On the other hand, the unemployment rate variable is said to be continuous.

The variable state can take up to 51 values after accounting for Washington, DC: AL, AK, ..., and WY. Because the responses themselves are categories, state is called a categorical variable, and the possible values are called the variable’s levels.

Finally, consider the median edu variable, which describes the median education level of county residents and takes values below hs, hs diploma, some college, or bachelors in each county. This variable seems to be a hybrid: it is a categorical variable but the levels have a natural ordering. A variable with these properties is called an ordinal variable, while a regular categorical variable without this type of special ordering is called a nominal variable. To simplify analyses, any ordinal variable in this book will be treated as a nominal (unordered) categorical variable.

all variables

numerical categorical

continuous discrete nominal(unordered categorical) ordinal

(ordered categorical)

Figure 1.7: Breakdown of variables into their respective types.

EXAMPLE 1.5

Data were collected about students in a statistics course. Three variables were recorded for each student: number of siblings, student height, and whether the student had previously taken a statistics course. Classify each of the variables as continuous numerical, discrete numerical, or categorical.

The number of siblings and student height represent numerical variables. Because the number of siblings is a count, it is discrete. Height varies continuously, so it is a continuous numerical variable. The last variable classifies students into two categories – those who have and those who have not taken a statistics course – which makes this variable categorical.

GUIDED PRACTICE 1.6

An experiment is evaluating the effectiveness of a new drug in treating migraines. A group variable is used to indicate the experiment group for each patient: treatment or control. The num migraines variable represents the number of migraines the patient experienced during a 3-month period. Classify each variable as either numerical or categorical?7

7There group variable can take just one of two group names, making it categorical. The num migraines variable describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this is numerical outcome; more specifically, since it represents a count, num migraines is a discrete numerical variable.

16 CHAPTER 1. INTRODUCTION TO DATA

1.2.3 Relationships between variables

Many analyses are motivated by a researcher looking for a relationship between two or more variables. A social scientist may like to answer some of the following questions:

(1) If homeownership is lower than the national average in one county, will the percent of multi-unit structures in that county tend to be above or below the national average?

(2) Does a higher than average increase in county population tend to correspond to counties with higher or lower median household incomes?

(3) How useful a predictor is median education level for the median household income for US counties?

To answer these questions, data must be collected, such as the county data set shown in Figure 1.5. Examining summary statistics could provide insights for each of the three questions about counties. Additionally, graphs can be used to visually explore data.

Scatterplots are one type of graph used to study the relationship between two numerical vari- ables. Figure 1.8 compares the variables homeownership and multi unit, which is the percent of units in multi-unit structures (e.g. apartments, condos). Each point on the plot represents a single county. For instance, the highlighted dot corresponds to County 413 in the county data set: Chat- tahoochee County, Georgia, which has 39.4% of units in multi-unit structures and a homeownership rate of 31.3%. The scatterplot suggests a relationship between the two variables: counties with a higher rate of multi-units tend to have lower homeownership rates. We might brainstorm as to why this relationship exists and investigate each idea to determine which are the most reasonable explanations.

H om

eo w

ne rs

hi p

R at

0% 20% 40% 60% 80% 100%

20%

40%

60%

80%

●

Percent of Units in Multi−Unit Structures

Figure 1.8: A scatterplot of homeownership versus the percent of units that are in multi-unit structures for US counties. The highlighted dot represents Chatta- hoochee County, Georgia, which has a multi-unit rate of 39.4% and a homeowner- ship rate of 31.3%.

The multi-unit and homeownership rates are said to be associated because the plot shows a discernible pattern. When two variables show some connection with one another, they are called associated variables. Associated variables can also be called dependent variables and vice-versa.

1.2. DATA BASICS 17

$0 $20k $40k $60k $80k $100k $120k

−10%

10%

20%

Median Household Income

P op

ul at

io n

C ha

ng e

ov er

7 Y

ea rs

●

Figure 1.9: A scatterplot showing pop change against median hh income. Owsley County of Kentucky, is highlighted, which lost 3.63% of its population from 2010 to 2017 and had median household income of $22,736.

GUIDED PRACTICE 1.7

Examine the variables in the loan50 data set, which are described in Figure 1.4 on page 12. Create two questions about possible relationships between variables in loan50 that are of interest to you.8

EXAMPLE 1.8

This example examines the relationship between a county’s population change from 2010 to 2017 and median household income, which is visualized as a scatterplot in Figure 1.9. Are these variables associated?

The larger the median household income for a county, the higher the population growth observed for the county. While this trend isn’t true for every county, the trend in the plot is evident. Since there is some relationship between the variables, they are associated.

Because there is a downward trend in Figure 1.8 – counties with more units in multi-unit structures are associated with lower homeownership – these variables are said to be negatively associated. A positive association is shown in the relationship between the median hh income and pop change in Figure 1.9, where counties with higher median household income tend to have higher rates of population growth.

If two variables are not associated, then they are said to be independent. That is, two variables are independent if there is no evident relationship between the two.

ASSOCIATED OR INDEPENDENT, NOT BOTH

A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent.

Homework is Completed By:

Writer	Writer Name	Amount	Client Comments & Rating
ONLINE	Instant Homework Helper 4.8 4305 Orders Completed	$36	She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up! 5.00
Answer.docx Turnitin Report.pdf Contact Writer For Solution Contact Writer For Solution

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 3 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 6 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 12 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

6 writers have sent their proposals to do this homework:

Writer	Writer Name	Offer	Chat
ONLINE	Buy Coursework Help I have assisted scholars, business persons, startups, entrepreneurs, marketers, managers etc in their, pitches, presentations, market research, business plans etc. 4.8 1617 Orders Completed	$21	Chat With Writer
ONLINE	Engineering Exam Guru I will be delighted to work on your project. As an experienced writer, I can provide you top quality, well researched, concise and error-free work within your provided deadline at very reasonable prices. 4.8 1176 Orders Completed	$42	Chat With Writer
ONLINE	Top Quality Assignments This project is my strength and I can fulfill your requirements properly within your given deadline. I always give plagiarism-free work to my clients at very competitive prices. 4.9 1071 Orders Completed	$38	Chat With Writer
ONLINE	Homework Master I am an academic and research writer with having an MBA degree in business and finance. I have written many business reports on several topics and am well aware of all academic referencing styles. 4.9 1470 Orders Completed	$22	Chat With Writer
ONLINE	Top Essay Tutor I have written research reports, assignments, thesis, research proposals, and dissertations for different level students and on different subjects. 4.7 9702 Orders Completed	$46	Chat With Writer
ONLINE	Smart Accountants As an experienced writer, I have extensive experience in business writing, report writing, business profile writing, writing business reports and business plans for my clients. 3.2 63 Orders Completed	$16	Chat With Writer