Recent Orders

Our Reviews

Sample Papers

How It Works

Get First 2 Pages Of Your Homework Absolutely Free!

Messages

Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free Quotes Post Your Requirements

14.2 choosing among linear quadratic and exponential models answers

18/10/2021 Client: muhammad11 Deadline: 2 Day

Questions On Logistic Regression Using R

From the text book attached below, I need solutions for the questions below from Chapter10: Logistic Regression

Problem 1: Financial conditions of banks (b, c, e)

Problem 4: Competitive auctioning on ebay (a, b, c, d, e, f)

DATA MINING FOR BUSINESS ANALYTICS

Concepts, Techniques, and Applications in R

Galit Shmueli

Peter C. Bruce

Inbal Yahav

Nitin R. Patel

Kenneth C. Lichtendahl, Jr.

This edition first published 2018

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl Jr. to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office 111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties; including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of on-going research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from.

Library of Congress Cataloging-in-Publication Data applied for

Hardback: 9781118879368

Cover Design: Wiley Cover Image: © Achim Mittler, Frankfurt am Main/Gettyimages

Set in 11.5/14.5pt BemboStd by Aptara Inc., New Delhi, India Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

http://www.wiley.com/go/permissions
http://www.wiley.com
The beginning of wisdom is this:

Get wisdom, and whatever else you get, get insight.

– Proverbs 4:7

Contents

Foreword by Gareth James xix

Foreword by Ravi Bapna xxi

Preface to the R Edition xxiii

Acknowledgments xxvii

PART I PRELIMINARIES CHAPTER 1 Introduction 3

1.1 What Is Business Analytics? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 What Is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Data Mining and Related Terms . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Why Are There So Many Different Methods? . . . . . . . . . . . . . . . . . . . 8 1.7 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.8 Road Maps to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Order of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER 2 Overview of the Data Mining Process 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Core Ideas in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Association Rules and Recommendation Systems . . . . . . . . . . . . . . . . . 16 Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Data Reduction and Dimension Reduction . . . . . . . . . . . . . . . . . . . . 17 Data Exploration and Visualization . . . . . . . . . . . . . . . . . . . . . . . . 17 Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 18

2.3 The Steps in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Organization of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Predicting Home Values in the West Roxbury Neighborhood . . . . . . . . . . . 21

vii

viii CONTENTS

Loading and Looking at the Data in R . . . . . . . . . . . . . . . . . . . . . . 22 Sampling from a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Oversampling Rare Events in Classification Tasks . . . . . . . . . . . . . . . . . 25 Preprocessing and Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Predictive Power and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 33 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Creation and Use of Data Partitions . . . . . . . . . . . . . . . . . . . . . . . 35

2.6 Building a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Modeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.7 Using R for Data Mining on a Local Machine . . . . . . . . . . . . . . . . . . . 43 2.8 Automating Data Mining Solutions . . . . . . . . . . . . . . . . . . . . . . . . 43

Data Mining Software: The State of the Market (by Herb Edelstein) . . . . . . . . 45 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

PART II DATA EXPLORATION AND DIMENSION REDUCTION CHAPTER 3 Data Visualization 55

3.1 Uses of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Base R or ggplot? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Example 1: Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . 57 Example 2: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots . . . . . . . . . . . . . 59 Distribution Plots: Boxplots and Histograms . . . . . . . . . . . . . . . . . . . 61 Heatmaps: Visualizing Correlations and Missing Values . . . . . . . . . . . . . . 64

3.4 Multidimensional Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Adding Variables: Color, Size, Shape, Multiple Panels, and Animation . . . . . . . 67 Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering . . . . 70 Reference: Trend Lines and Labels . . . . . . . . . . . . . . . . . . . . . . . . 74 Scaling up to Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Multivariate Plot: Parallel Coordinates Plot . . . . . . . . . . . . . . . . . . . . 75 Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.5 Specialized Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Visualizing Networked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Visualizing Hierarchical Data: Treemaps . . . . . . . . . . . . . . . . . . . . . 82 Visualizing Geographical Data: Map Charts . . . . . . . . . . . . . . . . . . . . 83

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal . . . . . . . 86 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

CHAPTER 4 Dimension Reduction 91

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

CONTENTS ix

4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Example 1: House Prices in Boston . . . . . . . . . . . . . . . . . . . . . . . 93

4.4 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Aggregation and Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.6 Reducing the Number of Categories in Categorical Variables . . . . . . . . . . . 99

4.7 Converting a Categorical Variable to a Numerical Variable . . . . . . . . . . . . 99

4.8 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Example 2: Breakfast Cereals . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Using Principal Components for Classification and Prediction . . . . . . . . . . . 109

4.9 Dimension Reduction Using Regression Models . . . . . . . . . . . . . . . . . . 111

4.10 Dimension Reduction Using Classification and Regression Trees . . . . . . . . . . 111

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

PART III PERFORMANCE EVALUATION

CHAPTER 5 Evaluating Predictive Performance 117

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.2 Evaluating Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . 118

Naive Benchmark: The Average . . . . . . . . . . . . . . . . . . . . . . . . . 118

Prediction Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Comparing Training and Validation Performance . . . . . . . . . . . . . . . . . 121

Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.3 Judging Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Benchmark: The Naive Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Class Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

The Confusion (Classification) Matrix . . . . . . . . . . . . . . . . . . . . . . . 124

Using the Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Propensities and Cutoff for Classification . . . . . . . . . . . . . . . . . . . . . 127

Performance in Case of Unequal Importance of Classes . . . . . . . . . . . . . . 131

Asymmetric Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . 133

Generalization to More Than Two Classes . . . . . . . . . . . . . . . . . . . . . 135

5.4 Judging Ranking Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Lift Charts for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Decile Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Beyond Two Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Lift Charts Incorporating Costs and Benefits . . . . . . . . . . . . . . . . . . . 139

Lift as a Function of Cutoff . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.5 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Oversampling the Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . 144

x CONTENTS

Evaluating Model Performance Using a Non-oversampled Validation Set . . . . . . 144 Evaluating Model Performance if Only Oversampled Validation Set Exists . . . . . 144

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

PART IV PREDICTION AND CLASSIFICATION METHODS CHAPTER 6 Multiple Linear Regression 153

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.2 Explanatory vs. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . 154 6.3 Estimating the Regression Equation and Prediction . . . . . . . . . . . . . . . . 156

Example: Predicting the Price of Used Toyota Corolla Cars . . . . . . . . . . . . 156 6.4 Variable Selection in Linear Regression . . . . . . . . . . . . . . . . . . . . . 161

Reducing the Number of Predictors . . . . . . . . . . . . . . . . . . . . . . . 161 How to Reduce the Number of Predictors . . . . . . . . . . . . . . . . . . . . . 162

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

CHAPTER 7 k-Nearest Neighbors (kNN) 173

7.1 The k-NN Classifier (Categorical Outcome) . . . . . . . . . . . . . . . . . . . . 173 Determining Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Classification Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Example: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Choosing k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Setting the Cutoff Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 k-NN with More Than Two Classes . . . . . . . . . . . . . . . . . . . . . . . . 180 Converting Categorical Variables to Binary Dummies . . . . . . . . . . . . . . . 180

7.2 k-NN for a Numerical Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.3 Advantages and Shortcomings of k-NN Algorithms . . . . . . . . . . . . . . . . 182 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

CHAPTER 8 The Naive Bayes Classifier 187

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Cutoff Probability Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Example 1: Predicting Fraudulent Financial Reporting . . . . . . . . . . . . . . 188

8.2 Applying the Full (Exact) Bayesian Classifier . . . . . . . . . . . . . . . . . . . 189 Using the “Assign to the Most Probable Class” Method . . . . . . . . . . . . . . 190 Using the Cutoff Probability Method . . . . . . . . . . . . . . . . . . . . . . . 190 Practical Difficulty with the Complete (Exact) Bayes Procedure . . . . . . . . . . 190 Solution: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 The Naive Bayes Assumption of Conditional Independence . . . . . . . . . . . . 192 Using the Cutoff Probability Method . . . . . . . . . . . . . . . . . . . . . . . 192 Example 2: Predicting Fraudulent Financial Reports, Two Predictors . . . . . . . 193 Example 3: Predicting Delayed Flights . . . . . . . . . . . . . . . . . . . . . . 194

8.3 Advantages and Shortcomings of the Naive Bayes Classifier . . . . . . . . . . . 199 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

CONTENTS xi

CHAPTER 9 Classification and Regression Trees 205

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

9.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Recursive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Example 1: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Measures of Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Classifying a New Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

9.3 Evaluating the Performance of a Classification Tree . . . . . . . . . . . . . . . . 215

Example 2: Acceptance of Personal Loan . . . . . . . . . . . . . . . . . . . . . 215

9.4 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Stopping Tree Growth: Conditional Inference Trees . . . . . . . . . . . . . . . . 221

Pruning the Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Best-Pruned Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

9.5 Classification Rules from Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 226

9.6 Classification Trees for More Than Two Classes . . . . . . . . . . . . . . . . . . 227

9.7 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Measuring Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

9.8 Improving Prediction: Random Forests and Boosted Trees . . . . . . . . . . . . 229

Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

9.9 Advantages and Weaknesses of a Tree . . . . . . . . . . . . . . . . . . . . . . 232

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

CHAPTER 10 Logistic Regression 237

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

10.2 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 239

10.3 Example: Acceptance of Personal Loan . . . . . . . . . . . . . . . . . . . . . . 240

Model with a Single Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 241

Estimating the Logistic Model from Data: Computing Parameter Estimates . . . . 243

Interpreting Results in Terms of Odds (for a Profiling Goal) . . . . . . . . . . . . 244

10.4 Evaluating Classification Performance . . . . . . . . . . . . . . . . . . . . . . 247

Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

10.5 Example of Complete Analysis: Predicting Delayed Flights . . . . . . . . . . . . 250

Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

Model-Fitting and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 254

Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

10.6 Appendix: Logistic Regression for Profiling . . . . . . . . . . . . . . . . . . . . 259

Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome . . . 259

xii CONTENTS

Appendix B: Evaluating Explanatory Power . . . . . . . . . . . . . . . . . . . . 261 Appendix C: Logistic Regression for More Than Two Classes . . . . . . . . . . . . 264

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

CHAPTER 11 Neural Nets 271

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 11.2 Concept and Structure of a Neural Network . . . . . . . . . . . . . . . . . . . . 272 11.3 Fitting a Network to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

Example 1: Tiny Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Computing Output of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Example 2: Classifying Accident Severity . . . . . . . . . . . . . . . . . . . . . 282 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Using the Output for Prediction and Classification . . . . . . . . . . . . . . . . 283

11.4 Required User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 11.5 Exploring the Relationship Between Predictors and Outcome . . . . . . . . . . . 287 11.6 Advantages and Weaknesses of Neural Networks . . . . . . . . . . . . . . . . . 288 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

CHAPTER 12 Discriminant Analysis 293

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Example 1: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Example 2: Personal Loan Acceptance . . . . . . . . . . . . . . . . . . . . . . 294

12.2 Distance of a Record from a Class . . . . . . . . . . . . . . . . . . . . . . . . 296 12.3 Fisher’s Linear Classification Functions . . . . . . . . . . . . . . . . . . . . . . 297 12.4 Classification Performance of Discriminant Analysis . . . . . . . . . . . . . . . 300 12.5 Prior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 12.6 Unequal Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . . 302 12.7 Classifying More Than Two Classes . . . . . . . . . . . . . . . . . . . . . . . . 303

Example 3: Medical Dispatch to Accident Scenes . . . . . . . . . . . . . . . . . 303 12.8 Advantages and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling 311

13.1 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Why Ensembles Can Improve Predictive Power . . . . . . . . . . . . . . . . . . 312 Simple Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Bagging and Boosting in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Advantages and Weaknesses of Ensembles . . . . . . . . . . . . . . . . . . . . 315

13.2 Uplift (Persuasion) Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 A-B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

CONTENTS xiii

Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 Gathering the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 A Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Modeling Individual Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Computing Uplift with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 Using the Results of an Uplift Model . . . . . . . . . . . . . . . . . . . . . . . 322

13.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

PART V MINING RELATIONSHIPS AMONG RECORDS CHAPTER 14 Association Rules and Collaborative Filtering 329

14.1 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Discovering Association Rules in Transaction Databases . . . . . . . . . . . . . 330 Example 1: Synthetic Data on Purchases of Phone Faceplates . . . . . . . . . . 330 Generating Candidate Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Selecting Strong Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 The Process of Rule Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Interpreting the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Rules and Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Example 2: Rules for Similar Book Purchases . . . . . . . . . . . . . . . . . . . 340

14.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Data Type and Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Example 3: Netflix Prize Contest . . . . . . . . . . . . . . . . . . . . . . . . . 343 User-Based Collaborative Filtering: “People Like You” . . . . . . . . . . . . . . 344 Item-Based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . 347 Advantages and Weaknesses of Collaborative Filtering . . . . . . . . . . . . . . 348 Collaborative Filtering vs. Association Rules . . . . . . . . . . . . . . . . . . . 349

14.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

CHAPTER 15 Cluster Analysis 357

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Example: Public Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

15.2 Measuring Distance Between Two Records . . . . . . . . . . . . . . . . . . . . 361 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Normalizing Numerical Measurements . . . . . . . . . . . . . . . . . . . . . . 362 Other Distance Measures for Numerical Data . . . . . . . . . . . . . . . . . . . 362 Distance Measures for Categorical Data . . . . . . . . . . . . . . . . . . . . . . 365 Distance Measures for Mixed Data . . . . . . . . . . . . . . . . . . . . . . . . 366

15.3 Measuring Distance Between Two Clusters . . . . . . . . . . . . . . . . . . . . 366 Minimum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Maximum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

xiv CONTENTS

Average Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

Centroid Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

15.4 Hierarchical (Agglomerative) Clustering . . . . . . . . . . . . . . . . . . . . . 368

Single Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

Complete Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

Average Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

Centroid Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

Dendrograms: Displaying Clustering Process and Results . . . . . . . . . . . . . 371

Validating Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

Limitations of Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . 375

15.5 Non-Hierarchical Clustering: The k-Means Algorithm . . . . . . . . . . . . . . . 376

Choosing the Number of Clusters (k) . . . . . . . . . . . . . . . . . . . . . . . 377

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

PART VI FORECASTING TIME SERIES

CHAPTER 16 Handling Time Series 387

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

16.2 Descriptive vs. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . 389

16.3 Popular Forecasting Methods in Business . . . . . . . . . . . . . . . . . . . . . 389

Combining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

16.4 Time Series Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

Example: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . . 390

16.5 Data-Partitioning and Performance Evaluation . . . . . . . . . . . . . . . . . . 395

Benchmark Performance: Naive Forecasts . . . . . . . . . . . . . . . . . . . . 395

Generating Future Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

CHAPTER 17 Regression-Based Forecasting 401

17.1 A Model with Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

Linear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

Exponential Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

Polynomial Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

17.2 A Model with Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

17.3 A Model with Trend and Seasonality . . . . . . . . . . . . . . . . . . . . . . . 411

17.4 Autocorrelation and ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . 412

Computing Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

Improving Forecasts by Integrating Autocorrelation Information . . . . . . . . . 416

Evaluating Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422

CONTENTS xv

CHAPTER 18 Smoothing Methods 433

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 18.2 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

Centered Moving Average for Visualization . . . . . . . . . . . . . . . . . . . . 434 Trailing Moving Average for Forecasting . . . . . . . . . . . . . . . . . . . . . 435 Choosing Window Width (w) . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

18.3 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 439 Choosing Smoothing Parameter α . . . . . . . . . . . . . . . . . . . . . . . . 440 Relation Between Moving Average and Simple Exponential Smoothing . . . . . . 440

18.4 Advanced Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 442 Series with a Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 Series with a Trend and Seasonality . . . . . . . . . . . . . . . . . . . . . . . 443 Series with Seasonality (No Trend) . . . . . . . . . . . . . . . . . . . . . . . . 443

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446

PART VII DATA ANALYTICS CHAPTER 19 Social Network Analytics 455

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 19.2 Directed vs. Undirected Networks . . . . . . . . . . . . . . . . . . . . . . . . 457 19.3 Visualizing and Analyzing Networks . . . . . . . . . . . . . . . . . . . . . . . 458

Graph Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 Edge List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 Using Network Data in Classification and Prediction . . . . . . . . . . . . . . . 461

19.4 Social Data Metrics and Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 462 Node-Level Centrality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Egocentric Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Network Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

19.5 Using Network Metrics in Prediction and Classification . . . . . . . . . . . . . . 467 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

19.6 Collecting Social Network Data with R . . . . . . . . . . . . . . . . . . . . . . 471 19.7 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . 474 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476

CHAPTER 20 Text Mining 479

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” . 480 20.3 Bag-of-Words vs. Meaning Extraction at Document Level . . . . . . . . . . . . . 481 20.4 Preprocessing the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482

Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Text Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

xvi CONTENTS

Presence/Absence vs. Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 487

Term Frequency–Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . 487

From Terms to Concepts: Latent Semantic Indexing . . . . . . . . . . . . . . . 488

Extracting Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489

20.5 Implementing Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . 489

20.6 Example: Online Discussions on Autos and Electronics . . . . . . . . . . . . . . 490

Importing and Labeling the Records . . . . . . . . . . . . . . . . . . . . . . . 490

Text Preprocessing in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

Producing a Concept Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

Fitting a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

PART VIII CASES CHAPTER 21 Cases 499

21.1 Charles Book Club . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

The Book Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

Database Marketing at Charles . . . . . . . . . . . . . . . . . . . . . . . . . . 500

Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502

Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

21.2 German Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506

Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

21.3 Tayko Software Cataloger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510

The Mailing Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510

Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512

21.4 Political Persuasion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

Predictive Analytics Arrives in US Politics . . . . . . . . . . . . . . . . . . . . 513

Political Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516

21.5 Taxi Cancellations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

Business Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

21.6 Segmenting Consumers of Bath Soap . . . . . . . . . . . . . . . . . . . . . . . 518

Business Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518

Key Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519

CONTENTS xvii

Measuring Brand Loyalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

21.7 Direct-Mail Fundraising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523

21.8 Catalog Cross-Selling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524

21.9 Predicting Bankruptcy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Predicting Corporate Bankruptcy . . . . . . . . . . . . . . . . . . . . . . . . . 525 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526

21.10 Time Series Case: Forecasting Public Transportation Demand . . . . . . . . . . . 528 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Assignment Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 Tips and Suggested Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

References 531

Data Files Used in the Book 533

Index 535

Foreword by Gareth James

T he field of statistics has existed in one form or another for 200 years, and bythe second half of the 20th century had evolved into a well-respected and essential academic discipline. However, its prominence expanded rapidly in the 1990s with the explosion of new, and enormous, data sources. For the first part of this century, much of this attention was focused on biological applications, in particular, genetics data generated as a result of the sequencing of the human genome. However, the last decade has seen a dramatic increase in the availability of data in the business disciplines, and a corresponding interest in business-related statistical applications.

The impact has been profound. Ten years ago, when I was able to attract a full class of MBA students to my new statistical learning elective, my colleagues were astonished because our department struggled to fill most electives. Today, we offer a Masters in Business Analytics, which is the largest specialized masters program in the school and has application volume rivaling those of our MBA programs. Our department’s faculty size and course offerings have increased dramatically, yet the MBA students are still complaining that the classes are all full. Google’s chief economist, Hal Varian, was indeed correct in 2009 when he stated that “the sexy job in the next 10 years will be statisticians.”

This demand is driven by a simple, but undeniable, fact. Business analyt- ics solutions have produced significant and measurable improvements in business performance, on multiple dimensions and in numerous settings, and as a result, there is a tremendous demand for individuals with the requisite skill set. How- ever, training students in these skills is challenging given that, in addition to the obvious required knowledge of statistical methods, they need to understand business-related issues, possess strong communication skills, and be comfortable dealing with multiple computational packages. Most statistics texts concentrate on abstract training in classical methods, without much emphasis on practical, let alone business, applications.

This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression, through to modern methods like neural

xix

xx FOREWORD BY GARETH JAMES

networks, bagging and boosting, and even much more business specific proce- dures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject. However, just as important as the list of topics, is the way that they are all presented in an applied fashion using business applications. Indeed the last chapter is entirely dedicated to 10 separate cases where business analytics approaches can be applied.

In this latest edition, the authors have added an important new dimension in the form of the R software package. Easily the most widely used and influ- ential open source statistical software, R has become the go-to tool for such purposes. With literally hundreds of freely available add-on packages, R can be used for almost any business analytics related problem. The book provides detailed descriptions and code involving applications of R in numerous business settings, ensuring that the reader will actually be able to apply their knowledge to real-life problems.

We recently introduced a business analytics course into our required MBA core curriculum and I intend to make heavy use of this book in developing the syllabus. I’m confident that it will be an indispensable tool for any such course.

GARETH JAMES

Marshall School of Business, University of Southern California, 2017

Foreword by Ravi Bapna

D ata is the new gold—and mining this gold to create business value in today’scontext of a highly networked and digital society requires a skillset that we haven’t traditionally delivered in business or statistics or engineering programs on their own. For those businesses and organizations that feel overwhelmed by today’s Big Data, the phrase you ain’t seen nothing yet comes to mind. Yester- day’s three major sources of Big Data—the 20+ years of investment in enterprise systems (ERP, CRM, SCM, …), the 3 billion plus people on the online social grid, and the close to 5 billion people carrying increasingly sophisticated mobile devices—are going to be dwarfed by tomorrow’s smarter physical ecosystems fueled by the Internet of Things (IoT) movement.

The idea that we can use sensors to connect physical objects such as homes, automobiles, roads, even garbage bins and streetlights, to digitally optimized systems of governance goes hand in glove with bigger data and the need for deeper analytical capabilities. We are not far away from a smart refrigerator sensing that you are short on, say, eggs, populating your grocery store’s mobile app’s shopping list, and arranging a Task Rabbit to do a grocery run for you. Or the refrigerator negotiating a deal with an Uber driver to deliver an evening meal to you. Nor are we far away from sensors embedded in roads and vehicles that can compute traffic congestion, track roadway wear and tear, record vehicle use and factor these into dynamic usage-based pricing, insurance rates, and even taxation. This brave new world is going to be fueled by analytics and the ability to harness data for competitive advantage.

Business Analytics is an emerging discipline that is going to help us ride this new wave. This new Business Analytics discipline requires individuals who are grounded in the fundamentals of business such that they know the right questions to ask, who have the ability to harness, store, and optimally process vast datasets from a variety of structured and unstructured sources, and who can then use an array of techniques from machine learning and statistics to uncover new insights for decision-making. Such individuals are a rare commodity today, but their creation has been the focus of this book for a decade now. This book’s forte is that it relies on explaining the core set of concepts required for today’s business analytics professionals using real-world data-rich cases in a hands-on manner,

xxi

xxii FOREWORD BY RAVI BAPNA

without sacrificing academic rigor. It provides a modern day foundation for Business Analytics, the notion of linking the x’s to the y’s of interest in a predictive sense. I say this with the confidence of someone who was probably the first adopter of the zeroth edition of this book (Spring 2006 at the Indian School of Business).

I can’t say enough about the long-awaited R edition. R is my go-to platform for analytics these days. It’s also used by a wide variety of instructors in our MS- Business Analytics program. The open-innovation paradigm used by R is one key part of the analytics perfect storm, the other components being the advances in computing and the business appetite for data-driven decision-making.

I look forward to using the book in multiple fora, in executive education, in MBA classrooms, in MS-Business Analytics programs, and in Data Science bootcamps. I trust you will too!

RAVI BAPNA

Carlson School of Management, University of Minnesota, 2017

Preface to the R Edition

T his textbook first appeared in early 2007 and has been used by numerousstudents and practitioners and in many courses, ranging from dedicated data mining classes to more general business analytics courses (including our own experience teaching this material both online and in person for more than 10 years). The first edition, based on the Excel add-in XLMiner, was followed by two more XLMiner editions, a JMP edition, and now this R edition, with its companion website, www.dataminingbook.com.

This new R edition, which relies on the free and open-source R soft- ware, presents output from R, as well as the code used to produce that output, including specification of a variety of packages and functions. Unlike computer- science or statistics-oriented textbooks, the focus in this book is on data mining concepts, and how to implement the associated algorithms in R. We assume a basic facility with R.

For this R edition, two new co-authors stepped on board—Inbal Yahav and Casey Lichtendahl—bringing both expertise teaching business analytics courses using R and data mining consulting experience in business and government. Such practical experience is important, since the open-source nature of R soft- ware makes available a plethora of approaches, packages, and functions available for data mining. Given the main goal of this book—to introduce data min- ing concepts using R software for illustration—our challenge was to choose an R code cocktail that supports highlighting the important concepts. In addi- tion to providing R code and output, this edition also incorporates updates and new material based on feedback from instructors teaching MBA, undergraduate, diploma, and executive courses, and from their students as well.

One update, compared to the first two editions of the book, is the title: we now use Business Analytics in place of Business Intelligence. This reflects the change in terminology since the second edition: Business Intelligence today refers mainly to reporting and data visualization (“what is happening now”), while Business Analytics has taken over the “advanced analytics,” which include predictive analytics and data mining. In this new edition, we therefore use the updated terms.

xxiii

http://www.dataminingbook.com
xxiv PREFACE TO THE R EDITION

This R edition includes the material that was recently added in the third edition of the original (XLMiner-based) book:

• Social network analysis

• Text mining

• Ensembles

• Uplift modeling

• Collaborative filtering

Since the appearance of the (XLMiner-based) second edition, the landscape of the courses using the textbook has greatly expanded: whereas initially, the book was used mainly in semester-long elective MBA-level courses, it is now used in a variety of courses in Business Analytics degrees and certificate programs, ranging from undergraduate programs, to post-graduate and executive education programs. Courses in such programs also vary in their duration and coverage. In many cases, this textbook is used across multiple courses. The book is designed to continue supporting the general “Predictive Analytics” or “Data Mining” course as well as supporting a set of courses in dedicated business analytics programs.

A general “Business Analytics,” “Predictive Analytics,” or “Data Mining” course, common in MBA and undergraduate programs as a one-semester elec- tive, would cover Parts I–III, and choose a subset of methods from Parts IV and V. Instructors can choose to use cases as team assignments, class discussions, or projects. For a two-semester course, Part VI might be considered, and we recommend introducing the new Part VII (Data Analytics).

For a set of courses in a dedicated business analytics program, here are a few courses that have been using our book:

Predictive Analytics: Supervised Learning In a dedicated Business Analytics program, the topic of Predictive Analytics is typically instructed across a set of courses. The first course would cover Parts I–IV and instruc- tors typically choose a subset of methods from Part IV according to the course length. We recommend including the new Chapter 13 in such a course, as well as the new “Part VII: Data Analytics.”

Predictive Analytics: Unsupervised Learning This course introduces data exploration and visualization, dimension reduction, mining relation- ships, and clustering (Parts III and V). If this course follows the Predictive Analytics: Supervised Learning course, then it is useful to examine examples and approaches that integrate unsupervised and supervised learning, such as the new part on “Data Analytics.”

Forecasting Analytics A dedicated course on time series forecasting would rely on Part VI.

PREFACE TO THE R EDITION xxv

Advanced Analytics A course that integrates the learnings from Predictive Analytics (supervised and unsupervised learning). Such a course can focus on Part VII: Data Analytics, where social network analytics and text mining are introduced. Some instructors choose to use the Cases (Chapter 21) in such a course.

In all courses, we strongly recommend including a project component, where data are either collected by students according to their interest or pro- vided by the instructor (e.g., from the many data mining competition datasets available). From our experience and other instructors’ experience, such projects enhance the learning and provide students with an excellent opportunity to understand the strengths of data mining and the challenges that arise in the process.

Acknowledgments

We thank the many people who assisted us in improving the first three edi- tions of the initial XLMiner version of this book and the JMP edition, as well as those who helped with comments on early drafts of this R edition. Anthony Babinec, who has been using earlier editions of this book for years in his data mining courses at Statistics.com, provided us with detailed and expert correc- tions. Dan Toy and John Elder IV greeted our project with early enthusiasm and provided detailed and useful comments on initial drafts. Ravi Bapna, who used an early draft in a data mining course at the Indian School of Business, has provided invaluable comments and helpful suggestions since the book’s start.

Many of the instructors, teaching assistants, and students using earlier edi- tions of the book have contributed invaluable feedback both directly and indi- rectly, through fruitful discussions, learning journeys, and interesting data min- ing projects that have helped shape and improve the book. These include MBA students from the University of Maryland, MIT, the Indian School of Business, National Tsing Hua University, and Statistics.com. Instructors from many uni- versities and teaching programs, too numerous to list, have supported and helped improve the book since its inception.

Several professors have been especially helpful with this R edition: Hayri Tongarlak, Prashant Joshi, Jay Annadatha, Roger Bohn, and Sridhar Vaithianathan provided detailed comments and R code files for the compan- ion website; Scott Nestler has been a helpful friend of this book project from the beginning.

Kuber Deokar, instructional operations supervisor at Statistics.com, has been unstinting in his assistance, support, and detailed attention. We also thank Shweta Jadhav and Dhanashree Vishwasrao, assistant teachers. Valerie Troiano has shepherded many instructors and students through the Statistics.com courses that have helped nurture the development of these books.

Colleagues and family members have been providing ongoing feedback and assistance with this book project. Boaz Shmueli and Raquelle Azran gave detailed editorial comments and suggestions on the first two editions; Bruce McCullough and Adam Hughes did the same for the first edition. Noa Shmueli provided careful proofs of the third edition. Ran Shenberger offered design tips.

xxvii

xxviii ACKNOWLEDGMENTS

Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, provided the scenario and data for the section on uplift modeling. We also thank Jen Gol- beck, director of the Social Intelligence Lab at the University of Maryland and author of Analyzing the Social Web, whose book inspired our presentation in the chapter on social network analytics. Randall Pruim contributed extensively to the chapter on visualization.

Marietta Tretter at Texas A&M shared comments and thoughts on the time series chapters, and Stephen Few and Ben Shneiderman provided feedback and suggestions on the data visualization chapter and overall design tips.

Susan Palocsay and Mia Stephens have provided suggestions and feedback on numerous occasions, as has Margret Bjarnadottir. We also thank Catherine Plaisant at the University of Maryland’s Human–Computer Interaction Lab, who helped out in a major way by contributing exercises and illustrations to the data visualization chapter. Gregory Piatetsky-Shapiro, founder of KDNuggets.com, has been generous with his time and counsel over the many years of this project.

This book would not have seen the light of day without the nurturing sup- port of the faculty at the Sloan School of Management at MIT. Our special thanks to Dimitris Bertsimas, James Orlin, Robert Freund, Roy Welsch, Gor- don Kaufmann, and Gabriel Bitran. As teaching assistants for the data mining course at Sloan, Adam Mersereau gave detailed comments on the notes and cases that were the genesis of this book, Romy Shioda helped with the preparation of several cases and exercises used here, and Mahesh Kumar helped with the material on clustering.

Colleagues at the University of Maryland’s Smith School of Business: Shri- vardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and comments. We thank Robert Windle, and University of Maryland MBA stu- dents Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the heatmap and map charts.

Anand Bodapati provided both data and advice. Jake Hofman from Micro- soft Research and Sharad Borle assisted with data access. Suresh Ankolekar and Mayank Shah helped develop several cases and provided valuable pedagogical comments. Vinni Bhandari helped write the Charles Book Club case.

We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at Har- vard, as well as Anil Gore at Pune University, for thought-provoking discussions on the relationship between statistics and data mining. Our thanks to Richard Larson of the Engineering Systems Division, MIT, for sparking many stimu- lating ideas on the role of data mining in modeling complex systems. Over a decade ago, they helped us develop a balanced philosophical perspective on the emerging field of data mining.

ACKNOWLEDGMENTS xxix

Lastly, we thank the folks at Wiley for the decade-long successful journey of this book. Steve Quigley at Wiley showed confidence in this book from the beginning and helped us navigate through the publishing process with great speed. Curt Hinrichs’ vision, tips, and encouragement helped bring this book to the starting gate. Jon Gurstelle, Kathleen Pagliaro, and Katrina Maceda greatly assisted us in pushing ahead and finalizing this R edition. We are also especially grateful to Amy Hendrickson, who assisted with typesetting and making this book beautiful.

Part I

Preliminaries

CHAPTER 1

Introduction

1.1 What Is Business Analytics?

Business Analytics (BA) is the practice and art of bringing quantitative data to bear on decision-making. The term means different things to different organizations.

Consider the role of analytics in helping newspapers survive the transition to a digital world. One tabloid newspaper with a working-class readership in Britain had launched a web version of the paper, and did tests on its home page to determine which images produced more hits: cats, dogs, or monkeys. This simple application, for this company, was considered analytics. By contrast, the Washington Post has a highly influential audience that is of interest to big defense contractors: it is perhaps the only newspaper where you routinely see advertisements for aircraft carriers. In the digital environment, the Post can track readers by time of day, location, and user subscription information. In this fashion, the display of the aircraft carrier advertisement in the online paper may be focused on a very small group of individuals—say, the members of the House and Senate Armed Services Committees who will be voting on the Pentagon’s budget.

Business Analytics, or more generically, analytics, include a range of data analysis methods. Many powerful applications involve little more than count- ing, rule-checking, and basic arithmetic. For some organizations, this is what is meant by analytics.

The next level of business analytics, now termed Business Intelligence (BI), refers to data visualization and reporting for understanding “what happened and what is happening.” This is done by use of charts, tables, and dashboards to display, examine, and explore data. BI, which earlier consisted mainly of gener- ating static reports, has evolved into more user-friendly and effective tools and practices, such as creating interactive dashboards that allow the user not only to

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition. Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.

4 INTRODUCTION

access real-time data, but also to directly interact with it. Effective dashboards are those that tie directly into company data, and give managers a tool to quickly see what might not readily be apparent in a large complex database. One such tool for industrial operations managers displays customer orders in a single two- dimensional display, using color and bubble size as added variables, showing customer name, type of product, size of order, and length of time to produce.

Business Analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical models and data mining algorithms used for exploring data, quantifying and explaining relationships between measurements, and predicting new records. Methods like regression models are used to describe and quantify “on average” relationships (e.g., between advertising and sales), to predict new records (e.g., whether a new patient will react positively to a medication), and to forecast future values (e.g., next week’s web traffic).

Readers familiar with earlier editions of this book may have noticed that the book title has changed from Data Mining for Business Intelligence to Data Mining for Business Analytics in this edition. The change reflects the more recent term BA, which overtook the earlier term BI to denote advanced analytics. Today, BI is used to refer to data visualization and reporting.

W H O U S E S P R E D I C T I V E A N A L Y T I C S ?

The widespread adoption of predictive analytics, coupled with the accelerating avail- ability of data, has increased organizations’ capabilities throughout the economy. A few examples: Credit scoring: One long-established use of predictive modeling techniques for business prediction is credit scoring. A credit score is not some arbitrary judgment of credit-worthiness; it is based mainly on a predictive model that uses prior data to predict repayment behavior. Future purchases: A more recent (and controversial) example is Target’s use of predictive modeling to classify sales prospects as “pregnant” or “not-pregnant.” Those classified as pregnant could then be sent sales promotions at an early stage of pregnancy, giving Target a head start on a significant purchase stream. Tax evasion: The US Internal Revenue Service found it was 25 times more likely to find tax evasion when enforcement activity was based on predictive models, allowing agents to focus on the most-likely tax cheats (Siegel, 2013).

The Business Analytics toolkit also includes statistical experiments, the most common of which is known to marketers as A-B testing. These are often used for pricing decisions:

• Orbitz, the travel site, found that it could price hotel options higher for Mac users than Windows users.

• Staples online store found it could charge more for staplers if a customer lived far from a Staples store.

WHAT IS DATA MINING? 5

Beware the organizational setting where analytics is a solution in search of a problem: A manager, knowing that business analytics and data mining are hot areas, decides that her organization must deploy them too, to capture that hidden value that must be lurking somewhere. Successful use of analytics and data mining requires both an understanding of the business context where value is to be captured, and an understanding of exactly what the data mining methods do.

1.2 What Is Data Mining?

In this book, data mining refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and methods based on business rules. While we do introduce data visualization, which is commonly the first step into more advanced analytics, the book focuses mostly on the more advanced data analytics tools. Specifically, it includes statistical and machine-learning meth- ods that inform decision-making, often in an automated fashion. Prediction is typically an important component, often at the individual level. Rather than “what is the relationship between advertising and sales,” we might be interested in “what specific advertisement, or recommended product, should be shown to a given online shopper at this moment?” Or we might be interested in clustering customers into different “personas” that receive different marketing treatment, then assigning each new prospect to one of these personas.

The era of Big Data has accelerated the use of data mining. Data mining methods, with their power and automaticity, have the ability to cope with huge amounts of data and extract value.

1.3 Data Mining and Related Terms

The field of analytics is growing rapidly, both in terms of the breadth of appli- cations, and in terms of the number of organizations using advanced analytics. As a result, there is considerable overlap and inconsistency of definitions.

The term data mining itself means different things to different people. To the general public, it may have a general, somewhat hazy and pejorative meaning of digging through vast stores of (often personal) data in search of something interesting. One major consulting firm has a “data mining department,” but its responsibilities are in the area of studying and graphing past data in search of general trends. And, to confuse matters, their more advanced predictive models are the responsibility of an “advanced analytics department.” Other terms that organizations use are predictive analytics, predictive modeling, and machine learning.

Data mining stands at the confluence of the fields of statistics and machine learning (also known as artificial intelligence). A variety of techniques for explor- ing data and building models have been around for a long time in the world of

6 INTRODUCTION

statistics: linear regression, logistic regression, discriminant analysis, and princi- pal components analysis, for example. But the core tenets of classical statistics— computing is difficult and data are scarce—do not apply in data mining applica- tions where both data and computing power are plentiful.

This gives rise to Daryl Pregibon’s description of data mining as “statistics at scale and speed” (Pregibon, 1999). Another major difference between the fields of statistics and machine learning is the focus in statistics on inference from a sample to the population regarding an “average effect”—for example, “a $1 price increase will reduce average demand by 2 boxes.” In contrast, the focus in machine learning is on predicting individual records—“the predicted demand for person i given a $1 price increase is 1 box, while for person j it is 3 boxes.” The emphasis that classical statistics places on inference (determining whether a pattern or interesting result might have happened by chance in our sample) is absent from data mining.

In comparison to statistics, data mining deals with large datasets in an open- ended fashion, making it impossible to put the strict limits around the question being addressed that inference would require. As a result, the general approach to data mining is vulnerable to the danger of overfitting, where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data, but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal.

In this book, we use the term machine learning to refer to algorithms that learn directly from data, especially local patterns, often in layered or iterative fashion. In contrast, we use statistical models to refer to methods that apply global structure to the data. A simple example is a linear regression model (statistical) vs. a k-nearest-neighbors algorithm (machine learning). A given record would be treated by linear regression in accord with an overall linear equation that applies to all the records. In k-nearest neighbors, that record would be classified in accord with the values of a small number of nearby records.

Lastly, many practitioners, particularly those from the IT and computer sci- ence communities, use the term machine learning to refer to all the methods dis- cussed in this book.

1.4 Big Data

Data mining and Big Data go hand in hand. Big Data is a relative term—data today are big by reference to the past, and to the methods and devices available to deal with them. The challenge Big Data presents is often characterized by the four V’s—volume, velocity, variety, and veracity. Volume refers to the amount of data. Velocity refers to the flow rate—the speed at which it is being generated and changed. Variety refers to the different types of data being generated (currency,

DATA SCIENCE 7

dates, numbers, text, etc.). Veracity refers to the fact that data is being generated by organic distributed processes (e.g., millions of people signing up for services or free downloads) and not subject to the controls or quality checks that apply to data collected for a study.

Most large organizations face both the challenge and the opportunity of Big Data because most routine data processes now generate data that can be stored and, possibly, analyzed. The scale can be visualized by comparing the data in a traditional statistical analysis (say, 15 variables and 5000 records) to the Walmart database. If you consider the traditional statistical study to be the size of a period at the end of a sentence, then the Walmart database is the size of a football field. And that probably does not include other data associated with Walmart—social media data, for example, which comes in the form of unstructured text.

If the analytical challenge is substantial, so can be the reward:

• OKCupid, the online dating site, uses statistical models with their data to predict what forms of message content are most likely to produce a response.

• Telenor, a Norwegian mobile phone service company, was able to reduce subscriber turnover 37% by using models to predict which customers were most likely to leave, and then lavishing attention on them.

• Allstate, the insurance company, tripled the accuracy of predicting injury liability in auto claims by incorporating more information about vehicle type.

The above examples are from Eric Siegel’s book Predictive Analytics (2013, Wiley). Some extremely valuable tasks were not even feasible before the era of Big

Data. Consider web searches, the technology on which Google was built. In early days, a search for “Ricky Ricardo Little Red Riding Hood” would have yielded various links to the I Love LucyTV show, other links to Ricardo’s career as a band leader, and links to the children’s story of Little Red Riding Hood. Only once the Google database had accumulated sufficient data (including records of what users clicked on) would the search yield, in the top position, links to the specific I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish and English, Little Red Riding Hood for his infant son.

1.5 Data Science

The ubiquity, size, value, and importance of Big Data has given rise to a new profession: the data scientist. Data science is a mix of skills in the areas of statistics, machine learning, math, programming, business, and IT. The term itself is thus broader than the other concepts we discussed above, and it is a rare individual who combines deep skills in all the constituent areas. In their book Analyzing

8 INTRODUCTION

the Analyzers (Harris et al., 2013), the authors describe the skill sets of most data scientists as resembling a ‘T’—deep in one area (the vertical bar of the T), and shallower in other areas (the top of the T).

At a large data science conference session (Strata+Hadoop World, Octo- ber 2014), most attendees felt that programming was an essential skill, though there was a sizable minority who felt otherwise. And, although Big Data is the motivating power behind the growth of data science, most data scientists do not actually spend most of their time working with terabyte-size or larger data.

Data of the terabyte or larger size would be involved at the deployment stage of a model. There are manifold challenges at that stage, most of them IT and programming issues related to data-handling and tying together different compo- nents of a system. Much work must precede that phase. It is that earlier piloting and prototyping phase on which this book focuses—developing the statistical and machine learning models that will eventually be plugged into a deployed system. What methods do you use with what sorts of data and problems? How do the methods work? What are their requirements, their strengths, their weaknesses? How do you assess their performance?

1.6 Why Are There So Many Different Methods?

As can be seen in this book or any other resource on data mining, there are many different methods for prediction and classification. You might ask yourself why they coexist, and whether some are better than others. The answer is that each method has advantages and disadvantages. The usefulness of a method can depend on factors such as the size of the dataset, the types of patterns that exist in the data, whether the data meet some underlying assumptions of the method, how noisy the data are, and the particular goal of the analysis. A small illustration is shown in Figure 1.1, where the goal is to find a combination of household income level and household lot size that separates buyers (solid circles) from

FIGURE 1.1 TWO METHODS FOR SEPARATING OWNERS FROM NONOWNERS

TERMINOLOGY AND NOTATION 9

nonbuyers (hollow circles) of riding mowers. The first method (left panel) looks only for horizontal and vertical lines to separate buyers from nonbuyers, whereas the second method (right panel) looks for a single diagonal line.

Different methods can lead to different results, and their performance can vary. It is therefore customary in data mining to apply several different methods and select the one that appears most useful for the goal at hand.

1.7 Terminology and Notation

Because of the hybrid parentry of data mining, its practitioners often use multiple terms to refer to the same thing. For example, in the machine learning (artificial intelligence) field, the variable being predicted is the output variable or target variable. To a statistician, it is the dependent variable or the response. Here is a summary of terms used:

Algorithm A specific procedure used to implement a particular data mining technique: classification tree, discriminant analysis, and the like.

Attribute see Predictor.

Case see Observation.

Confidence A performance measure in association rules of the type “IF A and B are purchased, THEN C is also purchased.” Confidence is the conditional probability that C will be purchased IF A and B are purchased.

Confidence also has a broader meaning in statistics (confidence interval), concern- ing the degree of error in an estimate that results from selecting one sample as opposed to another.

Dependent Variable see Response.

Estimation see Prediction.

Feature see Predictor.

Holdout Data (or holdout set) A sample of data not used in fitting a model, but instead used to assess the performance of that model. This book uses the terms validation set and test set instead of holdout set.

Input Variable see Predictor.

Model An algorithm as applied to a dataset, complete with its settings (many of the algorithms have parameters that the user can adjust).

Observation The unit of analysis on which the measurements are taken (a cus- tomer, a transaction, etc.), also called instance, sample, example, case, record, pattern, or row. In spreadsheets, each row typically represents a record; each column, a variable. Note that the use of the term “sample” here is dif- ferent from its usual meaning in statistics, where it refers to a collection of observations.

10 INTRODUCTION

Outcome Variable see Response.

Output Variable see Response.

P (A | B) The conditional probability of event A occurring given that event B has occurred, read as “the probability that A will occur given that B has occurred.”

Prediction The prediction of the numerical value of a continuous output vari- able; also called estimation.

Predictor A variable, usually denoted by X , used as an input into a predic- tive model, also called a feature, input variable, independent variable, or from a database perspective, a field.

Profile A set of measurements on an observation (e.g., the height, weight, and age of a person).

Record see Observation.

Response A variable, usually denoted by Y , which is the variable being pre- dicted in supervised learning, also called dependent variable, output variable, target variable, or outcome variable.

Sample In the statistical community, “sample” means a collection of observa- tions. In the machine learning community, “sample” means a single obser- vation.

Score A predicted value or class. Scoring new data means using a model devel- oped with training data to predict output values in new data.

Success Class The class of interest in a binary outcome (e.g., purchasers in the outcome purchase/no purchase).

Supervised Learning The process of providing an algorithm (logistic regres- sion, regression tree, etc.) with records in which an output variable of inter- est is known and the algorithm “learns” how to predict this value with new records where the output is unknown.

Target see Response.

Test Data (or test set) The portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on new data.

Training Data (or training set) The portion of the data used to fit a model.

Unsupervised Learning An analysis in which one attempts to learn patterns in the data other than predicting an output value of interest.

Validation Data (or validation set) The portion of the data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried.

Variable Any measurement on the records, including both the input (X) vari- ables and the output (Y ) variable.

ROAD MAPS TO THIS BOOK 11

1.8 Road Maps to This Book

The book covers many of the widely used predictive and classification methods as well as other data mining tools. Figure 1.2 outlines data mining from a process perspective and where the topics in this book fit in. Chapter numbers are indi- cated beside the topic. Table 1.1 provides a different perspective: it organizes data mining procedures according to the type and structure of the data.

Order of Topics

The book is divided into five parts: Part I (Chapters 1–2) gives a general overview of data mining and its components. Part II (Chapters 3–4) focuses on the early stages of data exploration and dimension reduction.

Part III (Chapter 5) discusses performance evaluation. Although it contains only one chapter, we discuss a variety of topics, from predictive performance metrics to misclassification costs. The principles covered in this part are crucial for the proper evaluation and comparison of supervised learning methods.

Part IV includes eight chapters (Chapters 6–13), covering a variety of popular supervised learning methods (for classification and/or prediction). Within this part, the topics are generally organized according to the level of sophistication of the algorithms, their popularity, and ease of understanding. The final chapter introduces ensembles and combinations of methods.

FIGURE 1.2 DATA MINING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHESES INDICATE CHAPTER NUMBERS

12 INTRODUCTION

TABLE 1.1 ORGANIZATION OF DATA MINING METHODS IN THIS BOOK, ACCORDING TO THE NATURE OF THE DATA∗

Supervised Unsupervised Continuous Categorical Response Response No Response

Continuous Linear regression (6) Logistic regression (10) Principal components (4) predictors Neural nets (11) Neural nets (11) Cluster analysis (15)

k-Nearest neighbors (7) Discriminant analysis (12) Collaborative filtering (14)

Ensembles (13) k-Nearest neighbors (7) Ensembles (13)

Categorical Linear regression (6) Neural nets (11) Association rules (14) predictors Neural nets (11) Classification trees (9) Collaborative filtering (14)

Regression trees (9) Logistic regression (10) Ensembles (13) Naive Bayes (8)

Ensembles (13)

∗Numbers in parentheses indicate chapter number.

Part V focuses on unsupervised mining of relationships. It presents associa- tion rules and collaborative filtering (Chapter 14) and cluster analysis (Chapter 15).

Part VI includes three chapters (Chapters 16–18), with the focus on fore- casting time series. The first chapter covers general issues related to handling and understanding time series. The next two chapters present two popular fore- casting approaches: regression-based forecasting and smoothing methods.

Part VII (Chapters 19–20) presents two broad data analytics topics: social network analysis and text mining. These methods apply data mining to special- ized data structures: social networks and text.

Finally, part VIII includes a set of cases. Although the topics in the book can be covered in the order of the chapters,

each chapter stands alone. We advise, however, to read parts I–III before pro- ceeding to chapters in parts IV–V. Similarly, Chapter 16 should precede other chapters in part VI.

U S I N G R A N D R S T U D I O

To facilitate a hands-on data mining experience, this book uses R, a free software environment for statistical computing and graphics, and RStudio, an integrated development environment (IDE) for R. The R programming language is widely used in academia and industry for data mining and data analysis. R offers a variety of methods for analyzing data, provided by a variety of separate packages. Among the numerous packages, R has extensive coverage of statistical and data mining tech- niques for classification, prediction, mining associations and text, forecasting, and

ROAD MAPS TO THIS BOOK 13

data exploration and reduction. It offers a variety of supervised data mining tools: neural nets, classification and regression trees, k-nearest-neighbor classification, naive Bayes, logistic regression, linear regression, and discriminant analysis, all for predictive modeling. R’s packages also cover unsupervised algorithms: association rules, collaborative filtering, principal components analysis, k-means clustering, and hierarchical clustering, as well as visualization tools and data-handling util- ities. Often, the same method is implemented in multiple packages, as we will discuss throughout the book. The illustrations, exercises, and cases in this book are written in relation to R.

Download: To download R and RStudio, visit www.r-project.org and www.rstudio. com/products/RStudio and follow the instructions there.

Installation: Install both R and RStudio. Note that R releases new versions fairly often. When a new version is released, some packages might require a new instal- lation of R (this is rare).

Use: To start using R, open RStudio, then open a new script under File > New File > R Script. RStudio contains four panels as shown in Figure 1.3: Script (top left), Console (bottom left), Environment (top right), and additional information, such as plot and help (bottom right). To run a selected code line from the Script panel, press ctrl+r. Code lines starting with # are comments.

Package Installation: To start using an R package, you will first need to install it. Installation is done via the information panel (tab ”packages”) or using command install.packages(). New packages might not support old R versions and require a new R installation.

FIGURE 1.3 RSTUDIO SCREEN

http://www.r-project.org
http://www.rstudio.com/products/RStudio
http://www.rstudio.com/products/RStudio
CHAPTER 2

Overview of the Data Mining Process

In this chapter, we give an overview of the steps involved in data mining, starting from a clear goal definition and ending with model deployment. The general steps are shown schematically in Figure 2.1. We also discuss issues related to data collection, cleaning, and preprocessing. We introduce the notion of data partitioning, where methods are trained on a set of training data and then their performance is evaluated on a separate set of validation data, as well as explain how this practice helps avoid overfitting. Finally, we illustrate the steps of model building by applying them to data.

Define Purpose

Obtain Data

Explore & Clean

Data

Determine DM Task

Choose DM

Methods

Apply Methods & Select Final

Model

Evaluate Performance

Deploy

FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS

2.1 Introduction

In Chapter 1, we saw some very general definitions of data mining. In this chap- ter, we introduce the variety of methods sometimes referred to as data mining. The core of this book focuses on what has come to be called predictive analyt- ics, the tasks of classification and prediction as well as pattern discovery, which have become key elements of a “business analytics” function in most large firms. These terms are described and illustrated below.

16 OVERVIEW OF THE DATA MINING PROCESS

Not covered in this book to any great extent are two simpler database meth- ods that are sometimes considered to be data mining techniques: (1) OLAP (online analytical processing) and (2) SQL (structured query language). OLAP and SQL searches on databases are descriptive in nature and are based on business rules set by the user (e.g., “find all credit card customers in a certain zip code with annual charges > $20,000, who own their home and who pay the entire amount of their monthly bill at least 95% of the time.”) Although SQL queries are often used to obtain the data in data mining, they do not involve statistical modeling or automated algorithmic methods.

2.2 Core Ideas in Data Mining

Classification

Classification is perhaps the most basic form of data analysis. The recipient of an offer can respond or not respond. An applicant for a loan can repay on time, repay late, or declare bankruptcy. A credit card transaction can be normal or fraudulent. A packet of data traveling on a network can be benign or threatening. A bus in a fleet can be available for service or unavailable. The victim of an illness can be recovered, still be ill, or be deceased.

A common task in data mining is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be. Similar data where the classification is known are used to develop rules, which are then applied to the data with the unknown classification.

Prediction

Prediction is similar to classification, except that we are trying to predict the value of a numerical variable (e.g., amount of purchase) rather than a class (e.g., purchaser or nonpurchaser). Of course, in classification we are trying to predict a class, but the term prediction in this book refers to the prediction of the value of a continuous variable. (Sometimes in the data mining literature, the terms estima- tion and regression are used to refer to the prediction of the value of a continuous variable, and prediction may be used for both continuous and categorical data.)

Association Rules and Recommendation Systems

Large databases of customer transactions lend themselves naturally to the analysis of associations among items purchased, or “what goes with what.” Association rules, or affinity analysis, is designed to find such general associations patterns between items in large databases. The rules can then be used in a variety of ways. For example, grocery stores can use such information for product placement.

CORE IDEAS IN DATA MINING 17

They can use the rules for weekly promotional offers or for bundling products. Association rules derived from a hospital database on patients’ symptoms during consecutive hospitalizations can help find “which symptom is followed by what other symptom” to help predict future symptoms for returning patients.

Online recommendation systems, such as those used on Amazon.com and Netflix.com, use collaborative filtering, a method that uses individual users’ pref- erences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference, as well as other users’ history. In contrast to association rules that generate rules general to an entire population, collaborative filtering generates “what goes with what” at the individual user level. Hence, collaborative filtering is used in many recommendation systems that aim to deliver personalized recommendations to users with a wide range of preferences.

Predictive Analytics

Classification, prediction, and to some extent, association rules and collaborative filtering constitute the analytical methods employed in predictive analytics. The term predictive analytics is sometimes used to also include data pattern identifi- cation methods such as clustering.

Data Reduction and Dimension Reduction

The performance of data mining algorithms is often improved when the num- ber of variables is limited, and when large numbers of records can be grouped into homogeneous groups. For example, rather than dealing with thousands of product types, an analyst might wish to group them into a smaller number of groups and build separate models for each group. Or a marketer might want to classify customers into different “personas,” and must therefore group customers into homogeneous groups to define the personas. This process of consolidating a large number of records (or cases) into a smaller set is termed data reduction. Methods for reducing the number of cases are often called clustering.

Reducing the number of variables is typically called dimension reduction. Dimension reduction is a common initial step before deploying data min- ing methods, intended to improve predictive power, manageability, and inter- pretability.

Data Exploration and Visualization

One of the earliest stages of engaging with a dataset is exploring it. Exploration is aimed at understanding the global landscape of the data, and detecting unusual values. Exploration is used for data cleaning and manipulation as well as for visual discovery and “hypothesis generation.”

18 OVERVIEW OF THE DATA MINING PROCESS

Methods for exploring data include looking at various data aggregations and summaries, both numerically and graphically. This includes looking at each variable separately as well as looking at relationships among variables. The pur- pose is to discover patterns and exceptions. Exploration by creating charts and dashboards is called Data Visualization or Visual Analytics. For numerical vari- ables, we use histograms and boxplots to learn about the distribution of their values, to detect outliers (extreme observations), and to find other information that is relevant to the analysis task. Similarly, for categorical variables, we use bar charts. We can also look at scatter plots of pairs of numerical variables to learn about possible relationships, the type of relationship, and again, to detect outliers. Visualization can be greatly enhanced by adding features such as color and interactive navigation.

Supervised and Unsupervised Learning

A fundamental distinction among data mining techniques is between supervised and unsupervised methods. Supervised learning algorithms are those used in classi- fication and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known. Such data are also called “labeled data,” since they contain the label (outcome value) for each record. These training data are the data from which the classification or predic- tion algorithm “learns,” or is “trained,” about the relationship between predictor variables and the outcome variable. Once the algorithm has learned from the training data, it is then applied to another sample of labeled data (the validation data) where the outcome is known but initially hidden, to see how well it does in comparison to other models. If many different models are being tried out, it is prudent to save a third sample, which also includes known outcomes (the test data) to use with the model finally selected to predict how well it will do. The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown.

Simple linear regression is an example of a supervised learning algorithm (although rarely called that in the introductory statistics course where you prob- ably first encountered it). The Y variable is the (known) outcome variable and the X variable is a predictor variable. A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. The regression line can now be used to predict Y values for new values of X for which we do not know the Y value.

Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify. Hence, there is no “learning” from cases where such an outcome variable is known. Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods.

THE STEPS IN DATA MINING 19

Supervised and unsupervised methods are sometimes used in conjunction. For example, unsupervised clustering methods are used to separate loan appli- cants into several risk-level groups. Then, supervised algorithms are applied separately to each risk-level group for predicting propensity of loan default.

S U P E R V I S E D L E A R N I N G R E Q U I R E S G O O D S U P E R V I S I O N

In some cases, the value of the outcome variable (the ‘label’) is known because it is an inherent component of the data. Web logs will show whether a person clicked on a link or not. Bank records will show whether a loan was paid on time or not. In other cases, the value of the known outcome must be supplied by a human labeling process to accumulate enough data to train a model. E-mail must be labeled as spam or legitimate, documents in legal discovery must be labeled as relevant or irrelevant. In either case, the data mining algorithm can be led astray if the quality of the supervision is poor.

Gene Weingarten reported in the January 5, 2014 Washington Post magazine how the strange phrase “defiantly recommend” is making its way into English via auto-correction. “Defiantly” is closer to the common misspelling definatly than is definitely, so Google.com, in the early days, offered it as a correction when users typed the misspelled word “definatly.” In the ideal supervised learning model, humans guide the auto-correction process by rejecting defiantly and substituting definitely. Google’s algorithm would then learn that this is the best first-choice correction of “definatly.” The problem was that too many people were lazy, just accepting the first correction that Google presented. All these acceptances then cemented “defiantly” as the proper correction.

2.3 The Steps in Data Mining

This book focuses on understanding and using data mining algorithms (Steps 4 to 7 below). However, some of the most serious errors in analytics projects result from a poor understanding of the problem—an understanding that must be developed before we get into the details of algorithms to be used. Here is a list of steps to be taken in a typical data mining effort:

1. Develop an understanding of the purpose of the data mining project. How will the stakeholder use the results? Who will be affected by the results? Will the analysis be a one-shot effort or an ongoing procedure?

2. Obtain the dataset to be used in the analysis. This often involves sampling from a large database to capture records to be used in an analysis. How well this sample reflects the records of interest affects the ability of the data mining results to generalize to records outside of this sample. It may also involve pulling together data from different databases or sources.

20 OVERVIEW OF THE DATA MINING PROCESS

The databases could be internal (e.g., past purchases made by customers) or external (credit ratings). While data mining deals with very large databases, usually the analysis to be done requires only thousands or tens of thousands of records.

3. Explore, clean, and preprocess the data. This step involves verifying that the data are in reasonable condition. How should missing data be handled? Are the values in a reasonable range, given what you would expect for each variable? Are there obvious outliers? The data are reviewed graph- ically: for example, a matrix of scatterplots showing the relationship of each variable with every other variable. We also need to ensure con- sistency in the definitions of fields, units of measurement, time periods, and so on. In this step, new variables are also typically created from exist- ing ones. For example, “duration” can be computed from start and end dates.

4. Reduce the data dimension, if necessary. Dimension reduction can involve operations such as eliminating unneeded variables, transforming variables (e.g., turning “money spent” into “spent > $100” vs. “spent ≤ $100”), and creating new variables (e.g., a variable that records whether at least one of several products was purchased). Make sure that you know what each variable means and whether it is sensible to include it in the model.

5. Determine the data mining task. (classification, prediction, clustering, etc.). This involves translating the general question or problem of Step 1 into a more specific data mining question.

6. Partition the data (for supervised tasks). If the task is supervised (classification or prediction), randomly partition the dataset into three parts: training, validation, and test datasets.

7. Choose the data mining techniques to be used. (regression, neural nets, hier- archical clustering, etc.).

8. Use algorithms to perform the task. This is typically an iterative process— trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm). Where appropriate, feedback from the algorithm’s performance on vali- dation data is used to refine the settings.

9. Interpret the results of the algorithms. This involves making a choice as to the best algorithm to deploy, and where possible, testing the final choice on the test data to get an idea as to how well it will perform. (Recall that each algorithm may also be tested on the validation data for tuning purposes; in this way, the validation data become a part of the fitting process and are likely to underestimate the error in the deployment of the model that is finally chosen.)

PRELIMINARY STEPS 21

10. Deploy the model. This step involves integrating the model into oper- ational systems and running it on real records to produce decisions or actions. For example, the model might be applied to a purchased list of possible customers, and the action might be “include in the mailing if the predicted amount of purchase is > $10.” A key step here is “scoring” the new records, or using the chosen model to predict the outcome value (“score”) for each new record.

The foregoing steps encompass the steps in SEMMA, a methodology developed by the software company SAS:

Sample Take a sample from the dataset; partition into training, validation, and test datasets.

Explore Examine the dataset statistically and graphically.

Modify Transform the variables and impute missing values.

Model Fit predictive models (e.g., regression tree, neural network).

Assess Compare models using a validation dataset.

IBM SPSS Modeler (previously SPSS-Clementine) has a similar method- ology, termed CRISP-DM (CRoss-Industry Standard Process for Data Min- ing). All these frameworks include the same main steps involved in predictive modeling.

2.4 Preliminary Steps

Organization of Datasets

Datasets are nearly always constructed and displayed so that variables are in columns and records are in rows. We will illustrate this with home values in West Roxbury, Boston, in 2014. 14 variables are recorded for over 5000 homes. The spreadsheet is organized so that each row represents a home—the first home’s assessed value was $344,200, its tax was $4430, its size was 9965 ft2, it was built in 1880, and so on. In supervised learning situations, one of these variables will be the outcome variable, typically listed in the first or last column (in this case it is TOTAL VALUE, in the first column).

Predicting Home Values in the West Roxbury Neighborhood

The Internet has revolutionized the real estate industry. Realtors now list houses and their prices on the web, and estimates of house and condominium prices have become widely available, even for units not on the market. At this time of

22 OVERVIEW OF THE DATA MINING PROCESS

writing, Zillow (www.zillow.com) is the most popular online real estate infor- mation site in the United States1, and in 2014 they purchased their major rival, Trulia. By 2015, Zillow had become the dominant platform for checking house prices and, as such, the dominant online advertising venue for realtors. What used to be a comfortable 6% commission structure for realtors, affording them a handsome surplus (and an oversupply of realtors), was being rapidly eroded by an increasing need to pay for advertising on Zillow. (This, in fact, is the key to Zillow’s business model—redirecting the 6% commission away from realtors and to itself.)

Zillow gets much of the data for its “Zestimates” of home values directly from publicly available city housing data, used to estimate property values for tax assessment. A competitor seeking to get into the market would likely take the same approach. So might realtors seeking to develop an alternative to Zillow.

A simple approach would be a naive, model-less method—just use the assessed values as determined by the city. Those values, however, do not nec- essarily include all properties, and they might not include changes warranted by remodeling, additions, etc. Moreover, the assessment methods used by cities may not be transparent or always reflect true market values. However, the city property data can be used as a starting point to build a model, to which additional data (such as that collected by large realtors) can be added later.

Let’s look at how Boston property assessment data, available from the city of Boston, might be used to predict home values. The data in WestRoxbury.csv includes information on single family owner-occupied homes in West Roxbury, a neighborhood in southwest Boston, MA, in 2014. The data include values for various predictor variables, and for an outcome—assessed home value (“total value”). This dataset has 14 variables, and a description of each variable is given in Table 2.1 (the full data dictionary provided by the City of Boston is available at http://goo.gl/QBRlYF; we have modified a few variable names). The dataset includes 5802 homes. A sample of the data is shown in Table 2.2, and the “data dictionary” describing each variable is in Table 2.1.

As we saw earlier, below the header row, each row in the data represents a home. For example, the first home was assessed at a total value of $344.2 thousand (TOTAL VALUE). Its tax bill was $4330. It has a lot size of 9965 square feet (ft2), was built in the year 1880, has two floors, six rooms, and so on.

Loading and Looking at the Data in R

To load data into R, we will typically want to have the data available as a csv (comma separated values) file. If the data are in an xls (or xlsx) file, we can save

1“Harney, K., Zestimates may not be as right as you’d like”, Washington Post, Feb. 7, 2015, p. T10.

http://www.zillow.com
http://goo.gl/QBRlYF
PRELIMINARY STEPS 23

TABLE 2.1 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATASET

TOTAL VALUE Total assessed value for property, in thousands of USD

TAX Tax bill amount based on total assessed value multiplied by the tax rate, in USD

LOT SQ FT Total lot size of parcel in square feet

YR BUILT Year the property was built

GROSS AREA Gross floor area

LIVING AREA Total living area for residential properties (ft2)

FLOORS Number of floors

ROOMS Total number of rooms

BEDROOMS Total number of bedrooms

FULL BATH Total number of full baths

HALF BATH Total number of half baths

KITCHEN Total number of kitchens

FIREPLACE Total number of fireplaces

REMODEL When the house was remodeled (Recent/Old/None)

that same file in Excel as a csv file: go to File > Save as > Save as type: CSV (Comma delimited) (*.csv) > Save. Note: When dealing with .csv files in Excel, beware of two things:

• Opening a .csv file in Excel strips off leading 0’s, which corrupts zipcode data.

• Saving a .csv file in Excel saves only the digits that are displayed; if you need precision to a certain number of decimals, you need to ensure they are displayed before saving.

Once we have R and RStudio installed on our machine and the West Rox- bury.csv file saved as a csv file, we can run the code in Table 2.3 to load the data into R.

TABLE 2.2 FIRST 10 RECORDS IN THE WEST ROXBURY HOME VALUES DATASET

TOTAL TAX LOT YR GROSS LIVING FLOORS ROOMS BED FULL HALF KIT FIRE REMODEL

VALUE SQ FT BUILT AREA AREA ROOMS BATH BATH CHEN PLACE

344.2 4330 9965 1880 2436 1352 2 6 3 1 1 1 0 None

412.6 5190 6590 1945 3108 1976 2 10 4 2 1 1 0 Recent

330.1 4152 7500 1890 2294 1371 2 8 4 1 1 1 0 None

498.6 6272 13,773 1957 5032 2608 1 9 5 1 1 1 1 None

331.5 4170 5000 1910 2370 1438 2 7 3 2 0 1 0 None

337.4 4244 5142 1950 2124 1060 1 6 3 1 0 1 1 Old

359.4 4521 5000 1954 3220 1916 2 7 3 1 1 1 0 None

320.4 4030 10,000 1950 2208 1200 1 6 3 1 0 1 0 None

333.5 4195 6835 1958 2582 1092 1 5 3 1 0 1 1 Recent

409.4 5150 5093 1900 4818 2992 2 8 4 2 0 1 0 None

24 OVERVIEW OF THE DATA MINING PROCESS

TABLE 2.3 WORKING WITH FILES IN R

To start, open RStudio, go to File > New File > R Script. It opens a new tab. Then save your Untitled1.R file into the directory where your csv is saved. Give it the name WestRoxbury.R. From the Menu Bar, go to Session > Set Working Directory > To Source File Location; This sets the working directory as the place where both the R file and csv file are saved.

code for loading and creating subsets from the data

housing.df <- read.csv("WestRoxbury.csv", header = TRUE) # load data dim(housing.df) # find the dimension of data frame head(housing.df) # show the first six rows View(housing.df) # show all the data in a new tab

# Practice showing different subsets of the data housing.df[1:10, 1] # show the first 10 rows of the first column only housing.df[1:10, ] # show the first 10 rows of each of the columns housing.df[5, 1:10] # show the fifth row of the first 10 columns housing.df[5, c(1:2, 4, 8:10)] # show the fifth row of some columns housing.df[, 1] # show the whole first column housing.df$TOTAL_VALUE # a different way to show the whole first column housing.df$TOTAL_VALUE[1:10] # show the first 10 rows of the first column length(housing.df$TOTAL_VALUE) # find the length of the first column mean(housing.df$TOTAL_VALUE) # find the mean of the first column summary(housing.df) # find summary statistics for each column

Data from a csv file is stored in R as a data frame (e.g., housing.df ). If our csv file has column headers, these headers get automatically stored as the column names of our data. A data frame is the fundamental object almost all analyses begin with in R. A data frame has rows and columns. The rows are the obser- vations for each case (e.g., house), and the columns are the variables of interest (e.g., TOTAL VALUE, TAX). The code in Table 2.3 walks you through some basic steps you will want to perform prior to doing any analysis: finding the size and dimension of your data (number of rows and columns), viewing all the data, displaying only selected rows and columns, and computing summary statistics for variables of interest. Note that comments are preceded with the # symbol.

Sampling from a Database

Typically, we perform data mining on less than the complete database. Data mining algorithms will have varying limitations on what they can handle in terms of the numbers of records and variables, limitations that may be specific to computing power and capacity as well as software limitations. Even within those limits, many algorithms will execute faster with smaller samples.

Accurate models can often be built with as few as several thousand records. Hence, we will want to sample a subset of records for model building. Table 2.4 provides code for sampling in R.

PRELIMINARY STEPS 25

TABLE 2.4 SAMPLING IN R

code for sampling and over/under-sampling

# random sample of 5 observations s <- sample(row.names(housing.df), 5) housing.df[s,]

# oversample houses with over 10 rooms s <- sample(row.names(housing.df), 5, prob = ifelse(housing.df$ROOMS>10, 0.9, 0.01)) housing.df[s,]

Oversampling Rare Events in Classification Tasks

If the event we are interested in classifying is rare, for example, customers pur- chasing a product in response to a mailing, or fraudulent credit card transactions, sampling a random subset of records may yield so few events (e.g., purchases) that we have little information on them. We would end up with lots of data on nonpurchasers and non-fraudulent transactions but little on which to base a model that distinguishes purchasers from nonpurchasers or fraudulent from non- fraudulent. In such cases, we would want our sampling procedure to overweight the rare class (purchasers or frauds) relative to the majority class (nonpurchasers, non-frauds) so that our sample would end up with a healthy complement of purchasers or frauds.

Assuring an adequate number of responder or “success” cases to train the model is just part of the picture. A more important factor is the costs of mis- classification. Whenever the response rate is extremely low, we are likely to attach more importance to identifying a responder than to identifying a non- responder. In direct-response advertising (whether by traditional mail, e-mail, or web advertising), we may encounter only one or two responders for every hundred records—the value of finding such a customer far outweighs the costs of reaching him or her. In trying to identify fraudulent transactions, or customers unlikely to repay debt, the costs of failing to find the fraud or the nonpaying customer are likely to exceed the cost of more detailed review of a legitimate transaction or customer.

If the costs of failing to locate responders are comparable to the costs of misidentifying responders as non-responders, our models would usually achieve highest overall accuracy if they identified everyone as a non-responder (or almost everyone, if it is easy to identify a few responders without catching many non- responders). In such a case, the misclassification rate is very low—equal to the rate of responders—but the model is of no value.

26 OVERVIEW OF THE DATA MINING PROCESS

More generally, we want to train our model with the asymmetric costs in mind so that the algorithm will catch the more valuable responders, probably at the cost of “catching” and misclassifying more non-responders as responders than would be the case if we assume equal costs. This subject is discussed in detail in Chapter 5.

Preprocessing and Cleaning the Data

Types of Variables There are several ways of classifying variables. Vari- ables can be numerical or text (character/string). They can be continuous (able to assume any real numerical value, usually in a given range), integer (taking only integer values), categorical (assuming one of a limited number of values), or date. Categorical variables can be either coded as numerical (1, 2, 3) or text (payments current, payments not current, bankrupt). Categorical variables can be unordered (called nominal variables) with categories such as North America, Europe, and Asia; or they can be ordered (called ordinal variables) with categories such as high value, low value, and nil value.

Continuous variables can be handled by most data mining routines with the exception of the naive Bayes classifier, which deals exclusively with categorical predictor variables. The machine learning roots of data mining grew out of problems with categorical outcomes; the roots of statistics lie in the analysis of continuous variables. Sometimes, it is desirable to convert continuous variables to categorical variables. This is done most typically in the case of outcome variables, where the numerical variable is mapped to a decision (e.g., credit scores above a certain threshold mean “grant credit,” a medical test result above a certain threshold means “start treatment”).

For the West Roxbury data, Table 2.5 presents some R statements to review the variables and determine what type (class) R thinks they are, and to determine the number of levels in a factor variable.

Handling Categorical Variables Categorical variables can also be han- dled by most data mining routines, but often require special handling. If the categorical variable is ordered (age group, degree of creditworthiness, etc.), we can sometimes code the categories numerically (1, 2, 3, ...) and treat the vari- able as if it were a continuous variable. The smaller the number of categories, and the less they represent equal increments of value, the more problematic this approach becomes, but it often works well enough.

Nominal categorical variables, however, often cannot be used as is. In many cases, they must be decomposed into a series of binary variables, called dummy

Homework is Completed By:

Writer	Writer Name	Amount	Client Comments & Rating
ONLINE	Instant Homework Helper 4.8 4305 Orders Completed	$36	She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up! 5.00
Answer.docx Turnitin Report.pdf Contact Writer For Solution Contact Writer For Solution

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 3 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 6 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 12 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

6 writers have sent their proposals to do this homework:

Writer	Writer Name	Offer	Chat
ONLINE	Write My Coursework I find your project quite stimulating and related to my profession. I can surely contribute you with your project. 4.7 2205 Orders Completed	$22	Chat With Writer
ONLINE	Solution Provider I will provide you with the well organized and well research papers from different primary and secondary sources will write the content that will support your points. 0 Orders Completed	$22	Chat With Writer
ONLINE	Quality Homework Helper I can assist you in plagiarism free writing as I have already done several related projects of writing. I have a master qualification with 5 years’ experience in; Essay Writing, Case Study Writing, Report Writing. 4.8 1449 Orders Completed	$23	Chat With Writer
ONLINE	Math Guru As an experienced writer, I have extensive experience in business writing, report writing, business profile writing, writing business reports and business plans for my clients. 4.8 1659 Orders Completed	$37	Chat With Writer
ONLINE	Accounting & Finance Master I am an academic and research writer with having an MBA degree in business and finance. I have written many business reports on several topics and am well aware of all academic referencing styles. 4.6 1197 Orders Completed	$50	Chat With Writer
ONLINE	Writing Factory I find your project quite stimulating and related to my profession. I can surely contribute you with your project. 4.8 1470 Orders Completed	$34	Chat With Writer