Recent Orders

Our Reviews

Sample Papers

How It Works

Get First 2 Pages Of Your Homework Absolutely Free!

Messages

Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free Quotes Post Your Requirements

Fyf 893 from you flowers

01/12/2021 Client: muhammad11 Deadline: 2 Day

you will discuss the basic concepts and algorithms of Association Analysis.

Boston San Francisco NewYork

London Toronto Sydney Tokyo Singapore Madrid

MexicoCity Munich Paris CapeTown HongKong Montreal

G.R r+6,q

If you purchased this book within the United States or Canada you should be aware that it has been wrongfirlly imported without the approval of the Publishel or the Author.

T3 Loo 6

- {)gq* 3 AcquisitionsEditor Matt Goldstein

ProjectEditor Katherine Harutunian Production Supervisor Marilyn Lloyd Production Services Paul C. Anagnostopoulos of Windfall Software Marketing Manager Michelle Brown Copyeditor Kathy Smith Proofreader IenniferMcClain Technicallllustration GeorgeNichols Cover Design Supervisor Joyce Cosentino Wells Cover Design Night & Day Design Cover Image @ 2005 Rob Casey/Brand X pictures hepress and Manufacturing Caroline Fell Printer HamiltonPrinting

Access the latest information about Addison-Wesley titles from our iWorld Wide Web site: http : //www. aw-bc.com/computing

Many of the designations used by manufacturers and sellers to distiriguish their products are claimed as trademarks. where those designations appear in this book, and Addison- Wesley was aware of a trademark claim, the designations have been printed in initial caps or all caps.

The programs and applications presented in this book have been incl,[rded for their instructional value. They have been tested with care, but are not guatanteed for any particular purpose. The publisher does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs or applications.

For information on obtaining permission for use of material in this work, please submit a written request to Pearson Education, Inc., Rights and Contract Department, 75 Arlington Street, Suite 300, Boston, MA02II6 or fax your request to (617) g4g-j047.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or any other media embodiments now known or hereafter to become known, without the prior written permission of the publisher. printed in the united States of America.

lsBN 0-321-42052-7

2 3 4 5 67 8 9 10-HAM-O8 07 06

our famili,es

Preface

Advances in data generation and collection are producing data sets of mas- sive size in commerce and a variety of scientific disciplines. Data warehouses store details of the sales and operations of businesses, Earth-orbiting satellites beam high-resolution images and sensor data back to Earth, and genomics ex- periments generate sequence, structural, and functional data for an increasing number of organisms. The ease with which data can now be gathered and stored has created a new attitude toward data analysis: Gather whatever data you can whenever and wherever possible. It has become an article of faith that the gathered data will have value, either for the purpose that initially motivated its collection or for purposes not yet envisioned.

The field of data mining grew out of the limitations of current data anal- ysis techniques in handling the challenges posedl by these new types of data sets. Data mining does not replace other areas of data analysis, but rather takes them as the foundation for much of its work. While some areas of data mining, such as association analysis, are unique to the field, other areas, such as clustering, classification, and anomaly detection, build upon a long history of work on these topics in other fields. Indeed, the willingness of data mining researchers to draw upon existing techniques has contributed to the strength and breadth of the field, as well as to its rapid growth.

Another strength of the field has been its emphasis on collaboration with researchers in other areas. The challenges of analyzing new types of data cannot be met by simply applying data analysis techniques in isolation from those who understand the data and the domain in which it resides. Often, skill in building multidisciplinary teams has been as responsible for the success of data mining projects as the creation of new and innovative algorithms. Just as, historically, many developments in statistics were driven by the needs of agriculture, industry, medicine, and business, rxrany of the developments in data mining are being driven by the needs of those same fields.

This book began as a set of notes and lecture slides for a data mining course that has been offered at the University of Minnesota since Spring 1998 to upper-division undergraduate and graduate Students. Presentation slides

viii Preface

and exercises developed in these offerings grew with time and served as a basis for the book. A survey of clustering techniques in data mining, originally written in preparation for research in the area, served as a starting point for one of the chapters in the book. Over time, the clustering chapter was joined by chapters on data, classification, association analysis, and anomaly detection. The book in its current form has been class tested at the home institutions of the authors-the University of Minnesota and Michigan State University-as well as several other universities.

A number of data mining books appeared in the meantime, but were not completely satisfactory for our students primarily graduate and undergrad- uate students in computer science, but including students from industry and a wide variety of other disciplines. Their mathematical and computer back- grounds varied considerably, but they shared a common goal: to learn about data mining as directly as possible in order to quickly apply it to problems in their own domains. Thus, texts with extensive mathematical or statistical prerequisites were unappealing to many of them, as were texts that required a substantial database background. The book that evolved in response to these students needs focuses as directly as possible on the key concepts of data min- ing by illustrating them with examples, simple descriptions of key algorithms, and exercises.

Overview Specifically, this book provides a comprehensive introduction to data mining and is designed to be accessible and useful to students, instructors, researchers, and professionals. Areas covered include data preprocessing, vi- sualization, predictive modeling, association analysis, clustering, and anomaly detection. The goal is to present fundamental concepts and algorithms for each topic, thus providing the reader with the necessary background for the application of data mining to real problems. In addition, this book also pro- vides a starting point for those readers who are interested in pursuing research in data mining or related fields.

The book covers five main topics: data, classification, association analysis, clustering, and anomaly detection. Except for anomaly detection, each of these areas is covered in a pair of chapters. For classification, association analysis, and clustering, the introductory chapter covers basic concepts, representative algorithms, and evaluation techniques, while the more advanced chapter dis- cusses advanced concepts and algorithms. The objective is to provide the reader with a sound understanding of the foundations of data mining, while still covering many important advanced topics. Because of this approach, the book is useful both as a learning tool and as a reference.

Preface ix

To help the readers better understand the concepts that have been pre- sented, we provide an extensive set of examples, figures, and exercises. Bib- Iiographic notes are included at the end of each chapter for readers who are interested in more advanced topics, historically important papers, and recent trends. The book also contains a comprehensive subject and author index.

To the Instructor As a textbook, this book is suitable for a wide range of students at the advanced undergraduate or graduate level. Since students come to this subject with diverse backgrounds that may not include extensive knowledge of statistics or databases, our book requires minimal prerequisites- no database knowledge is needed and we assume only a modest background in statistics or mathematics. To this end, the book was designed to be as self-contained as possible. Necessary material from statistics, linear algebra, and machine learning is either integrated into the body of the text, or for some advanced topics, covered in the appendices.

Since the chapters covering major data mining topics are self-contained, the order in which topics can be covered is quite flexible. The core material is covered in Chapters 2, 4, 6, 8, and 10. Although the introductory data chapter (2) should be covered first, the basic classification, association analy- sis, and clustering chapters (4, 6, and 8, respectively) can be covered in any order. Because of the relationship of anomaly detection (10) to classification (4) and clustering (8), these chapters should precede Chapter 10. Various topics can be selected from the advanced classification, association analysis, and clustering chapters (5, 7, and 9, respectively) to fit the schedule and in- terests of the instructor and students. We also advise that the lectures be augmented by projects or practical exercises in data mining. Although they are time consuming, such hands-on assignments greatly enhance the value of the course.

Support Materials The supplements for the book are available at Addison- Wesley's Website www.aw.con/cssupport. Support materials available to all readers of this book include

PowerPoint lecture slides

Suggestions for student projects

Data mining resources such as data mining algorithms and data sets

On-line tutorials that give step-by-step examples for selected data mining techniques described in the book using actual data sets and data analysis software

x Preface

Additional support materials, including solutions to exercises, are available only to instructors adopting this textbook for classroom use. Please contact your school's Addison-Wesley representative for information on obtaining ac- cess to this material. Comments and suggestions, as well as reports of errors, can be sent to the authors through dnbook@cs.unm.edu.

Acknowledgments Many people contributed to this book. We begin by acknowledging our families to whom this book is dedicated. Without their patience and support, this project would have been impossible.

We would like to thank the current and former students of our data mining groups at the University of Minnesota and Michigan State for their contribu- tions. Eui-Hong (Sam) Han and Mahesh Joshi helped with the initial data min- ing classes. Some ofthe exercises and presentation slides that they created can be found in the book and its accompanying slides. Students in our data min- ing groups who provided comments on drafts of the book or who contributed in other ways include Shyam Boriah, Haibin Cheng, Varun Chandola, Eric Eilertson, Levent Ertoz, Jing Gao, Rohit Gupta, Sridhar Iyer, Jung-Eun Lee, Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey, Kashif Riaz, Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and Pusheng Zhang. We would also like to thank the students of our data mining classes at the Univer- sity of Minnesota and Michigan State University who worked with early drafbs of the book and provided invaluable feedback. We specifically note the helpful suggestions of Bernardo Craemer, Arifin Ruslim, Jamshid Vayghan, and Yu Wei.

Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of Florida) class tested early versions of the book. We also received many useful suggestions directly from the following UT students: Pankaj Adhikari, Ra- jiv Bhatia, Fbederic Bosche, Arindam Chakraborty, Meghana Deodhar, Chris Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones, Ajay Joshi, Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung Park, Aashish Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and Mei Yang.

Ronald Kostoff (ONR) read an early version of the clustering chapter and offered numerous suggestions. George Karypis provided invaluable IATEX as- sistance in creating an author index. Irene Moulitsas also provided assistance with IATEX and reviewed some of the appendices. Musetta Steinbach was very helpful in finding errors in the figures.

We would like to acknowledge our colleagues at the University of Min- nesota and Michigan State who have helped create a positive environment for data mining research. They include Dan Boley, Joyce Chai, Anil Jain, Ravi

Preface xi

Janardan, Rong Jin, George Karypis, Haesun Park, William F. Punch, Shashi Shekhar, and Jaideep Srivastava. The collaborators on our many data mining projects, who also have our gratitude, include Ramesh Agrawal, Steve Can- non, Piet C. de Groen, FYan Hill, Yongdae Kim, Steve Klooster, Kerry Long, Nihar Mahapatra, Chris Potter, Jonathan Shapiro, Kevin Silverstein, Nevin Young, and Zhi-Li Zhang.

The departments of Computer Science and Engineering at the University of Minnesota and Michigan State University provided computing resources and a supportive environment for this project. ARDA, ARL, ARO, DOE, NASA, and NSF provided research support for Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. In particular, Kamal Abdali, Dick Brackney, Jagdish Chan- dra, Joe Coughlan, Michael Coyle, Stephen Davis, Flederica Darema, Richard Hirsch, Chandrika Kamath, Raju Namburu, N. Radhakrishnan, James Sido- ran, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova, and Xiaodong Zhanghave been supportive of our research in data mining and high-performance computing.

It was a pleasure working with the helpful staff at Pearson Education. In particular, we would like to thank Michelle Brown, Matt Goldstein, Katherine Harutunian, Marilyn Lloyd, Kathy Smith, and Joyce Wells. We would also like to thank George Nichols, who helped with the art work and Paul Anag- nostopoulos, who provided I4.T[X support. We are grateful to the following Pearson reviewers: Chien-Chung Chan (University of Akron), Zhengxin Chen (University of Nebraska at Omaha), Chris Clifton (Purdue University), Joy- deep Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute of Technology), J. Michael Hardin (University of Alabama), James Hearne (Western Washington University), Hillol Kargupta (University of Maryland, Baltimore County and Agnik, LLC), Eamonn Keogh (University of California- Riverside), Bing Liu (University of Illinois at Chicago), Mariofanna Milanova (University of Arkansas at Little Rock), Srinivasan Parthasarathy (Ohio State University), Zbigniew W. Ras (University of North Carolina at Charlotte), Xintao Wu (University of North Carolina at Charlotte), and Mohammed J. Zaki (Rensselaer Polvtechnic Institute).

Gontents

Preface

Introduction 1 1.1 What Is Data Mining? 2 7.2 Motivating Challenges 4 1.3 The Origins of Data Mining 6 1.4 Data Mining Tasks 7 1.5 Scope and Organization of the Book 11 1.6 Bibliographic Notes 13

vl l

t.7 Exercises

Data

19 2.I Types of Data 22

2.1.I Attributes and Measurement 23 2.L.2 Types of Data Sets . 29

2.2 Data Quality 36 2.2.I Measurement and Data Collection Issues 37 2.2.2 Issues Related to Applications

2.3 Data Preprocessing 2.3.L Aggregation 2.3.2 Sampling 2.3.3 Dimensionality Reduction 2.3.4 Feature Subset Selection 2.3.5 Feature Creation 2.3.6 Discretization and Binarization 2.3:7 Variable Tlansformation .

2.4 Measures of Similarity and Dissimilarity . . . 2.4.L Basics 2.4.2 Similarity and Dissimilarity between Simple Attributes . 2.4.3 Dissimilarities between Data Objects . 2.4.4 Similarities between Data Objects

43 44 45 47 50 52 55 57 63 65 66 67 69 72

xiv Contents

2.4.5 Examples of Proximity Measures 2.4.6 Issues in Proximity Calculation 2.4.7 Selecting the Right Proximity Measure

2.5 BibliographicNotes 2.6 Exercises

Exploring Data 3.i The Iris Data Set 3.2 Summary Statistics

3.2.L Frequencies and the Mode 3.2.2 Percentiles 3.2.3 Measures of Location: Mean and Median 3.2.4 Measures of Spread: Range and Variance 3.2.5 Multivariate Summary Statistics 3.2.6 Other Ways to Summarize the Data

3.3 Visualization 3.3.1 Motivations for Visualization 3.3.2 General Concepts 3.3.3 Techniques 3.3.4 Visualizing Higher-Dimensional Data . 3.3.5 Do's and Don'ts

3.4 OLAP and Multidimensional Data Analysis 3.4.I Representing Iris Data as a Multidimensional Array 3.4.2 Multidimensional Data: The General Case . 3.4.3 Analyzing Multidimensional Data 3.4.4 Final Comments on Multidimensional Data Analysis Bibliographic Notes Exercises

Classification: Basic Concepts, Decision Tlees, and Model Evaluation 4.1 Preliminaries 4.2 General Approach to Solving a Classification Problem 4.3 Decision Tlee Induction

4.3.1 How a Decision Tlee Works 4.3.2 How to Build a Decision TYee 4.3.3 Methods for Expressing Attribute Test Conditions 4.3.4 Measures for Selecting the Best Split . 4.3.5 Algorithm for Decision Tlee Induction 4.3.6 An Examole: Web Robot Detection

3.5 3.6

73 80 83 84 88

97 98 98 99

100 101 102 704 105 105 105 106 110 724 130 131 131 133 135 139 139 747

L45 746 748 150 150 151 155 158 164 166

Contents xv

4.3.7 Characteristics of Decision Tlee Induction 4.4 Model Overfitting

4.4.L Overfitting Due to Presence of Noise 4.4.2 Overfitting Due to Lack of Representative Samples 4.4.3 Overfitting and the Multiple Comparison Procedure 4.4.4 Estimation of Generalization Errors 4.4.5 Handling Overfitting in Decision Tlee Induction

4.5 Evaluating the Performance of a Classifier 4.5.I Holdout Method 4.5.2 Random Subsampling . . . 4.5.3 Cross-Validation 4.5.4 Bootstrap

4.6 Methods for Comparing Classifiers 4.6.L Estimating a Confidence Interval for Accuracy 4.6.2 Comparing the Performance of Two Models . 4.6.3 Comparing the Performance of Two Classifiers

4.7 BibliographicNotes 4.8 Exercises

5 Classification: Alternative Techniques 5.1 Rule-Based Classifier

5.1.1 How a Rule-Based Classifier Works 5.1.2 Rule-Ordering Schemes 5.1.3 How to Build a Rule-Based Classifier 5.1.4 Direct Methods for Rule Extraction 5.1.5 Indirect Methods for Rule Extraction 5.1.6 Characteristics of Rule-Based Classifiers

5.2 Nearest-Neighbor classifiers 5.2.L Algorithm 5.2.2 Characteristics of Nearest-Neighbor Classifiers

5.3 Bayesian Classifiers 5.3.1 Bayes Theorem 5.3.2 Using the Bayes Theorem for Classification 5.3.3 Naive Bayes Classifier 5.3.4 Bayes Error Rate 5.3.5 Bayesian Belief Networks

5.4 Artificial Neural Network (ANN) 5.4.I Perceptron 5.4.2 Multilayer Artificial Neural Network 5.4.3 Characteristics of ANN

168 172 L75 L77 178 179 184 186 186 187 187 188 188 189 191 192 193 198

207 207 209 2 I I 2r2 2r3 22L 223 223 225 226 227 228 229 23L 238 240 246 247 25r 255

xvi Contents

5.5 Support Vector Machine (SVM) 5.5.1 Maximum Margin Hyperplanes 5.5.2 Linear SVM: Separable Case 5.5.3 Linear SVM: Nonseparable Case 5.5.4 Nonlinear SVM . 5.5.5 Characteristics of SVM Ensemble Methods 5.6.1 Rationale for Ensemble Method 5.6.2 Methods for Constructing an Ensemble Classifier 5.6.3 Bias-Variance Decomposition 5.6.4 Bagging 5.6.5 Boosting 5.6.6 Random Forests 5.6.7 Empirical Comparison among Ensemble Methods Class Imbalance Problem 5.7.1 Alternative Metrics 5.7.2 The Receiver Operating Characteristic Curve 5.7.3 Cost-Sensitive Learning . . 5.7.4 Sampling-Based Approaches . Multiclass Problem Bibliographic Notes Exercises

5 .6

o . t

256 256 259 266 270 276 276 277 278 28r 283 285 290 294 294 295 298 302 305 306 309 315

c .6

5.9 5.10

Association Analysis: Basic Concepts and Algorithms 327 6.1 Problem Definition . 328 6.2 Flequent Itemset Generation 332

6.2.I The Apri,ori Principle 333 6.2.2 Fbequent Itemset Generation in the Apri,ori, Algorithm . 335 6.2.3 Candidate Generation and Pruning . . . 338 6.2.4 Support Counting 342 6.2.5 Computational Complexity 345

6.3 Rule Generatiorr 349 6.3.1 Confidence-Based Pruning 350 6.3.2 Rule Generation in Apri,ori, Algorithm 350 6.3.3 An Example: Congressional Voting Records 352

6.4 Compact Representation of Fbequent Itemsets 353 6.4.7 Maximal Flequent Itemsets 354 6.4.2 Closed Frequent Itemsets 355

6.5 Alternative Methods for Generating Frequent Itemsets 359 6.6 FP-Growth Alsorithm 363

Contents xvii

6.6.1 FP-tee Representation 6.6.2 Frequent Itemset Generation in FP-Growth Algorithm .

6.7 Evaluation of Association Patterns 6.7.l Objective Measures of Interestingness 6.7.2 Measures beyond Pairs of Binary Variables 6.7.3 Simpson's Paradox

6.8 Effect of Skewed Support Distribution 6.9 Bibliographic Notes

363 366 370 37r 382 384 386 390 404

4L5 415 4t8 418 422 424 426 429 429 431 436 439 442 443 444 447 448 453 457 457 458 458

460 461 463 465 469 473

6.10 Exercises

7 Association Analysis: Advanced 7.I Handling Categorical Attributes 7.2 Handling Continuous Attributes

Concepts

7.2.I Discretization-Based Methods 7.2.2 Statistics-Based Methods 7.2.3 Non-discretizalion Methods Handling a Concept Hierarchy Seouential Patterns 7.4.7 Problem Formulation 7.4.2 Sequential Pattern Discovery 7.4.3 Timing Constraints 7.4.4 Alternative Counting Schemes

7.5 Subgraph Patterns 7.5.1 Graphs and Subgraphs . 7.5.2 Frequent Subgraph Mining 7.5.3 Apri,od-like Method 7.5.4 Candidate Generation 7.5.5 Candidate Pruning 7.5.6 Support Counting

7.6 Infrequent Patterns 7.6.7 Negative Patterns 7.6.2 Negatively Correlated Patterns 7.6.3 Comparisons among Infrequent Patterns, Negative Pat-

terns, and Negatively Correlated Patterns 7.6.4 Techniques for Mining Interesting Infrequent Patterns 7.6.5 Techniques Based on Mining Negative Patterns 7.6.6 Techniques Based on Support Expectation .

7.7 Bibliographic Notes 7.8 Exercises

7.3 7.4

xviii Contents

Cluster Analysis: Basic Concepts and Algorithms 8.1 Overview

8.1.1 What Is Cluster Analysis? 8.I.2 Different Types of Clusterings . 8.1.3 Different Types of Clusters

8.2 K-means 8.2.7 The Basic K-means Algorithm 8.2.2 K-means: Additional Issues 8.2.3 Bisecting K-means 8.2.4 K-means and Different Types of Clusters 8.2.5 Strengths and Weaknesses 8.2.6 K-means as an Optimization Problem

8.3 Agglomerative Hierarchical Clustering 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 8.3.2 Specific Techniques 8.3.3 The Lance-Williams Formula for Cluster Proximity . 8.3.4 Key Issues in Hierarchical Clustering . 8.3.5 Strengths and Weaknesses DBSCAN 8.4.1 Tladitional Density: Center-Based Approach 8.4.2 The DBSCAN Algorithm 8.4.3 Strengths and Weaknesses Cluster Evaluation 8.5.1 Overview 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and

Separation 8.5.3 Unsupervised Cluster Evaluation Using the Proximity

Matrix 8.5.4 Unsupervised Evaluation of Hierarchical Clustering . 8.5.5 Determining the Correct Number of Clusters 8.5.6 Clustering Tendency 8.5.7 Supervised Measures of Cluster Validity 8.5.8 Assessing the Significance of Cluster Validity Measures .

8.4

8.5

487 490 490 49r 493 496 497 506 508 510 510 513 515 516 518 524 524 526 526 527 528 530 532 533

536

542 544 546 547 548 553 ooo

559 8.6 Bibliograph 8.7 Exercises

ic Notes

Cluster Analysis: Additional Issues and Algorithms 569 9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570

9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570 9.1.2 Data Characteristics 577

Contents xix

9.1.3 Cluster Characteristics . . 573 9.L.4 General Characteristics of Clustering Algorithms 575

9.2 Prototype-Based Clustering 577 9.2.1 F\zzy Clustering 577 9.2.2 Clustering Using Mixture Models 583 9.2.3 Self-Organizing Maps (SOM) 594

9.3 Density-Based Clustering 600 9.3.1 Grid-Based Clustering 601 9.3.2 Subspace Clustering 604 9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based

Clustering 608 9.4 Graph-Based Clustering 612

9.4.1 Sparsification 613 9.4.2 Minimum Spanning Tlee (MST) Clustering . . . 674 9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities

Using METIS 616 9.4.4 Chameleon: Hierarchical Clustering with Dynamic

Modeling 9.4.5 Shared Nearest Neighbor Similarity 9.4.6 The Jarvis-Patrick Clustering Algorithm 9.4.7 SNN Density 9.4.8 SNN Density-Based Clustering

9.5 Scalable Clustering Algorithms 9.5.1 Scalability: General Issues and Approaches 9.5.2 BIRCH 9.5.3 CURE

9.6 Which Clustering Algorithm? 9.7 Bibliographic Notes 9.8 Exercises

616 622 625 627 629 630 630 633 635 639 643 647

10 Anomaly Detection 651 10.1 Preliminaries 653

10.1.1 Causes of Anomalies 653 10.1.2 Approaches to Anomaly Detection 654 10.1.3 The Use of Class Labels 655 10.1.4 Issues 656

10.2 Statistical Approaches 658 t0.2.7 Detecting Outliers in a Univariate Normal Distribution 659 10.2.2 Outl iersinaMult ivar iateNormalDistr ibut ion . . . . . 661 10.2.3 A Mixture Model Approach for Anomaly Detection. 662

xx Contents

10.2.4 Strengths and Weaknesses 10.3 Proximity-Based Outlier Detection

10.3.1 Strengths and Weaknesses 10.4 Density-Based Outlier Detection

10.4.1 Detection of Outliers Using Relative Density 70.4.2 Strengths and Weaknesses

10.5 Clustering-Based Techniques 10.5.1 Assessing the Extent to Which an Object Belongs to a

Cluster 10.5.2 Impact of Outliers on the Initial Clustering 10.5.3 The Number of Clusters to Use 10.5.4 Strengths and Weaknesses

665 666 666 668 669 670 67L

672 674 674 674 675 680

685 b6i)

10.6 Bibliograph 10.7 Exercises

ic Notes

Appendix A Linear Algebra A.1 Vectors

A.1.1 Definition 685 4.I.2 Vector Addition and Multiplication by a Scalar 685 A.1.3 Vector Spaces 687 4.7.4 The Dot Product, Orthogonality, and Orthogonal

Projections 688 A.1.5 Vectors and Data Analysis 690

42 Matrices 691 A.2.1 Matrices: Definitions 691 A-2.2 Matrices: Addition and Multiplication by a Scalar 692 4.2.3 Matrices: Multiplication 693 4.2.4 Linear tansformations and Inverse Matrices 695 4.2.5 Eigenvalue and Singular Value Decomposition . 697 4.2.6 Matrices and Data Analysis 699

A.3 Bibliographic Notes 700

Appendix B Dimensionality Reduction 7OL 8.1 PCA and SVD 70I

B.1.1 Principal Components Analysis (PCA) 70L 8.7.2 SVD . 706

8.2 Other Dimensionality Reduction Techniques 708 8.2.I Factor Analysis 708 8.2.2 Locally Linear Embedding (LLE) . 770 8.2.3 Multidimensional Scaling, FastMap, and ISOMAP 7I2

Contents xxi

8.2.4 Common Issues B.3 Bibliographic Notes

Appendix C Probability and Statistics C.1 Probability

C.1.1 Expected Values C.2 Statistics

C.2.L Point Estimation C.2.2 Central Limit Theorem C.2.3 Interval Estimation

C.3 Hypothesis Testing

Appendix D Regression D.1 Preliminaries D.2 Simple Linear Regression

D.2.L Least Square Method D.2.2 Analyzing Regression Errors D.2.3 Analyzing Goodness of Fit

D.3 Multivariate Linear Regression D.4 Alternative Least-Square Regression Methods

Appendix E Optimization E.1 Unconstrained Optimizafion

E.1.1 Numerical Methods 8.2 Constrained Optimization

E.2.I Equality Constraints 8.2.2 Inequality Constraints

Author Index

Subject Index

715 7L6

7L9 7L9 722 723 724 724 725 726

739 739 742 746 746 747

750

758

769

729 729 730 731 733 735 736 737

Introduction

Rapid advances in data collection and storage technology have enabled or- ganizations to accumulate vast amounts of data. However, extracting useful information has proven extremely challenging. Often, traditional data analy- sis tools and techniques cannot be used because of the massive size of a data set. Sometimes, the non-traditional nature of the data means that traditional approaches cannot be applied even if the data set is relatively small. In other situations, the questions that need to be answered cannot be addressed using existing data analysis techniques, and thus, new methods need to be devel- oped.

Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for processing large volumes of data. It has also opened up exciting opportunities for exploring and analyzing new types of data and for analyzing old types of data in new ways. In this introductory chapter, we present an overview of data mining and outline the key topics to be covered in this book. We start with a description of some well-known applications that require new techniques for data analysis.

Business Point-of-sale data collection (bar code scanners, radio frequency identification (RFID), and smart card technology) have allowed retailers to collect up-to-the-minute data about customer purchases at the checkout coun- ters of their stores. Retailers can utilize this information, along with other business-critical data such as Web logs from e-commerce Web sites and cus- tomer service records from call centers, to help them better understand the needs of their customers and make more informed business decisions.

Data mining techniques can be used to support a wide range of business intelligence applications such as customer profiling, targeted marketing, work- flow management, store layout, and fraud detection. It can also help retailers

2 Chapter 1 lntroduction

answer important business questions such as "Who are the most profitable customers?" "What products can be cross-sold or up-sold?" and "What is the revenue outlook of the company for next year?)) Some of these questions mo- tivated the creation of association analvsis (Chapters 6 and 7), a new data analysis technique.

Medicine, Science, and Engineering Researchers in medicine, science, and engineering are rapidly accumulating data that is key to important new discoveries. For example, as an important step toward improving our under- standing of the Earth's climate system, NASA has deployed a series of Earth- orbiting satellites that continuously generate global observations of the Iand surface, oceans, and atmosphere. However, because of the size and spatio- temporal nature of the data, traditional methods are often not suitable for analyzing these data sets. Techniques developed in data mining can aid Earth scientists in answering questions such as "What is the relationship between the frequency and intensity of ecosystem disturbances such as droughts and hurricanes to global warming?" "How is land surface precipitation and temper- ature affected by ocean surface temperature?" and "How well can we predict the beginning and end of the growing season for a region?"

As another example, researchers in molecular biology hope to use the large amounts of genomic data currently being gathered to better understand the structure and function of genes. In the past, traditional methods in molecu- lar biology allowed scientists to study only a few genes at a time in a given experiment. Recent breakthroughs in microarray technology have enabled sci- entists to compare the behavior of thousands of genes under various situations. Such comparisons can help determine the function of each gene and perhaps isolate the genes responsible for certain diseases. However, the noisy and high- dimensional nature of data requires new types of data analysis. In addition to analyzing gene array data, data mining can also be used to address other important biological challenges such as protein structure prediction, multiple sequence alignment, the modeling of biochemical pathways, and phylogenetics.

1.1 What Is Data Mining?

Data mining is the process of automatically discovering useful information in large data repositories. Data mining techniques are deployed to scour large databases in order to find novel and useful patterns that might otherwise remain unknown. They also provide capabilities to predict the outcome of a

1.1 What Is Data Mining? 3

future observation, such as predicting whether a newly arrived. customer will spend more than $100 at a department store.

Not all information discovery tasks are considered to be data mining. For example, Iooking up individual records using a database managemenr sysrem or finding particular Web pages via a query to an Internet search engine are tasks related to the area of information retrieval. Although such tasks are important and may involve the use of the sophisticated algorithms and data structures, they rely on traditional computer science techniques and obvious features of the data to create index structures for efficiently organizing and retrieving information. Nonetheless, data mining techniques have been used to enhance information retrieval systems.

Data Mining and Knowledge Discovery

Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall process of converting raw data into useful in- formation, as shown in Figure 1.1. This process consists of a series of trans- formation steps, from data preprocessing to postprocessing of data mining results.

Information

Figure 1 ,1. The process of knowledge discovery in databases (KDD).

The input data can be stored in a variety of formats (flat files, spread- sheets, or relational tables) and may reside in a centralized data repository or be distributed across multiple sites. The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis. The steps involved in data preprocessing include fusing data from multiple sources, cleaning data to remove noise and duplicate observations, and selecting records and features that are relevant to the data mining task at hand. Because of the many ways data can be collected and stored, data

4 Chapter 1 Introduction

preprocessing is perhaps the most laborious and time-consuming step in the

overall knowledge discovery process. ,,Closing the loop" is the phrase often used to refer to the process of in-

tegrating data mining results into decision support systems. For example,

in business applications, the insights offered by data mining results can be

integrated with campaign management tools so that effective marketing pro-

motions can be conducted and tested. Such integration requires a postpro-

cessing step that ensures that only valid and useful results are incorporated

into the decision support system. An example of postprocessing is visualiza-

tion (see Chapter 3), which allows analysts to explore the data and the data

mining results from a variety of viewpoints. Statistical measures or hypoth-

esis testing methods can also be applied during postprocessing to eliminate

spurious data mining results.

L.2 Motivating Challenges

As mentioned earlier, traditional data analysis techniques have often encoun-

tered practical difficulties in meeting the challenges posed by new data sets.

The following are some of the specific challenges that motivated the develop-

ment of data mining.

Scalability Because of advances in data generation and collection, data sets

with sizes of gigabytes, terabytes, or even petabytes are becoming common.

If data mining algorithms are to handle these massive data sets, then they

must be scalable. Many data mining algorithms employ special search strate-

gies to handle exponential search problems. Scalability may also require the

implementation of novel data structures to access individual records in an ef-

ficient manner. For instance, out-of-core algorithms may be necessary when

processing data sets that cannot fit into main memory. Scalability can also be

improved by using sampling or developing parallel and distributed algorithms.

High Dimensionality It is now common to encounter data sets with hun-

dreds or thousands of attributes instead of the handful common a few decades

ago. In bioinformatics, progress in microarray technology has produced gene

expression data involving thousands of features. Data sets with temporal

or spatial components also tend to have high dimensionality. For example,

consider a data set that contains measurements of temperature at various

locations. If the temperature measurements are taken repeatedly for an ex-

tended period, the number of dimensions (features) increases in proportion to

L.2 Motivating Challenges 5

the number of measurements taken. Tladitional data analysis techniques that were developed for low-dimensional data often do not work well for such high- dimensional data. Also, for some data analysis algorithms, the computational complexity increases rapidly as the dimensionality (the number of features) increases.

Heterogeneous and Complex Data TYaditional data analysis methods often deal with data sets containing attributes of the same type, either contin- uous or categorical. As the role of data mining in business, science, medicine, and other flelds has grown, so has the need for techniques that can handle heterogeneous attributes. Recent years have also seen the emergence of more complex data objects. Examples of such non-traditional types of data include collections of Web pages containing semi-structured text and hyperlinks; DNA data with sequential and three-dimensional structure; and climate data that consists of time series measurements (temperature, pressure, etc.) at various locations on the Earth's surface. Techniques developed for mining such com- plex objects should take into consideration relationships in the data, such as temporal and spatial autocorrelation, graph connectivity, and parent-child re- lationships between the elements in semi-structured text and XML documents.

Data ownership and Distribution Sometimes, the data needed for an analysis is not stored in one location or owned by one organization. Instead, the data is geographically distributed among resources belonging to multiple entities. This requires the development of distributed data mining techniques. Among the key challenges faced by distributed data mining algorithms in- clude (1) how to reduce the amount of communication needed to perform the distributed computatior, (2) how to effectively consolidate the data mining results obtained from multiple sources, and (3) how to address data security issues.

Non-traditional Analysis The traditional statistical approach is based on a hypothesize-and-test paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data, and then the data is analyzed with respect to the hypothesis. Unfortunately, this process is extremely labor- intensive. Current data analysis tasks often require the generation and evalu- ation of thousands of hypotheses, and consequently, the development of some data mining techniques has been motivated by the desire to automate the process of hypothesis generation and evaluation. Furthermore, the data sets analyzed in data mining are typically not the result of a carefully designed

6 Chapter 1 Introduction

experiment and often represent opportunistic samples of the data, rather than

random samples. Also, the data sets frequently involve non-traditional types

of data and data distributions.

1.3 The Origins of Data Mining

Brought together by the goal of meeting the challenges of the previous sec-

tion, researchers from different disciplines began to focus on developing more

efficient and scalable tools that could handle diverse types of data. This work,

which culminated in the field of data mining, built upon the methodology and

algorithms that researchers had previously used. In particular, data mining

draws upon ideas, such as (1) sampling, estimation, and hypothesis testing

from statistics and (2) search algorithms, modeling techniques, and learning

theories from artificial intelligence, pattern recognition, and machine learning.

Data mining has also been quick to adopt ideas from other areas, including

optimization, evolutionary computing, information theory, signal processing,

visualization, and information retrieval. A number of other areas also play key supporting roles. In particular,

database systems are needed to provide support for efficient storage, index-

ing, and query processing. Techniques from high performance (parallel) com-

puting are often important in addressing the massive size of some data sets.

Distributed techniques can also help address the issue of size and are essential

when the data cannot be gathered in one location. Figure 1.2 shows the relationship of data mining to other areas.

Figure 1.2. Data mining as a conlluence of many disciplines.

Data Mining Tasks 7

1.4 Data Mining Tasks

Data mining tasks are generally divided into two major categories:

Predictive tasks. The objective of these tasks is to predict the value of a par- ticular attribute based on the values of other attributes. The attribute to be predicted is commonly known as the target or dependent vari- able, while the attributes used for making the prediction are known as the explanatory or independent variables.

Descriptive tasks. Here, the objective is to derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the un- derlying relationships in data. Descriptive data mining tasks are often exploratory in nature and frequently require postprocessing techniques to validate and explain the results.

Figure 1.3 illustrates four of the core data mining tasks that are described in the remainder of this book.

Four of the core data mining tasks.

L.4

Figure 1.3.

8 Chapter 1 Introduction

Predictive modeling refers to the task of building a model for the target

variable as a function of the explanatory variables. There are two types of

predictive modeling tasks: classification, which is used for discrete target

variables, and regression, which is used for continuous target variables. For

example, predicting whether a Web user will make a purchase at an online

bookstore is a classification task because the target variable is binary-valued.

On the other hand, forecasting the future price of a stock is a regression task

because price is a continuous-valued attribute. The goal of both tasks is to

learn a model that minimizes the error between the predicted and true values

of the target variable. Predictive modeling can be used to identify customers

that will respond to a marketing campaign, predict disturbances in the Earth's

ecosystem, or judge whether a patient has a particular disease based on the

results of medical tests.

Example 1.1 (Predicting the Type of a Flower). Consider the task of

predicting a species of flower based on the characteristics of the flower. In

particular, consider classifying an Iris flower as to whether it belongs to one

of the following three Iris species: Setosa, Versicolour, or Virginica. To per-

form this task, we need a data set containing the characteristics of various

flowers of these three species. A data set with this type of information is

the well-known Iris data set from the UCI Machine Learning Repository at

http: /hrurw.ics.uci.edu/-mlearn. In addition to the species of a flower,

this data set contains four other attributes: sepal width, sepal length, petal

length, and petal width. (The Iris data set and its attributes are described

further in Section 3.1.) Figure 1.4 shows a plot of petal width versus petal

length for the 150 flowers in the Iris data set. Petal width is broken into the

categories low, med'ium, and hi'gh, which correspond to the intervals [0' 0.75),

[0.75, 1.75), [1.75, oo), respectively. Also, petal length is broken into categories

low, med,'ium, and hi,gh, which correspond to the intervals [0' 2.5), [2.5,5), [5' oo), respectively. Based on these categories of petal width and length, the

following rules can be derived:

Petal width low and petal length low implies Setosa. Petal width medium and petal length medium implies Versicolour. Petal width high and petal length high implies Virginica.

While these rules do not classify all the flowers, they do a good (but not

perfect) job of classifying most of the flowers. Note that flowers from the

Setosa species are well separated from the Versicolour and Virginica species

with respect to petal width and length, but the latter two species overlap

somewhat with respect to these attributes. I

r Setosa . Versicolour o Virginica

L.4 Data Mining Tasks I

l - - - - a - - f o - - - - - - - i l a o r , f t f o o t o a i : o o o I I

' t 0 f 0 a o 0?oo r a a r f I

? 1 . 7 5 E() r 1 . 5 E

= (t' ()

( L l

!0_l! _.! o_ _o. t a a r O

. .4. a?o o a a a a

a aaaaaaa a a a a a

aa a a a a a

l l t l l

l l t I

I l l l l t t I

I t !

1 2 2 . 5 3 4 5 ( Petal Length (cm)

Figure 1.4. Petal width versus petal length for 1 50 lris flowers,

Association analysis is used to discover patterns that describe strongly as- sociated features in the data. The discovered patterns are typically represented in the form of implication rules or feature subsets. Because of the exponential size of its search space, the goal of association analysis is to extract the most interesting patterns in an efficient manner. Useful applications of association analysis include finding groups of genes that have related functionality, identi- fying Web pages that are accessed together, or understanding the relationships between different elements of Earth's climate system.

Example 1.2 (Market Basket Analysis). The transactions shown in Ta- ble 1.1 illustrate point-of-sale data collected at the checkout counters of a grocery store. Association analysis can be applied to find items that are fre- quently bought together by customers. For example, we may discover the rule {Diapers} -----* {lt:.ft}, which suggests that customers who buy diapers also tend to buy milk. This type of rule can be used to identify potential cross-selling opportunities among related items. I

Cluster analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other

10 Chapter 1 Introduction

Table 1 .1. Market basket data.

Tlansaction ID Items 1 2 3 4 r

o 7 8 9 10

{Bread, Butter, Diapers, Milk}

{Coffee, Sugar, Cookies, Sakoon}

{Bread, Butter, Coffee, Diapers, Milk, Eggs}

{Bread, Butter, Salmon, Chicken}

{fgg", Bread, Butter}

{Salmon, Diapers, Milk}

{Bread, Tea, Sugar, Eggs}

{Coffee, Sugar, Chicken, Eggs}

{Bread, Diapers, Mi1k, Salt}

{Tea, Eggs, Cookies, Diapers, Milk}

than observations that belong to other clusters. Clustering has been used to

group sets of related customers, find areas of the ocean that have a significant

impact on the Earth's climate, and compress data.

Example 1.3 (Document Clustering). The collection of news articles

shown in Table 1.2 can be grouped based on their respective topics. Each

article is represented as a set of word-frequency pairs (r, "),

where tu is a word

and c is the number of times the word appears in the article. There are two

natural clusters in the data set. The first cluster consists of the first four ar-

ticles, which correspond to news about the economy, while the second cluster

contains the last four articles, which correspond to news about health care. A

good clustering algorithm should be able to identify these two clusters based

on the similarity between words that appear in the articles.

Table 1.2. Collection of news articles.

Article Words I 2 .) A

r J

7 8

dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2

machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1 job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3

domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2 pharmaceutical:2, company: 3, drug: 2,vaccine:1, f lu: 3

death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2

medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1

1.5 Scope and Organization of the Book 11

Anomaly detection is the task of identifying observations whose character- istics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. The goal of an anomaly detection al- gorithm is to discover the real anomalies and avoid falsely labeling normal objects as anomalous. In other words, a good anomaly detector must have a high detection rate and a low false alarm rate. Applications of anomaly detection include the detection of fraud, network intrusions, unusual patterns of disease, and ecosystem disturbances.

Example 1.4 (Credit Card trYaud Detection). A credit card company records the transactions made by every credit card holder, along with personal information such as credit limit, age, annual income, and address. since the number of fraudulent cases is relatively small compared to the number of legitimate transactions, anomaly detection techniques can be applied to build a profile of legitimate transactions for the users. When a new transaction arrives, it is compared against the profile of the user. If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent. I

1.5 Scope and Organization of the Book

This book introduces the major principles and techniques used in data mining from an algorithmic perspective. A study of these principles and techniques is essential for developing a better understanding of how data mining technology can be applied to various kinds of data. This book also serves as a starting point for readers who are interested in doing research in this field.

We begin the technical discussion of this book with a chapter on data (Chapter 2), which discusses the basic types of data, data quality, prepro- cessing techniques, and measures of similarity and dissimilarity. Although this material can be covered quickly, it provides an essential foundation for data analysis. Chapter 3, on data exploration, discusses summary statistics, visualization techniques, and On-Line Analytical Processing (OLAP). These techniques provide the means for quickly gaining insight into a data set.

Chapters 4 and 5 cover classification. Chapter 4 provides a foundation by discussing decision tree classifiers and several issues that are important to all classification: overfitting, performance evaluation, and the comparison of different classification models. Using this foundation, Chapter 5 describes a number of other important classification techniques: rule-based systems, nearest-neighbor classifiers, Bayesian classifiers, artificial neural networks, sup- port vector machines, and ensemble classifiers, which are collections of classi-

!2 Chapter 1 lntroduction

fiers. The multiclass and imbalanced class problems are also discussed. These

topics can be covered independently. Association analysis is explored in Chapters 6 and 7. Chapter 6 describes

the basics of association analysis: frequent itemsets, association rules, and

some of the algorithms used to generate them. Specific types of frequent

itemsets-maximal, closed, and hyperclique-that are important for data min-

ing are also discussed, and the chapter concludes with a discussion of evalua-

tion measures for association analysis. Chapter 7 considers a variety of more

advanced topics, including how association analysis can be applied to categor-

ical and continuous data or to data that has a concept hierarchy. (A concept

hierarchy is a hierarchical categorization of objects, e.g., store items, clothing,

shoes, sneakers.) This chapter also describes how association analysis can be

extended to find sequential patterns (patterns involving order), patterns in

graphs, and negative relationships (if one item is present, then the other is

not). Cluster analysis is discussed in Chapters 8 and 9. Chapter 8 first describes

the different types of clusters and then presents three specific clustering tech-

niques: K-means, agglomerative hierarchical clustering, and DBSCAN. This

is followed by a discussion of techniques for validating the results of a cluster-

ing algorithm. Additional clustering concepts and techniques are explored in

Chapter 9, including fiszzy and probabilistic clustering, Self-Organizing Maps

(SOM), graph-based clustering, and density-based clustering. There is also a

discussion of scalability issues and factors to consider when selecting a clus-

tering algorithm. The last chapter, Chapter 10, is on anomaly detection. After some basic

definitions, several different types of anomaly detection are considered: sta-

tistical, distance-based, density-based, and clustering-based. Appendices A

through E give a brief review of important topics that are used in portions of

the book: linear algebra, dimensionality reduction, statistics, regression, and

optimization. The subject of data mining, while relatively young compared to statistics

or machine learning, is already too large to cover in a single book. Selected

references to topics that are only briefly covered, such as data quality' are

provided in the bibliographic notes of the appropriate chapter. References to

topics not covered in this book, such as data mining for streams and privacy-

preserving data mining, are provided in the bibliographic notes of this chapter.

Bibliographic Notes 13

1.6 Bibliographic Notes

The topic of data mining has inspired many textbooks. Introductory text- books include those by Dunham [10], Han and Kamber l2L), Hand et al. [23], and Roiger and Geatz [36]. Data mining books with a stronger emphasis on business applications include the works by Berry and Linoff [2], Pyle [34], and Parr Rud [33]. Books with an emphasis on statistical learning include those by Cherkassky and Mulier [6], and Hastie et al. 124]. Some books with an emphasis on machine learning or pattern recognition are those by Duda et al. [9], Kantardzic [25], Mitchell [31], Webb [41], and Witten and F]ank [42]. There are also some more specialized books: Chakrabarti [a] (web mining), Fayyad et al. [13] (collection of early articles on data mining), Fayyad et al.

111] (visualization), Grossman et al. [18] (science and engineering), Kargupta and Chan [26] (distributed data mining), Wang et al. [a0] (bioinformatics), and Zaki and Ho [44] (parallel data mining).

There are several conferences related to data mining. Some of the main conferences dedicated to this field include the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), the IEEE In- ternational Conference on Data Mining (ICDM), the SIAM International Con- ference on Data Mining (SDM), the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), and the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Data min- ing papers can also be found in other major conferences such as the ACM SIGMOD/PODS conference, the International Conference on Very Large Data Bases (VLDB), the Conference on Information and Knowledge Management (CIKM), the International Conference on Data Engineering (ICDE), the In- ternational Conference on Machine Learning (ICML), and the National Con- ference on Artificial Intelligence (AAAI).

Journal publications on data mining include IEEE Transact'ions on Knowl- edge and Data Engi,neering, Data Mi,ning and Knowledge Discouery, Knowl- edge and Information Systems, Intelli,gent Data Analysi,s, Inforrnati,on Sys- tems, and lhe Journal of Intelligent Informati,on Systems.

There have been a number of general articles on data mining that define the field or its relationship to other fields, particularly statistics. Fayyad et al. [12] describe data mining and how it fits into the total knowledge discovery process. Chen et al. [5] give a database perspective on data mining. Ramakrishnan and Grama [35] provide a general discussion of data mining and present several viewpoints. Hand [22] describes how data mining differs from statistics, as does Fliedman lf 4]. Lambert [29] explores the use of statistics for large data sets and provides some comments on the respective roles of data mining and statistics.

1_.6

L4 Chapter 1 Introduction

Glymour et al. 116] consider the lessons that statistics may have for data

mining. Smyth et aL [38] describe how the evolution of data mining is being

driven by new types of data and applications, such as those involving streams,

graphs, and text. Emerging applications in data mining are considered by Han

et al. [20] and Smyth [37] describes some research challenges in data mining.

A discussion of how developments in data mining research can be turned into

practical tools is given by Wu et al. [43]. Data mining standards are the

subject of a paper by Grossman et al. [17]. Bradley [3] discusses how data

mining algorithms can be scaled to large data sets. With the emergence of new data mining applications have come new chal-

lenges that need to be addressed. For instance, concerns about privacy breaches

as a result of data mining have escalated in recent years, particularly in ap-

plication domains such as Web commerce and health care. As a result, there

is growing interest in developing data mining algorithms that maintain user

privacy. Developing techniques for mining encrypted or randomized data is

known as privacy-preserving data mining. Some general references in this

area include papers by Agrawal and Srikant l1], Clifton et al. [7] and Kargupta

et al. [27]. Vassilios et al. [39] provide a survey. Recent years have witnessed a growing number of applications that rapidly

generate continuous streams of data. Examples of stream data include network

traffic, multimedia streams, and stock prices. Several issues must be considered

when mining data streams, such as the limited amount of memory available,

the need for online analysis, and the change of the data over time. Data

mining for stream data has become an important area in data mining. Some

selected publications are Domingos and Hulten [8] (classification), Giannella

et al. [15] (association analysis), Guha et al. [19] (clustering), Kifer et al. [28] (change detection), Papadimitriou et al. [32] (time series), and Law et al. [30] (dimensionality reduction).

[1]

12l

l o l[')]

[4]

Bibliography R. Agrawal and R. Srikant. Privacy-preserving data mining. ln Proc. of 2000 ACM-

SIGMOD IntI. Conf. on Management of Data, pages 439-450, Dallas, Texas, 2000.

ACM Press.

M. J. A. Berry and G. Linofi. Data Mtni,ng Technr,ques: For Marketing, Sales, and'

Customer Relati,onship Management. Wiley Computer Publishing, 2nd edition, 2004.

P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms

to large databases. Communicati,ons of the ACM, 45(8):38 43,2002.

S. Chakrabarti. Mini.ng the Web: Di.scoueri.ng Knouledge from Hypertert Data' Morgan

Kaufmann, San Flancisco, CA, 2003.

Bibliography 15

[5] M.-s. chen, J. Han, and P. s. Yu. Data Mining: An overview from a Database Perspective. IEEE Transact'ions on Knowled,ge abd Data Engineering, g(6):g66-gg3, 1996.

[6] v. cherkassky and F. Mulier. Learn'ing from Data: concepts, Theory, and, Method,s. Wiley Interscience, 1g98.

[7] c. clifton, M. Kantarcioglu, and J. vaidya. Defining privacy for data mining. In National Sc'ience Foundat'ion workshop on Nert Generation Data Mining, pages 126- 133, Baltimore, MD, November 2002.

f8] P' Domingos and G. Hulten. Mining high-speed data streams. In Proc. of the 6th Intt. conf. on Knowled,ge Discouery and Data M'in'ing, pages z1-80, Boston, Massachusetts, 2000. ACM Press.

J9] R. o' Duda, P. E. Hart, and D. G. stork. Pattern classification John wiley & sons, Inc., New York, 2nd edition, 2001.

[10] M. H. Dunham. Data Mini,ng: Introd,uctory and, Ad,uanced ropics. prentice Hall, 2002. f11] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information Visualization in

Data Mining and, Knowled,ge Discouery. Morgan Kaufmann Publishers, San Ftancisco, CA, September 200I.

112] u. M. Fayyad, G. Piatetsky-Shapiro, and P. smyth. Fyom Data Mining to Knowledge Discovery: An overview. rn Ad,aances in Knowledge Discouery and Data M'ining, pages 1-34. AAAI Press, 1996.

[13] u. M. Fayyad, G. Piatetsky-shapiro, P. Smyth, and R. uthurusamy, editors. Aduances 'in Knowled,ge Discouery and Data Mini.ng. AAAI/MIT press, 1g96.

[14] J. H. Friedman. Data Mining and Statistics: What's the Connection? Unpublished. www-stat.stanford.edu/-jhf/ftp/dm-stat.ps, 1992.

[15] c. Giannella, J. Han, J. Pei, X. Yan, and P. s. Yu. Mining Fyequent patterns in Data streams at Multiple Time Granularities. In H. Kargupta, A. Joshi, K. sivakumar, and Y. Yesha, editors, Nert Generation Data M,ining, pages ISI-2I2. AAAI/MIT,2003.

116] c. Glymour, D. Madigan, D. Pregibon, and P. smyth. statistical rhemes and Lessons for Data Mining. Data Mining and Knowleilge D,iscouerg, 1(1):11-28, 1992.

[17] R. L. Grossman, M. F. Hornick, and G. Meyer. Data mining standards initiatives. c omtnunications of the A c M, 45(g) :59-6I, 2002.

[18] R. L. Grossman, c. Kamath, P. Kegelmeyer, v. Kumar, and R. Namburu, editors. Data Mini;ng for Sci,entific and Engineering Applicati,ons. Kluwer Academic Publishers, 2001.

119] s. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. o'callaghan. clustering Data Streams: Theory and Practice. IEEE Tbansact'ions on Knowledge and, Data Engineering, 15(3) :515-528, May/June 2003.

[20] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. pregibon. Emerging scientific applications in data mining. Communications of the ACM, 4S(8):54-b8,2002.

[21] J. Han and M. Kamber. Data Mining: concepts and, Techniques. Morgan Kaufmann Publishers, San Francisco, 2001.

[22] D. J. Hand. Data Mining: Statistics and More? The American Statistician, 52(2): 1 1 2 - 1 1 8 , 1 9 9 8 .

[23] D. J. Hand, H. Mannila, and P. smyth. Principles of Data Mining. MIT press, 2001.

l24l T. Hastie, R. Tibshirani, and J. H. Fliedman. The Elements of Stati.stical Learning: Data Mini,ng, Inference, Pred,iction. Springer, New York, 2001.

[25] M. Kantardzic. Data Mini,ng: concepts, Mod,el.s, Method,s, and Algorithms. wiley-IEEE Press, Piscataway NJ, 2003.

16 Chapter I Introduction

[26] H. Kargupta and P. K. Chan, editors. Aduances in Di,stributed and Parallel Knowledge

Discouery. AAAI Press, September 2002.

l27l H. Kargupta, s. Datta, Q. wang, and K. sivakumar. on the Privacy Preserving Prop-

erties of Random Data Perturbation Techniques. In Proc. of the 2003 IEEE IntI. Conf.

on Data Mi,n'ing, pages 99 106, Melbourne, Florida, December 2003. IEEE Computer

Society.

f28] D. Kifer, s. Ben-David, and J. Gehrke. Detecting change in Data streams. In Proc. of

the 30th VLDB Conf., pages 180 191, Toronto, Canada,2004. Morgan Kaufmann.

f29] D. Lambert. What Use is Statistics for Massive Data? In ACM SIGMOD Workshop

on Research Issues in Data Mini'ng and Knowledge Di'scouery, pages 54-62, 2000.

[30] M H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Manifold Learning for Data

Streams. In Proc. of the SIAM Intl. Conf. on Data Mi.ning, Lake Buena Vista, Florida,

Apri l2004. SIAM.

[31] T. Mitchell. Mach'ine Learning. McGraw-Hill, Boston, MA, 1997.

[32] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsupervised stream min-

ing. VLDB Journal, 13(3):222 239.2004.

f33l O. Parr Rud. Data Mi,ning Cookbook: Modeling Data for Marleet'ing, Risk and, Customer

Relationship Management John Wiley & Sons, New York, NY, 2001.

[34] D. Pyle. Business Modeling and, Data Mining. Morgan Kaufmann, san FYancisco, cA,

2003.

135] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity to Science-Guest

Editors' Introduction. IEEE Computer, S2(8):34 37, 1999.

[36] R. Roiger and M. Geatz. Data Mzni,ng: A Tutorial Based, Primer. Addison-Wesley,

2002.

137] P. Smyth. Breaking out of the Black-Box: Research Challenges in Data Mining. In

Proc. of the 2001 ACM SIGMOD Workshop on Research Issues i.n Data Mining and

Knowledg e Discouerg, 2OOL.

[38] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolution of data mining algo-

rithms. Commun'ications of the ACM, 45(8):33-37, 2002.

[39] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis.

State-of-the-art in privacy preserving data mining. SIGMOD Record,,33(1):50-57' 2004.

[40] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. tr. Shasha, editors. Data Mining in

Bi,oi,nformatics. Springer, September 2004.

[41] A. R. Webb. Statistical Pattern Recogn'iti'on. John Wiley & Sons, 2nd edition, 2002.

[42] I.H. Witten and E. Frank. Data Mini,ng: Practzcal Machine Learn'ing Tools and Tech-

niques with Jaaa Implernentat'ions. Morgan Kaufmann, 1999.

[43] X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How Research Meets Practical

Development ? Knowledg e and Inf ormati'on Sy stems, 5 (2) :248-261, 2003.

l44l M. J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining. Springer, September

2002.

L.7 Exercises

1. Discuss whether or not each of the following activities is a data mining task.

J .

L.7 Exercises L7

(a) Dividing the customers of a company according to their gender.

(b) Dividing the customers of a company according to their profitability.

(d) Sorting a student database based on student identification numbers.

(e) Predicting the outcomes of tossing a (fair) pair of dice.

(f) Predicting the future stock price of a company using historical records.

(g) Monitoring the heart rate of a patient for abnormalities.

(h) Monitoring seismic waves for earthquake activities.

(i) Extracting the frequencies of a sound wave.

Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as clustering, classification, association rule mining, and anomaly detection can be applied.

For each of the following data sets, explain whether or not data privacy is an important issue.

(a) Census data collected from 1900-1950.

(b) IP addresses and visit times of Web users who visit your Website.

(d) Names and addresses of people from the telephone book.

(e) Names and email addresses collected from the Web.

Data This chapter discusses several data-related issues that are important for suc- cessful data mining:

The Type of Data Data sets differ in a number of ways. For example, the attributes used to describe data objects can be of different types-quantitative or qualitative-and data sets may have special characteristics; e.g., some data sets contain time series or objects with explicit relationships to one another. Not surprisingly, the type of data determines which tools and techniques can be used to analyze the data. F\rrthermore, new research in data mining is often driven by the need to accommodate new application areas and their new types of data.

The Quality of the Data Data is often far from perfect. while most data mining techniques can tolerate some level of imperfection in the data, a focus on understanding and improving data quality typically improves the quality of the resulting analysis. Data quality issues that often need to be addressed include the presence of noise and outliers; missing, inconsistent, or duplicate data; and data that is biased or, in some other way, unrepresentative of the phenomenon or population that the data is supposed to describe.

Preprocessing Steps to Make the Data More suitable for Data Min- ing often, the raw data must be processed in order to make it suitable for analysis. While one objective may be to improve data quality, other goals focus on modifying the data so that it better fits a specified data mining tech- nique or tool. For example, a continuous attribute, e.g., length, m&y need to be transformed into an attribute with discrete categories, e.g., short, med,ium, or long, in order to apply a particular technique. As another example, the

20 Chapter 2 Data

number of attributes in a data set is often reduced because many techniques

are more effective when the data has a relatively small number of attributes.

Analyzing Data in Terms of Its Relationships One approach to data

analysis is to find relationships among the data objects and then perform

the remaining analysis using these relationships rather than the data objects

themselves. For instance, we can compute the similarity or distance between pairs of objects and then perform the analysis-clustering, classification, or

anomaly detection-based on these similarities or distances. There are many

such similarity or distance measures) and the proper choice depends on the

type of data and the particular application.

Example 2.1 (An Illustration of Data-Related Issues). To further il-

Iustrate the importance of these issues, consider the following hypothetical sit-

uation. You receive an email from a medical researcher concerning a project

that you are eager to work on.

Hi,

I've attached the data file that I mentioned in my previous email. Each line contains the information for a single patient and consists of five fields. We want to predict the last field using the other fields. I don't have time to provide any more information about the data since I'm going out of town for a couple of days, but hopefully that won't slow you down too much. And if you don't mind, could we

meet when I get back to discuss your preliminary results? I might invite a few other members of mv team.

Thanks and see you in a couple of days.

Despite some misgivings, you proceed to analyze the data. The first few

rows of the fiIe are as follows:

232 33.5 0 10.7 72r 16.9 2 2L0.L 165 24.0 0 427.6

A brieflook at the data reveals nothing strange. You put your doubts aside

and start the analysis. There are only 1000 lines, a smaller data file than you

had hoped for, but two days later, you feel that you have made some progress.

You arrive for the meeting, and while waiting for others to arrive, you strike

0r2 020 027

2 L

up a conversation with a statistician who is working on the project. When she learns that you have also been analyzing the data from the project, she asks if you would mind giving her a brief overview of your results.

Statistician: So, you got the data for all the patients? Data Miner: Yes. I haven't had much time for analysis, but I

do have a few interesting results. Statistician: Amazing. There were so many data issues with

this set of patients that I couldn't do much. Data Miner: Oh? I didn't hear about any possible problems. Statistician: Well, first there is field 5, the variable we want to

predict. It's common knowledge among people who analyze this type of data that results are better if you work with the log of the values, but I didn't discover this until later. Was it mentioned to you?

Data Miner: No. Statistician: But surely you heard about what happened to field

4? It's supposed to be measured on a scale from 1 to 10, with 0 indicating a missing value, but because of a data entry error, all 10's were changed into 0's. Unfortunately, since some of the patients have missing values for this field, it's impossible to say whether a 0 in this field is a real 0 or a 10. Quite a few of the records have that problem.

Data Miner: Interesting. Were there any other problems? Statistician: Yes, fields 2 and 3 are basically the same, but I

assume that you probably noticed that. Data Miner: Yes, but these fields were only weak predictors of

field 5. Statistician: Anyway, given all those problems, I'm surprised

you were able to accomplish anything. Data Miner: Thue, but my results are really quite good. Field 1

is a very strong predictor of field 5. I'm surprised that this wasn't noticed before.

Statistician: What? Field 1 is just an identification number. Data Miner: Nonetheless, my results speak for themselves. Statistician: Oh, no! I just remembered. We assigned ID

numbers after we sorted the records based on field 5. There is a strong connection, but it's meaningless. Sorry.

Table 2,1. A sample data set containing student information.

Homework is Completed By:

Writer	Writer Name	Amount	Client Comments & Rating
ONLINE	Instant Homework Helper 4.8 4305 Orders Completed	$36	She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up! 5.00
Answer.docx Turnitin Report.pdf Contact Writer For Solution Contact Writer For Solution

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 3 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 6 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 12 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

6 writers have sent their proposals to do this homework:

Writer	Writer Name	Offer	Chat
ONLINE	Smart Tutor I have assisted scholars, business persons, startups, entrepreneurs, marketers, managers etc in their, pitches, presentations, market research, business plans etc. 4.9 1008 Orders Completed	$18	Chat With Writer
ONLINE	Top Grade Tutor I have worked on wide variety of research papers including; Analytical research paper, Argumentative research paper, Interpretative research, experimental research etc. 4.8 11445 Orders Completed	$34	Chat With Writer
ONLINE	Instant Assignments After reading your project details, I feel myself as the best option for you to fulfill this project with 100 percent perfection. 4.8 1869 Orders Completed	$38	Chat With Writer
ONLINE	Buy Coursework Help I can assist you in plagiarism free writing as I have already done several related projects of writing. I have a master qualification with 5 years’ experience in; Essay Writing, Case Study Writing, Report Writing. 4.8 1617 Orders Completed	$20	Chat With Writer
ONLINE	Quick Mentor I have read your project details and I can provide you QUALITY WORK within your given timeline and budget. 4.1 210 Orders Completed	$26	Chat With Writer
ONLINE	Coursework Helper I will be delighted to work on your project. As an experienced writer, I can provide you top quality, well researched, concise and error-free work within your provided deadline at very reasonable prices. 4.8 3234 Orders Completed	$21	Chat With Writer