School of Computer & Information Sciences
ITS 836 Data Science and Big Data Analytics
ITS 836
1
HW07 Lecture 07 Classification
Questions
Perform the ID3 Algorithm
R exercise for Decision Tree section 7_1
Explain how Random Forest Algorithm works
Iris Dataset with Decision Tree vs Random Forest
R exercise for Naïve Bayes section 7_2
Analyze Classifier Performance section 7_3
Redo calculations for ID3 and Naïve Bayes for the Golf
ITS 836
2
HW07-1 Apply ID3 Algorithm to demonstrate the Decision Tree for the data set
ITS 836
3
http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html
Select Size Color Shape
yes medium blue brick
yes small red sphere
yes large green pillar
yes large green sphere
no small red wedge
no large red wedge
no large red pillar
Back to HW07 Overview
HW07 Q 2
Analyze R code in section 7_1 to create the decision tree classifier for the dataset: bank_sample.csv
Create and Explain all plots an d results
ITS 836
4
# install packages rpart,rpart.plot
# put this code into Rstudio source and execute lines via Ctrl/Enter
library("rpart")
library("rpart.plot")
setwd("c:/data/rstudiofiles/")
banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")
## drop a few columns to simplify the tree
drops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month")
banktrain <- banktrain [,!(names(banktrain) %in% drops)]
summary(banktrain)
# Make a simple decision tree by only keeping the categorical variables
fit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1),
parms=list(split='information'))
summary(fit)
# Plot the tree
rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)
Back to HW07 Overview
4
HW07 Q 2
Analyze R code in section 7_1 to create the decision tree classifier for the dataset: bank_sample.csv
Create and Explain all plots an d results
ITS 836
5
5
HW07 Q 2
Analyze R code in section 7_1 to create the decision tree classifier for the dataset: bank_sample.csv
Create and Explain all plots and results
ITS 836
6
6
HW 7 Q3
Explain how a Random Forest Algorithm Works
ITS 836
7
http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics
Back to HW07 Overview
ITS 836
Use Decision Tree Classifier and Random Forest
Attributes: sepal length, sepal width, petal length, petal width
All flowers contain a sepal and a petal
For the iris flowers three categories (Versicolor, Setosa, Virginica) different measurements
R.A. Fisher, 1936
8
HW07 Q4 Using Iris Dataset
Back to HW07 Overview
HW07 Q4 Using Iris Dataset
Decision Tree applied to Iris Dataset
https://rpubs.com/abhaypadda/k-nn-decision-tree-on-IRIS-dataset or
https://davetang.org/muse/2013/03/12/building-a-classification-tree-in-r/
What are the disadvantages of Decision Trees?
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Random Forest applied to Iris Dataset and compare to
https://rpubs.com/rpadebet/269829
http://rischanlab.github.io/RandomForest.html
ITS 836
9
Get data and e1071 package
sample<-read.table("sample1.csv",header=TRUE,sep=",")
traindata<-as.data.frame(sample[1:14,])
testdata<-as.data.frame(sample[15,])
traindata #lists train data
testdata #lists test data, no Enrolls variable
install.packages("e1071", dep = TRUE)
library(e1071) #contains naïve Bayes function
model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata)
model # generates model output
results<-predict(model,testdata)
Results # provides test prediction
ITS 836
10
Q5 HW07 Section 7.2 Naïve Bayes in R
Back to HW07 Overview
10
7.3 classifier performance
# install some packages
install.packages("ROCR")
library(ROCR)
# training set
banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")
# drop a few columns
drops <- c("balance", "day", "campaign", "pdays", "previous", "month")
banktrain <- banktrain [,!(names(banktrain) %in% drops)]
# testing set
banktest <- read.table("bank-sample-test.csv",header=TRUE,sep=",")
banktest <- banktest [,!(names(banktest) %in% drops)]
# build the na?ve Bayes classifier
nb_model <- naiveBayes(subscribed~.,
data=banktrain)
ITS 836
11
# perform on the testing set
nb_prediction <- predict(nb_model,
# remove column "subscribed"
banktest[,-ncol(banktest)],
type='raw')
score <- nb_prediction[, c("yes")]
actual_class <- banktest$subscribed == 'yes'
pred <- prediction(score, actual_class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, lwd=2, xlab="False Positive Rate (FPR)",
ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
## corresponding AUC score
auc <- performance(pred, "auc")
auc <- unlist(slot(auc, "y.values"))
auc
Back to HW07 Overview
7.3 Diagnostics of Classifiers
We cover three classifiers
Logistic regression, decision trees, naïve Bayes
Tools to evaluate classifier performance
Confusion matrix
ITS 836
12
Back to HW07 Overview
12
7.3 Diagnostics of Classifiers
Bank marketing example
Training set of 2000 records
Test set of 100 records, evaluated below
ITS 836
13
Back to HW07 Overview
13
HW07 Q07 Review calculations for the ID3 and Naïve Bayes Algorithm
ITS 836
14
Record OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY GOLF
X0 Rainy Hot High False No
X1 Rainy Hot High True No
X2 Overcast Hot High False Yes
X3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
X13 Sunny Mild High True No
Back to HW07 Overview
Questions?