Data Analytics Lab
Lab 6 DATA Instructions.docx
6-2 Lab 6: Clustering and Association
Assignment
Task: Submit to complete this assignment
This lab is designed to investigate and practice association rules. After completing the tasks in this lab, you should be able to use R functions for association rule models. You will complete the following tasks in this lab:
· Use the R Studio environment to code association rule models
· Apply constraints in the Market Basket analysis methods, such as minimum thresholds on support and confidence measures that can be used to select interesting rules from the set of all possible rules
· Use R graphics "arules" to execute and inspect the models and the effect of the various thresholds
In order to complete this assignment, you will need to download a copy of the Lab 6 document, enter your responses (the areas highlighted in yellow), and submit your completed file as a Word document. Add your last name to the filename of the document you submit (for example, Britton_Lab_6.docx). You will also need the following files:
· MBAdata.csv
· mba.R
This is a pass/fail assignment.
dat_510_lab_6.docx
Lab Exercise 6: Association Rules
Purpose:
This lab is designed to investigate and practice Association Rules. After completing the tasks in this lab you should able to:
· Use R functions for Association Rule based models
Tasks:
Tasks you will complete in this lab include:
· Use the R –Studio environment to code Association Rule models
· Apply constraints in the Market Basket Analysis methods such as minimum thresholds on support and confidence measures that can be used to select interesting rules from the set of all possible rules
· Use R graphics “arules” to execute and inspect the models and the effect of the various thresholds
References:
· The groceries data set - provided for arules by Michael Hahsler, Kurt Hornik and Thomas Reutterer.
· Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006) Implications of probabilistic data modeling for mining association rules. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization, pages 598–605. Springer-Verlag.
Workflow Overview
LAB Instructions
Step
Action
1
Download the lab files from the Learning Environment:
· Start Here > Assignment Guidelines and Rubrics > Data Files
· MBAdata (CSV file)
· Mba.R (R File)
2
Set the Working Directory and install the “arules” package:
To understand Market Basket Analysis and the R package “arules,” use a simple set of transaction lists of “book-purchases”.
1. Set the working directory to by executing the command:
setwd("")
· (Or using the “Tools” option in the tool bar in the RStudio environment.)
2. Load the package (select the mirror if prompted) and the required libraries:
#Install the packages and load libraries
>install.packages('arules')
>install.packages('arulesViz')
>library('arules')
>library ('arulesViz')
3
Read in the Data for Modeling:
· Transaction List is a special data type function in the “arules” package.
1. Read the data in as a Transaction List using the following statement for the states data, “MBAdata.csv”.
> #read in the csv file as a transaction data
> txn <- read.transactions ("MBAdata.csv",rm.duplicates = FALSE,format="single",sep=",",cols=c(1,2))
The arguments for the read.transaction functions are detailed below:
· file
the file name.
· format
a character string indicating the format of the data set. One of "basket" or "single”, can be abbreviated.
· Sep
a character string specifying how fields are separated in the data file, or NULL (default). For basket format, this can be a regular expression; otherwise, a single character must be given. The default corresponds to white space separators.
· Cols
For the ‘single’ format, cols is a numeric vector of length two giving the numbers of the columns (fields) with the transaction and item ids, respectively. For the ‘basket’ format, cols can be a numeric scalar giving the number of the column (field) with the transaction ids. If cols = NULL
· rm.duplicates
a logical value specifying if duplicate items should be removed from the transactions.
4
Review Transaction data:
1. First inspect the transaction data (this can vary per version of R)
>txn@transactionInfo
>txn@itemInfo
Or
>txn@itemsetInfo
>txn@itemInfo
2. Review the results on the console
5
Plot Transactions:
1. Use the “image” function that shows a visual representation of the transaction set in which the rows are individual transactions (identified by transaction ids) and the dark squares are items contained in each transaction.
> image(txn)
2. Review the output in the graphics window
6
Mine the Association Rules:
The “apriori” function, provided by the arulesr package, is used as follows:
rules <- apriori(File,
parameter = list(supp = 0.5, conf = 0.9,
target = "rules"))
where the arguments are:
· data
object of class transactions or any data structure which can be coerced into transactions (for example, a binary matrix or data.frame).
· parameter
named list. The default behavior is to mine rules with support 0.1, confidence 0.8, and maxlen 5.
1. Read in the statement for the transaction data:
> #mine association rules
> basket_rules <- apriori(txn,parameter=list(sup=0.5,conf=0.9,target="rules"))
2. Review the output on the console. The number of rules generated can be seen in the output and is represented as follows:
writing ... [1 rule(s)] done [0.00s]
3. Inspect the rule using the following statement:
> inspect(basket_rules)
4. Review the output.
5. State the generated rule and the support, confidence and the lift thresholds for the rule
7
Read in Groceries dataset
Use the standard data set, “Groceries” available with the “arules” package.
· The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.
1. Read in the data set and inspect the item information
> #Read in Groceries data
> data(Groceries)
> Groceries@itemInfo
8
Mine the Rules for the Groceries Data:
> #mine rules
> rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
· Note the values used for the parameter list.
1. How many rules are generated?
9
Extract the Rules in which the Confidence Value is >0.8 and high lift:
1. Execute the following commands:
> subrules <- rules[quality(rules)$confidence > 0.8]
> plot(subrules, control = list(jitter=2))
> inspect(subrules)
2. Review the results.
3. How many sub-rules did you extract?
· These rules are more valuable for the business.
4. Extract the top three rules with high threshold for the parameter “lift” and plot.
> #Extract the top three rules with high lift
> rules_high_lift <- head(sort(rules, by="lift"), 3)
> inspect(rules_high_lift)
> plot(rules_high_lift,method="graph",
+ control=list(type="items"))
5. List the rules and the value of the parameters associated with these rules:
End of Lab Exercise
1
Set the Working Directory and install the “arules” and "arulesViz" package
2
3
Review Transaction data
4
Plot Transactions
5
Read in the Data for Modeling
Mine the Association Rules
6
Read in Groceries dataset
7
Mine the Rules for the Groceries Data and Visualize results
8
Extract the Rules in which the Confidence Value is >0.8 and high lift and visualize resuts
MOD6/mba.R
#part1 setwd("D:/Users/XXUserXX/Desktop/DAT-510/MOD6/") # Install the packages and load libraries install.packages('arules') library('arules') #read in the csv file as a transaction data txn <- read.transactions ("MBAdata.csv",rm.duplicates = FALSE,format="single",sep=",",cols=c(1,2)) #inspect transaction data txn@transactionInfo txn@itemInfo image(txn) #mine association rules basket_rules <- apriori(txn,parameter=list(sup=0.5,conf=0.9,target="rules")) inspect(basket_rules) #Part2 #Read in Groceries data data(Groceries) Groceries Groceries@itemInfo #mine rules rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5)) #Extract rules with confidence =0.8 subrules <- rules[quality(rules)$confidence > 0.8] inspect(subrules) #Extract the top three rules with high lift rules_high_lift <- head(sort(rules, by="lift"), 3) inspect(rules_high_lift)
MOD6/MBAdata.csv
101,R-basics 101,Stat-intro 101,PSQL-basics 102,Stat-intro 102,R-basics 103,Stat-intro 103,Learn-Spanish 103,Jane-Austen 104,Stat-intro 104,R-basics 104,Harry-Potter-DVD 105,PSQL-basics 106,Stat-intro 106,PSQL-basics 107,Stat-intro 107,R-basics