Apriori Property,Interesting Ruse Of A Categorical Variable ,Rules Distinguished From Coincidental Rules,
1. What is the Apriori property?
2. Following is a list of five transactions that include items A, B, C, and D:
• Tl: {A, B, C}
• T2: {A, B}
• T3: {B}
• T4: {A, C}
• TS: {A, C, D}
Which itemsets satisfy the minimum support of 0.5?
(Hint: An item set may include more than one item.)
3. How are interesting rules distinguished from coincidental rules?
4. A local retailer has a database that stores 10,000 transactions of last summer. After analyzing the data, a data science team has identified the following statistics:
• {battery} appears in 4,000 transactions.
• {sunscreen} appears in 3,000 transactions.
• {sandals} appears in 4,000 transactions.
• {bowls} appears in 1,000 transactions.
• {battery, sunscreen} appears in 1,500 transactions.
• {battery, sandals} appears in 1,000 transactions.
• {battery, bowls} appears in 1250 transactions.
• {battery, sunscreen, sandals} appears in 600 transactions.
Answer the following questions:
a. What are the support values of the preceding itemsets?
b. Assuming the minimum support is 0.05, which itemsets are considered frequent?
c. What are the confidence values of {battery}->{ sunscreen} and {battery, sunscreen}->{ sandals} ?
d. Which of the two rules is more interesting?
5. In the use of a categorical variable with n possible values, explain the following:
a. Why only n - 1 binary variables are necessary
b. Why using n variables would be problematic
6. If the probability of an event occurring is 0.4, then
a. What is the odds ratio?
b. What is the log odds ratio?
Data Science and Big Data Analytics
Chapter 5: Advanced Analytical Theory and Methods: Association Rules
1
Chapter Sections
5.1 Overview
5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules
5.4 Example: Transactions in a Grocery Store
5.5 Validation and Testing
5.6 Diagnostics
2
5.1 Overview
Association rules method
Unsupervised learning method
Descriptive (not predictive) method
Used to find hidden relationships in data
The relationships are represented as rules
Questions association rules might answer
Which products tend to be purchased together
What products do similar customers tend to buy
3
5.1 Overview
Example – general logic of association rules
4
5.1 Overview
Rules have the form X -> Y
When X is observed, Y is also observed
Itemset
Collection of items or entities
k-itemset = {item 1, item 2,…,item k}
Examples
Items purchased in one transaction
Set of hyperlinks clicked by a user in one session
5
5.1 Overview – Apriori Algorithm
Apriori is the most fundamental algorithm
Given itemset L, support of L is the percent of transactions that contain L
Frequent itemset – items appear together “often enough”
Minimum support defines “often enough” (% transactions)
If an itemset is frequent, then any subset is frequent
6
5.1 Overview – Apriori Algorithm
If {B,C,D} frequent, then all subsets frequent
7
5.2 Apriori Algorithm Frequent = minimum support
Bottom-up iterative algorithm
Identify the frequent (min support) 1-itemsets
Frequent 1-itemsets are paired into 2-itemsets, and the frequent 2-itemsets are identified, etc.
Definitions for next slide
D = transaction database
d = minimum support threshold
N = maximum length of itemset (optional parameter)
Ck = set of candidate k-itemsets
Lk = set of k-itemsets with minimum support
8
5.2 Apriori Algorithm
9
5.3 Evaluation of Candidate Rules Confidence
Frequent itemsets can form candidate rules
Confidence measures the certainty of a rule
Minimum confidence – predefined threshold
Problem with confidence
Given a rule X->Y, confidence considers only the antecedent (X) and the co-occurrence of X and Y
Cannot tell if a rule contains true implication
10
5.3 Evaluation of Candidate Rules Lift
Lift measures how much more often X and Y occur together than expected if statistically independent
Lift = 1 if X and Y are statistically independent
Lift > 1 indicates the degree of usefulness of the rule
Example – in 1000 transactions,
If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5
If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0
11
5.3 Evaluation of Candidate Rules Leverage
Leverage measures the difference in the probability of X and Y appearing together compared to statistical independence
Leverage = 0 if X and Y are statistically independent
Leverage > 0 indicates degree of usefulness of rule
Example – in 1000 transactions,
If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1
If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2
12
5.4 Applications of Association Rules
The term market basket analysis refers to a specific implementation of association rules
For better merchandising – products to include/exclude from inventory each month
Placement of products within related products
Association rules also used for
Recommender systems – Amazon, Netflix
Clickstream analysis from web usage log files
Website visitors to page X click on links A,B,C more than on links D,E,F
13
5.5 Example: Grocery Store Transactions 5.5.1 The Groceries Dataset
Packages -> Install -> arules, arulesViz # don’t enter next line
> install.packages(c("arules", "arulesViz")) # appears on console
> library('arules')
> library('arulesViz')
> data(Groceries)
> summary(Groceries) # indicates 9835 rows
Class of dataset Groceries is transactions, containing 3 slots
transactionInfo # data frame with vectors having length of transactions
itemInfo # data frame storing item labels
data # binary evidence matrix of labels in transactions
> Groceries@itemInfo[1:10,]
> apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))
14
5.5 Example: Grocery Store Transactions 5.5.2 Frequent Itemset Generation
To illustrate the Apriori algorithm, the code below does each iteration separately.
Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total
First, get itemsets of length 1
> itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets"))
> summary(itemsets) # found 59 itemsets
> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
Second, get itemsets of length 2
> itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets"))
> summary(itemsets) # found 61 itemsets
> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
Third, get itemsets of length 3
> itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets"))
> summary(itemsets) # found 2 itemsets
> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items
15
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
The Apriori algorithm will now generate rules.
Set minimum support threshold to 0.001 (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules.
> rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules"))
> summary(rules) # finds 2918 rules
> plot(rules) # displays scatterplot
The scatterplot shows that the highest lift occurs at a low support and a low confidence.
16
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
> plot(rules)
17
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules
> plot(rules@quality) # displays scatterplot matrix
Lift is proportional to confidence with several linear groupings.
Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).
18
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
> plot(rules)
19
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
Compute the 1/Support(Y) which is the slope
> slope<-sort(round(rules@quality$lift/rules@quality$confidence,2))
Display the number of times each slope appears in dataset
> unlist(lapply(split(slope,f=slope),length))
Display the top 10 rules sorted by lift
> inspect(head(sort(rules,by="lift"),10))
Rule {Instant food products, soda} -> {hamburger meat}
has the highest lift of 19 (page 154)
20
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
Find the rules with confidence above 0.9
> confidentRules<-rules[quality(rules)$confidence>0.9]
> confidentRules # set of 127 rules
Plot a matrix-based visualization of the LHS v RHS of rules
> plot(confidentRules,method="matrix",measure=c("lift","confidence"),control=list(reorder=TRUE))
The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds
21
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
> plot(rules)
22
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
Visualize the top 5 rules with the highest lift.
> highLiftRules<-head(sort(rules,by="lift"),5)
> plot(highLiftRules,method="graph",control=list(type="items"))
In the graph, the arrow always points from an item on the LHS to an item on the RHS.
For example, the arrows that connects ham, processed cheese, and white bread suggest the rule
{ham, processed cheese} -> {white bread}
Size of circle indicates support and shade represents lift
23
5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization
24
5.6 Validation and Testing
The frequent and high confidence itemsets are found by pre-specified minimum support and minimum confidence levels
Measures like lift and/or leverage then ensure that interesting rules are identified rather than coincidental ones
However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions
E.g., rules like {paper} -> {pencil} are not interesting/meaningful
Incorporating subjective knowledge requires domain experts
Good rules provide valuable insights for institutions to improve their business operations
25
5.7 Diagnostics
Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available
For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements
Partitioning
Sampling
Transaction reduction
Hash-based itemset counting
Dynamic itemset counting
26