Introduction to Data Mining
Mainly based on Tan, P.N., Steinbach, M., and Kumar, V. (2014) Introduction to Data Mining, Pearson
The CRISP-DM Process Model
Mariscal et al (2010). A survey of data mining and knowledge discovery process models and methodologies, Knowledge Engineering Review, 25, 137-166.
Learning Outcomes
• Define support, confidence, association rule
• Understand association rule mining
• Two-step approach to mine association rules • Techniques in frequent itemset generation
– Reduce number of candidates: Apriori principle
– Reduce number of comparison: Hash tree
– Compact representation of frequent itemsets
– Alternative methods
• Techniques in rule generations
• Evaluate association rules
Example: Questions to Think
When we go shopping, we often have a list of things to buy. Each shopper has a distinctive list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips.
Q. Are there any relationships between items from huge databases?
Q. How do we uncover the relationships/ patterns?
• Product Placement
• Promotional discounts
• Advertisement on items • New product
iwww.kdnuggets.com
Association Rule Learning
• Association learning is to discover the probability of the cooccurrence of items in a collection
• Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions Example of Association Rules
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Implication is co-occurrence, not causality!
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
The set of all items: I = {i1, i2, … , id} .
The set of all transactions: T = {t1, t2, … , tN}
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
σ(X) = |{ti|X ti, ti T}|
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule, Support
& Confidence
• Association Rule
– An implication expression of the form X Y, where X and Y are disjoint itemsets.
– The strength of an association rule can be measured in terms of support and confidence.
• Rule Evaluation Metrics
– Support (s): Fraction of transactions that contain both X and Y σ(X Y ) support, s(X → Y ) =
|T|
– Confidence (c): Measures how often items in Y appear in transactions that contain X
σ(X Y ) confidence, c(X → Y ) = σ(X)
Example: Association Rule
• Association Rule
– Example:
{Milk, Diaper} {Beer}
• Rule Evaluation Metrics
– Support (s)
σ(X Y )
s(X → Y ) =
|T|
– Confidence (c)
c(X → Y ) = σ(X Y ) σ(X)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example:
{Milk,Diaper} Beer
(Milk,Diaper, Beer) 2
s | T | 5 = ?0.4
(Milk,Diaper,Beer) 2
c =0 ?.67
(Milk,Diaper) 3
Application 1
• Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } –> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
Application 2
• Supermarket shelf management:
– Goal: To identify items that are bought together by sufficiently many customers.
– Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.
– A classic rule —
• If a customer buys diaper and milk, then he is very likely to buy beer.
• We may put six-packs of beers stacked next to diapers!
Application 3
• Inventory Management:
– Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts and tools to reduce on number of visits to consumer households.
– Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.
The most common application of association rule mining is Market Basket Analysis.
Association Rule Mining Task
• Association Rule Discovery: Given a set of transactions T, the goal of association rule mining is to find all rules having
– support ≥ minsup threshold – confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds It is computationally prohibitive!
One challenge is: how to reduce the computational complexity
Example: Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c(s=0.4, c=1.0)=1.0)
{Diaper,Beer} {Milk} (s=0.4, c(s=0.4, c=0.67)=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67) (s=0.4, c? =0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5) (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c(s=0.4, c=0.5)=0.5) Observations:
• All the above rules are binary partitions of the same σ(X Y ) s(X → Y ) =
itemset: {Milk, Diaper, Beer} |T|
• Rules originating from the same itemset have
identical support but can have different confidences c(X → Y ) = σ(X Y ) σ(X)
• Thus, we may decouple the support and confidence requirements
Example: Importance of Confidence and Support in Mining Association Rules
Custm. ID Items
1 3, 5, 8.
2 2, 6, 8.
3 1, 4, 7, 10.
4 3, 8, 10.
5 2, 5, 8.
6 1, 5, 6.
7 4, 5, 6, 8.
8 2, 3, 4.
9 1, 5, 7, 8.
10 3, 8, 9, 10.
Calculate support, confidence of the association rules:
• Rule 1: {5} –> {8} C = 80%
• Rule 2: {8} –> {5} C = 57%
• Which rule is more meaningful?
• Rule 3: {9} –> {3} C = 100%
S = 1/10.
• Is this rule meaningful?
Video Clip: Frequent Itemsets & Association Rules
• Support;
• Confidence;
• Frequent Itemset
• Association rules
Tutorial: Evaluate Association Rule of Iris Dataset Using R
#create a sub-directory “datamining” in “C:” drive using File Explorer; #copy “discrete2LevelIris.csv” into the working directory “C:\datamining”;
setwd(“C:/datamining”)
iris <- read.table(“discrete2LevelIris.csv”, header = TRUE, sep = “,”);
#get support count for whole dataset count_total <- nrow(iris);
#get support for each iris species support_Setosa <- length(which(iris$species == “Setosa”)) / count_total; support_Versicolor <- length(which(iris$species == “Versicolor”)) / count_total; support_Virginica <- length(which(iris$species == “Virginica”)) / count_total; support_Setosa support_Versicolor support_Virginica
##Rules: {sepalLength == “high”} -> {species?}
count_sepalLenH <- length(which(iris$sepalLength == “high”)) # 55
#rule: {sepalLength == “high”} -> {Setosa}
count_sepalLenH_Setosa <- length(which(iris$sepalLength == “high” & iris$species == “Setosa”)) # [1] 0
#rule: {sepalLength == “high”} -> {Virginica}
count_sepalLenH_Virginica <- length(which(iris$sepalLength == “high” & iris$species ==
“Virginica”)) #[1] 39
support_sepalLenH_Virginica <- count_sepalLenH_Virginica / count_total; confidence_sepalLenH_Virginica <- count_sepalLenH_Virginica / count_sepalLenH; support_sepalLenH_Virginica confidence_sepalLenH_Virginica
#rule: {sepalLength == “high”} -> {Versicolor}
count_sepalLenH_Versicolor <- length(which(iris$sepalLength == “high” & iris$species
== “Versicolor”)) #[1] 16
support_sepalLenH_Versicolor <- count_sepalLenH_Versicolor / count_total; confidence_sepalLenH_Versicolor <- count_sepalLenH_Versicolor / count_sepalLenH; support_sepalLenH_Versicolor confidence_sepalLenH_Versicolor
##Rule 1: {sepalLength == “high”, sepalWidth == “high”} -> {Virginica} count_sepalLenH_sepalWidH <-length(which(iris$sepalLength == “high” & iris$sepalWidth == “high”)); count_sepalLenH_sepalWidH
count_sepalLenH_sepalWidH_Virginica <-length(which(iris$sepalLength == “high” &
iris$sepalWidth == “high” & iris$species == “Virginica”)); count_sepalLenH_sepalWidH_Virginica
#support of the association rule
support_sepalLenH_sepalWidH_Virginica <- count_sepalLenH_sepalWidH_Virginica / count_total;
support_sepalLenH_sepalWidH_Virginica
#confidence of the association rule confidence_sepalLenH_sepalWidH_Virginica <-
count_sepalLenH_sepalWidH_Virginica / count_sepalLenH_sepalWidH; confidence_sepalLenH_sepalWidH_Virginica
##Rule 2: {sepalLength == “high”, sepalWidth == “low”, petalWidth == “high”} ->
{Virginica}
count_sepalLenH_sepalWidL_petalWidH <-length(which(iris$sepalLength == “high” &
iris$sepalWidth == “low” & iris$petalWidth == “high”));
count_sepalLenH_sepalWidL_petalWidH count_sepalLenH_sepalWidL_petalWidH_Virginica <-length(which(iris$sepalLength == “high” & iris$sepalWidth == “low” & iris$petalWidth== “high” & iris$species ==
“Virginica”));
count_sepalLenH_sepalWidL_petalWidH_Virginica
#support of the association rule
support_sepalLenH_sepalWidL_petalWidH_Virginica <count_sepalLenH_sepalWidL_petalWidH_Virginica / count_total; support_sepalLenH_sepalWidL_petalWidH_Virginica
#confidence of the association rule confidence_sepalLenH_sepalWidL_petalWidH_Virginica <count_sepalLenH_sepalWidL_petalWidH_Virginica / count_sepalLenH_sepalWidL_petalWidH; confidence_sepalLenH_sepalWidL_petalWidH_Virginica ##Rule: 3 {sepalLength == “high”, sepalWidth == “low”, petalWidth == “low”} ->
{Versicolor}
count_sepalLenH_sepalWidL_petalWidL <-length(which(iris$sepalLength == “high” & iris$sepalWidth == “low” & iris$petalWidth == “low”));
count_sepalLenH_sepalWidL_petalWidL count_sepalLenH_sepalWidL_petalWidL_Versicolor <-length(which(iris$sepalLength
== “high” & iris$sepalWidth == “low” & iris$petalWidth== “low” & iris$species ==
“Versicolor”));
count_sepalLenH_sepalWidL_petalWidL_Versicolor
#support of the association rule
support_sepalLenH_sepalWidL_petalWidL_Versicolor <count_sepalLenH_sepalWidL_petalWidL_Versicolor / count_total; support_sepalLenH_sepalWidL_petalWidL_Versicolor
#confidence of the association rule confidence_sepalLenH_sepalWidL_petalWidL_Versicolor <count_sepalLenH_sepalWidL_petalWidL_Versicolor / count_sepalLenH_sepalWidL_petalWidL; confidence_sepalLenH_sepalWidL_petalWidL_Versicolor
Mining Association Rules
• Market Basket Analysis
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
The computational requirements for frequent itemset generation are generally more expensive than rule generation
Frequent Itemset Generation
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions
– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d – 1 !!!
For a data set that contains d items
Computational Complexity
• Given d unique items in a data set:
– Total number of itemsets = 2d– 1
– Total number of possible association rules:
Q. How many rules if d=2?
Q. How many rules if d=3?
d1d dk d k
R k1k j1 j
3d 2d1 1
Tan et al. (2014)
Generation Strategies
Strategies to reduce computational complexity of frequent
Itemset generation
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by direct hashing and pruning (DHP) and vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction
Techniques in Frequent Itemset
Generation
• Reduce number of candidates: Apriori principle
• Reduce number of comparison: Hash tree
• Compact representation of frequent itemsets
• Alternative methods to generate frequent itemsets
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent
• Apriori principle holds due to the following property of the support measure:
X,Y :(X Y) s(X ) s(Y)
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
All of its subsets (i.e., the shaded itemsets) must also be frequent
Illustrating Apriori Principle
Example: Apriori Principle
Transaction Data Items (1-itemsets)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Item Support count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
Example: Apriori Principle
Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Minimum Support = 3
2-Itemset Count
Bread, Milk 3
Bread, Beer 2
Bread, Diaper 3
Milk, Beer 2
Milk, Diaper 3
Beer, Diaper 3
3-Itemset Count
Bread, Milk, Diaper 3
Transaction Data
Items (1-itemsets)
Example: Apriori Principle
Working Example
Itemsets
{1,2,3,4}
{1,2,4}
{1,2}
{2,3,4}
{2,3}
{3,4}
{2,4}
The supermarket has a database of transactions where each transaction is a set of SKUs. Use Apriori or Brute-force to determine the frequent itemsets of this database with minsup=3.
Apriori Algorithm
1. Let k=1
2. Generate frequent itemsets of length 1
3. Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are infrequent
• Count the support of each candidate by scanning the database
• Eliminate candidates that are infrequent, leaving only those that are frequent
Apriori Algorithm is based on two Apriori principles;
Apriori Algorithm is a level-wise search: 1. self-joining; 2. pruning;
Apriori Algorithm makes repeated passes over the data set to count the support
The Apriori Algorithm In Pseudocode
1: Find all large 1-itemsets
2: For (k = 2 ; while Lk-1 is non-empty; k++)
3 {Ck = apriori-gen(Lk-1)
4 For each c in Ck, initialise c.count to zero
5 For all records r in the DB
6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }
7 Set Lk:= all c in Ck whose count >= minsup 8 } /* end — return all of the Lksets.
apriori-gen : Notes
Suppose we have worked out that the frequent 2-itemsets are:
L2 = { {milk, noodles}, {milk, tights}, {noodles, bacon}} apriori-gen now generates 3-itemsets that all may be frequent.
An obvious way to do this would be to generate all of the possible 3itemsets that you can make from {milk, noodles, tights, bacon}.
But this would include, e.g., {milk, tights, bacon}. Now, if this really was a frequent 3-itemset, that would mean the number of records containing all three is >= minsup;
This implies it would have to be true that the number of records containing {tights, bacon} is >= minsup. But, it can’t be, because this is not one of the large 2-itemsets.
apriori-gen : the Join Step
apriori-gen is clever in generating not too many candidate frequent itemsets, but making sure to not lose any that do turn out to be frequent.
To explain it, we need to note that there is always an ordering of the items. We will assume alphabetical order, and that the data structures used always keep members of a set in alphabetical order, a < b will mean that a comes before b in alphabetical order.
Suppose we have Lk and wish to generate Ck+1
First we take every distinct pair of sets in Lk
{a1, a2 , … ak} and {b1, b2 , … bk}, and do this: in all cases where {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak< bk, then, {a1, a2 , … ak, bk} is a candidate k+1-itemset.
Example: the Join Step
Suppose the 2-itemsets are:
L2 = { {milk, noodles}, {milk, tights}, {noodles, bacon}, {noodles, peas}, {noodles, tights}}
The pairs that satisfy {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak< bk, are:
{milk, noodles}|{milk, tights} {noodles, bacon}|{noodles, peas} {noodles, bacon}|{noodles, tights} {noodles, peas}|{noodles, tights}
So the candidate 3-itemsets are:
{milk, noodles, tights}, {noodles, bacon, peas}
{noodles, bacon, tights} {noodles, peas, tights},
According to Apriori principle, all other 3-itemsets cannot be frequent!
apriori-gen : the Prune Step
Now we have some candidate k+1 itemsets, and are guaranteed to have all of the ones that possibly could be frequent, but we have the chance to maybe prune out some more before we enter the next stage of Apriori that counts their support.
In the prune step, we take the candidate k+1 itemsets we have, and remove any for which some k-subset of it is not a frequent k-itemset. Such couldn’t possibly be a frequent k+1-itemset.
E.g. in the current example, we have (let n:=noodles; b:=bacon; etc):
L2 = { {milk, n}, {milk, tights}, {n, bacon}, {n, peas}, {n, tights}}
And candidate k+1-itemsets so far: {m, n, t}, {n, b, p}, {n, p, t}, {n, b, t} Now, {b, p} is not a frequent 2-itemset, so {n, b, p} is pruned. {p,t} is not a 2-itemset, so {n,p,t} is pruned {b,t} is not a 2-itemset, so {n,b,t} is pruned.
After this we finally have C3 = {{milk, noodles, tights}}
Example: Apriori Algorithm
• With k = 3 (& k-itemsets lexicographically ordered):
Join Step
• {3,4,5}, {3,4,7}, {3,5,6}, {3,5,7}, {3,5,8}, {4,5,6}, {4,5,7}
• Generate all possible (k+1)-itemsets, by, for each to sets where we have
{3,4,5,7}, {3,5,6,7}, {3,5,6,8}, {3,5,7,8}, {4,5,6,7}
Prune Step
• Delete (prune) all itemset candidates with non-frequent subsets. Like; {3,5,6,7} is never frequent since subset {5,6,7} is not frequent.
• Actually, here, only one remaining candidate {3,4,5,7}.
Exercise 1: Apriori Algorithm
Table 1. Example of market basket transactions
Transaction ID Items
1 {a, b, d, e}
2 {b, c, d}
3 {a, b, d, e}
4 {a, c, d, e}
5 {b, c, d, e}
6 {b, d, e}
7 {c, d}
8 {a, b, c}
9 {a, d, e}
10 {b, d}
Suppose the Apriori algorithm is applied to the data set shown in Table 1 with minsup = 30%, i.e., any itemset occurring in less than 3 transactions is considered to be infrequent. Label each node in the lattice in Fig 1 with the following letter: N (not considered by Apriori algorithm; F=frequent itemset; I=infrequent itemset.
Exercise 1: Apriori Algorithm
Techniques in Frequent Itemset
Generation
• Reduce number of candidates: Apriori principle
• Reduce number of comparison: Hash tree
• Compact representation of frequent itemsets
• Alternative methods to generate frequent itemsets
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of each candidate itemset, which is time-consuming.
– To reduce the number of comparisons, store the candidates in a hash structure
• Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Generate Hash Tree
Suppose we have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
We need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
h(p) = p mod 3 4 5 7 14 2 5 58 1 5 9
Association Rule Discovery: Hash tree
Enumerate Subset of 3 Items
Factors Affecting Complexity of
Apriori Algorithm
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may also increase
• Size of database (number of transactions)
– since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
Techniques in Frequent Itemset
Generation
• Reduce number of candidates: Apriori principle
• Reduce number of comparison: Hash tree
• Compact representation of frequent itemsets
• Alternative methods to generate frequent itemsets
Compact Representation of Frequent
Itemsets
• Some itemsets are redundant because they have identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2
3
4
5 1 1 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
7
8
9
10 1 1 1 1 1 1 1 1 1 1
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
12
13
14
15 1 1 1 1 1 1 1 1 1 1
• Number of frequent itemsets 10 10
• Need a compact representation 3k1 k
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
Superset = a set which includes another set or sets.
Closed Itemset
• An itemset is closed if none of its immediate supersets has the same support as the itemset
TID Items
1 {A,B}
2 {B,C,D}
3 {A,B,C,D}
4 {A,B,D}
5 {A,B,C,D}
Itemset Support
{A} 4
{B} 5
{C} 3
{D} 4
{A,B} 4
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Itemset Support
{A,B,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
{A,B,C,D} 2
Maximal vs Closed Itemsets
Maximal vs Closed Frequent Itemsets
Maximal vs Closed Itemsets
Maximal frequent itemsets are the smallest set of itemsets from which all other frequent itemsets can be derived.
Closed itemsets provide a minimal representation of itemsets without losing their support information
Techniques in Frequent Itemset
Generation
• Reduce number of candidates: Apriori principle
• Reduce number of comparison: Hash tree
• Compact representation of frequent itemsets
• Alternative methods to generate frequent itemsets
Generation
• Apriori has successfully addressed the combinatorial explosion of frequent itemset generation
• The performance of Apriori may degrade significantly for dense data sets due to the increasing width of transactions
• Traversal of Itemset Lattice
• Representation of Transaction Data Set
Generation
• Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
Generation
• Traversal of Itemset Lattice
Generation
• Traversal of Itemset Lattice
Alternative Methods for Frequent Itemset
Generation
• Representation of Database
– horizontal vs vertical data layout
The support of each candidate itemset can be counted by intersecting the TID-lists of their subsets.
TID Items
1 A,B,E
2 B,C,D
3 C,E
4 A,C,D
5 A,B,C,D
6 A,E
7 A,B
8 A,B,C
9 A,C,D
10 B
A B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
8
9 10
E.g. Column A ∩ Column C, its size = ({A,C}) = 3.
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Note that for a given frequent itemset, the rules generated by a binary partitioning have the same support
Rule Generation
• Given a frequent itemset Y, find all non-empty subsets X Y such that X (Y – X) satisfies the minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
• If |Y| = k, then there are 2k – 2 candidate association rules (ignoring Y and Y)
Rule Generation
• How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same itemset has an anti-monotone property
– e.g., Y = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Rule Generation for Apriori Algorithm
Lattice of rules Rule
Rule generation from the same itemset.
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same prefix in the rule consequent
• Join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have Prune Step high confidence
Effect of Support Distribution
• Many real data sets have skewed support distribution for items
Support distribution of a retail data set
Effect of Support Distribution
• How to set the appropriate minsup threshold?
– If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
– If minsup is set too low, it is computationally expensive and the number of itemsets is very large
• Using a single minimum support threshold may not be effective
Multiple Minimum Support
• How to apply multiple minimum supports?
– MS(i):= minimum support for item i
– e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1%
– Challenge: Support is no longer anti-monotone
Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
Then: {Milk, Coke} is infrequent, Why? but {Milk, Coke, Broccoli} is frequent.
Multiple Minimum Support
Item MS(I) Sup(I)
A 0.10% 0.25%
B 0.20% 0.26%
C 0.30% 0.29%
D 0.50% 0.05%
E 3% 4.20%
Multiple Minimum Support
Item MS(I) Sup(I)
A 0.10% 0.25%
B 0.20% 0.26%
C 0.30% 0.29%
D 0.50% 0.05%
E 3% 4.20%
Modified Apriori Algorithm for Multiple
Minimum Support
• Order the items according to their minimum support
(in ascending order)
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– Ordering: Broccoli, Salmon, Coke, Milk
• Need to modify Apriori such that:
– L1 : set of frequent items
– F1 : set of items whose support is MS(1) where MS(1) is mini( MS(i) )
– C2 : candidate itemsets of size 2 is generated from F1 instead of L1
Modified Apriori Algorithm for Multiple
Minimum Support
• Modifications to Apriori:
– In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k
– Pruning step has to be modified:
• Prune only if infrequent subset contains the first item;
• e.g.: from two frequent 2-itemsets {Broccoli, Coke} and {Broccoli, Milk}, we generate a 3-itemset candidate={Broccoli, Coke, Milk}, where items are ordered according to minimum support;
• Then, although {Coke, Milk} is infrequent, the candidate {Broccoli, Coke, Milk} is not pruned, because {Coke, Milk} does not contain the first item, i.e., Broccoli.
Working Ex.: Frequent itemset with multiple minsups
Table 1. Dataset Table 2. Minsups of items
TID items
t1 {1, 3, 4,6}
t2 {1, 3, 5, 6, 7}
t3 {1, 2, 3, 6, 8}
t4 {2, 6, 7}
t5 {2, 3}
item minsup count
1 1 3
2 2 3
3 3 4
4 3 1
5 2 1
6 3 4
7 2 2
8 1 1
item minsup count
8 1 1
1 1 3
5 2 1
7 2 2
2 2 3
4 3 1
6 3 4
3 3 4
Q1: Which 1-itemsets are frequent? Which 2-itemsets are frequent?
Q2: Generate association rules and calculate their supports and confidences from 2itemsets, {8 1}, {2, 3}, {6 3}, respectively?
Hint: rank items according to minsups.
Frequent itemsets:
Itemset: count 8:1
8 1:1
8 1 2:1
8 1 2 6:1
8 1 2 6 3:1
8 1 2 3:1
8 1 6:1
8 1 6 3:1
8 1 3:1
8 2:1
8 2 6:1
8 2 6 3:1
8 2 3:1
8 6:1
8 6 3:1
8 3:1 1:3
1 7:1
1 7 5:1
1 7 5 6:1
1 7 5 6 3:1
1 7 5 3:1
1 7 6:1
1 7 6 3:1
1 7 3:1
1 5:1
1 5 6:1
1 5 6 3:1 1 5 3:1
1 2:1
1 2 6:1
1 2 6 3:1
1 2 3:1
1 6:3
1 6 4:1
1 6 4 3:1
1 6 3:3
1 4:1
1 4 3:1
1 3:3 7:2
7 6:2
2:3
2 6:2
2 3:2
6:4
6 3:3
3:4
Association Rule Evaluation
• Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence
• Interestingness measures can be used to prune/rank the derived patterns
• In the original formulation of association rules, support & confidence are the only measures used
• Other interestingness measures?
Association Rule Evaluation
Bu er →Bread,
Chocolate → Teddy Bear, Beer → Diapers,
• Which of these three seem interesting to you?
• Which of these three might affect the way you do business?
• After the creation of association rules we must decide which rules are actually interesting and of use to us.
• A market basket data which has about 10 transactions and 5 items can have up to 100 association rules
• We need identify the most interesting ones. Interestingness is the term coined to define patterns that we consider of interest. It can be identified by subjective and objective measures.
Application of Interestingness Measure
Computing Interestingness Measure
• Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table.
Drawback of Confidence
Coffee
Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Consider Association Rule: Tea Coffee:
Confidence= P(Coffee|Tea) = 0.75
Support(Coffee, Tea) = ?
but P(Coffee) = 0.9
Although confidence is high, rule is misleading 15%
P(Coffee|Tea) = 0.9375
Subjective vs. Objective Measures of
Interestingness
• Subjective measures are those that depend on the class of users who examine the pattern
• Objective measures use statistical information which can be derived from the data to determine whether a particular pattern is interesting, e.g. support and confidence.
• Other Objective Measures of Interestingness:
– Lift
– Interest Factor – Correlation Analysis – Etc.
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)
– P(SB) = 420/1000 = 0.42
– P(S) P(B) = 0.6 0.7 = 0.42
– P(SB) = P(S) P(B) => Statistical independence
– P(SB) > P(S) P(B) => Positively correlated
– P(SB) < P(S) P(B) => Negatively correlated
Statistical-based Measures
• Measures that take into account statistical dependence
Interest Factor
• Interpretation of Interest factor: compare the support of itemset {A,B} to the expected support under the assumption that A and B are statistically independent:
– s(A,B) ≈ P(A and B)
– s(A) ≈ P(A), s(B) ≈ P(B)
– Statistical independence: P(A and B) = P(A)xP(B)
• Use of interest factor:
– I(A,B) >1 : A and B occur together more frequently than expected by chance.
– I(A,B) < 1 : A and B occur together less frequently than expected by chance.
Example: Lift/Interest/Correlation
Coffee
Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9
Lift = 0.75/0.9= 0.8333 (< 1, less often than independent events) ϕ = (15*5-75*5)/√(90x20x10x80) =-0.25; (negatively associated)
Drawback of Lift & Interest
Y
Y Y
Y
X 90 0 90 X 10 0 10
X 0 10 10 X 0 90 90
90 10 100 10 90 100
Lift 1.11 Lift 10
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Sometimes, Lift could be misleading in terms of association
Correlation analysis: -coefficient
• For binary variables, correlation can be measured using the φ -coefficient
• φ-coefficient is: symmetric; Invariant under inversion; not invariant under null addition; not invariant under scaling
• φ-Coefficient considers the co-occurrence and co-absence equally important: the two contingency tables evaluate to the same value
• This makes the measure more suitable to symmetrical variables
Example: -Coefficient
• -coefficient is analogous to correlation coefficient for continuous variables
60201010 20601010
70307030 70307030
0.5238 0.5238
Coefficient is the same for both tables
Exercise 2: Support, Confidence, Lift
Based on the transaction database, calculate the support, confidence and lift for the following association rules and interpret the results:
1. rule 1: 3 ==> 2 2. rule 2: 1 ==> 5
3. rule 3: 5 ==> 2
4. rule 4: 4 5 ==> 2
Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}
5. rule 5: 4 5 ==> 1
6. rule 6: 1 5 ==> 2
7. rule 7: 1 2 ==> 5
8. rule 8: 1 4 5 ==> 2
9. rule 9: 1 2 4 ==> 5
10. rule 10: 4 5 ==> 1 2
Using R software to mine Association Rules in a transactional dataset
#show the numbers of rows and columns;
>Groceries
transactions in sparse format with
9835 transactions (rows) and
169 items (columns)
>
#observe the first five transactions in dataset:
> inspect(head(Groceries, 5))
items
[1] {citrus fruit,semi-finished bread,margarine,ready soups}
[2] {tropical fruit,yogurt,coffee}
[3] {whole milk}
[4] {pip fruit,yogurt,cream cheese ,meat spreads}
[5] {other vegetables,whole milk,condensed milk,long life bakery product} #Generating Rules
>grocery_rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))
#check the number of association rules
> grocery_rules set of 15 rules
>
#inspect the first five rules
> inspect(head(grocery_rules, 5))
lhs rhs
support confidence lift count
[1] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
[2] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885 113
[3] {other vegetables,domestic eggs} => {whole milk}
0.01230300 0.5525114 2.162336 121
[4] {yogurt,whipped/sour cream} => {whole milk} 0.01087951 0.5245098 2.052747 107
[5] {other vegetables,whipped/sour cream} => {whole milk}
0.01464159 0.5070423 1.984385 144
#inspect the first five rules by confidence
> inspect(head(sort(grocery_rules, by = “confidence”), 5)) lhs rhs
support confidence lift count
[1] {citrus fruit,root vegetables} => {other vegetables}
0.01037112 0.5862069 3.029608 102
[2] {tropical fruit,root vegetables} => {other vegetables}
0.01230300 0.5845411 3.020999 121
[3] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
[4] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885 113
[5] {tropical fruit,root vegetables} => {whole milk}
0.01199797 0.5700483 2.230969 118
#inspect the first five rules by lift
> inspect(head(sort(grocery_rules, by = “lift”), 5))
lhs rhs
support confidence lift count
[1] {citrus fruit,root vegetables} => {other vegetables}
0.01037112 0.5862069 3.029608 102
[2] {tropical fruit,root vegetables} => {other vegetables}
0.01230300 0.5845411 3.020999 121
[3] {root vegetables,rolls/buns} => {other vegetables} 0.01220132 0.5020921 2.594890 120
[4] {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
[5] {curd,yogurt} => {whole milk}
0.01006609 0.5823529 2.279125 99
#Generate rules by specifying the antecedent or consequent
#To show what products are bought before buying “whole milk” and will generate rules that lead to buying “whole milk”.
>wholemilk_rules <- apriori(data=Groceries, parameter=list (supp=0.001,conf =
0.08), appearance = list (rhs=”whole milk”))
#inspect the first five rules by lift
> inspect(head(sort(wholemilk_rules, by = “lift”), 5))
lhs rhs
support confidence lift count
[1] {rice,sugar} => {whole milk} 0.001220132 1 3.913649 12
[2] {canned fish,hygiene articles} => {whole milk}
0.001118454 1 3.913649 11
[3] {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 10
[4] {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 17
[5] {butter,soft cheese,domestic eggs} => {whole milk}
0.001016777 1 3.913649 10
# Limiting the number of rules generated by increasing minsupport and minconfidence
grocery_rules_increased_thresholds <- apriori(Groceries, parameter = list(support = 0.02, confidence = 0.5))
>
#inspect the generated rule
> inspect(grocery_rules_increased_thresholds)
lhs rhs support confidence lift count
[1] {other vegetables,yogurt} => {whole milk} 0.02226741
0.5128806 2.007235 219 >
#Note:
#If you want to get stronger rules, you have to increase the confidence.
#If you want lengthier rules, increase the maxlen parameter.
#If you want to eliminate shorter rules, decrease the minlen parameter.
Further Readings on Association Rules
in Business Cases
• Huang, D., Lu X and Duan , H 2011. Mining association rules to support resource allocation in business process management, Expert Systems with Applications, 38, 9483-9490.
• Kamsu-Foguem et al. 2013. Mining association rules for the quality improvement of the production process, Expert Systems with Applications 40, 1034-1045.
Summary of Association Rule
1. What is the purpose of mining association rules?
2. What are key concepts in mining association rules?
3. How to mine association rules from a given transaction dataset?
4. How to reduce the computational complexity in mining association rules?
5. How to measure whether association rules are good?