This essay has been submitted by a student. This is not an example of the work written by professional essay writers.
Uncategorized

The CRISP-DM Process Model

Pssst… we can write an original essay just for you.

Any subject. Any type of essay. We’ll even meet a 3-hour deadline.

GET YOUR PRICE

writers online

The CRISP-DM Process Model

Mariscal et al (2010). A survey of data mining and knowledge discovery process models and methodologies, Knowledge Engineering Review, 25, 137-166.

Learning Outcomes

  • Define support, confidence, association rule
  • Understand association rule mining
  • Two-step approach to mine association rules • Techniques in frequent itemset generation
    • Reduce number of candidates: Apriori principle
    • Reduce number of comparison: Hash tree
    • Compact representation of frequent itemsets
    • Alternative methods
  • Techniques in rule generations
  • Evaluate association rules

Example: Questions to Think

When we go shopping, we often have a list of things to buy. Each shopper has a distinctive list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips.

  1. Are there any relationships between items from huge databases?
  2. How do we uncover the relationships/ patterns?
  • Product Placement
  • Promotional discounts
  • Advertisement on items • New product

iwww.kdnuggets.com

Association Rule Learning

  • Association learning is to discover the probability of the cooccurrence of items in a collection
  • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions         Example of Association Rules

TIDItems
1Bread, Milk
2Bread, Diaper, Beer, Eggs
3Milk, Diaper, Beer, Coke
4Bread, Milk, Diaper, Beer
5Bread, Milk, Diaper, Coke
Implication is co-occurrence, not causality!

{Diaper}  {Beer},

{Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk},

 

Definition: Frequent Itemset

  • Itemset

– A collection of one or more items

The set of all items: I = {i1, i2, … , id} .

The set of all transactions: T = {t1, t2, … , tN}

  • Example: {Milk, Bread, Diaper}
  • k-itemset
    • An itemset that contains k items
    • Support count ()

σ(X) = |{ti|X  ti, ti T}|

  • Frequency of occurrence of an itemset
  • g. ({Milk, Bread,Diaper}) = 2
TIDItems
1Bread, Milk
2Bread, Diaper, Beer, Eggs
3Milk, Diaper, Beer, Coke
4Bread, Milk, Diaper, Beer
5Bread, Milk, Diaper, Coke
  • Support
    • Fraction of transactions that contain an itemset
    • g. s({Milk, Bread, Diaper}) = 2/5
  • Frequent Itemset
    • An itemset whose support is greater than

or equal to a minsup threshold

Definition: Association Rule, Support

& Confidence

  • Association Rule
    • An implication expression of the form X  Y, where X and Y are disjoint itemsets.
    • The strength of an association rule can be measured in terms of support and confidence.
  • Rule Evaluation Metrics
    • Support (s): Fraction of transactions that contain both X and Y σ(X Y ) support, s(X → Y ) =

|T|

  • Confidence (c): Measures how often items in Y appear in transactions that contain X

σ(X  Y ) confidence, c(X → Y ) = σ(X)

Example: Association Rule

 

  • Association Rule
    • Example:

{Milk, Diaper}  {Beer}

  • Rule Evaluation Metrics
    • Support (s)

σ(X  Y )

s(X → Y ) =

|T|

– Confidence (c)

c(X → Y ) = σ(X Y ) σ(X)

TIDItems
1Bread, Milk
2Bread, Diaper, Beer, Eggs
3Milk, Diaper, Beer, Coke
4Bread, Milk, Diaper, Beer
5Bread, Milk, Diaper, Coke

 

Example:

{Milk,Diaper} Beer

(Milk,Diaper, Beer)     2

s                 | T |            5 = ?0.4

(Milk,Diaper,Beer)     2

c                                   =0 ?.67

(Milk,Diaper)         3

 

Application 1

  • Marketing and Sales Promotion:
    • Let the rule discovered be

{Bagels, … } –> {Potato Chips}

  • Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
  • Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
  • Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Application 2

  • Supermarket shelf management:
    • Goal: To identify items that are bought together by sufficiently many customers.
    • Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.
    • A classic rule —
  • If a customer buys diaper and milk, then he is very likely to buy beer.
  • We may put six-packs of beers stacked next to diapers!

Application 3

  • Inventory Management:
  • Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts and tools to reduce on number of visits to consumer households.
  • Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

The most common application of association rule mining is Market Basket Analysis.

 

Association Rule Mining Task

  • Association Rule Discovery: Given a set of transactions T, the goal of association rule mining is to find all rules having
    • support ≥ minsup threshold – confidence ≥ minconf threshold
  • Brute-force approach:
    • List all possible association rules
    • Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds  It is computationally prohibitive!

One challenge is: how to reduce the computational complexity

Example: Mining Association Rules

Example of Rules:

TIDItems
1Bread, Milk
2Bread, Diaper, Beer, Eggs
3Milk, Diaper, Beer, Coke
4Bread, Milk, Diaper, Beer
5Bread, Milk, Diaper, Coke

{Milk,Diaper}  {Beer}  (s=0.4, c=0.67)

{Milk,Beer}  {Diaper}  (s=0.4, c(s=0.4, c=1.0)=1.0)

{Diaper,Beer}  {Milk}  (s=0.4, c(s=0.4, c=0.67)=0.67)

{Beer}  {Milk,Diaper}  (s=0.4, c=0.67) (s=0.4, c? =0.67)

{Diaper}  {Milk,Beer}  (s=0.4, c=0.5) (s=0.4, c=0.5)

{Milk}  {Diaper,Beer}  (s=0.4, c(s=0.4, c=0.5)=0.5) Observations:

  • All the above rules are binary partitions of the same σ(X Y ) s(X → Y ) =

itemset: {Milk, Diaper, Beer}                                                      |T|

  • Rules originating from the same itemset have

identical support but can have different confidences       c(X → Y ) = σ(X  Y ) σ(X)

  • Thus, we may decouple the support and confidence requirements

Example: Importance of Confidence and Support in Mining Association Rules

Custm. IDItems
13, 5, 8.
22, 6, 8.
31, 4, 7, 10.
43, 8, 10.
52, 5, 8.
61, 5, 6.
74, 5, 6, 8.
82, 3, 4.
91, 5, 7, 8.
103, 8, 9, 10.

Calculate support, confidence of the association rules:

  • Rule 1: {5} –> {8}  C = 80%
  • Rule 2: {8} –> {5}  C = 57%
  • Which rule is more meaningful?
  • Rule 3: {9} –> {3}  C = 100%

S = 1/10.

  • Is this rule meaningful?

Video Clip: Frequent Itemsets & Association Rules

  • Support;
  • Confidence;
  • Frequent Itemset
  • Association rules

Tutorial: Evaluate Association Rule of Iris Dataset Using R

#create a sub-directory “datamining” in “C:” drive using File Explorer; #copy “discrete2LevelIris.csv” into the working directory “C:\datamining”;

setwd(“C:/datamining”)

iris <- read.table(“discrete2LevelIris.csv”, header = TRUE, sep = “,”);

#get support count for whole dataset count_total <- nrow(iris);

#get support for each iris species support_Setosa <- length(which(iris$species == “Setosa”)) / count_total; support_Versicolor <- length(which(iris$species == “Versicolor”)) / count_total; support_Virginica <- length(which(iris$species == “Virginica”)) / count_total; support_Setosa support_Versicolor support_Virginica

##Rules: {sepalLength == “high”} -> {species?}

count_sepalLenH <- length(which(iris$sepalLength == “high”)) # 55

#rule: {sepalLength == “high”} -> {Setosa}

count_sepalLenH_Setosa <- length(which(iris$sepalLength == “high” & iris$species == “Setosa”)) # [1] 0

#rule: {sepalLength == “high”} -> {Virginica}

count_sepalLenH_Virginica <- length(which(iris$sepalLength == “high” & iris$species ==

“Virginica”)) #[1] 39

support_sepalLenH_Virginica <- count_sepalLenH_Virginica / count_total; confidence_sepalLenH_Virginica <- count_sepalLenH_Virginica / count_sepalLenH; support_sepalLenH_Virginica confidence_sepalLenH_Virginica

#rule: {sepalLength == “high”} -> {Versicolor}

count_sepalLenH_Versicolor <- length(which(iris$sepalLength == “high” & iris$species

== “Versicolor”)) #[1] 16

support_sepalLenH_Versicolor <- count_sepalLenH_Versicolor / count_total; confidence_sepalLenH_Versicolor <- count_sepalLenH_Versicolor / count_sepalLenH; support_sepalLenH_Versicolor confidence_sepalLenH_Versicolor

##Rule 1: {sepalLength == “high”, sepalWidth == “high”} -> {Virginica} count_sepalLenH_sepalWidH <-length(which(iris$sepalLength == “high” & iris$sepalWidth == “high”)); count_sepalLenH_sepalWidH

count_sepalLenH_sepalWidH_Virginica <-length(which(iris$sepalLength == “high” &

iris$sepalWidth == “high” & iris$species == “Virginica”)); count_sepalLenH_sepalWidH_Virginica

#support of the association rule

support_sepalLenH_sepalWidH_Virginica <- count_sepalLenH_sepalWidH_Virginica / count_total;

support_sepalLenH_sepalWidH_Virginica

#confidence of the association rule confidence_sepalLenH_sepalWidH_Virginica <-

count_sepalLenH_sepalWidH_Virginica / count_sepalLenH_sepalWidH; confidence_sepalLenH_sepalWidH_Virginica

##Rule 2: {sepalLength == “high”, sepalWidth == “low”, petalWidth == “high”} ->

{Virginica}

count_sepalLenH_sepalWidL_petalWidH <-length(which(iris$sepalLength == “high” &

iris$sepalWidth == “low” & iris$petalWidth == “high”));

count_sepalLenH_sepalWidL_petalWidH count_sepalLenH_sepalWidL_petalWidH_Virginica <-length(which(iris$sepalLength == “high” & iris$sepalWidth == “low” & iris$petalWidth== “high” & iris$species ==

“Virginica”));

count_sepalLenH_sepalWidL_petalWidH_Virginica

#support of the association rule

support_sepalLenH_sepalWidL_petalWidH_Virginica <count_sepalLenH_sepalWidL_petalWidH_Virginica / count_total; support_sepalLenH_sepalWidL_petalWidH_Virginica

#confidence of the association rule confidence_sepalLenH_sepalWidL_petalWidH_Virginica <count_sepalLenH_sepalWidL_petalWidH_Virginica / count_sepalLenH_sepalWidL_petalWidH; confidence_sepalLenH_sepalWidL_petalWidH_Virginica ##Rule: 3 {sepalLength == “high”, sepalWidth == “low”, petalWidth == “low”} ->

{Versicolor}

count_sepalLenH_sepalWidL_petalWidL <-length(which(iris$sepalLength == “high” & iris$sepalWidth == “low” & iris$petalWidth == “low”));

count_sepalLenH_sepalWidL_petalWidL count_sepalLenH_sepalWidL_petalWidL_Versicolor <-length(which(iris$sepalLength

== “high” & iris$sepalWidth == “low” & iris$petalWidth== “low” & iris$species ==

“Versicolor”));

count_sepalLenH_sepalWidL_petalWidL_Versicolor

#support of the association rule

support_sepalLenH_sepalWidL_petalWidL_Versicolor <count_sepalLenH_sepalWidL_petalWidL_Versicolor / count_total; support_sepalLenH_sepalWidL_petalWidL_Versicolor

#confidence of the association rule confidence_sepalLenH_sepalWidL_petalWidL_Versicolor <count_sepalLenH_sepalWidL_petalWidL_Versicolor / count_sepalLenH_sepalWidL_petalWidL; confidence_sepalLenH_sepalWidL_petalWidL_Versicolor

Mining Association Rules

  • Market Basket Analysis
  • Two-step approach:
    1. Frequent Itemset Generation
      • Generate all itemsets whose support  minsup
    2. Rule Generation
      • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

The computational requirements for frequent itemset generation are generally more expensive than rule generation

Frequent Itemset Generation

Frequent Itemset Generation

  • Brute-force approach:
  • Each itemset in the lattice is a candidate frequent itemset
  • Count the support of each candidate by scanning the database

Transactions

  • Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d – 1 !!!

For a data set that contains d items

Computational Complexity

  • Given d unique items in a data set:
  • Total number of itemsets = 2d– 1
  • Total number of possible association rules:
Q. How many rules if d=2?

Q. How many rules if d=3?

d1d dk d k

R  k1k  j1 j 

3d 2d1 1

Tan et al. (2014)

Generation Strategies

Strategies to reduce computational complexity of frequent

Itemset generation

  • Reduce the number of candidates (M)
    • Complete search: M=2d
    • Use pruning techniques to reduce M
  • Reduce the number of transactions (N)
    • Reduce size of N as the size of itemset increases
    • Used by direct hashing and pruning (DHP) and vertical-based mining algorithms
  • Reduce the number of comparisons (NM)
    • Use efficient data structures to store the candidates or transactions
    • No need to match every candidate against every transaction

Techniques in Frequent Itemset

Generation

  • Reduce number of candidates: Apriori principle
  • Reduce number of comparison: Hash tree
  • Compact representation of frequent itemsets
  • Alternative methods to generate frequent itemsets

Reducing Number of Candidates

  • Apriori principle:
    • If an itemset is frequent, then all of its subsets must also be frequent
  • Apriori principle holds due to the following property of the support measure:

X,Y :(X Y) s(X ) s(Y)

  • Support of an itemset never exceeds the support of its subsets
  • This is known as the anti-monotone property of support

Illustrating Apriori Principle

All of its subsets (i.e., the shaded itemsets) must also be frequent

Illustrating Apriori Principle

Example: Apriori Principle

Transaction Data                                Items (1-itemsets)

 

TIDItems
1Bread, Milk
2Bread, Diaper, Beer, Eggs
3Milk, Diaper, Beer, Coke
4Bread, Milk, Diaper, Beer
5Bread, Milk, Diaper, Coke
 

ItemSupport count
Bread4
Coke2
Milk4
Beer3
Diaper4
Eggs1

Example: Apriori Principle

ItemCount
Bread4
Coke2
Milk4
Beer3
Diaper4
Eggs1
TIDItems
1Bread, Milk
2Bread, Diaper, Beer, Eggs
3Milk, Diaper, Beer, Coke
4Bread, Milk, Diaper, Beer
5Bread, Milk, Diaper, Coke
Minimum Support = 3
2-ItemsetCount
Bread, Milk3
Bread, Beer2
Bread, Diaper3
Milk, Beer2
Milk, Diaper3
Beer, Diaper3
3-ItemsetCount
Bread, Milk, Diaper3

Transaction Data

Items (1-itemsets)

 

Example: Apriori Principle

Working Example

Itemsets
{1,2,3,4}
{1,2,4}
{1,2}
{2,3,4}
{2,3}
{3,4}
{2,4}

The supermarket has a database of transactions where each transaction is a set of SKUs. Use Apriori or Brute-force to determine the frequent itemsets of this database with minsup=3.

Apriori Algorithm

  1. Let k=1
  2. Generate frequent itemsets of length 1
  3. Repeat until no new frequent itemsets are identified
    • Generate length (k+1) candidate itemsets from length k frequent itemsets
    • Prune candidate itemsets containing subsets of length k that are infrequent
    • Count the support of each candidate by scanning the database
    • Eliminate candidates that are infrequent, leaving only those that are frequent

Apriori Algorithm is based on two Apriori principles;

Apriori Algorithm is a level-wise search: 1. self-joining; 2. pruning;

Apriori Algorithm makes repeated passes over the data set to count the support

The Apriori Algorithm In Pseudocode

1:  Find all large 1-itemsets

2:  For (k = 2 ; while Lk-1 is non-empty; k++)

  • {Ck = apriori-gen(Lk-1)
  • For each c in Ck, initialise c.count to zero
  • For all records r in the DB
  • {Cr = subset(Ck, r); For each c in Cr , c.count++ }
  • Set Lk:= all c in Ck whose count >= minsup 8         }  /*  end   — return all of the Lk

apriori-gen : Notes

Suppose we have worked out that the frequent 2-itemsets are:

L2 = { {milk, noodles},  {milk, tights}, {noodles, bacon}} apriori-gen now generates 3-itemsets that all may be frequent.

An obvious way to do this would be to generate all of the possible 3itemsets that you can make from {milk, noodles, tights, bacon}.

But this would include, e.g., {milk, tights, bacon}. Now, if this really was a frequent 3-itemset, that would mean the number of records containing all three is >= minsup;

This implies it would have to be true that the number of records containing {tights, bacon} is >= minsup. But, it can’t be, because this is not one of the large 2-itemsets.

apriori-gen : the Join Step

apriori-gen is clever in generating not too many candidate frequent itemsets, but making sure to not lose any that do turn out to be frequent.

To explain it, we need to note that there is always an ordering of the items. We will assume alphabetical order, and that the data structures used always keep members of a set in alphabetical order, a < b will mean that a comes before b in alphabetical order.

Suppose we have Lk and wish to generate Ck+1

First we take every distinct pair of sets in Lk

{a1, a2 , … ak}  and {b1, b2 , … bk}, and do this: in all cases where {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak< bk, then, {a1, a2 , … ak, bk} is a candidate k+1-itemset.

Example: the Join Step

Suppose the 2-itemsets are:

L2 = { {milk, noodles},  {milk, tights}, {noodles, bacon}, {noodles, peas}, {noodles, tights}}

The pairs that satisfy {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak< bk, are:

{milk, noodles}|{milk, tights}  {noodles, bacon}|{noodles, peas} {noodles, bacon}|{noodles, tights} {noodles, peas}|{noodles, tights}

So the candidate 3-itemsets are:

{milk, noodles, tights},                  {noodles, bacon, peas}

{noodles, bacon, tights}                {noodles, peas, tights},

According to Apriori principle, all other 3-itemsets cannot be frequent!

apriori-gen : the Prune Step

Now we have some candidate k+1 itemsets, and are guaranteed to have all of the ones that possibly could be frequent, but we have the chance to maybe prune out some more before we enter the next stage of Apriori that counts their support.

In the prune step, we take the candidate k+1 itemsets we have, and remove any for which some k-subset of  it is not a frequent k-itemset. Such couldn’t possibly be a frequent k+1-itemset.

E.g. in the current example, we have (let n:=noodles; b:=bacon; etc):

L2 = { {milk, n},  {milk, tights}, {n, bacon}, {n, peas}, {n, tights}}

And candidate k+1-itemsets so far:  {m, n, t},  {n, b, p}, {n, p, t}, {n, b, t} Now, {b, p} is not a frequent 2-itemset, so {n, b, p} is pruned. {p,t} is not a 2-itemset, so {n,p,t} is pruned {b,t} is not a 2-itemset, so {n,b,t} is pruned.

After this we finally have C3 = {{milk, noodles, tights}}

Example: Apriori Algorithm

  • With k = 3 (& k-itemsets lexicographically ordered):

Join Step

  • {3,4,5}, {3,4,7}, {3,5,6}, {3,5,7}, {3,5,8}, {4,5,6}, {4,5,7}
  • Generate all possible (k+1)-itemsets, by, for each to sets where we have

{3,4,5,7}, {3,5,6,7}, {3,5,6,8}, {3,5,7,8}, {4,5,6,7}

Prune Step

  • Delete (prune) all itemset candidates with non-frequent subsets. Like; {3,5,6,7} is never frequent since subset {5,6,7} is not frequent.
  • Actually, here, only one remaining candidate {3,4,5,7}.

Exercise 1: Apriori Algorithm

Table 1. Example of market basket transactions

Transaction IDItems
1{a, b, d, e}
2{b, c, d}
3{a, b, d, e}
4{a, c, d, e}
5{b, c, d, e}
6{b, d, e}
7{c, d}
8{a, b, c}
9{a, d, e}
10{b, d}

Suppose the Apriori algorithm is applied to the data set shown in Table 1 with minsup = 30%, i.e., any itemset occurring in less than 3 transactions is considered to be infrequent. Label each node in the lattice in Fig 1 with the following letter: N (not considered by Apriori algorithm; F=frequent itemset; I=infrequent itemset.

Exercise 1: Apriori Algorithm

Techniques in Frequent Itemset

Generation

  • Reduce number of candidates: Apriori principle
  • Reduce number of comparison: Hash tree
  • Compact representation of frequent itemsets
  • Alternative methods to generate frequent itemsets

Reducing Number of Comparisons

  • Candidate counting:
    • Scan the database of transactions to determine the support of each candidate itemset, which is time-consuming.
    • To reduce the number of comparisons, store the candidates in a hash structure
  • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Generate Hash Tree

Suppose we have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

We need:

  • Hash function
  • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

h(p) = p mod 3                      4 5 7  14 2  5 58 1 5 9

Association Rule Discovery: Hash tree

Enumerate Subset of 3 Items

 

Factors Affecting Complexity of

Apriori Algorithm

  • Choice of minimum support threshold
    • lowering support threshold results in more frequent itemsets
    • this may increase number of candidates and max length of frequent itemsets
  • Dimensionality (number of items) of the data set
    • more space is needed to store support count of each item
    • if number of frequent items also increases, both computation and I/O costs may also increase
  • Size of database (number of transactions)
    • since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
  • Average transaction width
    • transaction width increases with denser data sets
    • This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

Techniques in Frequent Itemset

Generation

  • Reduce number of candidates: Apriori principle
  • Reduce number of comparison: Hash tree
  • Compact representation of frequent itemsets
  • Alternative methods to generate frequent itemsets

Compact Representation of Frequent

Itemsets

  • Some itemsets are redundant because they have identical support as their supersets
TIDA1A2A3A4A5A6A7A8A9A10B1B2B3B4B5B6B7B8B9B10C1C2C3C4C5C6C7C8C9C10
11         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2
3
4
51         1      1      1      1      1      1      1      1        1
60 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

7
8
9
101         1      1      1      1      1      1      1      1        1
110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

1         1      1      1      1      1      1      1      1        1

12
13
14
151         1      1      1      1      1      1      1      1        1
  • Number of frequent itemsets 10 10
  • Need a compact representation 3k1 k 

Maximal Frequent Itemset

An itemset is maximal frequent if none of its immediate supersets is frequent

Superset = a set which includes another set or sets.

Closed Itemset

  • An itemset is closed if none of its immediate supersets has the same support as the itemset
 

TIDItems
1{A,B}
2{B,C,D}
3{A,B,C,D}
4{A,B,D}
5{A,B,C,D}
 

ItemsetSupport
{A}4
{B}5
{C}3
{D}4
{A,B}4
{A,C}2
{A,D}3
{B,C}3
{B,D}4
{C,D}3
 

ItemsetSupport
{A,B,C}2
{A,B,D}3
{A,C,D}2
{B,C,D}3
{A,B,C,D}2

Maximal vs Closed Itemsets

Maximal vs Closed Frequent Itemsets

Maximal vs Closed Itemsets

  • Maximal frequent itemsets are the smallest set of itemsets from which all other frequent itemsets can be derived.
  • Closed itemsets provide a minimal representation of itemsets without losing their support information

Techniques in Frequent Itemset

Generation

  • Reduce number of candidates: Apriori principle
  • Reduce number of comparison: Hash tree
  • Compact representation of frequent itemsets
  • Alternative methods to generate frequent itemsets

 

Generation

  • Apriori has successfully addressed the combinatorial explosion of frequent itemset generation
  • The performance of Apriori may degrade significantly for dense data sets due to the increasing width of transactions
  • Traversal of Itemset Lattice
  • Representation of Transaction Data Set

Generation

  • Traversal of Itemset Lattice

– General-to-specific vs Specific-to-general

Generation

  • Traversal of Itemset Lattice

Generation

  • Traversal of Itemset Lattice

 

Alternative Methods for Frequent Itemset

Generation

  • Representation of Database

– horizontal vs vertical data layout

The support of each candidate itemset can be counted by intersecting the TID-lists of their subsets.
TIDItems
1A,B,E
2B,C,D
3C,E
4A,C,D
5A,B,C,D
6A,E
7A,B
8A,B,C
9A,C,D
10B
ABCDE
11221
42343
55456
6789
789
8

9

10

E.g. Column A ∩ Column C, its size = ({A,C}) = 3.

Mining Association Rules

  • Two-step approach:
  1. Frequent Itemset Generation
    • Generate all itemsets whose support  minsup
  2. Rule Generation
    • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Note that for a given frequent itemset, the rules generated by a binary partitioning have the same support

Rule Generation

  • Given a frequent itemset Y, find all non-empty subsets X  Y such that X  (Y – X) satisfies the minimum confidence requirement

– If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC

AB CD,     AC  BD,       AD  BC,          BC AD,

BD AC,     CD AB,

  • If |Y| = k, then there are 2k – 2 candidate association rules (ignoring Y  and  Y)

Rule Generation

  • How to efficiently generate rules from frequent itemsets?
    • In general, confidence does not have an antimonotone property

c(ABC D) can be larger or smaller than c(AB D)

  • But confidence of rules generated from the same itemset has an anti-monotone property
  • g., Y = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

  • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Rule Generation for Apriori Algorithm

Lattice of rules Rule

Rule generation from the same itemset.

Rule Generation for Apriori Algorithm

  • Candidate rule is generated by merging two rules that share the same prefix in the rule consequent
  • Join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC
  • Prune rule D=>ABC if its

subset AD=>BC does not have   Prune Step high confidence

Effect of Support Distribution

  • Many real data sets have skewed support distribution for items

Support distribution of a retail data set

Effect of Support Distribution

  • How to set the appropriate minsup threshold?
    • If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
    • If minsup is set too low, it is computationally expensive and the number of itemsets is very large
  • Using a single minimum support threshold may not be effective

Multiple Minimum Support

  • How to apply multiple minimum supports?
  • MS(i):= minimum support for item i
  • g.: MS(Milk)=5%,         MS(Coke) = 3%, MS(Broccoli)=0.1%,    MS(Salmon)=0.5%
  • MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1%
  • Challenge: Support is no longer anti-monotone

Suppose:  Support(Milk, Coke) = 1.5% and

Support(Milk, Coke, Broccoli) = 0.5%

Then:      {Milk, Coke} is infrequent,        Why? but {Milk, Coke, Broccoli} is frequent.

Multiple Minimum Support

 

ItemMS(I)Sup(I)
A0.10%0.25%
B0.20%0.26%
C0.30%0.29%
D0.50%0.05%
E3%4.20%

Multiple Minimum Support

 

ItemMS(I)Sup(I)
A0.10%0.25%
B0.20%0.26%
C0.30%0.29%
D0.50%0.05%
E3%4.20%

Modified Apriori Algorithm for Multiple

Minimum Support

  • Order the items according to their minimum support

(in ascending order)

  • g.: MS(Milk)=5%,       MS(Coke) = 3%,

MS(Broccoli)=0.1%,     MS(Salmon)=0.5%

  • Ordering: Broccoli, Salmon, Coke, Milk
  • Need to modify Apriori such that:
    • L1 : set of frequent items
    • F1 : set of items whose support is  MS(1) where MS(1) is mini( MS(i) )
    • C2 : candidate itemsets of size 2 is generated from F1 instead of L1

Modified Apriori Algorithm for Multiple

Minimum Support

  • Modifications to Apriori:

– In traditional Apriori,

  • A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k
  • The candidate is pruned if it contains any infrequent subsets of size k

– Pruning step has to be modified:

  • Prune only if infrequent subset contains the first item;
  • g.: from two frequent 2-itemsets {Broccoli, Coke} and {Broccoli, Milk}, we generate a 3-itemset candidate={Broccoli, Coke, Milk}, where items are ordered according to minimum support;
  • Then, although {Coke, Milk} is infrequent, the candidate {Broccoli, Coke, Milk} is not pruned, because {Coke, Milk} does not contain the first item, i.e., Broccoli.

Working Ex.: Frequent itemset with multiple minsups

Table 1. Dataset                     Table 2. Minsups of items

 

TIDitems
t1{1, 3, 4,6}
t2{1, 3, 5, 6, 7}
t3{1, 2, 3, 6, 8}
t4{2, 6, 7}
t5{2, 3}
 

itemminsupcount
113
223
334
431
521
634
722
811

item minsup                                                                                                                 count

8           1                                                                                                             1

  • 1 3

5           2                                                                                                             1

7           2                                                                                                             2

  • 2 3

4           3                                                                                                             1

6           3                                                                                                             4

  • 3 4

Q1: Which 1-itemsets are frequent? Which 2-itemsets are frequent?

Q2: Generate association rules and calculate their supports and confidences from 2itemsets, {8 1}, {2, 3}, {6 3}, respectively?

Hint: rank items according to minsups.

 

Frequent itemsets:

Itemset: count

8:1

8 1:1

8 1 2:1

8 1 2 6:1

8 1 2 6 3:1

8 1 2 3:1

8 1 6:1

8 1 6 3:1

8 1 3:1

8 2:1

8 2 6:1

8 2 6 3:1

8 2 3:1

8 6:1

8 6 3:1

8 3:1

1:3

1 7:1

1 7 5:1

1 7 5 6:1

1 7 5 6 3:1

1 7 5 3:1

1 7 6:1

1 7 6 3:1

1 7 3:1

1 5:1

1 5 6:1

1 5 6 3:1

1 5 3:1

1 2:1

1 2 6:1

1 2 6 3:1

1 2 3:1

1 6:3

1 6 4:1

1 6 4 3:1

1 6 3:3

1 4:1

1 4 3:1

1 3:3

7:2

7 6:2

2:3

2 6:2

2 3:2

6:4

6 3:3

3:4

Association Rule Evaluation

  • Association rule algorithms tend to produce too many rules
    • many of them are uninteresting or redundant
    • Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence
  • Interestingness measures can be used to prune/rank the derived patterns
  • In the original formulation of association rules, support & confidence are the only measures used
  • Other interestingness measures?

Association Rule Evaluation

Bu er →Bread,

Chocolate → Teddy Bear, Beer → Diapers,

  • Which of these three seem interesting to you?
  • Which of these three might affect the way you do business?
  • After the creation of association rules we must decide which rules are actually interesting and of use to us.
  • A market basket data which has about 10 transactions and 5 items can have up to 100 association rules
  • We need identify the most interesting ones. Interestingness is the term coined to define patterns that we consider of interest. It can be identified by subjective and objective

Application of Interestingness Measure

Computing Interestingness Measure

  • Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table.

Drawback of Confidence

CoffeeCoffee
Tea15520
Tea75580
9010100

Consider Association Rule: Tea  Coffee:

Confidence= P(Coffee|Tea) = 0.75

Support(Coffee, Tea)  = ?

but P(Coffee) = 0.9

 Although confidence is high, rule is misleading             15%

 P(Coffee|Tea) = 0.9375

Subjective vs. Objective Measures of

Interestingness

  • Subjective measures are those that depend on the class of users who examine the pattern
  • Objective measures use statistical information which can be derived from the data to determine whether a particular pattern is interesting, e.g. support and confidence.
  • Other Objective Measures of Interestingness:
    • Lift
    • Interest Factor – Correlation Analysis – Etc.

Statistical Independence

  • Population of 1000 students
    • 600 students know how to swim (S)
    • 700 students know how to bike (B)
    • 420 students know how to swim and bike (S,B)
    • P(SB) = 420/1000 = 0.42
    • P(S)  P(B) = 0.6  0.7 = 0.42
    • P(SB) = P(S)  P(B) => Statistical independence
    • P(SB) > P(S)  P(B) => Positively correlated
    • P(SB) < P(S)  P(B) => Negatively correlated

Statistical-based Measures

  • Measures that take into account statistical dependence

Interest Factor

  • Interpretation of Interest factor: compare the support of itemset {A,B} to the expected support under the assumption that A and B are statistically independent:
    • s(A,B) ≈ P(A and B)
    • s(A) ≈ P(A), s(B) ≈ P(B)
    • Statistical independence: P(A and B) = P(A)xP(B)
  • Use of interest factor:
    • I(A,B) >1 : A and B occur together more frequently than expected by chance.
    • I(A,B) < 1 : A and B occur together less frequently than expected by chance.

Example: Lift/Interest/Correlation

CoffeeCoffee
Tea15520
Tea75580
9010100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9

 Lift = 0.75/0.9= 0.8333 (< 1, less often than independent events) ϕ = (15*5-75*5)/√(90x20x10x80) =-0.25; (negatively associated)

Drawback of Lift & Interest

YYYY
X90090X10010
X01010X09090
90101001090100

Lift 1.11   Lift 10

 

Statistical independence:

If P(X,Y)=P(X)P(Y)  => Lift = 1

 

Sometimes, Lift could be misleading in terms of association

Correlation analysis: -coefficient

  • For binary variables, correlation can be measured using the φ -coefficient
  • φ-coefficient is: symmetric; Invariant under inversion; not invariant under null addition; not invariant under scaling
  • φ-Coefficient considers the co-occurrence and co-absence equally important: the two contingency tables evaluate to the same value
  • This makes the measure more suitable to symmetrical variables

Example: -Coefficient

  • -coefficient is analogous to correlation coefficient for continuous variables

60201010                        20601010



70307030                       70307030

 0.5238                                        0.5238

 Coefficient is the same for both tables

Exercise 2: Support, Confidence, Lift

Based on the transaction database, calculate the support, confidence and lift for the following association rules and interpret the results:

  1. rule 1: 3  ==> 2  2.   rule 2:   1  ==> 5
  2. rule 3: 5  ==> 2
  3. rule 4: 4 5  ==> 2
Transaction idItems
t1{1, 2, 4, 5}
t2{2, 3, 5}
t3{1, 2, 4, 5}
t4{1, 2, 3, 5}
t5{1, 2, 3, 4, 5}
t6{2, 3, 4}
  1. rule 5: 4 5  ==> 1
  2. rule 6: 1 5  ==> 2
  3. rule 7: 1 2  ==> 5
  4. rule 8: 1 4 5  ==> 2
  5. rule 9: 1 2 4  ==> 5
  6. rule 10: 4 5  ==> 1 2

Using R software to mine Association Rules in a transactional dataset

#show the numbers of rows and columns;

>Groceries

transactions in sparse format with

9835 transactions (rows) and

169 items (columns)

>

#observe the first five transactions in dataset:

> inspect(head(Groceries, 5))

items

  • {citrus fruit,semi-finished bread,margarine,ready soups}
  • {tropical fruit,yogurt,coffee}
  • {whole milk}
  • {pip fruit,yogurt,cream cheese ,meat spreads}
  • {other vegetables,whole milk,condensed milk,long life bakery product} #Generating Rules

>grocery_rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))

#check the number of association rules

> grocery_rules set of 15 rules

>

#inspect the first five rules

> inspect(head(grocery_rules, 5))

lhs                                      rhs

support    confidence lift     count

  • {curd,yogurt} => {whole milk} 0.01006609 0.5823529  279125  99
  • {other vegetables,butter} => {whole milk} 0.01148958 0.5736041  244885 113
  • {other vegetables,domestic eggs} => {whole milk}

0.01230300 0.5525114  2.162336 121

  • {yogurt,whipped/sour cream} => {whole milk} 0.01087951 0.5245098  052747 107
  • {other vegetables,whipped/sour cream} => {whole milk}

0.01464159 0.5070423  1.984385 144

#inspect the first five rules by confidence

> inspect(head(sort(grocery_rules, by = “confidence”), 5)) lhs                                 rhs

support    confidence lift     count

  • {citrus fruit,root vegetables} => {other vegetables}

0.01037112 0.5862069  3.029608 102

  • {tropical fruit,root vegetables} => {other vegetables}

0.01230300 0.5845411  3.020999 121

  • {curd,yogurt} => {whole milk}       01006609 0.5823529  2.279125  99
  • {other vegetables,butter} => {whole milk}       01148958 0.5736041  2.244885 113
  • {tropical fruit,root vegetables} => {whole milk}

0.01199797 0.5700483  2.230969 118

#inspect the first five rules by lift

> inspect(head(sort(grocery_rules, by = “lift”), 5))

lhs                                 rhs

support    confidence lift     count

  • {citrus fruit,root vegetables} => {other vegetables}

0.01037112 0.5862069  3.029608 102

  • {tropical fruit,root vegetables} => {other vegetables}

0.01230300 0.5845411  3.020999 121

  • {root vegetables,rolls/buns} => {other vegetables} 0.01220132 0.5020921  594890 120
  • {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000  584078 127
  • {curd,yogurt}            => {whole milk}

0.01006609 0.5823529  2.279125  99

#Generate rules by specifying the antecedent or consequent

#To show what products are bought before buying “whole milk” and will generate rules that lead to buying “whole milk”.

>wholemilk_rules <- apriori(data=Groceries, parameter=list (supp=0.001,conf =

0.08), appearance = list (rhs=”whole milk”))

#inspect the first five rules by lift

> inspect(head(sort(wholemilk_rules, by = “lift”), 5))

lhs                                           rhs

support     confidence lift     count

  • {rice,sugar} => {whole milk} 0.001220132 1          913649 12
  • {canned fish,hygiene articles} => {whole milk}

0.001118454 1          3.913649 11

  • {root vegetables,butter,rice} => {whole milk} 0.001016777 1          913649 10
  • {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 913649 17
  • {butter,soft cheese,domestic eggs} => {whole milk}

0.001016777 1          3.913649 10

# Limiting the number of rules generated by increasing minsupport and minconfidence

grocery_rules_increased_thresholds <- apriori(Groceries, parameter = list(support = 0.02, confidence = 0.5))

>

#inspect the generated rule

> inspect(grocery_rules_increased_thresholds)

lhs                          rhs support    confidence lift     count

[1] {other vegetables,yogurt} => {whole milk} 0.02226741

0.5128806  2.007235 219  >

#Note:

#If you want to get stronger rules, you have to increase the confidence.

#If you want lengthier rules, increase the maxlen parameter.

#If you want to eliminate shorter rules, decrease the minlen parameter.

Further Readings on Association Rules

in Business Cases

  • Huang, D., Lu X and Duan , H 2011. Mining association rules to support resource allocation in business process management, Expert Systems with Applications, 38, 9483-9490.
  • Kamsu-Foguem et al. 2013. Mining association rules for the quality improvement of the production process, Expert Systems with Applications 40, 1034-1045.

Summary of Association Rule

  1. What is the purpose of mining association rules?
  2. What are key concepts in mining association rules?
  3. How to mine association rules from a given transaction dataset?
  4. How to reduce the computational complexity in mining association rules?
  5. How to measure whether association rules are good?

  Remember! This is just a sample.

Save time and get your custom paper from our expert writers

 Get started in just 3 minutes
 Sit back relax and leave the writing to us
 Sources and citations are provided
 100% Plagiarism free
error: Content is protected !!
×
Hi, my name is Jenn 👋

In case you can’t find a sample example, our professional writers are ready to help you with writing your own paper. All you need to do is fill out a short form and submit an order

Check Out the Form
Need Help?
Dont be shy to ask