Frequent Itemset and Association Rule Mining Lab: A Practical Guide
Objective:
- To understand and implement frequent itemset and association rule mining techniques.
Theory:
Frequent pattern mining is a fundamental data mining task that deals with the search of recurring regularities in large datasets . Association rule mining is a core data mining task . Association rules are formed by analyzing data for frequent if/then patterns and using the criteria support and confidence to recognize the most important relationships. Support indicates how frequently the items appear in the database . Confidence indicates the number of times the if/then statements have been found to be true
Dataset
For this lab, we’ll use a sample dataset representing transactions in a retail store. Each transaction includes a unique identifier and a set of items bought by a customer
TID | Items Purchased |
T100 | Milk, Bread, Eggs |
T200 | Milk, Bread |
T300 | Milk, Cheese |
T400 | Bread, Cheese |
T500 | Milk, Bread, Cheese |
Pre-Lab Questions:
What is a frequent itemset?
A frequent itemset is a set of items, attributes, or events that occur frequently together in a dataset. “Frequent” means that the itemset appears in a significant number of transactions, exceeding a predefined minimum support threshold. For example, in market basket analysis, {milk, bread} being purchased together frequently would be a frequent.
Explain the concepts of support and confidence in association rule mining.
Support:
Support measures the proportion of transactions that contain both the antecedent and consequent items. It indicates how frequently the itemset appears in the. A high support value suggests that the itemset is common and potentially important.
For a rule X → Y, the support is calculated as:
Support(X → Y) = (Number of transactions containing X and Y) / (Total number of transactions)
Confidence:
Confidence measures the reliability of the association rule. It is the likelihood that a customer who buys the antecedent item will also buy the consequent item. A high confidence value indicates a strong association between the items.
For a rule X → Y, the confidence is calculated as:
Confidence(X → Y) = (Number of transactions containing X and Y) / (Number of transactions containing X)
Describe the Apriori algorithm and its purpose.
The Apriori algorithm is a classic algorithm used for frequent itemset mining and association rule learnin]. Its purpose is to identify frequent itemsets in a transaction database, which can then be used to generate association rule]. The Apriori algorithm leverages the Apriori property: all subsets of a frequent itemset must also be frequen].
- Key Steps:
- The algorithm starts by scanning the database to count the occurrences of each item, identifying frequent items (1-itemsets) that meet the minimum support threshold.
- It iteratively generates candidate itemsets of length k+1 from the frequent itemsets of length k.
- The algorithm prunes candidate itemsets that do not meet the minimum support threshold .
- The process continues until no new frequent itemsets are found.
By identifying these frequent patterns and associations, businesses can make informed decisions about product placement, cross-selling strategies, and recommendation systems.
Procedure:
- Data Preparation:
- Represent the dataset in a suitable format for analysis (e.g., a list of transactions).
- Frequent Itemset Generation:
- Use the Apriori algorithm to generate frequent itemsets.
- Set a minimum support threshold (e.g., 30%).
- Identify itemsets that meet this threshold.
- Association Rule Mining:
- Generate association rules from the frequent itemsets.
- Set a minimum confidence threshold (e.g., 60%).
- Evaluate the rules based on support and confidence.
- Analysis and Interpretation:
- Interpret the generated association rules.
- Discuss the implications of these rules for business decisions.
Implementation:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
# Sample Dataset
dataset = [
[‘Milk’, ‘Bread’, ‘Eggs’],
[‘Milk’, ‘Bread’],
[‘Milk’, ‘Cheese’],
[‘Bread’, ‘Cheese’],
[‘Milk’, ‘Bread’, ‘Cheese’]
]
# Data Preparation
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary.astype(int), columns=te.columns_) # Convert to integers (0,1)
# Print the transformed DataFrame
print(“Transaction DataFrame:\n”, df)
# Frequent Itemset Generation
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
# Check if frequent itemsets exist
if frequent_itemsets.empty:
print(“\nNo frequent itemsets found with the given min_support.”)
else:
print(“\nFrequent Itemsets:\n”, frequent_itemsets)
# Association Rule Mining
rules = association_rules(frequent_itemsets, metric=”confidence”, min_threshold=0.6)
if rules.empty:
print(“\nNo association rules found with the given min_threshold.”)
else:
print(“\nAssociation Rules:\n”, rules)