Chapter 10 - Decision Tree Learning
Chapter Outline
- Chapter Outline
- Quick Recap
- Decision Tree Learning
- ID3 – Basic Decision Tree Learning Algorithm
- Machine Learning Cycle – ID3 Algorithm
- Overfitting – ID3 Algorithm
- Chapter Summary
Quick Recap
- Quick Recap – Version Space Algorithm
- Problem – FIND-S Algorithm
- Only returns one hypothesis which best fits the Training Data
- However, there can be multiple hypothesis, which best fit the Training Data
- Proposed Solution
- Version Space
- Version Space (VSH,D) contains set of all hypothesis consistent with the Training Examples
- Version Space Algorithm computes the set of all hypothesis consistent with the Training Examples
- Two Algorithms which return a Version Space are
- List Ten Eliminate Algorithm
- Candidate Elimination Algorithm
- List Ten Eliminate Algorithm – Summary
- Representation of Example
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Attributes
- Searching Strategy
- I am not clear about this. Please drop me an email if you know. Jazak Allah Khair
- Training Regime
- Incremental Method
- Inductive Bias of List Ten Eliminate Algorithm
- Training Data is error-free
- Target Function / Concept is present in the Hypothesis Space (H)
- Strengths
- List Then Eliminate Algorithm overcomes the limitation of FIND-S Algorithm and
- returns multiple hypothesis, which best fits the Training Data i.e. Version Space (VSH,D)
- Weaknesses
- Only works on error-free Data
- However, Real-world Data is noisy
- Works on the assumption that Target Function is present in the Hypothesis Space (H)
- However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known
- List Then Eliminate Algorithm is computationally expensive because it makes a
- pairwise comparison of each Training Example with all hypothesis in Hypothesis Space (H)
- List Then Eliminate Algorithm is computationally expensive because
- Hypothesis Space (H) is represented as a List , which forces
- List Then Eliminate Algorithm to make pairwise comparisons
- To reduce the computational cost of Version Space Algorithms
- Have a more compact representation of Hypothesis Space (H)
- for e.g. Candidate Elimination Algorithm
- Candidate Elimination Algorithm – Summary
- Representation of Example
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Attributes
- Searching Strategy
- I am not clear about this. Please drop me an email if you know. Jazak Allah Khair
- Training Regime
- Incremental Method
- Inductive Bias of Candidate Elimination Algorithm
- Training Data is error-free
- Target Function / Concept is present in the Hypothesis Space (H)
- Strengths
- Candidate Elimination Algorithm overcomes the limitation of List Then Eliminate Algorithm and
- provides a more compact representation of the Version Space
- Version Space returned by Candidate Elimination Algorithm can be used to make predictions on unseen Data using either a
- Single Model or
- Ensemble Model
- Weaknesses
- Only works on error-free Data
- However, Real-world Data is noisy
- Works on assumption that Target Function is present in the Hypothesis Space (H)
- However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known
- Only works when values of Attributes are Categorical
- However, in Real-world many Machine Learning Problems have
- Attributes / Features with Numeric values
- Only works for simple representations of Data
- However, in Real-world many Machine Learning Problems have complex representation (for e.g. Face Detection, Face Recognition etc.)
- Single Model
- A Single Model is trained on the Training Data and used to make predictions on the Test Data
- Strengths
- It is computationally fast and requires less time to make predictions compared to Ensemble Model
- Weaknesses
- Error is likely to be high compared to Ensemble Model
- Ensemble Model
- An Ensemble Model works by training different Models on the same Training Data and using each Model to individually make predictions on the Test Data
- Strengths
- It is computationally expensive and requires more time to make predictions compared to Single Model
- Weaknesses
- Error is likely to be low compared to Single Model
- Voting Approach is one of the simplest approaches to combine predictions from different Models
- In Voting Approach, we may assign
- Same weight to all Models
- Different weights to all Models
- Model Weight plays an important role in making the Final Prediction
- Voting Approach works as follows
- Given a Test instance x
- Step 1: Make individual predictions using different Models
- Step 2: Combine predictions of individual Models
- Step 3: Final Prediction of x will be the class which has the majority vote
- Voting Classifier is not an actual Classifier (Machine Learning Algorithm) but a wrapper for a set of different Machine Learning Algorithms , which are trained and tested on the same Data
- A Voting Classifier combines predictions of individual Machine Learning Algorithms to make Final Predictions on unseen Data
Quick Recap - Models
- Learning Input-Output Functions – General Settings
- Input to Learner
- Set of Training Examples (D)
- Set of Functions / Hypothesis (H)
- Output of Learner
- A hypothesis h from H, which best fits the Training Examples (D)
- Learning is a Searching Problem
- For any Machine Learning Algorithm, we should have clear understanding of the following three points
- Representation of Training Examples (D)
- How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can understand and learn from them?
- Representation of Hypothesis (h)
- How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can understand?
- Searching Strategy
- What searching strategy is used by a Machine Learning Algorithm to find h form H, which best fits the Training Examples (D)?
- Summary - FIND-S Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Input Attributes
- Searching Strategy
- I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair
- Training Regime
- Incremental Method
- Summary – List Then Eliminate Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Input Attributes
- Searching Strategy
- Sequential Search
- Training Regime
- Incremental Method
- Summary – Candidate Elimination Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Input Attributes
- Searching Strategy
- I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair
- Training Regime
- Incremental Method
- Strengths and Weaknesses – Candidate Elimination Algorithm
- Strengths
- Candidate Elimination Algorithm overcomes the limitation of List Then Eliminate Algorithm and
- provides a more compact representation of the Version Space
- Version Space returned by Candidate Elimination Algorithm can be used to make predictions on unseen Data using either a
- Single Model or
- Ensemble Model
- Weaknesses
- Only works on error free Data
- However, Real-world Data is noisy
- Works on assumption that Target Function is present in the Hypothesis Space (H)
- However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known
- Only works when values of Attributes are Categorical
- However, in Real-world many Machine Learning Problems have
- Attributes / Features with Numeric values
- Only works for simple representations of Data
- However, in Real-world many Machine Learning Problems have complex representation (e.g. Face Detection, Face Recognition, etc.)
- Main Problems in Candidate Elimination Algorithm
- Main Problems
- Cannot handle noisy Data
- Cannot handle Numeric Data
- Cannot work for complex representations of Data
- It may fail to find the Target Function in the Hypothesis Space
- i.e. converge to an empty Version Space
- Proposed Solution
- Decision Tree Learning Algorithms
Decision Tree Learning
- ID3 Decision Tree Learning Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Disjunction (OR) of Conjunction (AND) of Input Attribute
(a.k.a. Decision Tree)
- Searching Strategy
- Greedy Search
- Simple to Complex (Hill Climbing)
- Training Regime
- Batch Method
- Comparison of Representation of Training Examples (D)
- In FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms, Training Examples (D) are represented as
- Attribute-Value Pair
- Conclusions
- Representation of Trainings Examples (D) is same for FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms
- Comparison of Representation of Hypothesis (h)
- In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Hypothesis (h) is represented as
- Conjunction (AND) of Constraints on Input Attributes
- In ID3 Machine Learning Algorithms, Hypothesis (h) is represented as
- Disjunction (OR) of Conjunction (AND) of Input Attribute
- Conclusion
- Representation of Hypothesis (h) in ID3 ML Algorithm is more expressive and powerful than FIND-S, List Then Eliminate and Candidate Elimination Algorithms
- Reason
- FIND-S, List Then Eliminate and Candidate Elimination Algorithms only use
- Conjunction (AND) relationship between Attributes
- ID3 Algorithm uses both
- Disjunction (OR) and Conjunction (AND) relationships between Attributes
- Comparison of Searching Strategy
- In FIND-S Algorithm, Searching Strategy is
- I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair
- In List Then Eliminate Algorithm, Searching Strategy is
- Sequential Search
- In Candidate Elimination Algorithm, Searching Strategy is
- I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair
- In ID3 Machine Learning Algorithms, Searching Strategy is
- Greedy Search
- Simple to Complex (Hill Climbing)
- Conclusion
- Search Strategy of ID3 Algorithm is different from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms
- Comparison of Training Regimes
- In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Training Regime is
- Incremental Method
- In ID3 Machine Learning Algorithm, Training Regime is
- Batch Method
- Conclusion
- Training Regime of ID3 Algorithm is different from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms
- Decision Tree
- Definition
- A Decision Tree is a graph that uses a branching method to illustrate every possible outcome of a decision
- Decision Trees can be drawn by hand or created with a graphics program or specialized software
- Decision Tree Learning
- Definition
- Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree
- Learned Decision Trees (or Models) can also be re-represented as
- Sets of If-Then Rules (to improve human readability)
- Importance
- Most popular of Inductive Learning Algorithms
- Applications
- Decision Tree Learning Algorithms are successfully and widely used in a range of Real-world Applications including
- Predicting Magnetic Properties of Crystals
- Profiling Higher-Priced Houses
- Detecting Advertisements on the Web
- Controlling a Production Process
- Diagnosing Hypothyroidism
- Assessing Credit Risk
- Sentiment Analysis
- Emotion Analysis
- Author Profiling
- Word Sense Disambiguation
- Text Reuse and Plagiarism Detection
- Note
- Note the difference between the definitions of
- Decision Tree and
- Decision Tree Learning
- Example – Decision Tree
- A Decision Tree (or Model) is given below
- Example – Decision Tree Cont…
- Steps - Converting a Decision Tree into If-Then Rules
- To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps
- Step 1: Write down all the paths in the Decision Tree
- A path is a Conjunction (AND) of Attributes
- Step 2: Join the paths using Disjunction (OR)
- Note
- Decision Tree = Disjunction of Paths
- Example – Converting a Decision Tree into If-Then Rules
- Consider the following Decision Tree, In sha Allah (انشاء اللہ), we will convert it into If-Then Rules
- Example – Converting a Decision Tree into If-Then Rules
- Step 1: Write down all the paths in the Decision Tree
- Step 2: Join the paths using Disjunction (OR)
- Decision Tree = Disjunction (OR) of Paths
- Problems Appropriate for Decision Tree Learning
- In general , Decision Tree Learning Algorithms are best for Machine Learning Problems where
- Instances / Examples are Represented as Attribute–Value Pair
- Values of Attributes can be either
- Categorical or
- Numeric
- Target Function (f) is Discrete Valued
- Example 1 – Discrete Valued Target Function
- Gender Identification Problem
- Two Possible Output Values
- Male
- Female
- Example 2 – Discrete Valued Target Function
- Sentiment Analysis
- Three Possible Output Values
- Positive
- Negative
- Neutral
- Disjunctive Hypothesis may be Required
- Easy for Decision Tree Learning Algorithms to learn disjunctive concepts
- Note that such concepts were outside the Hypothesis Space (H) of the Candidate Elimination Algorithm
- Possibly Noisy / Incomplete Training Data
- Decision Tree Learning Algorithms are robust to errors in
- Values of Input Attributes
- Values of Output Attribute i.e. Classification of Training Example(s) in Incorrect
- Decision Tree Learning Algorithms can be trained on Training Examples where for some instances some Attribute Values are unknown/missing
- Note
- Candidate Elimination Algorithm cannot be trained on noisy/incomplete Data
ID3 - Basic Decision Tree Learning Algorithm
- Note
- There are many state-of-the-art Decision Tree Learning Algorithms available
- For example, J48, Random Forest
- In sha Allah, I will discuss the most basic Decision Tree Learning Algorithm i.e. ID3
- Reason
- In sha Allah, it will help you to completely and correctly understand
- How Decision Tree Learning Algorithms works?
- ID3 Algorithm
- How ID3 Algorithm Works?
- ID3 Algorithm, learns Decision Tree by constructing them top-down , beginning with the following question
- Question
- Which Attribute should be tested at the root of the tree?
- Answer
- Use a statistical test to evaluate each Input Attribute to determine
- How well it alone classifies the Training Examples?
- Best Attribute
- Best Attribute is the one which alone best classifies the Training Examples
- Best Attribute is selected and used as the test at the root node of the tree
- A descendant of the root node is then created for each possible value of this Attribute, and
- Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)
- The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree
- This process continues for each new leaf node until either of two conditions are met
- Every Attribute has already been included along this path through the tree, or
- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)
- Note
- This forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices
- Which Attribute is the Best Classifier?
- Most Crucial Step in ID3 Algorithm
- Choosing which Attribute to test at the next node?
- Would like to choose that Attribute which does
- best at separating Training Examples according to their Target Classification (or Output)
- An Attribute which separates Training Examples into two sets
- each of which contains Positive / Negative Training Examples of the Target / Output Attribute in the same ratio as the initial set of Training Examples
- has not helped us progress towards a classification
- Example - Which Attribute is the Best Classifier?
- Consider Gender Identification Problem
- Suppose we have a Set of 14 Training Examples
- Positive (Female) = 9
- Negative (Male) = 5
- Example - Which Attribute is the Best Classifier? Cont.…
- Consider Gender Identification Problem
- Suppose we have a Set of 14 Training Examples
- Positive = 9
- Negative = 5
- Question
- Which Attribute is the best classifier?
- Hair Length or
- Head Covered
- Answer
- Hair Length
- Measure to Pick Best Classifier Attribute
- Question
- How many measures are there for picking the best classifier attribute?
- Answer
- Many measures are available for picking the best classifier Attribute
- Examples
- Information Gain
- Gain Ratio
- Note
- In this Chapter, in sha Allah we will use Information Gain
- Information Gain
- Information Gain is a useful measure for picking the best classifier Attribute
- Information Gain measures
- how well a given Attribute separates Training Examples with respect to their Target Classification
- Information Gain is defined in terms of
- Entropy
- Entropy
- Definition
- Entropy gives a measure of purity/impurity of a Set of Training Examples
- Formula
- Where S represents the Set of Training Examples (or Sample)
- p⊕ represents the proportion of Positive Training Examples
- p⊖ represents the proportion of Negative Training Examples
- Range of Entropy
- Value of Entropy lies between [0 – 1]
- 0 means the Sample is Pure
- 1 means that Sample has maximum impurity
- Minimum Entropy vs Maximum Entropy
- Minimum Entropy
- Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category
- Maximum Entropy
- Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative
- Example 1 – Computing Entropy
- Considering Gender Identification Problem
- Training Examples = 14
- Positive = 14
- Negative = 0
- Entropy([14+, 0-]) = -p⊕ log2 p⊕- p⊖ log2 p⊖
= – (14/14) log2 (14/14) – (0/14) log2 (0/14)
= – 1 log2 (1) – 0 log2 (0)
= – 1 (0) – 0 (0)
= 0
- Note
- Sample Is Pure (all Training Examples are Positive) and
- Entropy is ZERO
- Example 2 – Computing Entropy
- Considering Gender Identification Problem
- Training Examples = 14
- Positive = 0
- Negative = 14
= – (0/14) log2 (0/14) – (14/14) log2 (14/14)
= – 0 log2 (0) – 1 log2 (1)
= 0 (0) – 1 (0)
= 0
- Note
- Sample Is Pure (all Training Examples are Negative) and
- Entropy is ZERO
- Example 3 – Computing Entropy
- Considering Gender Identification Problem
- Training Examples = 14
- Positive = 7
- Negative = 7
= – (7/14) log2 (7/14) – (7/14) log2 (7/14)
= – 0.5 log2 (-1) – 0.5 log2 (-1)
= – 0.5 (-1) – 0.5 (-1)
= 0.5 + 0.5
= 1
- Note
- Sample has maximum impurity (half of the Training Examples are Positive and remaining half are Negative) and
- Entropy is ONE
- Example 4 – Computing Entropy
- Considering Gender Identification Problem
- Training Examples = 14
- Positive = 9
- Negative = 5
= -(9/14) log2 (9/14) – (5/14) log2 (5/14)
= 0.940
- Note
- Sample Is impure
- Information Gain
- Definition
- Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute
- Formula
- Given a Set of Training Examples S and Attribute A
- where
- Values(A) is the Set of Values Attribute A can take on
- Sv is the subset of S for which Attribute A has value v
- First term in Gain(S,A) is
- Entropy of Set of Training Examples S
- Second term in Gain(S, A) is
- Expected entropy after partitioning on A = sum of Training Entropies of each subset Sv weighted by ratio of Sv in S
- Strengths
- Information Gain can be used to Construct / Learn a Decision Tree using ID3 Algorithm
- Weaknesses
- Information Gain measure favors attributes with many values over those with few values
- Example - Weakness of Information Gain
- Consider the Gender Identification Problem
- If we add a Date attribute to the set of Input Attributes
- The Date attribute will have a distinct value for each day and will have the highest Information Gain
- This is because Date attribute perfectly predicts the Target / Output Attribute for all Training Examples
- Result is a Decision Tree of Depth 1 that
- perfectly classifies Training Examples but fails on all other data
- Example – Computing Information Gain
- Considering Gender Identification Problem
- Information Gain for HeadCovered will be
Machine Learning Cycle - ID3 Algorithm
- Gender Identification Problem
- Machine Learning Problem
- Gender Identification
- Input
- Human
- Output
- Gender of a Human
- Task
- Given a Human (Input), predict the Gender of the Human (Output)
- Treated as
- Learning Input-Output Function
- i.e., Learn from Input to predict Output
- Representation of Examples
- Example = Input + Output
- Input = Human
- Output = Gender
- Representation of Input and Output
- Attribute-Value Pair
- Representation of Input
- Input is represented as a
- Set of 4 Input Attributes
- Height
- Weight
- Hair Length
- Head Covered
- Representation of Output
- Output is represented as a
- Set of 1 Output Attribute
- Gender
- Note
- Yes means Female and No means Male
- Computing Size of Instance Space (X)
- Computing Size of Instance Space (X)
- |X| = No. of Values of Height Input Attribute x No. of Values of Weight Input Attribute x No. of Values of Hair Length Input Attribute x No. of Values of Head Covered Input Attribute
- |X| = 3*3*2*2 = 36
- Sample Data
- We obtained a Sample Data of 21 examples
- Representation of Hypothesis (h)
- We represent a Hypothesis (h) as
- Disjunction (OR) of Conjunction (AND) of Input Attributes
- Since ID3 represents Disjunctive Hypothesis , therefore
- Hypothesis Space (H) = Concept Space (C)
- |H| = |C| = 2|X| = 2|36| = 68,719,476,736
- Note
- Using ID3 Algorithm, we will find the Target Function / Concept in the Hypothesis Space because
- we are searching complete Concept Space
- In Candidate Elimination Algorithm, we may / may not find the Target Function / Concept in the Hypothesis Space because
- we are searching incomplete Concept Space
- Hypothesis Space (H) – ID3 Algorithm
- In ID3 Algorithm, Hypothesis Space (H) is a
- Set of Decision Trees
- Sample Data
- Split the Sample Data
- We split the Sample Data using Random Split Approach into
- Training Data – 2 / 3 of Sample Data
- Testing Data – 1 / 3 of Sample Data
- Training Data
- Testing Data
- Machine Learning Cycle
- Four phases of a Machine Learning Cycle are
- Training Phase
- Build the Model using Training Data
- Testing Phase
- Evaluate the performance of Model using Testing Data
- Application Phase
- Deploy the Model in Real-world , to make a prediction on Real-time unseen Data
- Feedback Phase
- Take Feedback from the Users and Domain Experts to improve the Model
- Feedback Phase
Training Phase – ID3 Algorithm
- Training Phase - ID3 Algorithm
- Let’s construct/learn a Decision Tree (Model / h) which best fits the Training Examples using ID3 Algorithm
- Selection of Attribute for Root Node
- Question
- What Attribute should be tested at the root node?
- Answer
- Compute Information Gain for all four Attributes (Height, Weight, Hair length and Head Covered) and
- Attribute with the highest Information Gain will be tested at the root
node
- Attribute with the highest Information Gain will be tested at the root
- Which attribute should be tested at the root?
- Gain (S, Height) = 0.246
- Gain (S, Weight) = 0.029
- Gain (S, Hair Length) = 0.151
- Gain (S, Head Covered) = 0.084
- Height provides the best prediction for Target Classification
- Therefore, Height will be the root node
- After Selection of Root Node, Cont…
- Working on Height = Short Node
- Question
- What Attribute should be tested here?
- Answer
- Compute Information Gain for all three Attributes (Weight, Hair length and Head Covered) and
- Attribute with the highest Information Gain will be tested at Height = Short node
- Computing Information Gain
- Hair Length provides the best prediction for Target Classification
- Therefore, it will be tested at Height = Short node
- Let’s further grow Decision Tree
- Add to the tree a successor for each possible value of Hair Length
- Partition the Training Examples according to the value of Hair Length
- Note
- For Root Node Selection
- No. of Training Examples = 14
- No. of Attributes = 4
- For Height = Short Node Selection
- No. of Training Examples = 5
- No. of Attributes = 3
- Working on Height = Normal Node
- Snormal = {x3,x7,x12,x13}
- Target Classification for all four instances is
- Positive
- i.e. Sample is Pure (Entropy is ZERO)
- Recall that one of the Conditions to Stop Growing the Decision Tree is
- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is ZERO)
- For Height = Normal Node
- We will not grow the Decision Tree and
- Leaf Node will be assigned the label of the Majority Class
- Considering Snormal = {x3,x7,x12,x13}
- Positive (Female) = 4
- Negative (Male) = 0
- Therefore, Majority Class is Female and this will be the label of Leaf Node
- Decision Tree Learned by ID3 Algorithm
- Decision Tree (or Model) learned by ID3 Algorithm is given below
- Searching Strategy - ID3 Algorithm
- Recall – Learning is a Searching Problem
- Given
- Set of Training Examples (D)
- Set of Hypotheses (Decision Trees)
- Learner (ID3 Algorithm)
- searches the Hypothesis Space (Set of Decision Trees) to find a Hypothesis (Decision Tree / Model / h) which best fits the Training Examples
- Search of ID3 Algorithm is
- simple-to-complex, hill-climbing search guided by the Information Gain evaluation function
- ID3 Algorithm
- Hypothesis Space of ID3 Algorithm is complete space of finite, discrete-valued functions w.r.t available Attributes
- Candidate Elimination Algorithm
- Hypothesis Space of Candidate Elimination Algorithm is incomplete space because
- It only contains Hypotheses with Conjunction relationship between Attributes
- ID3 Algorithm
- ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far
- This means that we cannot
- determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples
- Candidate Elimination Algorithm
- Candidate Elimination Algorithm maintains Version Space (VSH,D) i.e. set of all Hypotheses consistent with Training Examples
- Conclusion
- ID3 Algorithm incompletely searches a complete Hypothesis Space (H)
- Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H)
- ID3 Algorithm
- ID3 Algorithm performs no backtracking
- Once an Attribute is selected for testing at a given node, this choice is never reconsidered
Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions
- Training Regime – ID3 Algorithm
- ID3 Algorithm
- ID3 Algorithm uses Batch Method for Training
- i.e. Uses all Training Examples at each step to make statistically based decision about how to refine current Hypothesis (h or Decision Tree)
- Using statistically based properties of all Training Examples (Information Gain) means the technique is robust in the face of errors in individual Training Examples
- Candidate Elimination Algorithm
- Candidate Elimination Algorithm uses Incremental Method for Training
- i.e. make decisions incrementally based on single Training Examples
- Conclusion
- Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method
- Inductive Bias – ID3 Algorithm
- Inductive Bias
- Inductive Bias Is the set of assumptions needed in addition to Training Examples to justify Deductively Learner’s Classification
- Inductive Bias of ID3 Algorithm
- Shorter Decision Trees are preferred over Longer Decision Trees
- Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not
- Note
- Inductive Bias of ID3 Algorithm arises from its Searching Strategy (simple-to-complex, hill climbing)
- Selects shorter Decision Trees over longer ones
- Selects Decision Trees that place Attributes with highest Information Gain closest to root
- Recall
- ID3 Algorithm incompletely searches a complete Hypothesis Space (H)
- Candidate Elimination Algorithm completely searches a incomplete Hypothesis Space (H)
- Can be put differently by saying
- Inductive Bias of ID3 Algorithm follows from its Searching Strategy (Preference Bias or Search Bias)
- Inductive Bias of Candidate Elimination Algorithm follows from its definition of its search space (Restriction Bias or Language Bias).
- Question
- Which Bias is better?
- Answer
- Main Difference between Tow Biases
- Preference Bias only effects order in which Hypotheses are searched
- In Preference Bias, Hypothesis Space (H) will contain the Target Function
- Restriction Bias effects which Hypotheses are searched
- In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function
- Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias
- Note
- Some Machine Learning Algorithms may combine Preference and Restriction Biases
- Example
- Checker’s Learning Program
- Restriction Bias
- Linear weighted function of fixed set of board features introduces Restriction Bias ( non-linear potential target functions excluded)
- Preference Bias
- Selection of moves that lead to win the game introduces Preference Bias
- Summary - Training Phase of ID3 Algorithm
- Recall
- Data = Model + Error
- Training Data
- Model
- Model – in the form of Rules
- In sha Allah, In the next Phase i.e. Testing Phase, we will
- Evaluate the performance of the Model
Testing Phase – ID3 Algorithm
- Testing Phase
- Question
- How good Model has learned?
- Answer
- Evaluate the performance of the Model on unseen data (or Testing Data)
- Evaluation Measures
- Evaluation will be carried out using
- Error measure
- Error
- Definition
- Error is defined as the proportion of incorrectly classified Test instances
- Formula
- Note
- Accuracy = 1 – Error
- Evaluate Model (Decision Tree)
- Apply Model on Test Data
Application Phase – ID3 Algorithm
- Application Phase
- We assume that our Model
- performed well on large Test Data and can be deployed in Real-world
- Model is deployed in the Real-world and now we can make
- predictions on Real-time Data
- Steps – Making Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector
- Step 4: Return Prediction to the User
- Making Predictions on Real-time Data
- Step 1: Take input from User
- Step 2: Convert User Input into Feature Vector
- Note that order of Attributes must be exactly same as that of Training and Testing Examples
- Step 3: Apply Model on Feature Vector
- Return Prediction to the User
- Male
- Note
- You can take Input from the user, apply Model and return predictions as many times as you like 😊
Feedback Phase – ID3 Algorithm
- Feedback Phase
- Only Allah is Perfect 😊
- Take Feedback on your deployed Model from
- Domain Experts and
- Users
- Improve your Model based on Feedback 😊
Overfitting – ID3 Algorithm
- Overfitting
- Definition
- Given a hypothesis space H, a hypothesis h ∈ H overfits the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances
- What Causes Overfitting?
- Noise in Training Examples
- Number of Training Examples is too small to produce a representative sample of Target Function
- Why Overfitting is a Serious Problem?
- Overfitting is a serious problem for many Machine Learning Algorithms
- For example
- Decision Tree Learning
- Regular Neural Networks
- Deep Learning Algorithm etc.
- Example – Overfitting
- Suppose the performance of the following two Hypotheses on Training Data and Test Data
- Note that h has smaller Error than h′ over the Training Examples, but
- h′ has a smaller Error over the entire distribution of instances
- This means that h has overfit the Training Examples
- Overfitting and Decision Tree Learning
- Overfitting is a real problem for Decision Tree Learning
- One empirical study showed that for a range of tasks there was a
- Decrease of 10% – 25% in Accuracy due to Overfitting
- The simple ID3 Algorithm can produce
- Decision Trees that overfit the Training Examples
- Example 1 – Overfitting of ID3 Algorithm
- Suppose in addition to the 14 Training Examples for Gender Identification we get a 15th Training Example, whose Target Classification is wrong i.e. Training Example has Error
- ‹Tall, Heavy, Short, Yes, Gender = Female›
- Question
- What impact this will have on our earlier Decision Tree?
- Since we previously had the correct Training Example
- The addition of incorrect Training Example will now cause ID3 to construct a more complex Decision Tree
- The new (error full) Training Example will be sorted into the second leaf node from the left in the Decision Tree, along with the previous Training Examples x9 and x11 (both x9 and x11 labeled as Male)
- Because the new (error full) Training Example is labeled as a Female , ID3 will search for further refinements to the Decision Tree below this node
- Result will be a Decision Tree which performs
- well on ( error full Training Examples) and less well on new unseen instances
- Note
- In this example, ID3 Algorithm overfits Training Examples due to
- Noise in Training Example
- Overfitting can also occur when
- Number of Training Examples is too small to produce a representative sample of Target Function
- Example 2 - Overfitting of ID3 Algorithm
ID3 Algorithm when Trained and Tested on medical patients, who had a form of diabetes
- Observations from Graph
- Accuracy of Decision Tree over Training Examples increases monotonically as Decision Tree grows (to be expected)
- Accuracy of Decision Tree over Test Examples increases till about 25 nodes, then decreases
- Conclusion
- Decision Tree has overfit
- Avoiding Overfitting
- Question
- How can we avoid Overfitting in Decision Tree Learning?
- Answer
- Many Possible Solutions
- Avoiding Overfitting – Two General Approaches
- Two general approaches to avoid Overfitting in Decision Tree Learning are
- Stop growing Decision Tree before perfectly fitting Training Examples
- e.g. when Data Split is not statistically significant
- Grow full Decision Tree, then prune afterward
- In practice
- Second approach has been more successful
- Question
- How the size of an optimal final Decision Tree can be decided to avoid Overfitting ?
- Answer
- Many Possible Approaches
- Most Common Approach
- Training and Validation Set Approach
- Training and Validation Set Approach
- Using this approach, Sample Data is split into three sets
- Training Set
- Validation Set
- Testing Set
- Strengths
- Validation Set helps to check whether the Model is Overfitting or not during the Training
- Weaknesses
- Holding Data back for a Validation Set reduces Data available for Training
- Training Set
- is used to build the Model
- Validation Set
- is used to check whether the Model is Overfitting or not during Training
- Testing Set
- is used to evaluate the performance of the Model
- Note
- Training Set and Validation Set are used in the
- Training Phase
- Testing Set is used in the
- Testing Phase
- Question 1
- How to split Sample Data using Training and Validation Set Approach when we have very large/huge amount of data?
- Answer 1
- A Good Split maybe
- Training Set = 80%
- Validation Set = 10%
- Testing Set = 10%
- Note to efficiently Train Deep Learning Algorithms, we need huge amount of Training Data
- Question 2
- How to split Sample Data using Training and Validation Set Approach when we have sufficiently large amount of data?
- Answer 2
- A Good Split may be
- Training Set = 80%
- Testing Set = 20%
- Validation Set = 10% of Training Set
- Example 1 – Splitting Data using Training and Validation Set Approach
- Machine Learning Problem
- Text Summarization
- Deep learning Algorithm
- LSTM
- Total Sample Data
- 100,000 instances
- Splitting Data using the following Split Ratio
- Training Set = 80% = 80,000
- Validation Set = 10% = 10,000
- Testing Set = 10% = 10,000
- Example 2 – Splitting Data using Training and Validation Set Approach
- Machine Learning Problem
- Sentiment Analysis
- Machine Learning Algorithm
- Random Forest
- Total Sample Data
- 10,000 instances
- Splitting Data using the following Split Ratio
- Training Set = 80% = 7,200
- Testing Set = 20% = 2,000
- Validation Set = 10% of Training Set = 800
- Training and Validation Set Approach, Cont…
- Question
- How do I know that Model is Overfitting during Training ?
- Answer
- Model is Not Overfitting
- During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting
- Model is Overfitting
- During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting
- Approaches for Decision Tree Pruning
- Insha Allah, in this Chapter, I will try to explain two approaches for Decision Tree Pruning
- Reduced Error Pruning Approach
- Rule Post-Pruning Approach
- Reduced Error Pruning Approach
- Reduced Error Pruning Approach works as follows
- Step 1: Split data into
- Training Set
- Validation Set
- Testing Set
- Step 2: Construct / Learn Decision Tree using Training Set
- Step 3: For each node evaluate the impact on Validation Set of removing that node and those below it
- Remove node that most improves Accuracy on Validation Set
- When a node is removed, subtree rooted at it is replaced with a leaf node
- whose Classification is the Most Common Classification of Training Examples beneath the node
- Reduced Error Pruning Approach, Cont…
- Strengths
- Decision Tree is pruned to remove Overfitting
- Weaknesses
- If a node close to the root node is removed, then subtree below it will be removed
- Consequently, a lot of information maybe lost
- Example – Effect of Reduced Error Pruning
- In the previous example, Reduced Error Pruning produces this effect
- Note that performance improves after applying Reduced Error Pruning Approach
- Rule Post-Pruning Approach
- Rule Post-Pruning Approach works as follows
- Step 1: Split data into
- Training Set
- Validation Set
- Testing Set
- Step 2: Construct / Learn Decision Tree using Training Set
- Step 3: Convert Decision Tree to equivalent Set of Rules
- Step 4: For each Rule , evaluate impact on Validation Set of removing that Rule
- Remove Rule that improves Accuracy on Validation Set
- Step 5: Sort final Set of Rules into Decision Tree
- Strengths
- Converting Decision Tree to Rules allows distinguishing different contexts in which Rules are used
- Treat each path through Decision Tree differently
- Removes distinction between testing nodes, near root and those near leaves
- Rules are often easier for people to understand
- Strengths and Weaknesses – ID3 Algorithm
- Strengths
- Can handle noisy Data
- Always finds the Target Function because
- ID3 Algorithm searches a complete Hypothesis Space
- Weaknesses
- Cannot handle very complex Numeric Representation
- Note
- State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations
However, they fail to handle very complex Numeric Representations (e.g. Activity Detection in Video / Image)
TODO and Your Turn
TODO Task 1
- Task 1
- Consider the Titanic Dataset with the following Attributes
- Gender: Male, Female
- Ticket Class: Upper, Middle, Lower
- Parent/Child Abroad: Zero, One, Two, Three
- Embarked: Cherbourg, Queenstown, Southampton
- Survival: No, Yes
- We obtained the following Sample Data
- Sample Data was split into Training and Testing Data in a
- Train-Test Split Ratio of 80%-20%
- Training Data
- Testing Data
- Note
- Consider ID3 Algorithm when answering questions given below
- Your answer should be
- Well Justified
- Questions
- Write down the Input and Output for the above Machine Learning Problem?
- How Training Example is represented?
- How Hypothesis (h) should be represented?
- Execute the Machine Learning Cycle?
- In Training Phase
- Use Information Gain measure to select the best Attribute?
- What approach you will use to avoid Overfitting during Training?
- How will you check whether the Model is Overfitting or Not during Training?
- After Training Phase
- What approach is most suitable to remove Overfitting from Trained Model
- What could be the possible reasons of Overfitting?
- Write down your observations that you observed during the execution of Machine Learning Cycle?
Your Turn Task 1
- Task 1
- Select a Machine Learning Problem (similar to Titanic – Machine Learning from Disaster) and answer the questions given below
- Note
- Consider ID3 Algorithm in answering all the questions
- Questions
- Write down the Input and Output for the selected Machine Learning Problem?
- How Training Example is represented?
- How Hypothesis (h) should be represented?
- Execute the Machine Learning Cycle?
- In Training Phase
- Use Information Gain measure to select the best Attribute?
- What approach you will use to avoid Overfitting during Training?
- How will you check whether the Model is Overfitting or Not during Training?
- After Training Phase
- What approach is most suitable to remove Overfitting from Trained Model
- What could be the possible reasons of Overfitting?
- Write down your observations that you observed during the execution of Machine Learning Cycle?
Chapter Summary
- Chapter Summary
- Main Problems – Candidate Elimination Algorithm
- Cannot handle noisy Data
- Cannot handle Numeric Data
- Cannot work for complex representations of Data
- It may fail to find the Target Function in the Hypothesis Space
- i.e. converge to an empty Version Space
- Proposed Solution
- Decision Tree Learning Algorithms
- Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree
- Learned Decision Trees (or Models) can also be re-represented as
- Sets of If-Then Rules (to improve human readability)
- Decision Tree Learning is most popular in Inductive Learning Algorithms
- ID3 Algorithm – Summary
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)
- Searching Strategy
- Greedy Search
- Simple to Complex (Hill Climbing)
- Training Regime
- Batch Method
- Inductive Bias
- Shorter Decision Trees are preferred over Longer Decision Trees
- Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not
- Strengths
- Can handle noisy Data
- Always finds the Target Function because
- ID3 Algorithm searches a complete Hypothesis Space
- Weaknesses
- Cannot handle very complex Numeric Representation
- Note
- State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations
- However, they fail to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)
- To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps
- Step 1: Write down all the paths in the Decision Tree
- A path is a Conjunction (AND) of Attributes
- Step 2: Join the paths using Disjunction (OR)
- Decision Tree = Disjunction of Paths
- In general, Decision Tree Learning Algorithms are best for Machine Learning Problems where
- Instances / Examples are Represented as Attribute–Value Pair
- Target Function (f) is Discrete Valued
- Disjunctive Hypothesis may be Required
- Possibly Noisy / Incomplete Training Data
- ID3 Algorithm, learns Decision Tree by constructing them top- down, beginning with the following question
- Which Attribute should be tested at the root of the tree?
- Generally, a statistical test is used to evaluate each Input Attribute to determine
- How well it alone classifies the Training Examples?
- Best Attribute is the one which alone best classifies the Training Examples
- Best Attribute is selected and used as the test at the root node of the tree
- A descendant of the root node is then created for each possible value of this Attribute, and
- Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)
- The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree
- This process continues for each new leaf node until either of two conditions is met
- Every Attribute has already been included along this path through the tree, or
- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)
- ID3 Algorithm forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices
- Many measures are available for picking the best classifier Attribute
- for e.g. Information Gain, Gain Ratio etc.
- Information Gain is a useful measure of for picking the best classifier Attribute
- Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute
- Information Gain measures
- how well a given Attribute separates Training Examples with respect to their Target Classification
- Information Gain is defined in terms of
- Entropy
- Entropy gives a measure of purity / impurity of a Set of Training Examples
- Value of Entropy lies between [0 – 1]
- 0 means the Sample is Pure
- 1 means that Sample has maximum impurity
- Minimum Entropy
- Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category
- Maximum Entropy
- Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative
- A limitation of Information Gain measure is that it favors attributes with many values over those with few values
- Hypothesis Space of ID3 Algorithm is complete space of finite, discrete-valued functions w.r.t available Attributes
- Hypothesis Space of Candidate Elimination Algorithm is incomplete space because
- It only contains Hypotheses with Conjunction relationship between Attributes
- ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far
- This means that we cannot
- determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples
- ID3 Algorithm incompletely searches a complete Hypothesis Space (H)
- Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H)
- ID3 Algorithm performs no backtracking
- Once an Attribute is selected for testing at a given node, this choice is never reconsidered
- Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions
- Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method
- Preference Bias only effects order in which Hypotheses are searched
- In Preference Bias, Hypothesis Space (H) will contain the Target Function
- Restriction Bias effects which Hypotheses are searched
- In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function
- Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias
- Some Machine Learning Algorithms may combine Preference and Restriction Biases
- for e.g. Checker’s Learning Program
- Given a hypothesis space H, a hypothesis h ∈ H overfits the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances
- What Causes Overfitting?
- Noise in Training Examples
- Number of Training Examples is too small to produce a representative sample of Target Function
- Why Overfitting is a Serious Problem?
- Overfitting is a serious problem for many Machine Learning Algorithms
- For example
- Decision Tree Learning
- Regular Neural Networks
- Deep Learning Algorithm etc.
- Overfitting is a real problem for Decision Tree Learning
- For Decision Tree Learning, one empirical study showed that for a range of tasks there was a
- Decrease of 10% – 25% in Accuracy due to Overfitting
- The simple ID3 Algorithm can produce
- Decision Trees that overfit the Training Examples
- Two general approaches to avoid Overfitting in Decision Tree Learning are
- Stop growing Decision Tree before perfectly fitting Training Examples
- e.g. when Data Split is not statistically significant
- Grow full Decision Tree, then prune afterwards
- In practice
- Second approach has been more successful
- Most Common Approach to avoid Overfitting is
- Training and Validation Set Approach
- Using Training and Validation Set Approach, Sample Data is split into three sets
- Training Set
- is used to build the Model
- Validation Set
- is used to check whether the Model is Overfitting or not during Training
- Testing Set
- is used to evaluate the performance of the Model
- Holding Data back for a Validation Set reduces Data available for Training
- During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting
- During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting
- Two approaches for Diction Tree Pruning to avoid Overfitting are
- Reduced Error Pruning Approach
- Rule Post-Pruning Approach
- Rule Post-Pruning Approach is better than Reduced Error Pruning Approach
In Next Chapter
- In Next Chapter
- In Sha Allah, in next Chapter, I will present
- Artificial Neural Network