Machine Learning

January 16, 2023

Chapter 10 - Decision Tree Learning

Chapter Outline

Quick Recap

Decision Tree Learning

ID3 – Basic Decision Tree Learning Algorithm

Machine Learning Cycle – ID3 Algorithm

Overfitting – ID3 Algorithm

Chapter Summary

Quick Recap

Problem – FIND-S Algorithm

- Only returns one hypothesis which best fits the Training Data

- - However, there can be multiple hypothesis, which best fit the Training Data

Proposed Solution

- Version Space

Version Space (VS_H,D) contains set of all hypothesis consistent with the Training Examples

Version Space Algorithm computes the set of all hypothesis consistent with the Training Examples

Two Algorithms which return a Version Space are

- List Ten Eliminate Algorithm

- Candidate Elimination Algorithm

List Ten Eliminate Algorithm – Summary

- Representation of Example

- - Attribute-Value Pair

- Representation of Hypothesis (h)

- - Conjunction (AND) of Constraints on Attributes

- Searching Strategy

- - I am not clear about this. Please drop me an email if you know. Jazak Allah Khair

- Training Regime

- - Incremental Method

- Inductive Bias of List Ten Eliminate Algorithm

- - Training Data is error-free

- - Target Function / Concept is present in the Hypothesis Space (H)

- Strengths

- - List Then Eliminate Algorithm overcomes the limitation of FIND-S Algorithm and

- - - returns multiple hypothesis, which best fits the Training Data i.e. Version Space (VS_H,D)

- Weaknesses

- - Only works on error-free Data

- - - However, Real-world Data is noisy

- - Works on the assumption that Target Function is present in the Hypothesis Space (H)

- - - However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known

- - List Then Eliminate Algorithm is computationally expensive because it makes a

- - - pairwise comparison of each Training Example with all hypothesis in Hypothesis Space (H)

List Then Eliminate Algorithm is computationally expensive because

- Hypothesis Space (H) is represented as a List , which forces

- - List Then Eliminate Algorithm to make pairwise comparisons

To reduce the computational cost of Version Space Algorithms

- Have a more compact representation of Hypothesis Space (H)

- - for e.g. Candidate Elimination Algorithm

Candidate Elimination Algorithm – Summary

- Representation of Example

- - Attribute-Value Pair

- Representation of Hypothesis (h)

- - Conjunction (AND) of Constraints on Attributes

- Searching Strategy

- - I am not clear about this. Please drop me an email if you know. Jazak Allah Khair

- Training Regime

- - Incremental Method

- Inductive Bias of Candidate Elimination Algorithm

- - Training Data is error-free

- - Target Function / Concept is present in the Hypothesis Space (H)

- Strengths

- - Candidate Elimination Algorithm overcomes the limitation of List Then Eliminate Algorithm and

- - - provides a more compact representation of the Version Space

- - Version Space returned by Candidate Elimination Algorithm can be used to make predictions on unseen Data using either a

- - - Single Model or

- - - Ensemble Model

- Weaknesses

- - Only works on error-free Data

- - - However, Real-world Data is noisy

- - Works on assumption that Target Function is present in the Hypothesis Space (H)

- - - However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known

- - Only works when values of Attributes are Categorical

- - - However, in Real-world many Machine Learning Problems have

- - - - Attributes / Features with Numeric values

- - Only works for simple representations of Data

- - - However, in Real-world many Machine Learning Problems have complex representation (for e.g. Face Detection, Face Recognition etc.)

Single Model

- A Single Model is trained on the Training Data and used to make predictions on the Test Data

- Strengths

- - It is computationally fast and requires less time to make predictions compared to Ensemble Model

- Weaknesses

- - Error is likely to be high compared to Ensemble Model

Ensemble Model

- An Ensemble Model works by training different Models on the same Training Data and using each Model to individually make predictions on the Test Data

- Strengths

- - It is computationally expensive and requires more time to make predictions compared to Single Model

- Weaknesses

- - Error is likely to be low compared to Single Model

Voting Approach is one of the simplest approaches to combine predictions from different Models

- In Voting Approach, we may assign

- - Same weight to all Models

- - Different weights to all Models

- Model Weight plays an important role in making the Final Prediction

Voting Approach works as follows

- Given a Test instance x

- - Step 1: Make individual predictions using different Models

- - Step 2: Combine predictions of individual Models

- - Step 3: Final Prediction of x will be the class which has the majority vote

Voting Classifier is not an actual Classifier (Machine Learning Algorithm) but a wrapper for a set of different Machine Learning Algorithms , which are trained and tested on the same Data

- A Voting Classifier combines predictions of individual Machine Learning Algorithms to make Final Predictions on unseen Data

Quick Recap - Models

Input to Learner

- Set of Training Examples (D)

- Set of Functions / Hypothesis (H)

Output of Learner

- A hypothesis h from H, which best fits the Training Examples (D)

For any Machine Learning Algorithm, we should have clear understanding of the following three points

- Representation of Training Examples (D)

- - How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can understand and learn from them?

- Representation of Hypothesis (h)

- - How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can understand?

- Searching Strategy

- - What searching strategy is used by a Machine Learning Algorithm to find h form H, which best fits the Training Examples (D)?

Representation of Training Examples (D)

- Attribute-Value Pair

Representation of Hypothesis (h)

- Conjunction (AND) of Constraints on Input Attributes

Searching Strategy

- I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair

Training Regime

- Incremental Method

Representation of Training Examples (D)

- Attribute-Value Pair

Representation of Hypothesis (h)

- Conjunction (AND) of Constraints on Input Attributes

Searching Strategy

- Sequential Search

Training Regime

- Incremental Method

Representation of Training Examples (D)

- Attribute-Value Pair

Representation of Hypothesis (h)

- Conjunction (AND) of Constraints on Input Attributes

Searching Strategy

- I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair

Training Regime

- Incremental Method

Strengths

- Candidate Elimination Algorithm overcomes the limitation of List Then Eliminate Algorithm and

- - provides a more compact representation of the Version Space

- - Version Space returned by Candidate Elimination Algorithm can be used to make predictions on unseen Data using either a

- - - Single Model or

- - - Ensemble Model

Weaknesses

- Only works on error free Data

- - However, Real-world Data is noisy

- Works on assumption that Target Function is present in the Hypothesis Space (H)

- - However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known

- Only works when values of Attributes are Categorical

- - However, in Real-world many Machine Learning Problems have

Attributes / Features with Numeric values

- Only works for simple representations of Data

- - However, in Real-world many Machine Learning Problems have complex representation (e.g. Face Detection, Face Recognition, etc.)

Main Problems

- Cannot handle noisy Data

- Cannot handle Numeric Data

- Cannot work for complex representations of Data

- It may fail to find the Target Function in the Hypothesis Space

- - i.e. converge to an empty Version Space

Proposed Solution

- Decision Tree Learning Algorithms

Decision Tree Learning

Representation of Training Examples (D)

- Attribute-Value Pair

Representation of Hypothesis (h)

- Disjunction (OR) of Conjunction (AND) of Input Attribute

(a.k.a. Decision Tree)

Searching Strategy

- Greedy Search

- - Simple to Complex (Hill Climbing)

Training Regime

- Batch Method

In FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms, Training Examples (D) are represented as

- Attribute-Value Pair

Conclusions

- Representation of Trainings Examples (D) is same for FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms

In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Hypothesis (h) is represented as

- Conjunction (AND) of Constraints on Input Attributes

In ID3 Machine Learning Algorithms, Hypothesis (h) is represented as

- Disjunction (OR) of Conjunction (AND) of Input Attribute

Conclusion

- Representation of Hypothesis (h) in ID3 ML Algorithm is more expressive and powerful than FIND-S, List Then Eliminate and Candidate Elimination Algorithms

Reason

- FIND-S, List Then Eliminate and Candidate Elimination Algorithms only use

- - Conjunction (AND) relationship between Attributes

- ID3 Algorithm uses both

- - Disjunction (OR) and Conjunction (AND) relationships between Attributes

In FIND-S Algorithm, Searching Strategy is

- I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair

In List Then Eliminate Algorithm, Searching Strategy is

- Sequential Search

In Candidate Elimination Algorithm, Searching Strategy is

- I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair

In ID3 Machine Learning Algorithms, Searching Strategy is

- Greedy Search

- - Simple to Complex (Hill Climbing)

Conclusion

- Search Strategy of ID3 Algorithm is different from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms

In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Training Regime is

- Incremental Method

In ID3 Machine Learning Algorithm, Training Regime is

- Batch Method

Conclusion
- Training Regime of ID3 Algorithm is different from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms

Definition

- A Decision Tree is a graph that uses a branching method to illustrate every possible outcome of a decision

Decision Trees can be drawn by hand or created with a graphics program or specialized software

Definition

- Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree

Learned Decision Trees (or Models) can also be re-represented as

- Sets of If-Then Rules (to improve human readability)

Importance

- Most popular of Inductive Learning Algorithms

Applications

- Decision Tree Learning Algorithms are successfully and widely used in a range of Real-world Applications including

- - Predicting Magnetic Properties of Crystals

- - Proﬁling Higher-Priced Houses

- - Detecting Advertisements on the Web

- - Controlling a Production Process

- - Diagnosing Hypothyroidism

- - Assessing Credit Risk

- - Sentiment Analysis

- - Emotion Analysis

- - Author Profiling

- - Word Sense Disambiguation

- - Text Reuse and Plagiarism Detection

Note

- Note the difference between the definitions of

- - Decision Tree and

- - Decision Tree Learning

A Decision Tree (or Model) is given below

To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps

- Step 1: Write down all the paths in the Decision Tree

- - A path is a Conjunction (AND) of Attributes

- Step 2: Join the paths using Disjunction (OR)

Note

- Decision Tree = Disjunction of Paths

Consider the following Decision Tree, In sha Allah (انشاء اللہ), we will convert it into If-Then Rules

Step 1: Write down all the paths in the Decision Tree

Step 2: Join the paths using Disjunction (OR)

- Decision Tree = Disjunction (OR) of Paths

In general , Decision Tree Learning Algorithms are best for Machine Learning Problems where

- Instances / Examples are Represented as Attribute–Value Pair

- - Values of Attributes can be either

- - - Categorical or

- - - Numeric

- Target Function (f) is Discrete Valued

- - Example 1 – Discrete Valued Target Function

- - - Gender Identification Problem

- - - - Two Possible Output Values

- - - - Male

- - - - Female

- - Example 2 – Discrete Valued Target Function

- - - Sentiment Analysis

- - - - Three Possible Output Values

- - - - Positive

- - - - Negative

- - - - Neutral

- Disjunctive Hypothesis may be Required

- - Easy for Decision Tree Learning Algorithms to learn disjunctive concepts

- - - Note that such concepts were outside the Hypothesis Space (H) of the Candidate Elimination Algorithm

- Possibly Noisy / Incomplete Training Data

- - Decision Tree Learning Algorithms are robust to errors in

- - - Values of Input Attributes

- - - Values of Output Attribute i.e. Classiﬁcation of Training Example(s) in Incorrect

- - Decision Tree Learning Algorithms can be trained on Training Examples where for some instances some Attribute Values are unknown/missing

- - Note

- - - Candidate Elimination Algorithm cannot be trained on noisy/incomplete Data

ID3 - Basic Decision Tree Learning Algorithm

There are many state-of-the-art Decision Tree Learning Algorithms available

- For example, J48, Random Forest

In sha Allah, I will discuss the most basic Decision Tree Learning Algorithm i.e. ID3

Reason

- In sha Allah, it will help you to completely and correctly understand

- - How Decision Tree Learning Algorithms works?

ID3 Algorithm, learns Decision Tree by constructing them top-down , beginning with the following question

Question

- Which Attribute should be tested at the root of the tree?

Answer

- Use a statistical test to evaluate each Input Attribute to determine

- - How well it alone classifies the Training Examples?

- Best Attribute

- - Best Attribute is the one which alone best classifies the Training Examples

Best Attribute is selected and used as the test at the root node of the tree

A descendant of the root node is then created for each possible value of this Attribute, and

- Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)

The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree

This process continues for each new leaf node until either of two conditions are met

- Every Attribute has already been included along this path through the tree, or

- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)

Note

- This forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices

Most Crucial Step in ID3 Algorithm

- Choosing which Attribute to test at the next node?

Would like to choose that Attribute which does

- best at separating Training Examples according to their Target Classiﬁcation (or Output)

An Attribute which separates Training Examples into two sets

- each of which contains Positive / Negative Training Examples of the Target / Output Attribute in the same ratio as the initial set of Training Examples

- - has not helped us progress towards a classiﬁcation

Consider Gender Identification Problem

Suppose we have a Set of 14 Training Examples

- Positive (Female) = 9
- Negative (Male) = 5

- Consider Gender Identification Problem
- Suppose we have a Set of 14 Training Examples

- Positive = 9

- Negative = 5

Question

- Which Attribute is the best classifier?

- - Hair Length or

- - Head Covered

Answer

- Hair Length

Question

- How many measures are there for picking the best classiﬁer attribute?

Answer

- Many measures are available for picking the best classifier Attribute

- - Examples

- - - Information Gain

- - - Gain Ratio

Note

- In this Chapter, in sha Allah we will use Information Gain

Information Gain is a useful measure for picking the best classiﬁer Attribute

Information Gain measures

- how well a given Attribute separates Training Examples with respect to their Target Classiﬁcation

Information Gain is deﬁned in terms of

- Entropy

Definition

- Entropy gives a measure of purity/impurity of a Set of Training Examples

Formula

- Where S represents the Set of Training Examples (or Sample)

- p⊕ represents the proportion of Positive Training Examples

- p⊖ represents the proportion of Negative Training Examples

Range of Entropy

- Value of Entropy lies between [0 – 1]

- - 0 means the Sample is Pure

- - 1 means that Sample has maximum impurity

Minimum Entropy

- Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category

Maximum Entropy

- Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative

Considering Gender Identification Problem

- Training Examples = 14

- - Positive = 14

- - Negative = 0

Entropy([14+, 0-]) = -p⊕ log2 p⊕- p⊖ log2 p⊖

= – (14/14) log2 (14/14) – (0/14) log2 (0/14)

= – 1 log2 (1) – 0 log2 (0)

= – 1 (0) – 0 (0)

= 0

Note

- Sample Is Pure (all Training Examples are Positive) and

- - Entropy is ZERO

Considering Gender Identification Problem

- Training Examples = 14

- - Positive = 0

- - Negative = 14

= – (0/14) log2 (0/14) – (14/14) log2 (14/14)

= – 0 log2 (0) – 1 log2 (1)

= 0 (0) – 1 (0)

= 0

Note

- Sample Is Pure (all Training Examples are Negative) and

- - Entropy is ZERO

Considering Gender Identification Problem

- Training Examples = 14

- - Positive = 7

- - Negative = 7

= – (7/14) log2 (7/14) – (7/14) log2 (7/14)

= – 0.5 log2 (-1) – 0.5 log2 (-1)

= – 0.5 (-1) – 0.5 (-1)

= 0.5 + 0.5

= 1

Note

- Sample has maximum impurity (half of the Training Examples are Positive and remaining half are Negative) and

- - Entropy is ONE

Considering Gender Identification Problem

- Training Examples = 14

- - Positive = 9

- - Negative = 5

= -(9/14) log2 (9/14) – (5/14) log2 (5/14)

= 0.940

Note

- Sample Is impure

Definition

- Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute

Formula

- Given a Set of Training Examples S and Attribute A

where

- Values(A) is the Set of Values Attribute A can take on

- Sv is the subset of S for which Attribute A has value v

First term in Gain(S,A) is

- Entropy of Set of Training Examples S

Second term in Gain(S, A) is

- Expected entropy after partitioning on A = sum of Training Entropies of each subset Sv weighted by ratio of Sv in S

Strengths

- Information Gain can be used to Construct / Learn a Decision Tree using ID3 Algorithm

Weaknesses

- Information Gain measure favors attributes with many values over those with few values

Consider the Gender Identification Problem

- If we add a Date attribute to the set of Input Attributes

- The Date attribute will have a distinct value for each day and will have the highest Information Gain

- - This is because Date attribute perfectly predicts the Target / Output Attribute for all Training Examples

- Result is a Decision Tree of Depth 1 that

- - perfectly classiﬁes Training Examples but fails on all other data

Considering Gender Identification Problem

Information Gain for HeadCovered will be

Machine Learning Cycle - ID3 Algorithm

Machine Learning Problem

- Gender Identification

Input

- Human

Output

- Gender of a Human

Task

- Given a Human (Input), predict the Gender of the Human (Output)

Treated as

- Learning Input-Output Function

- - i.e., Learn from Input to predict Output

Example = Input + Output

- Input = Human

- Output = Gender

Representation of Input and Output

- Attribute-Value Pair

Input is represented as a

- Set of 4 Input Attributes

- - Height

- - Weight

- - Hair Length

- - Head Covered

Output is represented as a

- Set of 1 Output Attribute

- - Gender

Note
- Yes means Female and No means Male

Computing Size of Instance Space (X)

- |X| = No. of Values of Height Input Attribute x No. of Values of Weight Input Attribute x No. of Values of Hair Length Input Attribute x No. of Values of Head Covered Input Attribute

- |X| = 3*3*2*2 = 36

We obtained a Sample Data of 21 examples

We represent a Hypothesis (h) as

- Disjunction (OR) of Conjunction (AND) of Input Attributes

Since ID3 represents Disjunctive Hypothesis , therefore

- Hypothesis Space (H) = Concept Space (C)

|H| = |C| = 2|X| = 2|36| = 68,719,476,736

Note

- Using ID3 Algorithm, we will find the Target Function / Concept in the Hypothesis Space because

- - we are searching complete Concept Space

- In Candidate Elimination Algorithm, we may / may not find the Target Function / Concept in the Hypothesis Space because

- - we are searching incomplete Concept Space

In ID3 Algorithm, Hypothesis Space (H) is a

- Set of Decision Trees

We split the Sample Data using Random Split Approach into

- Training Data – 2 / 3 of Sample Data

- Testing Data – 1 / 3 of Sample Data

Four phases of a Machine Learning Cycle are

- Training Phase

- - Build the Model using Training Data

- Testing Phase

- - Evaluate the performance of Model using Testing Data

- Application Phase

- - Deploy the Model in Real-world , to make a prediction on Real-time unseen Data

- Feedback Phase
  - Take Feedback from the Users and Domain Experts to improve the Model

Training Phase – ID3 Algorithm

Let’s construct/learn a Decision Tree (Model / h) which best fits the Training Examples using ID3 Algorithm

Question

- What Attribute should be tested at the root node?

Answer

- Compute Information Gain for all four Attributes (Height, Weight, Hair length and Head Covered) and

- - Attribute with the highest Information Gain will be tested at the root
    node

Which attribute should be tested at the root?

- Gain (S, Height) = 0.246

- Gain (S, Weight) = 0.029

Gain (S, Hair Length) = 0.151

- Gain (S, Head Covered) = 0.084

Height provides the best prediction for Target Classification

- Therefore, Height will be the root node

Question

- What Attribute should be tested here?

Answer

Compute Information Gain for all three Attributes (Weight, Hair length and Head Covered) and

- - Attribute with the highest Information Gain will be tested at Height = Short node

Computing Information Gain

Hair Length provides the best prediction for Target Classification

- Therefore, it will be tested at Height = Short node

Let’s further grow Decision Tree

- Add to the tree a successor for each possible value of Hair Length

- Partition the Training Examples according to the value of Hair Length

Note

- For Root Node Selection

- - No. of Training Examples = 14

- - No. of Attributes = 4

- For Height = Short Node Selection

- - No. of Training Examples = 5

- - No. of Attributes = 3

Snormal = {x3,x7,x12,x13}

Target Classification for all four instances is

- Positive

- - i.e. Sample is Pure (Entropy is ZERO)

Recall that one of the Conditions to Stop Growing the Decision Tree is

- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is ZERO)

For Height = Normal Node

- We will not grow the Decision Tree and

- - Leaf Node will be assigned the label of the Majority Class

Considering Snormal = {x3,x7,x12,x13}

- Positive (Female) = 4

- Negative (Male) = 0

Therefore, Majority Class is Female and this will be the label of Leaf Node

Decision Tree (or Model) learned by ID3 Algorithm is given below

Recall – Learning is a Searching Problem

- Given

- - Set of Training Examples (D)

- - Set of Hypotheses (Decision Trees)

- Learner (ID3 Algorithm)

- - searches the Hypothesis Space (Set of Decision Trees) to find a Hypothesis (Decision Tree / Model / h) which best fits the Training Examples
Search of ID3 Algorithm is
- simple-to-complex, hill-climbing search guided by the Information Gain evaluation function

ID3 Algorithm

- Hypothesis Space of ID3 Algorithm is complete space of ﬁnite, discrete-valued functions w.r.t available Attributes

Candidate Elimination Algorithm

- Hypothesis Space of Candidate Elimination Algorithm is incomplete space because

- - It only contains Hypotheses with Conjunction relationship between Attributes

ID3 Algorithm

- ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far

- This means that we cannot

- - determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples

Candidate Elimination Algorithm

- Candidate Elimination Algorithm maintains Version Space (VSH,D) i.e. set of all Hypotheses consistent with Training Examples

Conclusion

- ID3 Algorithm incompletely searches a complete Hypothesis Space (H)

- Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H)

ID3 Algorithm

- ID3 Algorithm performs no backtracking

- - Once an Attribute is selected for testing at a given node, this choice is never reconsidered

Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions

ID3 Algorithm

- ID3 Algorithm uses Batch Method for Training

- - i.e. Uses all Training Examples at each step to make statistically based decision about how to reﬁne current Hypothesis (h or Decision Tree)

- Using statistically based properties of all Training Examples (Information Gain) means the technique is robust in the face of errors in individual Training Examples

Candidate Elimination Algorithm

- Candidate Elimination Algorithm uses Incremental Method for Training

- - i.e. make decisions incrementally based on single Training Examples

Conclusion

- Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method

Inductive Bias

- Inductive Bias Is the set of assumptions needed in addition to Training Examples to justify Deductively Learner’s Classiﬁcation

Inductive Bias of ID3 Algorithm

- Shorter Decision Trees are preferred over Longer Decision Trees

- Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not

Note

- Inductive Bias of ID3 Algorithm arises from its Searching Strategy (simple-to-complex, hill climbing)

- - Selects shorter Decision Trees over longer ones

- - Selects Decision Trees that place Attributes with highest Information Gain closest to root

- ID3 Algorithm incompletely searches a complete Hypothesis Space (H)

- Candidate Elimination Algorithm completely searches a incomplete Hypothesis Space (H)

Can be put differently by saying

- Inductive Bias of ID3 Algorithm follows from its Searching Strategy (Preference Bias or Search Bias)

- Inductive Bias of Candidate Elimination Algorithm follows from its deﬁnition of its search space (Restriction Bias or Language Bias).

Question

- Which Bias is better?

Answer

- Main Difference between Tow Biases

- - Preference Bias only effects order in which Hypotheses are searched

- - - In Preference Bias, Hypothesis Space (H) will contain the Target Function

- - Restriction Bias effects which Hypotheses are searched

- - - In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function

- Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias

Note

- - Some Machine Learning Algorithms may combine Preference and Restriction Biases

- Example

- - Checker’s Learning Program

- - Restriction Bias

- - - Linear weighted function of ﬁxed set of board features introduces Restriction Bias ( non-linear potential target functions excluded)

- - Preference Bias

- - - Selection of moves that lead to win the game introduces Preference Bias

Recall

- Data = Model + Error

Training Data

Model

Model – in the form of Rules

In sha Allah, In the next Phase i.e. Testing Phase, we will

- Evaluate the performance of the Model

Testing Phase – ID3 Algorithm

Question

- How good Model has learned?

Answer

- Evaluate the performance of the Model on unseen data (or Testing Data)

Evaluation will be carried out using

- Error measure

Definition

- Error is defined as the proportion of incorrectly classified Test instances

Formula

Note

- Accuracy = 1 – Error

Apply Model on Test Data

Application Phase – ID3 Algorithm

We assume that our Model

- performed well on large Test Data and can be deployed in Real-world

Model is deployed in the Real-world and now we can make

- predictions on Real-time Data

Step 1: Take Input from User

Step 2: Convert User Input into Feature Vector

- Exactly same as Feature Vectors of Training and Testing Data

Step 3: Apply Model on the Feature Vector

Step 4: Return Prediction to the User

Step 1: Take input from User

Step 2: Convert User Input into Feature Vector

- Note that order of Attributes must be exactly same as that of Training and Testing Examples

Step 3: Apply Model on Feature Vector

Return Prediction to the User

- Male

Note

- You can take Input from the user, apply Model and return predictions as many times as you like 😊

Feedback Phase – ID3 Algorithm

Only Allah is Perfect 😊

Take Feedback on your deployed Model from

- Domain Experts and

- Users

Improve your Model based on Feedback 😊

Overfitting – ID3 Algorithm

Definition

- Given a hypothesis space H, a hypothesis h ∈ H overﬁts the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances

What Causes Overfitting?

Noise in Training Examples

Number of Training Examples is too small to produce a representative sample of Target Function

Why Overfitting is a Serious Problem?

- Overfitting is a serious problem for many Machine Learning Algorithms

- - For example

- - - Decision Tree Learning

- - - Regular Neural Networks

- - - Deep Learning Algorithm etc.

Suppose the performance of the following two Hypotheses on Training Data and Test Data

Note that h has smaller Error than h′ over the Training Examples, but

- h′ has a smaller Error over the entire distribution of instances

This means that h has overfit the Training Examples

Overﬁtting is a real problem for Decision Tree Learning

One empirical study showed that for a range of tasks there was a

- Decrease of 10% – 25% in Accuracy due to Overfitting

The simple ID3 Algorithm can produce

- Decision Trees that overfit the Training Examples

Suppose in addition to the 14 Training Examples for Gender Identification we get a 15th Training Example, whose Target Classiﬁcation is wrong i.e. Training Example has Error

- ‹Tall, Heavy, Short, Yes, Gender = Female›

Question

- What impact this will have on our earlier Decision Tree?

Since we previously had the correct Training Example
The addition of incorrect Training Example will now cause ID3 to construct a more complex Decision Tree
The new (error full) Training Example will be sorted into the second leaf node from the left in the Decision Tree, along with the previous Training Examples x9 and x11 (both x9 and x11 labeled as Male)
- Because the new (error full) Training Example is labeled as a Female , ID3 will search for further refinements to the Decision Tree below this node

Result will be a Decision Tree which performs

- well on ( error full Training Examples) and less well on new unseen instances

Note

- In this example, ID3 Algorithm overfits Training Examples due to

- - Noise in Training Example

- Overfitting can also occur when

- - Number of Training Examples is too small to produce a representative sample of Target Function

ID3 Algorithm when Trained and Tested on medical patients, who had a form of diabetes

Observations from Graph

- Accuracy of Decision Tree over Training Examples increases monotonically as Decision Tree grows (to be expected)

- Accuracy of Decision Tree over Test Examples increases till about 25 nodes, then decreases

Conclusion

- Decision Tree has overfit

Question

- How can we avoid Overfitting in Decision Tree Learning?

Answer

- Many Possible Solutions

Two general approaches to avoid Overfitting in Decision Tree Learning are

- Stop growing Decision Tree before perfectly ﬁtting Training Examples

- - e.g. when Data Split is not statistically signiﬁcant

- Grow full Decision Tree, then prune afterward

In practice

- Second approach has been more successful

Question

- How the size of an optimal ﬁnal Decision Tree can be decided to avoid Overfitting ?

Answer

- Many Possible Approaches

Most Common Approach

- Training and Validation Set Approach

Using this approach, Sample Data is split into three sets

- Training Set

- Validation Set

- Testing Set

Strengths

- Validation Set helps to check whether the Model is Overfitting or not during the Training

Weaknesses

- Holding Data back for a Validation Set reduces Data available for Training

Training Set

- is used to build the Model

Validation Set

- is used to check whether the Model is Overfitting or not during Training

Testing Set

- is used to evaluate the performance of the Model

Note

- Training Set and Validation Set are used in the

- - Training Phase

- Testing Set is used in the

- - Testing Phase

Question 1

- How to split Sample Data using Training and Validation Set Approach when we have very large/huge amount of data?

Answer 1

- A Good Split maybe

- - Training Set = 80%

- - Validation Set = 10%

- - Testing Set = 10%

Note to efficiently Train Deep Learning Algorithms, we need huge amount of Training Data

Question 2

- How to split Sample Data using Training and Validation Set Approach when we have sufficiently large amount of data?

Answer 2

- A Good Split may be

- - Training Set = 80%

- - Testing Set = 20%

- - Validation Set = 10% of Training Set

Machine Learning Problem

- Text Summarization

Deep learning Algorithm

- LSTM

Total Sample Data

- 100,000 instances

Splitting Data using the following Split Ratio

- Training Set = 80% = 80,000

- Validation Set = 10% = 10,000

- Testing Set = 10% = 10,000

Machine Learning Problem

- Sentiment Analysis

Machine Learning Algorithm

- Random Forest

Total Sample Data

- 10,000 instances

Splitting Data using the following Split Ratio

- Training Set = 80% = 7,200

- Testing Set = 20% = 2,000

- Validation Set = 10% of Training Set = 800

Question

- How do I know that Model is Overfitting during Training ?

Answer

- Model is Not Overfitting

- - During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting

- Model is Overfitting

- - During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting

Insha Allah, in this Chapter, I will try to explain two approaches for Decision Tree Pruning

- Reduced Error Pruning Approach

- Rule Post-Pruning Approach

Reduced Error Pruning Approach works as follows

Step 1: Split data into

- Training Set

- Validation Set

- Testing Set

Step 2: Construct / Learn Decision Tree using Training Set

Step 3: For each node evaluate the impact on Validation Set of removing that node and those below it

- Remove node that most improves Accuracy on Validation Set

When a node is removed, subtree rooted at it is replaced with a leaf node

- whose Classiﬁcation is the Most Common Classiﬁcation of Training Examples beneath the node

Strengths

- Decision Tree is pruned to remove Overfitting

Weaknesses

- If a node close to the root node is removed, then subtree below it will be removed

- - Consequently, a lot of information maybe lost

In the previous example, Reduced Error Pruning produces this effect

Note that performance improves after applying Reduced Error Pruning Approach

Rule Post-Pruning Approach works as follows

Step 1: Split data into

- Training Set

- Validation Set

- Testing Set

Step 2: Construct / Learn Decision Tree using Training Set

Step 3: Convert Decision Tree to equivalent Set of Rules

Step 4: For each Rule , evaluate impact on Validation Set of removing that Rule

- Remove Rule that improves Accuracy on Validation Set

Step 5: Sort ﬁnal Set of Rules into Decision Tree

Strengths

- Converting Decision Tree to Rules allows distinguishing different contexts in which Rules are used

- - Treat each path through Decision Tree differently

- Removes distinction between testing nodes, near root and those near leaves

- Rules are often easier for people to understand

Strengths

- Can handle noisy Data

- Always finds the Target Function because

- - ID3 Algorithm searches a complete Hypothesis Space

Weaknesses

- Cannot handle very complex Numeric Representation

Note

- State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations

However, they fail to handle very complex Numeric Representations (e.g. Activity Detection in Video / Image)

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Chapter Summary

Main Problems – Candidate Elimination Algorithm

- Cannot handle noisy Data

- Cannot handle Numeric Data

- Cannot work for complex representations of Data

- It may fail to find the Target Function in the Hypothesis Space

- - i.e. converge to an empty Version Space

Proposed Solution

- Decision Tree Learning Algorithms

Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree

Learned Decision Trees (or Models) can also be re-represented as

- Sets of If-Then Rules (to improve human readability)

Decision Tree Learning is most popular in Inductive Learning Algorithms

ID3 Algorithm – Summary

- Representation of Training Examples (D)

- - Attribute-Value Pair

- Representation of Hypothesis (h)

- - Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)

- Searching Strategy

- - Greedy Search

Simple to Complex (Hill Climbing)

- Training Regime

- - Batch Method

- Inductive Bias

- - Shorter Decision Trees are preferred over Longer Decision Trees

- - Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not

- Strengths

- - Can handle noisy Data

- - Always finds the Target Function because

- - - ID3 Algorithm searches a complete Hypothesis Space

- Weaknesses

- - Cannot handle very complex Numeric Representation

- Note

- - State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations

- - - However, they fail to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)

To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps

- Step 1: Write down all the paths in the Decision Tree

- A path is a Conjunction (AND) of Attributes

- Step 2: Join the paths using Disjunction (OR)

- Decision Tree = Disjunction of Paths

In general, Decision Tree Learning Algorithms are best for Machine Learning Problems where

- Instances / Examples are Represented as Attribute–Value Pair

- Target Function (f) is Discrete Valued

- Disjunctive Hypothesis may be Required

- Possibly Noisy / Incomplete Training Data

ID3 Algorithm, learns Decision Tree by constructing them top- down, beginning with the following question

- Which Attribute should be tested at the root of the tree?

Generally, a statistical test is used to evaluate each Input Attribute to determine

- How well it alone classifies the Training Examples?

Best Attribute is the one which alone best classifies the Training Examples

Best Attribute is selected and used as the test at the root node of the tree

A descendant of the root node is then created for each possible value of this Attribute, and

- Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)

The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree

This process continues for each new leaf node until either of two conditions is met

- Every Attribute has already been included along this path through the tree, or

- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)

ID3 Algorithm forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices

Many measures are available for picking the best classifier Attribute

- for e.g. Information Gain, Gain Ratio etc.

Information Gain is a useful measure of for picking the best classiﬁer Attribute

Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute

Information Gain measures

- how well a given Attribute separates Training Examples with respect to their Target Classiﬁcation

Information Gain is deﬁned in terms of

- Entropy

Entropy gives a measure of purity / impurity of a Set of Training Examples

- Value of Entropy lies between [0 – 1]

- - 0 means the Sample is Pure

- - 1 means that Sample has maximum impurity

Minimum Entropy

- Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category

Maximum Entropy

- Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative

A limitation of Information Gain measure is that it favors attributes with many values over those with few values

Hypothesis Space of ID3 Algorithm is complete space of ﬁnite, discrete-valued functions w.r.t available Attributes

Hypothesis Space of Candidate Elimination Algorithm is incomplete space because

- It only contains Hypotheses with Conjunction relationship between Attributes

ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far

- This means that we cannot

- - determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples

ID3 Algorithm incompletely searches a complete Hypothesis Space (H)

Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H)

ID3 Algorithm performs no backtracking

- Once an Attribute is selected for testing at a given node, this choice is never reconsidered

- Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions

Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method

Preference Bias only effects order in which Hypotheses are searched

- In Preference Bias, Hypothesis Space (H) will contain the Target Function

Restriction Bias effects which Hypotheses are searched

- In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function

Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias

Some Machine Learning Algorithms may combine Preference and Restriction Biases

- for e.g. Checker’s Learning Program

Given a hypothesis space H, a hypothesis h ∈ H overﬁts the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances

What Causes Overfitting?

- - Noise in Training Examples

- - Number of Training Examples is too small to produce a representative sample of Target Function

Why Overfitting is a Serious Problem?

- Overfitting is a serious problem for many Machine Learning Algorithms

- - For example

- - - Decision Tree Learning

- - - Regular Neural Networks

- - - Deep Learning Algorithm etc.

Overﬁtting is a real problem for Decision Tree Learning

For Decision Tree Learning, one empirical study showed that for a range of tasks there was a

- Decrease of 10% – 25% in Accuracy due to Overfitting

The simple ID3 Algorithm can produce

- Decision Trees that overfit the Training Examples

Two general approaches to avoid Overfitting in Decision Tree Learning are

- Stop growing Decision Tree before perfectly ﬁtting Training Examples

- - e.g. when Data Split is not statistically signiﬁcant

- Grow full Decision Tree, then prune afterwards

In practice

- Second approach has been more successful

Most Common Approach to avoid Overfitting is

- Training and Validation Set Approach

Using Training and Validation Set Approach, Sample Data is split into three sets

- Training Set

- - is used to build the Model

- Validation Set

- - is used to check whether the Model is Overfitting or not during Training

- Testing Set

- - is used to evaluate the performance of the Model

Holding Data back for a Validation Set reduces Data available for Training

During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting

During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting

Two approaches for Diction Tree Pruning to avoid Overfitting are

- Reduced Error Pruning Approach

- Rule Post-Pruning Approach

- - Rule Post-Pruning Approach is better than Reduced Error Pruning Approach

Ilm o Irfan

Machine Learning

Table of Contents

Chapter 10 - Decision Tree Learning

Chapter Outline

Quick Recap

Quick Recap - Models

Decision Tree Learning

ID3 - Basic Decision Tree Learning Algorithm

Machine Learning Cycle - ID3 Algorithm

Training Phase – ID3 Algorithm

Testing Phase – ID3 Algorithm

Application Phase – ID3 Algorithm

Feedback Phase – ID3 Algorithm

Overfitting – ID3 Algorithm

TODO and Your Turn

Chapter Summary

In Next Chapter

Chapter 9 - Version Space Algorithms

Chapter 11 - Artificial Neural Network

Share this article!

About Us

Quick Links

Useful Links

Subscribe Our Newsletter

Ilm o Irfan

Machine Learning

Table of Contents

Chapter 10 - Decision Tree Learning

Chapter Outline

Quick Recap

Quick Recap - Models

Decision Tree Learning

ID3 - Basic Decision Tree Learning Algorithm

Machine Learning Cycle - ID3 Algorithm

Training Phase – ID3 Algorithm

Testing Phase – ID3 Algorithm

Application Phase – ID3 Algorithm

Feedback Phase – ID3 Algorithm

Overfitting – ID3 Algorithm

TODO and Your Turn​

Chapter Summary

In Next Chapter

Chapter 9 - Version Space Algorithms​

Chapter 11 - Artificial Neural Network​

Share this article!

About Us

Quick Links

Useful Links

Subscribe Our Newsletter

TODO and Your Turn

Chapter 9 - Version Space Algorithms

Chapter 11 - Artificial Neural Network