# Chapter 10 - Decision Tree Learning

### Chapter Outline

- Chapter Outline

**Quick Recap**

**Decision Tree Learning**

**ID3 – Basic Decision Tree Learning Algorithm**

**Machine Learning Cycle – ID3 Algorithm**

**Overfitting – ID3 Algorithm**

**Chapter Summary**

## Quick Recap

- Quick Recap – Version Space Algorithm

**Problem – FIND-S Algorithm**

**Only returns****one hypothesis****which****best fits****the Training Data**

**However, there can be****multiple hypothesis****, which****best fit****the Training Data**

**Proposed Solution**

**Version Space**

**Version Space (****VS**_{H,D}**) contains****set of all hypothesis consistent****with the Training Examples**

**Version Space Algorithm****computes****the set of all hypothesis consistent with the Training Examples**

**Two Algorithms which return a Version Space are**

**List Ten Eliminate Algorithm**

**Candidate Elimination Algorithm**

**List Ten Eliminate Algorithm – Summary**

**Representation of Example**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Attributes**

**Searching Strategy**

**I am not clear about this. Please drop me an email if you know. Jazak Allah Khair**

**Training Regime**

**Incremental Method**

**Inductive Bias of List Ten Eliminate Algorithm**

**Training Data is****error-free**

**Target Function / Concept is present in the Hypothesis Space (H)**

**Strengths**

**List Then Eliminate Algorithm****overcomes the limitation****of FIND-S Algorithm and**

**returns****multiple hypothesis****, which****best fits****the Training Data i.e. Version Space (****VS**_{H,D}**)**

**Weaknesses**

**Only works on****error-free****Data**

**However, Real-world Data is****noisy**

**Works on the****assumption****that Target Function is****present****in the Hypothesis Space (H)**

**However, we****may / may not find****the Target Function in the Hypothesis Space (H) and this****may / may not be known**

**List Then Eliminate Algorithm is****computationally expensive****because it makes a**

**pairwise comparison****of each Training Example with****all hypothesis in Hypothesis Space (H)**

**List Then Eliminate Algorithm is computationally expensive because**

**Hypothesis Space (H) is****represented****as a****List****, which****forces**

**List Then Eliminate Algorithm to make****pairwise comparisons**

**To****reduce****the****computational cost****of Version Space Algorithms**

**Have a****more compact representation****of****Hypothesis Space (H)**

**for e.g. Candidate Elimination Algorithm**

**Candidate Elimination Algorithm – Summary**

**Representation of Example**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Attributes**

**Searching Strategy**

**I am not clear about this. Please drop me an email if you know. Jazak Allah Khair**

**Training Regime**

**Incremental Method**

**Inductive Bias of Candidate Elimination Algorithm**

**Training Data is****error-free**

**Target Function / Concept is****present****in the Hypothesis Space (H)**

**Strengths**

**Candidate Elimination Algorithm****overcomes the limitation****of List Then Eliminate Algorithm and**

**provides a****more compact****representation of the Version Space**

**Version Space returned by Candidate Elimination Algorithm can be used to****make predictions****on****unseen****Data using either a**

**Single Model or**

**Ensemble Model**

**Weaknesses**

**Only works on****error-free****Data**

**However, Real-world Data is****noisy**

**Works on****assumption****that Target Function is****present****in the Hypothesis Space (H)**

**However, we****may / may not find****the Target Function in the Hypothesis Space (H) and this****may / may not be known**

**Only works when****values of Attributes****are****Categorical**

**However, in Real-world many Machine Learning Problems have**

**Attributes / Features with****Numeric****values**

**Only works for****simple****representations of Data**

**However, in Real-world many Machine Learning Problems have****complex****representation (for e.g. Face Detection, Face Recognition etc.)**

**Single Model**

**A Single Model is trained on the Training Data and used to make predictions on the Test Data**

**Strengths**

**It is****computationally fast****and requires****less time****to****make predictions****compared to Ensemble Model**

**Weaknesses**

**Error is****likely****to be****high****compared to Ensemble Model**

**Ensemble Model**

**An Ensemble Model works by training****different****Models on the****same****Training Data and using each Model to****individually****make predictions on the Test Data**

**Strengths**

**It is computationally expensive and requires more time to make predictions compared to Single Model**

**Weaknesses**

**Error is****likely****to be****low****compared to Single Model**

**Voting Approach is one of the****simplest****approaches to****combine predictions****from****different****Models**

**In Voting Approach, we may assign**

**Same****weight to all Models**

**Different****weights to all Models**

**Model Weight****plays an****important role****in making the****Final Prediction**

**Voting Approach works as follows**

**Given a Test instance x**

**Step 1: Make****individual predictions****using****different****Models**

**Step 2:****Combine predictions****of****individual****Models**

**Step 3:****Final Prediction of x****will be the****class****which has the****majority vote**

**Voting Classifier****is not an actual Classifier (Machine Learning Algorithm)****but a****wrapper****for a****set of different Machine Learning Algorithms****, which are trained and tested on the****same****Data**

**A Voting Classifier****combines predictions****of****individual****Machine Learning Algorithms to make****Final Predictions****on****unseen****Data**

### Quick Recap - Models

- Learning Input-Output Functions – General Settings

**Input to Learner**

**Set of Training Examples (D)**

**Set of Functions / Hypothesis (H)**

**Output of Learner**

**A hypothesis h from H, which****best fits****the Training Examples (D)**

- Learning is a Searching Problem

**For****any****Machine Learning Algorithm, we should have****clear****understanding of the following three points**

**Representation of Training Examples (D)**

**How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can****understand****and****learn****from them?**

**Representation of Hypothesis (h)**

**How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can****understand****?**

**Searching Strategy**

**What****searching strategy****is used by a Machine Learning Algorithm to****find h form H****, which****best fits****the Training Examples (D)?**

- Summary - FIND-S Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Input Attributes**

**Searching Strategy**

**I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair**

**Training Regime**

**Incremental Method**

- Summary – List Then Eliminate Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Input Attributes**

**Searching Strategy**

**Sequential Search**

**Training Regime**

**Incremental Method**

- Summary – Candidate Elimination Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Input Attributes**

**Searching Strategy**

**I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair**

**Training Regime**

**Incremental Method**

- Strengths and Weaknesses – Candidate Elimination Algorithm

**Strengths**

**Candidate Elimination Algorithm****overcomes the limitation****of List Then Eliminate Algorithm and**

**provides a****more compact****representation of the Version Space**

**Version Space returned by Candidate Elimination Algorithm can be used to****make predictions****on****unseen****Data using either a**

**Single Model or**

**Ensemble Model**

**Weaknesses**

**Only works on****error free****Data**

**However, Real-world Data is****noisy**

**Works on****assumption****that Target Function is****present****in the Hypothesis Space (H)**

**However, we****may / may not find****the Target Function in the Hypothesis Space (H) and this****may / may not be known**

**Only works when****values of Attributes****are****Categorical**

**However, in Real-world many Machine Learning Problems have**

**Attributes / Features with****Numeric****values**

**Only works for****simple****representations of Data**

**However, in Real-world many Machine Learning Problems have****complex****representation (e.g. Face Detection, Face Recognition, etc.)**

- Main Problems in Candidate Elimination Algorithm

**Main Problems**

**Cannot handle****noisy****Data**

**Cannot handle****Numeric****Data**

**Cannot work for****complex representations****of Data**

**It****may fail to find****the Target Function in the Hypothesis Space**

**i.e. converge to an****empty****Version Space**

**Proposed Solution**

**Decision Tree Learning Algorithms**

## Decision Tree Learning

- ID3 Decision Tree Learning Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Disjunction (OR) of Conjunction (AND) of Input Attribute**

** (a.k.a. Decision Tree)**

**Searching Strategy**

**Greedy Search**

**Simple to Complex (Hill Climbing)**

**Training Regime**

**Batch Method**

- Comparison of Representation of Training Examples (D)

**In FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms, Training Examples (D) are represented as**

**Attribute-Value Pair**

**Conclusions**

**Representation of Trainings Examples (D) is****same****for FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms**

- Comparison of Representation of Hypothesis (h)

**In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Hypothesis (h) is represented as**

**Conjunction (AND) of Constraints on Input Attributes**

**In ID3 Machine Learning Algorithms, Hypothesis (h) is represented as**

**Disjunction (OR) of Conjunction (AND) of Input Attribute**

**Conclusion**

**Representation of Hypothesis (h) in ID3 ML Algorithm is****more expressive and powerful****than FIND-S, List Then Eliminate and Candidate Elimination Algorithms**

**Reason**

**FIND-S, List Then Eliminate and Candidate Elimination Algorithms****only use**

**Conjunction (AND) relationship between Attributes**

**ID3 Algorithm****uses both**

**Disjunction (OR) and Conjunction (AND) relationships between Attributes**

- Comparison of Searching Strategy

**In FIND-S Algorithm, Searching Strategy is**

**I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair**

**In List Then Eliminate Algorithm, Searching Strategy is**

**Sequential Search**

**In Candidate Elimination Algorithm, Searching Strategy is**

**I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair**

**In ID3 Machine Learning Algorithms, Searching Strategy is**

**Greedy Search**

**Simple to Complex (Hill Climbing)**

**Conclusion**

**Search Strategy of ID3 Algorithm is****different****from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms**

- Comparison of Training Regimes

**In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Training Regime is**

**Incremental Method**

**In ID3 Machine Learning Algorithm, Training Regime is**

**Batch Method**

**Conclusion****Training Regime of ID3 Algorithm is****different****from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms**

- Decision Tree

**Definition**

**A Decision Tree is a****graph****that uses a****branching method****to illustrate****every possible outcome****of a decision**

**Decision Trees can be****drawn by hand****or****created with a graphics program****or specialized software**

- Decision Tree Learning

**Definition**

**Decision Tree Learning is a****method****for****approximating****Target Functions / Concepts, in which the****learned function****(or Model) is****represented****by a****Decision Tree**

**Learned Decision Trees (or Models) can also be****re-represented****as**

**Sets of If-Then Rules****(to****improve****human readability)**

**Importance**

**Most popular****of Inductive Learning Algorithms**

**Applications**

**Decision Tree Learning Algorithms are****successfully****and****widely****used in a****range of Real-world Applications****including**

**Predicting Magnetic Properties of Crystals**

**Proﬁling Higher-Priced Houses**

**Detecting Advertisements on the Web**

**Controlling a Production Process**

**Diagnosing Hypothyroidism**

**Assessing Credit Risk**

**Sentiment Analysis**

**Emotion Analysis**

**Author Profiling**

**Word Sense Disambiguation**

**Text Reuse and Plagiarism Detection**

**Note**

**Note the difference between the definitions of**

**Decision Tree and**

**Decision Tree Learning**

- Example – Decision Tree

**A Decision Tree (or Model) is given below**

- Example – Decision Tree Cont…

- Steps - Converting a Decision Tree into If-Then Rules

**To****convert****a Decision Tree (or Model) into If-Then Rules, follow the following steps**

**Step 1: Write down all the****paths****in the Decision Tree**

**A****path****is a****Conjunction (AND) of Attributes**

**Step 2: Join the****paths****using Disjunction (OR)**

**Note**

**Decision Tree = Disjunction of Paths**

- Example – Converting a Decision Tree into If-Then Rules

**Consider the following Decision Tree, In sha Allah (انشاء اللہ), we will convert it into If-Then Rules**

- Example – Converting a Decision Tree into If-Then Rules

**Step 1: Write down all the****paths****in the Decision Tree**

**Step 2: Join the****paths****using Disjunction (OR)**

**Decision Tree = Disjunction (OR) of Paths**

- Problems Appropriate for Decision Tree Learning

**In****general****, Decision Tree Learning Algorithms are****best****for Machine Learning Problems where**

**Instances / Examples are****Represented****as****Attribute–Value Pair**

**Values of Attributes****can be either**

**Categorical or**

**Numeric**

**Target Function (f) is****Discrete Valued**

**Example 1 – Discrete Valued Target Function**

**Gender Identification Problem**

**Two Possible Output Values**

**Male**

**Female**

**Example 2 – Discrete Valued Target Function**

**Sentiment Analysis**

**Three Possible Output Values**

**Positive**

**Negative**

**Neutral**

**Disjunctive Hypothesis may be Required**

**Easy for Decision Tree Learning Algorithms to learn****disjunctive concepts**

**Note that such concepts were****outside****the Hypothesis Space (H) of the Candidate Elimination Algorithm**

**Possibly Noisy / Incomplete Training Data**

**Decision Tree Learning Algorithms are****robust to errors****in**

**Values of****Input Attributes**

**Values of****Output Attribute****i.e. Classiﬁcation of Training Example(s) in Incorrect**

**Decision Tree Learning Algorithms****can be trained****on Training Examples where for****some****instances****some****Attribute Values are****unknown/missing**

**Note**

**Candidate Elimination Algorithm****cannot****be trained on****noisy/incomplete****Data**

## ID3 - Basic Decision Tree Learning Algorithm

- Note

**There are****many state-of-the-art****Decision Tree Learning Algorithms available**

**For example, J48, Random Forest**

**In sha Allah, I will discuss the****most basic****Decision Tree Learning Algorithm i.e. ID3**

**Reason**

**In sha Allah, it will help you to****completely****and****correctly****understand**

**How Decision Tree Learning Algorithms works?**

- ID3 Algorithm

- How ID3 Algorithm Works?

**ID3 Algorithm,****learns Decision Tree****by****constructing****them****top-down****, beginning with the following question**

**Question**

**Which Attribute should be****tested****at the****root****of the tree?**

**Answer**

**Use a****statistical test****to****evaluate****each****Input Attribute****to determine**

**How well it****alone classifies****the Training Examples?**

**Best Attribute**

**Best Attribute is the one which****alone best classifies the Training Examples**

**Best Attribute is****selected****and used as the test at the****root node****of the tree**

**A****descendant****of the root node is then created for****each possible value****of this Attribute, and**

**Training Examples are****sorted****to the****appropriate descendant node****(i.e., down the****branch****corresponding to the****Training Example’s value****for this Attribute)**

**The****entire process is then repeated****using the****Training Examples associated with each descendant node****to select the****Best Attribute****to test at that point in the tree**

**This process continues for****each new leaf node****until either of****two conditions are met**

**Every Attribute****has****already****been****included****along this****path****through the tree, or**

**Training Examples****associated****with this****leaf node****,****all****have the****same Output value****(i.e., their Entropy is Zero)**

**Note**

**This forms a****Greedy Search****for an acceptable Decision Tree, in which the ID3 Algorithm****never backtracks****to reconsider earlier choices**

- Which Attribute is the Best Classiﬁer?

**Most Crucial Step in ID3 Algorithm**

**Choosing****which Attribute to test at the****next node****?**

**Would like to****choose that Attribute****which does**

**best****at****separating****Training Examples****according to their Target Classiﬁcation (or Output)**

**An Attribute which****separates****Training Examples into****two****sets**

**each of which contains Positive / Negative Training Examples of the****Target / Output Attribute****in the****same ratio****as the****initial set of Training Examples**

**has not helped****us progress towards a classiﬁcation**

- Example - Which Attribute is the Best Classiﬁer?

**Consider Gender Identification Problem**

**Suppose we have a Set of 14 Training Examples**

**Positive (Female)****= 9****Negative (Male)****= 5**

- Example - Which Attribute is the Best Classiﬁer? Cont.…

**Consider Gender Identification Problem****Suppose we have a Set of 14 Training Examples**

**Positive****= 9**

**Negative****= 5**

**Question**

**Which Attribute is the best classifier?**

**Hair Length or**

**Head Covered**

**Answer**

**Hair Length**

- Measure to Pick Best Classifier Attribute

**Question**

**How many****measures****are there for****picking****the best classiﬁer attribute?**

**Answer**

**Many measures****are available for****picking****the best classifier Attribute**

**Examples**

**Information Gain**

**Gain Ratio**

**Note**

**In this Chapter, in sha Allah we will use Information Gain**

- Information Gain

**Information Gain is a useful measure for****picking****the best classiﬁer Attribute**

**Information Gain measures**

**how well****a given Attribute****separates****Training Examples****with respect to their Target Classiﬁcation**

**Information Gain is deﬁned in terms of**

**Entropy**

- Entropy

**Definition**

**Entropy gives a measure of****purity/impurity****of a Set of Training Examples**

**Formula**

**Where S represents the****Set of Training Examples (or Sample)**

**p****⊕****represents the proportion of Positive Training Examples**

**p****⊖****represents the proportion of Negative Training Examples**

**Range of Entropy**

**Value of Entropy lies between [0 – 1]**

**0 means the****Sample is Pure**

**1 means that Sample has****maximum****impurity**

- Minimum Entropy vs Maximum Entropy

**Minimum Entropy**

**Entropy is****minimum****(i.e. ZERO), when****all****the Training Examples****fall into one single Class / Category**

**Maximum Entropy**

**Entropy is****maximum****(i.e. ONE), when****half****of the Training Examples are Positive and remaining half are Negative**

- Example 1 – Computing Entropy

**Considering Gender Identification Problem**

**Training Examples****= 14**

**Positive****= 14**

**Negative****= 0**

**Entropy([14+, 0-]) = -p⊕ log2 p⊕- p⊖ log2 p⊖**

** = – (14/14) log2 (14/14) – (0/14) log2 (0/14)**

** = – 1 log2 (1) – 0 log2 (0)**

** = – 1 (0) – 0 (0)**

** = 0**

**Note**

**Sa****mple Is Pure (all Training Examples are Positive) and**

**Entropy is ZERO**

- Example 2 – Computing Entropy

**Considering Gender Identification Problem**

**Training Examples = 14**

**Positive = 0**

**Negative = 14**

** = – (0/14) log2 (0/14) – (14/14) log2 (14/14)**

** = – 0 log2 (0) – 1 log2 (1)**

** = 0 (0) – 1 (0)**

** = 0**

**Note**

**Sample Is Pure (all Training Examples are Negative) and**

**Entropy is ZERO**

- Example 3 – Computing Entropy

**Considering Gender Identification Problem**

**Training Examples****= 14**

**Positive****= 7**

**Negative****= 7**

** = – (7/14) log2 (7/14) – (7/14) log2 (7/14)**

** = – 0.5 log2 (-1) – 0.5 log2 (-1)**

** = – 0.5 (-1) – 0.5 (-1)**

** = 0.5 + 0.5**

** = 1**

**Note**

**Sample has****maximum****impurity (half of the Training Examples are Positive and remaining half are Negative) and**

**Entropy is ONE**

- Example 4 – Computing Entropy

**Considering Gender Identification Problem**

**Training Examples****= 14**

**Positive****= 9**

**Negative****= 5**

** = -(9/14) log2 (9/14) – (5/14) log2 (5/14)**

** = 0.940**

**Note**

**Sample Is impure**

- Information Gain

**Definition**

**Information Gain is the****expected reduction****in Entropy resulting from****partitioning****a Set of Training Examples on the basis of an Attribute**

**Formula**

**Given a Set of Training Examples S and Attribute A**

**where**

**Values(A) is the****Set of Values****Attribute A can take on**

**S****v****is the****subset of S****for which Attribute A has value v**

**First term in Gain(S,A) is**

**Entropy of Set of Training Examples S**

**Second term in Gain(S, A) is**

**Expected entropy after partitioning on A = sum of Training Entropies of each subset S****v****weighted by ratio of S****v****in S**

**Strengths**

**Information Gain can be used to Construct / Learn a Decision Tree using ID3 Algorithm**

**Weaknesses**

**Information Gain measure****favors****attributes with****many values****over those with****few values**

- Example - Weakness of Information Gain

**Consider the Gender Identification Problem**

**If we add a****Date****attribute to the set of Input Attributes**

**The****Date****attribute will have a****distinct value****for each day and will have the****highest****Information Gain**

**This is because****Date****attribute****perfectly predicts****the Target / Output Attribute for****all****Training Examples**

**Result is a Decision Tree of****Depth 1****that**

**perfectly****classiﬁes Training Examples but****fails****on****all****other data**

- Example – Computing Information Gain

**Considering Gender Identification Problem**

**Information Gain for HeadCovered will be**

## Machine Learning Cycle - ID3 Algorithm

- Gender Identification Problem

**Machine Learning Problem**

**Gender Identification**

**Input**

**Human**

**Output**

**Gender of a Human**

**Task**

**Given a Human (Input), predict the Gender of the Human (Output)**

**Treated as**

**Learning Input-Output Function**

**i.e., Learn from Input to predict Output**

- Representation of Examples

**Example = Input + Output**

**Input****= Human**

**Output****= Gender**

**Representation of Input and Output**

**Attribute-Value Pair**

- Representation of Input

**Input is represented as a**

**Set of 4 Input Attributes**

**Height**

**Weight**

**Hair Length**

**Head Covered**

- Representation of Output

**Output is represented as a**

**Set of 1 Output Attribute**

**Gender**

**Note****Yes****means Female and No means Male**

- Computing Size of Instance Space (X)

**Computing Size of Instance Space (X)**

**|X| = No. of Values of Height Input Attribute x No. of Values of Weight Input Attribute x No. of Values of Hair Length Input Attribute x No. of Values of Head Covered Input Attribute**

**|X| = 3*3*2*2 = 36**

- Sample Data

**We obtained a Sample Data of 21 examples**

- Representation of Hypothesis (h)

**We represent a Hypothesis (h) as**

**Disjunction (OR) of Conjunction (AND) of Input Attributes**

**Since ID3 represents****Disjunctive Hypothesis****, therefore**

**Hypothesis Space (H) = Concept Space (C)**

**|H| = |C| = 2|X| = 2|36| = 68,719,476,736**

**Note**

**Using ID3 Algorithm, we****will****find the Target Function / Concept in the Hypothesis Space because**

**we are searching****complete Concept Space**

**In Candidate Elimination Algorithm, we****may / may not find****the Target Function / Concept in the Hypothesis Space because**

**we are searching****incomplete Concept Space**

- Hypothesis Space (H) – ID3 Algorithm

**In ID3 Algorithm, Hypothesis Space (H) is a**

**Set of Decision Trees**

- Sample Data

- Split the Sample Data

**We split the Sample Data using****Random Split Approach****into**

**Training Data****– 2 / 3 of Sample Data**

**Testing Data****– 1 / 3 of Sample Data**

- Training Data

- Testing Data

- Machine Learning Cycle

**Four phases of a Machine Learning Cycle are**

**Training Phase**

**Build the Model****using Training Data**

**Testing Phase**

**Evaluate the performance of Model****using Testing Data**

**Application Phase**

**Deploy the Model in Real-world****, to****make a prediction****on Real-time unseen Data**

**Feedback Phase****Take Feedback from the****Users****and****Domain Experts****to****improve the Model**

### Training Phase – ID3 Algorithm

- Training Phase - ID3 Algorithm

**Let’s****construct/learn****a****Decision Tree (Model / h)****which****best fits****the Training Examples using ID3 Algorithm**

- Selection of Attribute for Root Node

**Question**

**What Attribute should be****tested at the root node****?**

**Answer**

**Compute Information Gain for****all four****Attributes (Height, Weight, Hair length and Head Covered) and**

**Attribute with the****highest****Information Gain will be****tested at the root**

node

**Which attribute should be tested at the root?**

**Gain (S, Height)****= 0.246**

**Gain (S, Weight)****= 0.029**

**Gain (S, Hair Length)****= 0.151**

**Gain (S, Head Covered)****= 0.084**

**Height****provides the****best prediction****for Target Classification**

**Therefore, Height will be the****root node**

- After Selection of Root Node, Cont…

- Working on Height = Short Node

**Question**

**What Attribute should be tested here?**

**Answer**

**Compute Information Gain for all****three****Attributes (Weight, Hair length and Head Covered) and**

**Attribute with the****highest****Information Gain will be tested at****Height = Short****node**

**Computing Information Gain**

**Hair Length provides the****best prediction****for Target Classification**

**Therefore, it will be****tested****at****Height = Short****node**

**Let’s****further****grow Decision Tree**

**Add****to the tree a****successor****for****each possible value****of Hair Length**

**Partition****the Training Examples****according****to the****value of Hair Length**

**Note**

**For Root Node Selection**

**No. of Training Examples****= 14**

**No. of Attributes****= 4**

**For****Height = Short****Node Selection**

**No. of Training Examples****= 5**

**No. of Attributes****= 3**

- Working on Height = Normal Node

**Snormal = {x3,x7,x12,x13}**

**Target Classification for****all four instances****is**

**Positive**

**i.e. Sample is Pure (Entropy is ZERO)**

**Recall that one of the****Conditions to Stop Growing the Decision Tree****is**

**Training Examples****associated****with this****leaf node****,****all****have the****same Output value****(i.e., their Entropy is ZERO)**

**For****Height = Normal****Node**

**We will****not grow****the Decision Tree and**

**Leaf Node will be assigned the****label****of the****Majority Class**

**Conside****ring Snormal = {x3,x7,x12,x13}**

**Positive (Female)****= 4**

**Negative (Male)****= 0**

**Therefore,****Majority Class****is Female and this will be the label of Leaf Node**

- Decision Tree Learned by ID3 Algorithm

**Decision Tree (or Model) learned by ID3 Algorithm is given below**

- Searching Strategy - ID3 Algorithm

**Recall – Learning is a Searching Problem**

**Given**

**Set of Training Examples (D)**

**Set of Hypotheses (Decision Trees)**

**Learner (ID3 Algorithm)**

**searches****the Hypothesis Space (Set of Decision Trees) to****find a Hypothesis****(Decision Tree / Model / h) which****best fits****the Training Examples**

**Search of ID3 Algorithm is****simple-to-complex, hill-climbing search guided by the Information Gain evaluation function**

**ID3 Algorithm**

**Hypothesis Space of ID3 Algorithm is****complete space****of ﬁnite, discrete-valued functions w.r.t available Attributes**

**Candidate Elimination Algorithm**

**Hypothesis Space of Candidate Elimination Algorithm is****incomplete space****because**

**It****only****contains Hypotheses with****Conjunction relationship between Attributes**

**ID3 Algorithm**

**ID3 Algorithm maintains****only one Hypothesis****(h / Decision Tree) at any time, instead of, e.g.,****all Hypotheses****(or Decision Trees) consistent with Training Examples seen so far**

**This means that we****cannot**

**determine****how many alternative Hypotheses****(Decision Trees) are****consistent****with Training Examples**

**Candidate Elimination Algorithm**

**Candidate Elimination Algorithm maintains****Version Space (VS****H,D****)****i.e. set of****all****Hypotheses****consistent****with Training Examples**

**Conclusion**

**ID3 Algorithm****incompletely searches****a****complete Hypothesis Space (H)**

**Candidate Elimination Algorithm****completely searches****an****incomplete Hypothesis Space (H)**

**ID3 Algorithm**

**ID3 Algorithm performs****no backtracking**

**Once an Attribute is selected for testing at a given node, this choice is****never reconsidered**

**Therefore, ID3 Algorithm is **** susceptible **** to converging to **** Locally Optimal Solution **** rather than **** Globally Optimal Solutions**

- Training Regime – ID3 Algorithm

**ID3 Algorithm**

**ID3 Algorithm uses****Batch Method****for Training**

**i.e. Uses****all****Training Examples at****each step****to make****statistically based decision****about how to****reﬁne current Hypothesis****(h or Decision Tree)**

**Using statistically based properties of all Training Examples (Information Gain) means the technique is****robust****in the face of****errors in individual Training Examples**

**Candidate Elimination Algorithm**

**Candidate Elimination Algorithm uses****Incremental Method****for Training**

**i.e. make decisions****incrementally****based on****single****Training Examples**

**Conclusion**

**Batch Method for Training is****more robust to errors in Training Examples****compared to Incremental Method**

- Inductive Bias – ID3 Algorithm

**Inductive Bias**

**Inductive Bias Is the****set of assumptions****needed****in addition****to Training Examples to****justify****Deductively Learner’s Classiﬁcation**

**Inductive Bias of ID3 Algorithm**

**Shorter****Decision Trees are****preferred****over****Longer****Decision Trees**

**Decision Trees that place****high****Information Gain Attributes****close to the root****are****preferred****to those that do not**

**Note**

**Inductive Bias of ID3 Algorithm arises from its****Searching Strategy****(simple-to-complex, hill climbing)**

**Selects****shorter Decision Trees over longer ones**

**Selects****Decision Trees that place Attributes with****highest****Information Gain****closest to root**

- Recall

**ID3 Algorithm****incompletely searches****a****complete Hypothesis Space (H)**

**Candidate Elimination Algorithm****completely searches****a****incomplete Hypothesis Space (H)**

**Can be put differently by saying**

**Inductive Bias of ID3 Algorithm follows from its****Searching Strategy****(Preference Bias or Search Bias)**

**Inductive Bias of Candidate Elimination Algorithm follows from its****deﬁnition of its search space****(Restriction Bias or Language Bias).**

**Question**

**Which Bias is better?**

**Answer**

**Main Difference between Tow Biases**

**Preference Bias only effects****order****in which Hypotheses are****searched**

**In Preference Bias, Hypothesis Space (H)****will****contain the Target Function**

**Restriction Bias effects****which****Hypotheses are searched**

**In Restriction Bias, Hypothesis Space (H)****may / may not****contain the Target Function**

**Generally,****better****to****choose****Machine Learning Algorithm with****Preference Bias****rather than****Restriction Bias**

**Note**

**Some****Machine Learning Algorithms may****combine****Preference and Restriction Biases**

**Example**

**Checker’s Learning Program**

**Restriction Bias**

**Linear****weighted function of ﬁxed set of board features introduces Restriction Bias (****non-linear****potential target functions excluded)**

**Preference Bias**

**Selection of moves that lead to win the game introduces Preference Bias**

- Summary - Training Phase of ID3 Algorithm

**Recall**

**Data = Model + Error**

**Training Data**

**Model**

**Model – in the form of Rules**

**In sha Allah, In the next Phase i.e. Testing Phase, we will**

**Evaluate the performance of the Model**

### Testing Phase – ID3 Algorithm

- Testing Phase

**Question**

**How****good****Model has learned?**

**Answer**

**Evaluate the performance of the Model on****unseen****data (or Testing Data)**

- Evaluation Measures

**Evaluation will be carried out using**

**Error measure**

- Error

**Definition**

**Error is defined as the proportion of incorrectly classified Test instances**

**Formula**

**Note**

**Accuracy = 1 – Error**

- Evaluate Model (Decision Tree)

**Apply Model on Test Data**

### Application Phase – ID3 Algorithm

- Application Phase

**We assume that our Model**

**performed well****on****large Test Data****and can be****deployed in Real-world**

**Model is****deployed in the Real-world****and now we can make**

**predictions****on Real-time Data**

- Steps – Making Predictions on Real-time Data

**Step 1: Take Input from User**

**Step 2: Convert****User Input****into****Feature Vector**

**Exactly same****as****Feature Vectors****of Training and Testing Data**

**Step 3:****Apply****Model on the****Feature Vector**

**Step 4: Return****Prediction****to the User**

- Making Predictions on Real-time Data

**Step 1: Take input from User**

**Step 2: Convert User Input into****Feature Vector**

**Note that****order****of Attributes must be****exactly****same as that of Training and Testing Examples**

**Step 3: Apply Model on****Feature Vector**

**Return****Prediction****to the User**

**Male**

**Note**

**You can take Input from the user, apply Model and return predictions as many times as you like****😊**

### Feedback Phase – ID3 Algorithm

- Feedback Phase

**Only Allah is Perfect****😊**

**Take Feedback on your****deployed****Model from**

**Domain Experts and**

**Users**

**Improve****your Model based on Feedback****😊**

## Overfitting – ID3 Algorithm

- Overfitting

**Definition**

**Given a hypothesis space H, a hypothesis h****∈****H****overﬁts****the Training Examples if there is another hypothesis h′****∈****H such that h has****smaller Error****than h′ over the Training Examples, but h′ has a****smaller Error****over the****entire distribution of instances**

**What Causes Overfitting?**

**Noise****in Training Examples**

**Number of Training Examples is****too small****to produce a****representative sample****of Target Function**

**Why Overfitting is a Serious Problem?**

**Overfitting is a serious problem for****many****Machine Learning Algorithms**

**For example**

**Decision Tree Learning**

**Regular Neural Networks**

**Deep Learning Algorithm etc.**

- Example – Overfitting

**Suppose the performance of the following two Hypotheses on Training Data and Test Data**

**Note that h has****smaller Error****than h′ over the Training Examples, but**

**h′ has a****smaller Error****over the****entire distribution of instances**

**This means that****h has overfit****the Training Examples**

- Overfitting and Decision Tree Learning

**Overﬁtting is a****real problem****for Decision Tree Learning**

**One****empirical study****showed that for a****range of tasks****there was a**

**Decrease****of 10% – 25% in Accuracy due to Overfitting**

**The simple ID3 Algorithm****can produce**

**Decision Trees that****overfit****the Training Examples**

- Example 1 – Overfitting of ID3 Algorithm

**Suppose in addition to the 14 Training Examples for Gender Identification we get a 15****th****Training Example, whose****Target Classiﬁcation****is****wrong****i.e. Training Example has****Error**

**‹***Tall, Heavy, Short, Yes, Gender = Female***›**

**Question**

**What****impact****this will have on our****earlier****Decision Tree?**

**Since we previously had the****correct****Training Example****The****addition****of****incorrect****Training Example will now****cause****ID3 to construct a****more complex Decision Tree****The****new (error full) Training Example****will be sorted into the****second leaf node****from the left in the Decision Tree, along with the****previous Training Examples x****9****and x****11****(both x****9****and x****11****labeled as Male)****Because the****new (error full) Training Example****is****labeled****as a****Female****, ID3 will****search****for****further refinements****to the Decision****Tree below this node**

**Result will be a Decision Tree which performs**

**well****on (****error full****Training Examples) and****less well****on****new unseen****instances**

**Note**

**In this example, ID3 Algorithm****overfits****Training Examples due to**

**Noise in Training Example**

**Overfitting can also occur when**

**Number of Training Examples is****too small****to produce a****representative sample****of Target Function**

- Example 2 - Overﬁtting of ID3 Algorithm

**ID3 Algorithm when Trained and Tested on medical patients, who had a form of diabetes**

**Observations from Graph**

**Accuracy of Decision Tree over Training Examples****increases monotonically****as Decision Tree grows (to be expected)**

**Accuracy of Decision Tree over Test Examples****increases****till about 25 nodes, then****decreases**

**Conclusion**

**Decision Tree has****overfit**

- Avoiding Overﬁtting

**Question**

**How can we****avoid Overfitting****in Decision Tree Learning?**

**Answer**

**Many Possible Solutions**

- Avoiding Overﬁtting – Two General Approaches

**Two general approaches to avoid Overfitting in Decision Tree Learning are**

**Stop growing****Decision Tree before perfectly ﬁtting Training Examples**

**e.g. when Data Split is not statistically signiﬁcant**

**Grow full****Decision Tree, then****prune****afterward**

**In practice**

**Second approach has been****more successful**

**Question**

**How the size of an****optimal ﬁnal****Decision Tree can be decided to****avoid Overfitting****?**

**Answer**

**Many Possible Approaches**

**Most Common****Approach**

**Training and Validation Set Approach**

- Training and Validation Set Approach

**Using this approach, Sample Data is split into three sets**

**Training Set**

**Validation Set**

**Testing Set**

**Strengths**

**Validation Set helps to****check****whether the Model is Overfitting or not during the Training**

**Weaknesses**

**Holding Data back for a Validation Set****reduces****Data available for Training**

**Training Set**

**is used to****build the Model**

**Validation Set**

**is used to****check****whether the****Model is Overfitting or not during Training**

**Testing Set**

**is used to****evaluate the performance of the Model**

**Note**

**Training Set and Validation Set are used in the**

**Training Phase**

**Testing Set is used in the**

**Testing Phase**

**Question 1**

**How to****split****Sample Data using Training and Validation Set Approach when we have****very large/huge****amount of data?**

**Answer 1**

**A Good Split maybe**

**Training Set****= 80%**

**Validation Set****= 10%**

**Testing Set****= 10%**

**Note to****efficiently Train****Deep Learning Algorithms, we need****huge****amount of Training Data**

**Question 2**

**How to****split****Sample Data using Training and Validation Set Approach when we have****sufficiently large****amount of data?**

**Answer 2**

**A Good Split may be**

**Training Set****= 80%**

**Testing Set****= 20%**

**Validation Set****= 10% of Training Set**

- Example 1 – Splitting Data using Training and Validation Set Approach

**Machine Learning Problem**

**Text Summarization**

**Deep learning Algorithm**

**LSTM**

**Total Sample Data**

**100,000 instances**

**Splitting Data using the following Split Ratio**

**Training Set****= 80%****= 80,000**

**Validation Set****= 10%****= 10,000**

**Testing Set****= 10%****= 10,000**

- Example 2 – Splitting Data using Training and Validation Set Approach

**Machine Learning Problem**

**Sentiment Analysis**

**Machine Learning Algorithm**

**Random Forest**

**Total Sample Data**

**10,000 instances**

**Splitting Data using the following Split Ratio**

**Training Set****= 80%****= 7,200**

**Testing Set****= 20%****= 2,000**

**Validation Set****= 10% of Training Set****= 800**

- Training and Validation Set Approach, Cont…

**Question**

**How do I know that Model is Overfitting****during Training****?**

**Answer**

**Model is Not Overfitting**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is also****increasing****then****Model is not Overfitting**

**Model is Overfitting**

**During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting**

- Approaches for Decision Tree Pruning

**Insha Allah, in this Chapter, I will try to explain two approaches for Decision Tree Pruning**

**Reduced Error Pruning Approach**

**Rule Post-Pruning Approach**

- Reduced Error Pruning Approach

**Reduced Error Pruning Approach works as follows**

**Step 1: Split data into**

**Training Set**

**Validation Set**

**Testing Set**

**Step 2: Construct / Learn Decision Tree using Training Set**

**Step 3: For each node evaluate the impact on****Validation Set****of removing****that node and those below it**

**Remove node that****most****improves Accuracy on****Validation Set**

**When a node is removed,****subtree****rooted at it is****replaced****with a****leaf node**

**whose****Classiﬁcation****is the****Most Common Classiﬁcation****of Training Examples beneath the node**

- Reduced Error Pruning Approach, Cont…

**Strengths**

**Decision Tree is****pruned****to****remove Overfitting**

**Weaknesses**

**If a node****close****to the****root node****is removed, then****subtree below it****will be removed**

**Consequently, a****lot of information****maybe****lost**

- Example – Effect of Reduced Error Pruning

**In the previous example, Reduced Error Pruning produces this effect**

**Note that****performance improves****after applying Reduced Error Pruning Approach**

- Rule Post-Pruning Approach

**Rule Post-Pruning Approach works as follows**

**Step 1: Split data into**

**Training Set**

**Validation Set**

**Testing Set**

**Step 2: Construct / Learn Decision Tree using Training Set**

**Step 3:****Convert****Decision Tree to equivalent****Set of Rules**

**Step 4: For each****Rule****,****evaluate impact****on Validation Set of****removing that Rule**

**Remove Rule that****improves****Accuracy on****Validation Set**

**Step 5: Sort****ﬁnal Set of Rules****into Decision Tree**

**Strengths**

**Converting Decision Tree to Rules allows****distinguishing different contexts****in which Rules are used**

**Treat****each path****through Decision Tree****differently**

**Removes****distinction****between testing nodes,****near root****and those****near leaves**

**Rules are often****easier for people****to****understand**

** **

- Strengths and Weaknesses – ID3 Algorithm

**Strengths**

**Can handle****noisy****Data**

**Always finds****the Target Function because**

**ID3 Algorithm****searches****a****complete Hypothesis Space**

**Weaknesses**

**Cannot handle****very complex Numeric Representation**

**Note**

**State-of-the-art****Decision Tree Learning Algorithms (like Random Forest) can handle****simple Numeric Representations**

**However, they **** fail **** to handle **** very complex Numeric Representations **** (e.g. Activity Detection in Video / Image)**

### TODO and Your Turn

**TODO Task 1**

**Task 1**

**Consider the Titanic Dataset with the following Attributes**

**Gender: Male, Female**

**Ticket Class: Upper, Middle, Lower**

**Parent/Child Abroad: Zero, One, Two, Three**

**Embarked: Cherbourg, Queenstown, Southampton**

**Survival: No, Yes**

**We obtained the following Sample Data**

**Sample Data was split into Training and Testing Data in a**

**Train-Test Split Ratio of 80%-20%**

**Training Data**

**Testing Data**

**Note**

**Consider ID3 Algorithm when answering questions given below**

**Your answer should be**

**Well Justified**

**Questions**

**Write down the Input and Output for the above Machine Learning Problem?**

**How****Training Example****is****represented****?**

**How****Hypothesis (h)****should be****represented****?**

**Execute the Machine Learning Cycle?**

**In Training Phase**

**Use Information Gain measure to select the****best****Attribute?**

**What****approach****you will use to****avoid****Overfitting during Training?**

**How will you****check****whether the Model is****Overfitting or Not****during Training?**

**After Training Phase**

**What approach is****most suitable****to remove Overfitting from****Trained Model**

**What could be the****possible reasons****of Overfitting?**

**Write down your****observations****that you observed during the execution of Machine Learning Cycle?**

**Your Turn Task 1**

**Task 1**

**Select a Machine Learning Problem (similar to Titanic – Machine Learning from Disaster) and answer the questions given below**

**Note**

**Consider ID3 Algorithm in answering all the questions**

**Questions**

**Write down the Input and Output for the selected Machine Learning Problem?**

**How****Training Example****is****represented****?**

**How****Hypothesis (h)****should be****represented****?**

**Execute the Machine Learning Cycle?**

**In Training Phase**

**Use Information Gain measure to select the****best****Attribute?**

**What****approach****you will use to****avoid****Overfitting during Training?**

**How will you****check****whether the Model is****Overfitting or Not****during Training?**

**After Training Phase**

**What approach is****most suitable****to****remove****Overfitting from****Trained Model**

**What could be the****possible reasons****of Overfitting?**

**Write down your****observations****that you observed during the execution of Machine Learning Cycle?**

## Chapter Summary

- Chapter Summary

**Main Problems – Candidate Elimination Algorithm**

**Cannot handle****noisy****Data**

**Cannot handle****Numeric****Data**

**Cannot work for****complex representations****of Data**

**It****may fail to find****the Target Function in the Hypothesis Space**

**i.e. converge to an****empty****Version Space**

**Proposed Solution**

**Decision Tree Learning Algorithms**

**Decision Tree Learning is a****method****for****approximating****Target Functions / Concepts, in which the****learned function****(or Model) is****represented****by a****Decision Tree**

**Learned Decision Trees (or Models) can also be****re-represented****as**

**Sets of If-Then Rules****(to****improve****human readability)**

**Decision Tree Learning is****most popular****in Inductive Learning Algorithms**

**ID3 Algorithm – Summary**

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)**

**Searching Strategy**

**Greedy Search**

**Simple to Complex (Hill Climbing)**

**Training Regime**

**Batch Method**

**Inductive Bias**

**Shorter****Decision Trees are****preferred****over****Longer****Decision Trees**

**Decision Trees that place****high****Information Gain Attributes****close to the root****are****preferred****to those that do not**

**Strengths**

**Can handle****noisy****Data**

**Always finds****the Target Function because**

**ID3 Algorithm****searches****a****complete Hypothesis Space**

**Weaknesses**

**Cannot handle****very complex Numeric Representation**

**Note**

**State-of-the-art****Decision Tree Learning Algorithms (like Random Forest) can handle****simple Numeric Representations**

**However, they****fail****to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)**

**To****convert****a Decision Tree (or Model) into If-Then Rules, follow the following steps**

**Step 1: Write down all the****paths****in the Decision Tree**

**A****path****is a****Conjunction (AND) of Attributes**

**Step 2: Join the****paths****using Disjunction (OR)**

**Decision Tree = Disjunction of Paths**

**In****general****, Decision Tree Learning Algorithms are****best****for Machine Learning Problems where**

**Instances / Examples are****Represented****as****Attribute–Value Pair**

**Target Function (f) is****Discrete Valued**

**Disjunctive Hypothesis may be Required**

**Possibly Noisy / Incomplete Training Data**

**ID3 Algorithm,****learns Decision Tree****by****constructing****them****top- down****, beginning with the following question**

**Which Attribute should be tested at the****root****of the tree?**

**Generally, a****statistical test****is used to****evaluate****each****Input Attribute****to determine**

**How well it****alone classifies****the Training Examples?**

**Best Attribute is the one which****alone best classifies the Training Examples**

**Best Attribute is****selected****and used as the test at the****root node****of the tree**

**A****descendant****of the root node is then created for****each possible value****of this Attribute, and**

**Training Examples are****sorted****to the****appropriate descendant node****(i.e., down the****branch****corresponding to the****Training Example’s value****for this Attribute)**

**The****entire process is then repeated****using the****Training Examples associated with each descendant node****to select the****Best Attribute****to test at that point in the tree**

**This process continues for****each new leaf node****until either of****two conditions is met**

**Every Attribute****has already been****included****along this****path****through the tree, or**

**Training Examples****associated****with this****leaf node****,****all****have the****same Output value****(i.e., their Entropy is Zero)**

**ID3 Algorithm forms a****Greedy Search****for an acceptable Decision Tree, in which the ID3 Algorithm****never backtracks****to reconsider earlier choices**

**Many measures****are available for****picking****the best classifier Attribute**

**for e.g. Information Gain, Gain Ratio etc.**

**Information Gain is a useful measure of for****picking****the best classiﬁer Attribute**

**Information Gain is the****expected reduction****in Entropy resulting from****partitioning****a Set of Training Examples on the basis of an Attribute**

**Information Gain measures**

**how well****a given Attribute****separates****Training Examples****with respect to their Target Classiﬁcation**

**Information Gain is deﬁned in terms of**

**Entropy**

**Entropy gives a measure of****purity / impurity****of a Set of Training Examples**

**Value of Entropy lies between [0 – 1]**

**0 means the****Sample is Pure**

**1 means that Sample has****maximum****impurity**

**Minimum Entropy**

**Entropy is****minimum****(i.e. ZERO), when****all****the Training Examples****fall into one single Class / Category**

**Maximum Entropy**

**Entropy is****maximum****(i.e. ONE), when****half****of the Training Examples are Positive and remaining half are Negative**

**A limitation of Information Gain measure is that it****favors****attributes with****many values****over those with****few values**

**Hypothesis Space of ID3 Algorithm is****complete space****of ﬁnite, discrete-valued functions w.r.t available Attributes**

**Hypothesis Space of Candidate Elimination Algorithm is****incomplete space****because**

**It****only****contains Hypotheses with****Conjunction relationship between Attributes**

**ID3 Algorithm maintains****only one Hypothesis****(h / Decision Tree) at any time, instead of, e.g.,****all Hypotheses (****or Decision Trees) consistent with Training Examples seen so far**

**This means that we****cannot**

**determine****how many alternative Hypotheses****(Decision Trees) are consistent with Training Examples**

**ID3 Algorithm****incompletely searches****a****complete Hypothesis Space (H)**

**Candidate Elimination Algorithm****completely searches****an****incomplete Hypothesis Space (H)**

**ID3 Algorithm performs****no backtracking**

**Once an Attribute is selected for testing at a given node, this choice is****never reconsidered**

**Therefore, ID3 Algorithm is****susceptible****to converging to****Locally Optimal Solution****rather than****Globally Optimal Solutions**

**Batch Method for Training is****more robust to errors in Training Examples****compared to Incremental Method**

**Preference Bias only effects****order****in which Hypotheses are****searched**

**In Preference Bias, Hypothesis Space (H)****will****contain the Target Function**

**Restriction Bias effects****which****Hypotheses are searched**

**In Restriction Bias, Hypothesis Space (H)****may / may not****contain the Target Function**

**Generally,****better****to****choose****Machine Learning Algorithm with Preference Bias rather than Restriction Bias**

**Some****Machine Learning Algorithms may****combine****Preference and Restriction Biases**

**for e.g. Checker’s Learning Program**

**Given a hypothesis space H, a hypothesis h****∈****H****overﬁts****the Training Examples if there is another hypothesis h′****∈****H such that h has****smaller Error****than h′ over the Training Examples, but h′ has a****smaller Error****over the****entire distribution of instances**

**What Causes Overfitting?**

**Noise****in Training Examples**

**Number of Training Examples is****too small****to produce a****representative sample****of Target Function**

**Why Overfitting is a Serious Problem?**

**Overfitting is a****serious****problem for****many****Machine Learning Algorithms**

**For example**

**Decision Tree Learning**

**Regular Neural Networks**

**Deep Learning Algorithm etc.**

**Overﬁtting is a****real problem****for Decision Tree Learning**

**For Decision Tree Learning, one****empirical study****showed that for a****range of tasks****there was a**

**Decrease of 10% – 25% in Accuracy due to Overfitting**

**The simple ID3 Algorithm****can produce**

**Decision Trees that****overfit****the Training Examples**

**Two general approaches to avoid Overfitting in Decision Tree Learning are**

**Stop growing****Decision Tree before perfectly ﬁtting Training Examples**

**e.g. when Data Split is not statistically signiﬁcant**

**Grow full****Decision Tree, then****prune****afterwards**

**In practice**

**Second approach has been****more successful**

**Most Common****Approach to avoid Overfitting is**

**Training and Validation Set Approach**

**Using Training and Validation Set Approach, Sample Data is split into three sets**

**Training Set**

**is used to****build the Model**

**Validation Set**

**is used to check whether the****Model is Overfitting or not during Training**

**Testing Set**

**is used to****evaluate the performance of the Model**

**Holding Data back for a Validation Set****reduces****Data available for Training**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is also****increasing****then****Model is not Overfitting**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is****decreasing****then****Model is Overfitting**

**Two approaches for Diction Tree Pruning to avoid Overfitting are**

**Reduced Error Pruning Approach**

**Rule Post-Pruning Approach**

**Rule Post-Pruning Approach is better than Reduced Error Pruning Approach**

### In Next Chapter

- In Next Chapter

- In Sha Allah, in next Chapter, I will present

- Artificial Neural Network