Skip to content

Ilm o Irfan

Technologies

  • Home
  • Courses
    • Free Courses
  • Motivational Seminars
  • Blog
  • Books
  • Contact
  • Home
  • Courses
    • Free Courses
  • Motivational Seminars
  • Blog
  • Books
  • Contact
Explore Courses

Machine Learning

  • January 16, 2023
  • Home
  • Free Book

Table of Contents

Chapter 10 - Decision Tree Learning

Chapter Outline

  • Chapter Outline
  • Quick Recap 
  • Decision Tree Learning
  • ID3 – Basic Decision Tree Learning Algorithm 
  • Machine Learning Cycle – ID3 Algorithm
  • Overfitting – ID3 Algorithm 
  • Chapter Summary

 

Quick Recap

  • Quick Recap – Version Space Algorithm
  • Problem – FIND-S Algorithm 
    • Only returns one hypothesis which best fits the Training Data
      • However, there can be multiple hypothesis, which best fit the Training Data
  • Proposed Solution
    • Version Space
  • Version Space (VSH,D) contains set of all hypothesis consistent with the Training Examples
  • Version Space Algorithm computes the set of all hypothesis consistent with the Training Examples
  • Two Algorithms which return a Version Space are
    • List Ten Eliminate Algorithm
    • Candidate Elimination Algorithm 
  • List Ten Eliminate Algorithm – Summary
    • Representation of Example
      • Attribute-Value Pair
    • Representation of Hypothesis (h)
      • Conjunction (AND) of Constraints on Attributes
    • Searching Strategy
      • I am not clear about this. Please drop me an email if you know. Jazak Allah Khair
    • Training Regime
      • Incremental Method
    • Inductive Bias of List Ten Eliminate Algorithm 
      • Training Data is error-free 
      • Target Function / Concept is present in the Hypothesis Space (H)
    • Strengths 
      • List Then Eliminate Algorithm overcomes the limitation of FIND-S Algorithm and 
        • returns multiple hypothesis, which best fits the Training Data i.e. Version Space (VSH,D)
    • Weaknesses 
      • Only works on error-free Data
        • However, Real-world Data is noisy 
      • Works on the assumption that Target Function is present in the Hypothesis Space (H)
        • However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known 
      • List Then Eliminate Algorithm is computationally expensive because it makes a 
        •  pairwise comparison of each Training Example with all hypothesis in Hypothesis Space (H) 
  • List Then Eliminate Algorithm is computationally expensive because
    • Hypothesis Space (H) is represented as a List , which forces 
      • List Then Eliminate Algorithm to make pairwise comparisons 
  • To reduce the computational cost of Version Space Algorithms 
    • Have a more compact representation of Hypothesis Space (H) 
      • for e.g. Candidate Elimination Algorithm
  • Candidate Elimination Algorithm – Summary
    • Representation of Example
      • Attribute-Value Pair
    • Representation of Hypothesis (h)
      • Conjunction (AND) of Constraints on Attributes
    • Searching Strategy
      • I am not clear about this. Please drop me an email if you know. Jazak Allah Khair
    • Training Regime
      • Incremental Method
    • Inductive Bias of Candidate Elimination Algorithm 
      • Training Data is error-free 
      • Target Function / Concept is present in the Hypothesis Space (H)
    • Strengths
      • Candidate Elimination Algorithm overcomes the limitation of List Then Eliminate Algorithm and 
        • provides a more compact representation of the Version Space
      • Version Space returned by Candidate Elimination Algorithm can be used to make predictions on unseen Data using either a
        • Single Model or
        • Ensemble Model
    • Weaknesses 
      • Only works on error-free Data
        • However, Real-world Data is noisy 
      • Works on assumption that Target Function is present in the Hypothesis Space (H)
        • However, we may / may not find the Target Function in the Hypothesis Space (H) and this may / may not be known 
      • Only works when values of Attributes are Categorical 
        • However, in Real-world many Machine Learning Problems have 
          • Attributes / Features with Numeric values
      • Only works for simple representations of Data
        • However, in Real-world many Machine Learning Problems have complex representation (for e.g. Face Detection, Face Recognition etc.)
  • Single Model
    • A Single Model is trained on the Training Data and used to make predictions on the Test Data
    • Strengths 
      • It is computationally fast and requires less time to make predictions compared to Ensemble Model
    • Weaknesses
      • Error is likely to be high compared to Ensemble Model
  • Ensemble Model
    • An Ensemble Model works by training different Models on the same Training Data and using each Model to individually make predictions on the Test Data
    • Strengths 
      • It is computationally expensive and requires more time to make predictions compared to Single Model
    • Weaknesses
      • Error is likely to be low compared to Single Model
  • Voting Approach is one of the simplest approaches to combine predictions from different Models 
    • In Voting Approach, we may assign
      •  Same weight to all Models
      •  Different weights to all Models
    •  Model Weight plays an important role in making the Final Prediction 
  • Voting Approach works as follows
    • Given a Test instance x
      • Step 1: Make individual predictions using different Models
      • Step 2: Combine predictions of individual Models
      • Step 3: Final Prediction of x will be the class which has the majority vote 
  • Voting Classifier is not an actual Classifier (Machine Learning Algorithm) but a wrapper for a set of different Machine Learning Algorithms , which are trained and tested on the same Data
    • A Voting Classifier combines predictions of individual Machine Learning Algorithms to make Final Predictions on unseen Data

Quick Recap - Models

  • Learning Input-Output Functions – General Settings
  • Input to Learner 
    • Set of Training Examples (D)
    • Set of Functions / Hypothesis (H)
  • Output of Learner 
    • A hypothesis h from H, which  best fits  the Training Examples (D)
  • Learning is a Searching Problem
  • For  any  Machine Learning Algorithm, we should have  clear  understanding of the following three points
    • Representation of Training Examples (D)
      • How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can understand and learn from them?
    • Representation of Hypothesis (h)
      • How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can understand?
    • Searching Strategy
      • What searching strategy is used by a Machine Learning Algorithm to find h form H, which best fits the Training Examples (D)?

 

  • Summary - FIND-S Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
  • Representation of Hypothesis (h)
    • Conjunction (AND) of Constraints on Input Attributes
  • Searching Strategy
    • I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair
  • Training Regime 
    • Incremental Method

 

  • Summary – List Then Eliminate Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
  • Representation of Hypothesis (h)
    • Conjunction (AND) of Constraints on Input Attributes
  • Searching Strategy
    • Sequential Search 
  • Training Regime 
    • Incremental Method

 

  • Summary – Candidate Elimination Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
  • Representation of Hypothesis (h)
    • Conjunction (AND) of Constraints on Input Attributes
  • Searching Strategy
    • I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair
  • Training Regime 
    • Incremental Method

 

  • Strengths and Weaknesses – Candidate Elimination Algorithm
  • Strengths 
    • Candidate Elimination Algorithm  overcomes the limitation  of List Then Eliminate Algorithm and 
      • provides a  more compact  representation of the Version Space
      • Version Space returned by Candidate Elimination Algorithm can be used to  make predictions  on  unseen  Data using either a
        • Single Model or
        • Ensemble Model
  • Weaknesses 
    • Only works on  error free  Data
      • However, Real-world Data is  noisy 
    • Works on  assumption  that Target Function is  present  in the Hypothesis Space (H)
      • However, we  may / may not find  the Target Function in the Hypothesis Space (H) and this  may / may not be known 
    • Only works when  values of Attributes  are  Categorical 
      • However, in Real-world many Machine Learning Problems have 
  • Attributes / Features with  Numeric  values
    • Only works for  simple  representations of Data
      • However, in Real-world many Machine Learning Problems have complex representation (e.g. Face Detection, Face Recognition, etc.)

 

  • Main Problems in Candidate Elimination Algorithm
  • Main Problems
    • Cannot handle  noisy  Data
    • Cannot handle  Numeric  Data
    • Cannot work for  complex representations  of Data
    • It  may fail to find  the Target Function in the Hypothesis Space
      • i.e. converge to an  empty  Version Space
  • Proposed Solution
    • Decision Tree Learning Algorithms

 

Decision Tree Learning

  • ID3 Decision Tree Learning Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
  • Representation of Hypothesis (h)
    • Disjunction (OR) of Conjunction (AND) of Input Attribute 

                 (a.k.a. Decision Tree)

  • Searching Strategy
    • Greedy Search
      • Simple to Complex (Hill Climbing)
  • Training Regime 
    • Batch Method
  • Comparison of Representation of Training Examples (D)
  • In FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms, Training Examples (D) are represented as 
    • Attribute-Value Pair
  • Conclusions 
    • Representation of Trainings Examples (D) is  same  for FIND-S, List Then Eliminate, Candidate Elimination and ID3 Machine Learning Algorithms

 

  • Comparison of Representation of Hypothesis (h)
  • In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Hypothesis (h) is represented as 
    • Conjunction (AND) of Constraints on Input Attributes
  • In ID3 Machine Learning Algorithms, Hypothesis (h) is represented as 
    • Disjunction (OR) of Conjunction (AND) of Input Attribute
  • Conclusion 
    • Representation of Hypothesis (h) in ID3 ML Algorithm is  more expressive and powerful  than FIND-S, List Then Eliminate and Candidate Elimination Algorithms
  • Reason
    • FIND-S, List Then Eliminate and Candidate Elimination Algorithms  only use 
      • Conjunction (AND) relationship between Attributes
    • ID3 Algorithm  uses both 
      • Disjunction (OR) and Conjunction (AND) relationships between Attributes
  • Comparison of Searching Strategy
  • In FIND-S Algorithm, Searching Strategy is
    • I am not clear about the searching strategy of FIND-S Algorithm. Please drop me an email if you know. Jazak Allah Khair
  • In List Then Eliminate Algorithm, Searching Strategy is
    • Sequential Search
  • In Candidate Elimination Algorithm, Searching Strategy is
    • I am not clear about the searching strategy of Candidate Elimination Algorithm. Please drop me an email if you know. Jazak Allah Khair
  • In ID3 Machine Learning Algorithms, Searching Strategy is 
    • Greedy Search 
      • Simple to Complex (Hill Climbing)
  • Conclusion 
    • Search Strategy of ID3 Algorithm is  different  from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms

 

  • Comparison of Training Regimes
  • In FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms, Training Regime is 
    • Incremental Method 
  • In ID3 Machine Learning Algorithm, Training Regime is 
    • Batch Method 
  • Conclusion
    • Training Regime of ID3 Algorithm is  different  from FIND-S, List Then Eliminate and Candidate Elimination Machine Learning Algorithms
  • Decision Tree
  • Definition
    • A Decision Tree is a  graph  that uses a  branching method  to illustrate  every possible outcome  of a decision
  • Decision Trees can be  drawn by hand  or  created with a graphics program  or specialized software

 

  • Decision Tree Learning
  • Definition 
    • Decision Tree Learning is a  method  for  approximating  Target Functions / Concepts, in which the  learned function  (or Model) is  represented  by a  Decision Tree 
  • Learned Decision Trees (or Models) can also be  re-represented  as 
    •  Sets of If-Then Rules  (to  improve  human readability)
  • Importance
    •  Most popular  of Inductive Learning Algorithms
  • Applications 
    • Decision Tree Learning Algorithms are  successfully  and  widely  used in a  range of Real-world Applications  including 
      • Predicting Magnetic Properties of Crystals
      • Profiling Higher-Priced Houses 
      • Detecting Advertisements on the Web
      • Controlling a Production Process
      • Diagnosing Hypothyroidism
      • Assessing Credit Risk
      • Sentiment Analysis
      • Emotion Analysis
      • Author Profiling
      • Word Sense Disambiguation
      • Text Reuse and Plagiarism Detection
  • Note
    • Note the difference between the definitions of 
      • Decision Tree and
      • Decision Tree Learning
  • Example – Decision Tree
  • A Decision Tree (or Model) is given below

  • Example – Decision Tree Cont…

  • Steps - Converting a Decision Tree into If-Then Rules
  • To  convert  a Decision Tree (or Model) into If-Then Rules, follow the following steps
    • Step 1: Write down all the  paths  in the Decision Tree
      • A  path  is a  Conjunction (AND) of Attributes 
    • Step 2: Join the  paths  using Disjunction (OR)
  • Note 
    • Decision Tree = Disjunction of Paths
  • Example – Converting a Decision Tree into If-Then Rules
  • Consider the following Decision Tree, In sha Allah (انشاء اللہ), we will convert it into If-Then Rules

  • Example – Converting a Decision Tree into If-Then Rules
  • Step 1: Write down all the  paths  in the Decision Tree

  • Step 2: Join the  paths  using Disjunction (OR)
    • Decision Tree = Disjunction (OR) of Paths

  • Problems Appropriate for Decision Tree Learning
  • In  general , Decision Tree Learning Algorithms are  best  for Machine Learning Problems where
    • Instances / Examples are  Represented  as  Attribute–Value Pair 
      •  Values of Attributes  can be either 
        • Categorical or
        • Numeric
    • Target Function (f) is  Discrete Valued 
      • Example 1 – Discrete Valued Target Function
        • Gender Identification Problem
          • Two Possible Output Values
            • Male 
            • Female
      • Example 2 – Discrete Valued Target Function
        • Sentiment Analysis 
          • Three Possible Output Values
            • Positive
            • Negative
            • Neutral
    • Disjunctive Hypothesis may be Required
      • Easy for Decision Tree Learning Algorithms to learn  disjunctive concepts  
        • Note that such concepts were  outside  the Hypothesis Space (H) of the Candidate Elimination Algorithm
    • Possibly Noisy / Incomplete Training Data
      • Decision Tree Learning Algorithms are  robust to errors  in 
        • Values of  Input Attributes 
        • Values of  Output Attribute  i.e. Classification of Training Example(s) in Incorrect
      • Decision Tree Learning Algorithms  can be trained  on Training Examples where for  some  instances  some  Attribute Values are  unknown/missing 
      • Note
        • Candidate Elimination Algorithm  cannot  be trained on  noisy/incomplete  Data

ID3 - Basic Decision Tree Learning Algorithm

  • Note
  • There are  many state-of-the-art  Decision Tree Learning Algorithms available 
    • For example, J48, Random Forest
  • In sha Allah, I will discuss the  most basic  Decision Tree Learning Algorithm i.e. ID3 
  • Reason
    • In sha Allah, it will help you to  completely  and  correctly  understand 
      • How Decision Tree Learning Algorithms works?
  • ID3 Algorithm

  • How ID3 Algorithm Works?
  • ID3 Algorithm,  learns Decision Tree  by  constructing  them  top-down , beginning with the following question 
  • Question
    • Which Attribute should be  tested  at the  root  of the tree?
  • Answer
    • Use a  statistical test  to  evaluate  each  Input Attribute  to determine 
      • How well it  alone classifies  the Training Examples?
    • Best Attribute 
      • Best Attribute is the one which  alone best classifies the Training Examples 
  • Best Attribute is  selected  and used as the test at the  root node  of the tree
  • A  descendant  of the root node is then created for  each possible value  of this Attribute, and 
    • Training Examples are  sorted  to the  appropriate descendant node  (i.e., down the  branch  corresponding to the  Training Example’s value  for this Attribute)
  • The  entire process is then repeated  using the  Training Examples associated with each descendant node  to select the  Best Attribute  to test at that point in the tree
  • This process continues for  each new leaf node  until either of  two conditions are met 
    •  Every Attribute  has  already  been  included  along this  path  through the tree, or 
    • Training Examples  associated  with this  leaf node ,  all  have the  same Output value  (i.e., their Entropy is Zero)
  • Note 
    • This forms a Greedy Search  for an acceptable Decision Tree, in which the ID3 Algorithm  never backtracks  to reconsider earlier choices
  • Which Attribute is the Best Classifier?
  • Most Crucial Step in ID3 Algorithm
    •  Choosing which Attribute to test at the next node? 
  • Would like to  choose that Attribute  which does 
    •  best  at  separating  Training Examples  according to their Target Classification (or Output) 
  • An Attribute which  separates  Training Examples into  two  sets 
    • each of which contains Positive / Negative Training Examples of the  Target / Output Attribute  in the  same ratio  as the  initial set of Training Examples 
      •   has not helped  us progress towards a classification
  • Example - Which Attribute is the Best Classifier?
  • Consider Gender Identification Problem
  • Suppose we have a Set of 14 Training Examples
    • Positive (Female) = 9 
    • Negative (Male) = 5

  • Example - Which Attribute is the Best Classifier? Cont.…
    • Consider Gender Identification Problem
    • Suppose we have a Set of 14 Training Examples
    • Positive = 9 
    • Negative = 5 
  • Question
    • Which Attribute is the best classifier?
      • Hair Length or
      • Head Covered
  • Answer
    • Hair Length

  • Measure to Pick Best Classifier Attribute
  • Question
    • How many measures are there for picking the best classifier attribute?
  • Answer
    •  Many measures  are available for  picking  the best classifier Attribute
      • Examples
        • Information Gain
        • Gain Ratio
  • Note
    • In this Chapter, in sha Allah we will use Information Gain
  • Information Gain
  • Information Gain is a useful measure for  picking  the best classifier Attribute 
  • Information Gain measures 
    •  how well  a given Attribute  separates  Training Examples  with respect to their Target Classification 
  • Information Gain is defined in terms of 
    • Entropy 

 

  • Entropy
  • Definition 
    • Entropy gives a measure of  purity/impurity  of a Set of Training Examples
  • Formula

    • Where S represents the  Set of Training Examples (or Sample) 
    • p⊕ represents the proportion of Positive Training Examples
    • p⊖ represents the proportion of Negative Training Examples
  • Range of Entropy
    • Value of Entropy lies between [0 – 1]
      • 0 means the  Sample is Pure 
      • 1 means that Sample has  maximum  impurity
  • Minimum Entropy vs Maximum Entropy
  • Minimum Entropy
    • Entropy is minimum (i.e. ZERO), when all  the Training Examples  fall into one single Class / Category 
  • Maximum Entropy
    • Entropy is maximum (i.e. ONE), when half  of the Training Examples are Positive and remaining half are Negative
  • Example 1 – Computing Entropy
  • Considering Gender Identification Problem
    • Training Examples = 14 
      • Positive = 14
      • Negative = 0
  • Entropy([14+, 0-]) = -p⊕ log2 p⊕- p⊖ log2 p⊖

                                         = – (14/14) log2 (14/14) – (0/14) log2 (0/14)

                                         = – 1 log2 (1) – 0 log2 (0)

                                         = – 1 (0) – 0 (0)

                                         = 0

  • Note
    • Sample Is Pure (all Training Examples are Positive) and 
      • Entropy  is ZERO
  • Example 2 – Computing Entropy
  • Considering Gender Identification Problem
    • Training Examples = 14 
      • Positive = 0
      • Negative = 14

                       = – (0/14) log2 (0/14) – (14/14) log2 (14/14)

                       = – 0 log2 (0) – 1 log2 (1)

                       = 0 (0) – 1 (0)

                       = 0

  • Note
    • Sample Is Pure (all Training Examples are Negative) and 
      • Entropy is ZERO
  • Example 3 – Computing Entropy
  • Considering Gender Identification Problem
    • Training Examples = 14 
      • Positive = 7
      • Negative = 7

                = – (7/14) log2 (7/14) – (7/14) log2 (7/14)

                = – 0.5 log2 (-1) – 0.5 log2 (-1)

                = – 0.5 (-1) – 0.5 (-1)

                = 0.5 + 0.5

                = 1

  • Note
    • Sample has  maximum  impurity (half of the Training Examples are Positive and remaining half are Negative) and 
      • Entropy is ONE
  • Example 4 – Computing Entropy
  • Considering Gender Identification Problem
    • Training Examples = 14 
      • Positive = 9
      • Negative = 5

               = -(9/14) log2 (9/14) – (5/14) log2 (5/14)

               = 0.940

  • Note
    • Sample Is impure 
  • Information Gain
  • Definition 
    • Information Gain is the  expected reduction  in Entropy resulting from  partitioning  a Set of Training Examples on the basis of an Attribute
  • Formula 
    • Given a Set of Training Examples S and Attribute A

  • where
    • Values(A) is the  Set of Values  Attribute A can take on
    • Sv is the  subset of S  for which Attribute A has value v
  • First term in Gain(S,A) is 
    • Entropy of Set of Training Examples S
  • Second term in Gain(S, A) is 
    • Expected entropy after partitioning on A = sum of Training Entropies of each subset Sv weighted by ratio of Sv in S
  • Strengths
    • Information Gain can be used to Construct / Learn a Decision Tree using ID3 Algorithm
  • Weaknesses
    • Information Gain measure  favors  attributes with  many values  over those with  few values 
  • Example - Weakness of Information Gain
  • Consider the Gender Identification Problem
    • If we add a  Date  attribute to the set of Input Attributes
    • The  Date  attribute will have a  distinct value  for each day and will have the  highest  Information Gain
      • This is because  Date  attribute  perfectly predicts  the Target / Output Attribute for  all  Training Examples
    • Result is a Decision Tree of  Depth 1  that 
      •  perfectly  classifies Training Examples but  fails  on  all  other data
  • Example – Computing Information Gain
  • Considering Gender Identification Problem
  • Information Gain for HeadCovered will be 

Machine Learning Cycle - ID3 Algorithm

  • Gender Identification Problem
  • Machine Learning Problem
    • Gender Identification
  • Input 
    • Human
  • Output
    • Gender of a Human
  • Task
    • Given a Human (Input), predict the Gender of the Human (Output)
  • Treated as
    • Learning Input-Output Function 
      • i.e., Learn from Input to predict Output
  • Representation of Examples
  • Example = Input + Output
    • Input = Human
    • Output = Gender
  • Representation of Input and Output
    • Attribute-Value Pair
  • Representation of Input
  • Input is represented as a 
    • Set of 4 Input Attributes
      • Height
      • Weight
      • Hair Length
      • Head Covered

  • Representation of Output
  • Output is represented as a 
    • Set of 1 Output Attribute
      • Gender

  • Note
    • Yes  means Female and  No  means Male
  • Computing Size of Instance Space (X)

  • Computing Size of Instance Space (X) 
    • |X| = No. of Values of Height Input Attribute x No. of Values of Weight Input Attribute x No. of Values of Hair Length Input Attribute x No. of Values of Head Covered Input Attribute 
    • |X| = 3*3*2*2 = 36
  • Sample Data
  • We obtained a Sample Data of 21 examples

  • Representation of Hypothesis (h)
  • We represent a Hypothesis (h) as
    • Disjunction (OR) of Conjunction (AND) of Input Attributes
  • Since ID3 represents  Disjunctive Hypothesis , therefore
    • Hypothesis Space (H) = Concept Space (C)
  • |H| = |C| = 2|X| = 2|36| = 68,719,476,736
  • Note 
    • Using ID3 Algorithm, we  will  find the Target Function / Concept in the Hypothesis Space because 
      • we are searching  complete Concept Space 
    • In Candidate Elimination Algorithm, we  may / may not find  the Target Function / Concept in the Hypothesis Space because
      • we are searching  incomplete Concept Space 
  • Hypothesis Space (H) – ID3 Algorithm
  • In ID3 Algorithm, Hypothesis Space (H) is a
    • Set of Decision Trees
  • Sample Data

  • Split the Sample Data
  • We split the Sample Data using  Random Split Approach  into
    • Training Data – 2 / 3 of Sample Data
    • Testing Data – 1 / 3 of Sample Data
  • Training Data

  • Testing Data

  • Machine Learning Cycle
  • Four phases of a Machine Learning Cycle are
    • Training Phase
      •  Build the Model  using Training Data
    • Testing Phase
      •  Evaluate the performance of Model  using Testing Data
    • Application Phase
      •  Deploy the Model in Real-world , to  make a prediction  on Real-time unseen Data
    • Feedback Phase
      • Take Feedback from the  Users  and  Domain Experts  to  improve the Model

Training Phase – ID3 Algorithm

  • Training Phase - ID3 Algorithm
  • Let’s  construct/learn  a  Decision Tree (Model / h)  which  best fits  the Training Examples using ID3 Algorithm

 

  • Selection of Attribute for Root Node
  • Question
    • What Attribute should be tested at the root node?
  • Answer
    • Compute Information Gain for  all four  Attributes (Height, Weight, Hair length and Head Covered) and 
      • Attribute with the  highest  Information Gain will be  tested at the root 
        node 

  • Which attribute should be tested at the root?
    • Gain (S, Height) = 0.246 
    • Gain (S, Weight) = 0.029
  • Gain (S, Hair Length) = 0.151
    • Gain (S, Head Covered) = 0.084
  •  Height  provides the  best prediction  for Target Classification
    • Therefore, Height will be the  root node 
  • After Selection of Root Node, Cont…

  • Working on Height = Short Node
  • Question
    • What Attribute should be tested here?
  • Answer
  • Compute Information Gain for all  three  Attributes (Weight, Hair length and Head Covered) and 
      • Attribute with the  highest  Information Gain will be tested at  Height = Short  node 
  • Computing Information Gain

  • Hair Length provides the  best prediction  for Target Classification
    • Therefore, it will be  tested  at  Height = Short  node
  • Let’s  further  grow Decision Tree
    •  Add  to the tree a  successor  for  each possible value  of Hair Length
    •  Partition  the Training Examples  according  to the  value of Hair Length 
  • Note
    • For Root Node Selection
      • No. of Training Examples = 14
      • No. of Attributes = 4
    • For  Height = Short  Node Selection
      • No. of Training Examples = 5
      • No. of Attributes = 3
  • Working on Height = Normal Node
  • Snormal = {x3,x7,x12,x13}
  • Target Classification for  all four instances  is
    • Positive 
      • i.e. Sample is Pure (Entropy is ZERO)
  • Recall that one of the  Conditions to Stop Growing the Decision Tree  is
    • Training Examples  associated  with this  leaf node ,  all  have the  same Output value  (i.e., their Entropy is ZERO)
  • For  Height = Normal  Node
    • We will  not grow  the Decision Tree and 
      • Leaf Node will be assigned the  label  of the  Majority Class 
  • Considering Snormal = {x3,x7,x12,x13}
    • Positive (Female) = 4
    • Negative (Male) = 0
  • Therefore,  Majority Class  is Female and this will be the label of Leaf Node 
  • Decision Tree Learned by ID3 Algorithm
  • Decision Tree (or Model) learned by ID3 Algorithm is given below

  • Searching Strategy - ID3 Algorithm
  • Recall – Learning is a Searching Problem
    • Given 
      • Set of Training Examples (D) 
      • Set of Hypotheses (Decision Trees)
    • Learner (ID3 Algorithm)
      •  searches  the Hypothesis Space (Set of Decision Trees) to  find a Hypothesis  (Decision Tree / Model / h) which  best fits  the Training Examples
  • Search of ID3 Algorithm is 
    • simple-to-complex, hill-climbing search guided  by the Information Gain evaluation function

  • ID3 Algorithm
    • Hypothesis Space of ID3 Algorithm is  complete space  of finite, discrete-valued functions w.r.t available Attributes
  • Candidate Elimination Algorithm 
    • Hypothesis Space of Candidate Elimination Algorithm is  incomplete space  because
      • It  only  contains Hypotheses with  Conjunction relationship between Attributes  
  • ID3 Algorithm
    • ID3 Algorithm maintains  only one Hypothesis  (h / Decision Tree) at any time, instead of, e.g.,  all Hypotheses  (or Decision Trees) consistent with Training Examples seen so far
    • This means that we  cannot 
      • determine  how many alternative Hypotheses  (Decision Trees) are  consistent  with Training Examples
  • Candidate Elimination Algorithm 
    • Candidate Elimination Algorithm maintains  Version Space (VSH,D)  i.e. set of  all  Hypotheses  consistent  with Training Examples 
  • Conclusion 
    • ID3 Algorithm  incompletely searches  a  complete Hypothesis Space (H) 
    • Candidate Elimination Algorithm  completely searches  an  incomplete Hypothesis Space (H) 
  • ID3 Algorithm
    • ID3 Algorithm performs  no backtracking 
      • Once an Attribute is selected for testing at a given node, this choice is  never reconsidered 

Therefore, ID3 Algorithm is  susceptible  to converging to  Locally Optimal Solution  rather than  Globally Optimal Solutions

  • Training Regime – ID3 Algorithm
  • ID3 Algorithm 
    • ID3 Algorithm uses  Batch Method  for Training
      • i.e. Uses  all  Training Examples at  each step  to make  statistically based decision  about how to  refine current Hypothesis  (h or Decision Tree)
    • Using statistically based properties of all Training Examples (Information Gain) means the technique is  robust  in the face of  errors in individual Training Examples 
  • Candidate Elimination Algorithm
    • Candidate Elimination Algorithm uses  Incremental Method  for Training 
      • i.e. make decisions  incrementally  based on  single  Training Examples 
  • Conclusion
    • Batch Method for Training is  more robust to errors in Training Examples  compared to Incremental Method

 

  • Inductive Bias – ID3 Algorithm
  • Inductive Bias 
    • Inductive Bias Is the  set of assumptions  needed  in addition  to Training Examples to  justify  Deductively Learner’s Classification
  • Inductive Bias of ID3 Algorithm
    •  Shorter  Decision Trees are  preferred  over  Longer  Decision Trees
    • Decision Trees that place  high  Information Gain Attributes  close to the root  are  preferred  to those that do not
  • Note
    • Inductive Bias of ID3 Algorithm arises from its  Searching Strategy  (simple-to-complex, hill climbing) 
      •  Selects  shorter Decision Trees over longer ones
      •  Selects  Decision Trees that place Attributes with  highest  Information Gain  closest to root 

 

  • Recall
    • ID3 Algorithm  incompletely searches  a  complete Hypothesis Space (H) 
    • Candidate Elimination Algorithm  completely searches  a  incomplete Hypothesis Space (H) 
  • Can be put differently by saying
    • Inductive Bias of ID3 Algorithm follows from its  Searching Strategy  (Preference Bias or Search Bias)
    • Inductive Bias of Candidate Elimination Algorithm follows from its definition of its search space (Restriction Bias or Language Bias).
  • Question
    • Which Bias is better?
  • Answer
    • Main Difference between Tow Biases
      • Preference Bias only effects  order  in which Hypotheses are  searched  
        • In Preference Bias, Hypothesis Space (H)  will  contain the Target Function 
      • Restriction Bias effects  which  Hypotheses are searched
        • In Restriction Bias, Hypothesis Space (H)  may / may not  contain the Target Function 
    • Generally,  better  to  choose  Machine Learning Algorithm with Preference Bias rather than Restriction Bias
  • Note 
      •  Some  Machine Learning Algorithms may  combine  Preference and Restriction Biases 
    • Example 
      • Checker’s Learning Program 
      • Restriction Bias
        •  Linear  weighted function of fixed set of board features introduces Restriction Bias ( non-linear  potential target functions excluded)
      • Preference Bias
        • Selection of moves that lead to win the game introduces Preference Bias

 

  • Summary - Training Phase of ID3 Algorithm
  • Recall
    • Data = Model + Error
  • Training Data

  • Model

  • Model – in the form of Rules

  • In sha Allah, In the next Phase i.e. Testing Phase, we will 
    • Evaluate the performance of the Model

Testing Phase – ID3 Algorithm

  • Testing Phase
  • Question 
    • How good Model has learned?
  • Answer
    • Evaluate the performance of the Model on  unseen  data (or Testing Data)

 

  • Evaluation Measures
  • Evaluation will be carried out using 
    • Error measure

 

  • Error
  • Definition
    • Error is defined as the proportion of incorrectly classified Test instances
  • Formula

  • Note
    • Accuracy = 1 – Error
  • Evaluate Model (Decision Tree)
  • Apply Model on Test Data

Application Phase – ID3 Algorithm

  • Application Phase
  • We assume that our Model 
    • performed well  on  large Test Data  and can be  deployed in Real-world 
  • Model is  deployed in the Real-world  and now we can make
    • predictions  on Real-time Data

 

  • Steps – Making Predictions on Real-time Data
  • Step 1: Take Input from User
  • Step 2: Convert  User Input  into  Feature Vector 
    • Exactly same  as  Feature Vectors  of Training and Testing Data
  • Step 3:  Apply  Model on the  Feature Vector 
  • Step 4: Return  Prediction  to the User

 

  • Making Predictions on Real-time Data
  • Step 1: Take input from User

  • Step 2: Convert User Input into  Feature Vector 
    • Note that  order  of Attributes must be  exactly  same as that of Training and Testing Examples

  • Step 3: Apply Model on  Feature Vector 

  • Return  Prediction  to the User
    • Male
  • Note
    • You can take Input from the user, apply Model and return predictions as many times as you like 😊

Feedback Phase – ID3 Algorithm

  • Feedback Phase
  • Only Allah is Perfect 😊
  • Take Feedback on your  deployed  Model from
    • Domain Experts and
    • Users
  •  Improve  your Model based on Feedback 😊

 

Overfitting – ID3 Algorithm

  • Overfitting
  • Definition 
    • Given a hypothesis space H, a hypothesis h ∈ H  overfits  the Training Examples if there is another hypothesis h′ ∈ H such that h has  smaller Error  than h′ over the Training Examples, but h′ has a  smaller Error  over the  entire distribution of instances 
  • What Causes Overfitting?
  • Noise  in Training Examples 
  • Number of Training Examples is  too small  to produce a  representative sample  of Target Function 
  • Why Overfitting is a Serious Problem?
    • Overfitting is a serious problem for  many  Machine Learning Algorithms 
      • For example 
        • Decision Tree Learning
        • Regular Neural Networks
        • Deep Learning Algorithm etc. 

 

  • Example – Overfitting
  • Suppose the performance of the following two Hypotheses on Training Data and Test Data

  • Note that h has  smaller Error  than h′ over the Training Examples, but 
    • h′ has a  smaller Error  over the  entire distribution of instances 
  • This means that  h has overfit  the Training Examples
  • Overfitting and Decision Tree Learning
  • Overfitting is a  real problem  for Decision Tree Learning 
  • One  empirical study  showed that for a  range of tasks  there was a
    •  Decrease  of 10% – 25% in Accuracy due to Overfitting
  • The simple ID3 Algorithm  can produce  
    • Decision Trees that  overfit  the Training Examples

 

  • Example 1 – Overfitting of ID3 Algorithm
  • Suppose in addition to the 14 Training Examples for Gender Identification we get a 15th Training Example, whose  Target Classification  is  wrong  i.e. Training Example has  Error 
    • ‹Tall, Heavy, Short, Yes, Gender = Female›
  • Question
    • What  impact  this will have on our  earlier  Decision Tree?
  • Since we previously had the  correct  Training Example
  • The  addition  of  incorrect  Training Example will now  cause  ID3 to construct a  more complex Decision Tree 
  • The  new (error full) Training Example  will be sorted into the  second leaf node  from the left in the Decision Tree, along with the  previous Training Examples x9 and x11 (both x9 and x11 labeled as Male) 
    • Because the  new (error full) Training Example  is  labeled  as a  Female , ID3 will  search  for  further refinements  to the Decision Tree below this node
  • Result will be a Decision Tree which performs 
    •  well  on ( error full  Training Examples) and  less well  on  new unseen  instances
  • Note 
    • In this example, ID3 Algorithm  overfits  Training Examples due to 
      • Noise in Training Example
    • Overfitting can also occur when 
      • Number of Training Examples is  too small  to produce a  representative sample  of Target Function 
  • Example 2 - Overfitting of ID3 Algorithm

ID3 Algorithm when Trained and Tested on medical patients, who had a form of diabetes

  • Observations from Graph
    • Accuracy of Decision Tree over Training Examples  increases monotonically  as Decision Tree grows (to be expected)
    • Accuracy of Decision Tree over Test Examples  increases  till about 25 nodes, then  decreases 
  • Conclusion
    • Decision Tree has  overfit 
  • Avoiding Overfitting
  • Question
    • How can we  avoid Overfitting  in Decision Tree Learning?
  • Answer
    • Many Possible Solutions

 

  • Avoiding Overfitting – Two General Approaches
  • Two general approaches to avoid Overfitting in Decision Tree Learning are 
    •  Stop growing  Decision Tree before perfectly fitting Training Examples 
      • e.g. when Data Split is not statistically significant
    •  Grow full  Decision Tree, then  prune  afterward
  • In practice
    • Second approach has been  more successful 
  • Question
    • How the size of an optimal final Decision Tree can be decided to avoid Overfitting ? 
  • Answer
    • Many Possible Approaches
  •  Most Common  Approach
    • Training and Validation Set Approach
  • Training and Validation Set Approach
  • Using this approach, Sample Data is split into three sets
    • Training Set
    • Validation Set 
    • Testing Set
  • Strengths
    • Validation Set helps to  check  whether the Model is Overfitting or not during the Training
  • Weaknesses
    • Holding Data back for a Validation Set  reduces  Data available for Training
  • Training Set
    • is used to  build the Model  
  • Validation Set
    • is used to  check  whether the  Model is Overfitting or not during Training 
  • Testing Set
    • is used to  evaluate the performance of the Model 
  • Note
    • Training Set and Validation Set are used in the 
      • Training Phase
    • Testing Set is used in the
      • Testing Phase
  • Question 1 
    • How to split Sample Data using Training and Validation Set Approach when we have very large/huge amount of data?
  • Answer 1
    • A Good Split maybe
      • Training Set = 80%
      • Validation Set = 10%
      • Testing Set = 10%
  • Note to  efficiently Train  Deep Learning Algorithms, we need  huge  amount of Training Data 
  • Question 2 
    • How to split Sample Data using Training and Validation Set Approach when we have sufficiently large amount of data?
  • Answer 2
    • A Good Split may be
      • Training Set = 80%
      • Testing Set = 20%
      • Validation Set = 10% of Training Set
  • Example 1 – Splitting Data using Training and Validation Set Approach
  • Machine Learning Problem
    • Text Summarization 
  • Deep learning Algorithm
    • LSTM 
  • Total Sample Data 
    • 100,000 instances
  • Splitting Data using the following Split Ratio
    • Training Set = 80% = 80,000
    • Validation Set = 10% = 10,000
    • Testing Set = 10% = 10,000

 

  • Example 2 – Splitting Data using Training and Validation Set Approach
  • Machine Learning Problem
    • Sentiment Analysis
  • Machine Learning Algorithm
    • Random Forest
  • Total Sample Data 
    • 10,000 instances
  • Splitting Data using the following Split Ratio
    • Training Set = 80% = 7,200
    • Testing Set = 20% = 2,000
    • Validation Set = 10% of Training Set = 800

 

  • Training and Validation Set Approach, Cont…
  • Question
    • How do I know that Model is Overfitting  during Training ?
  • Answer
    • Model is Not Overfitting 
      • During Training, if  Training Accuracy  is  increasing  and  Validation Accuracy  is also  increasing  then  Model is not Overfitting 
    • Model is Overfitting 
      • During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting 

 

  • Approaches for Decision Tree Pruning
  • Insha Allah, in this Chapter, I will try to explain two approaches for Decision Tree Pruning
    • Reduced Error Pruning Approach
    • Rule Post-Pruning Approach

 

  • Reduced Error Pruning Approach
  • Reduced Error Pruning Approach works as follows
  • Step 1: Split data into
    • Training Set
    • Validation Set
    • Testing Set
  • Step 2: Construct / Learn Decision Tree using Training Set
  • Step 3: For each node evaluate the impact on  Validation Set  of removing  that node and those below it 
    • Remove node that  most  improves Accuracy on  Validation Set 
  • When a node is removed,  subtree  rooted at it is  replaced  with a  leaf node  
    • whose  Classification  is the  Most Common Classification  of Training Examples beneath the node

 

  • Reduced Error Pruning Approach, Cont…
  • Strengths 
    • Decision Tree is  pruned  to  remove Overfitting 
  • Weaknesses
    • If a node  close  to the  root node  is removed, then  subtree below it  will be removed
      • Consequently, a  lot of information  maybe  lost 

 

  • Example – Effect of Reduced Error Pruning
  • In the previous example, Reduced Error Pruning produces this effect

  • Note that  performance improves  after applying Reduced Error Pruning Approach
  • Rule Post-Pruning Approach
  • Rule Post-Pruning Approach works as follows
  • Step 1: Split data into
    • Training Set
    • Validation Set
    • Testing Set
  • Step 2: Construct / Learn Decision Tree using Training Set
  • Step 3:  Convert  Decision Tree to equivalent  Set of Rules 
  • Step 4: For each  Rule ,  evaluate impact  on Validation Set of  removing that Rule  
    • Remove Rule that  improves  Accuracy on  Validation Set 
  • Step 5: Sort  final Set of Rules  into Decision Tree
  • Strengths 
    • Converting Decision Tree to Rules allows  distinguishing different contexts  in which Rules are used 
      • Treat  each path  through Decision Tree  differently  
    • Removes  distinction  between testing nodes,  near root  and those  near leaves  
    • Rules are often  easier for people  to  understand 

 

  • Strengths and Weaknesses – ID3 Algorithm
  • Strengths 
    • Can handle  noisy  Data
    •  Always finds  the Target Function because 
      • ID3 Algorithm  searches  a  complete Hypothesis Space 
  • Weaknesses
    • Cannot handle  very complex Numeric Representation 
  • Note
    •  State-of-the-art  Decision Tree Learning Algorithms (like Random Forest) can handle  simple Numeric Representations 

However, they  fail  to handle  very complex Numeric Representations  (e.g. Activity Detection in Video / Image)

TODO and Your Turn​

Todo Tasks
Your Turn Tasks
Todo Tasks

TODO Task 1

  • Task 1
    • Consider the Titanic Dataset with the following Attributes
      • Gender: Male, Female
      • Ticket Class: Upper, Middle, Lower
      • Parent/Child Abroad: Zero, One, Two, Three
      • Embarked: Cherbourg, Queenstown, Southampton
      • Survival: No, Yes
    • We obtained the following Sample Data

    • Sample Data was split into Training and Testing Data in a 
      • Train-Test Split Ratio of 80%-20%
    • Training Data 

    • Testing Data

    • Note
      • Consider ID3 Algorithm when answering questions given below
      • Your answer should be 
        • Well Justified
    • Questions
      • Write down the Input and Output for the above Machine Learning Problem?
      • How Training Example is represented?
      • How Hypothesis (h) should be represented?
      • Execute the Machine Learning Cycle?
    • In Training Phase
      • Use Information Gain measure to select the best Attribute?
      • What approach you will use to avoid Overfitting during Training?
      • How will you check whether the Model is Overfitting or Not during Training?
    • After Training Phase
      • What approach is  most suitable  to remove Overfitting from  Trained Model 
      • What could be the possible reasons of Overfitting?
      • Write down your observations that you observed during the execution of Machine Learning Cycle? 
Your Turn Tasks

Your Turn Task 1


  • Task 1
    • Select a Machine Learning Problem (similar to Titanic – Machine Learning from Disaster) and answer the questions given below
    • Note
      • Consider ID3 Algorithm in answering all the questions
    • Questions
      • Write down the Input and Output for the selected Machine Learning Problem?
      • How Training Example is represented?
      • How Hypothesis (h) should be represented?
      • Execute the Machine Learning Cycle?
  • In Training Phase
    • Use Information Gain measure to select the best Attribute?
    • What approach you will use to avoid Overfitting during Training?
    • How will you check whether the Model is Overfitting or Not during Training?
  • After Training Phase
    • What approach is  most suitable  to  remove  Overfitting from  Trained Model 
    • What could be the possible reasons of Overfitting?
    • Write down your observations that you observed during the execution of Machine Learning Cycle?

 

Chapter Summary

  • Chapter Summary
  • Main Problems – Candidate Elimination Algorithm
    • Cannot handle noisy Data
    • Cannot handle Numeric Data
    • Cannot work for complex representations of Data
    • It may fail to find the Target Function in the Hypothesis Space
      • i.e. converge to an empty Version Space
  • Proposed Solution
    • Decision Tree Learning Algorithms
  • Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree 
  • Learned Decision Trees (or Models) can also be re-represented as 
    •  Sets of If-Then Rules (to improve human readability)
  • Decision Tree Learning is most popular in Inductive Learning Algorithms
  • ID3 Algorithm – Summary
    • Representation of Training Examples (D)
      • Attribute-Value Pair
    • Representation of Hypothesis (h)
      • Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)
    • Searching Strategy
      • Greedy Search
  • Simple to Complex (Hill Climbing)
    • Training Regime 
      • Batch Method
    • Inductive Bias
      •  Shorter Decision Trees are preferred over Longer Decision Trees
      • Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not
    • Strengths
      • Can handle noisy Data
      •  Always finds the Target Function because 
        • ID3 Algorithm searches a complete Hypothesis Space 
    • Weaknesses
      • Cannot handle very complex Numeric Representation 
    • Note
      •  State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations 
        • However, they fail to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)
  • To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps
    • Step 1: Write down all the paths in the Decision Tree
    • A path is a Conjunction (AND) of Attributes 
    • Step 2: Join the paths using Disjunction (OR)
    • Decision Tree = Disjunction of Paths
  • In general, Decision Tree Learning Algorithms are best for Machine Learning Problems where
    • Instances / Examples are Represented as Attribute–Value Pair 
    • Target Function (f) is Discrete Valued 
    • Disjunctive Hypothesis may be Required
    • Possibly Noisy / Incomplete Training Data
  • ID3 Algorithm, learns Decision Tree by constructing them top- down, beginning with the following question 
    • Which Attribute should be tested at the root of the tree?
  • Generally, a statistical test is used to evaluate each Input Attribute to determine 
    • How well it alone classifies the Training Examples?
  • Best Attribute is the one which alone best classifies the Training Examples 
  • Best Attribute is selected and used as the test at the root node of the tree
  • A descendant of the root node is then created for each possible value of this Attribute, and 
    • Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)
  • The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree
  • This process continues for each new leaf node until either of two conditions is met 
    •  Every Attribute has already been included along this path through the tree, or 
    • Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)
  • ID3 Algorithm forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices


  •  Many measures are available for picking the best classifier Attribute
    • for e.g. Information Gain, Gain Ratio etc.
  • Information Gain is a useful measure of for picking the best classifier Attribute 
  • Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute
  • Information Gain measures 
    •  how well a given Attribute separates Training Examples with respect to their Target Classification 
  • Information Gain is defined in terms of 
    • Entropy 
  • Entropy gives a measure of purity / impurity of a Set of Training Examples
    • Value of Entropy lies between [0 – 1]
      • 0 means the Sample is Pure 
      • 1 means that Sample has maximum impurity
  • Minimum Entropy
    • Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category 
  • Maximum Entropy
    • Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative
  • A limitation of Information Gain measure is that it favors attributes with many values over those with few values 
  • Hypothesis Space of ID3 Algorithm is complete space of finite, discrete-valued functions w.r.t available Attributes
  • Hypothesis Space of Candidate Elimination Algorithm is incomplete space because
    • It only contains Hypotheses with Conjunction relationship between Attributes 
  • ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far
    • This means that we cannot 
      • determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples
  • ID3 Algorithm incompletely searches a complete Hypothesis Space (H) 
  • Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H) 
  • ID3 Algorithm performs no backtracking 
    • Once an Attribute is selected for testing at a given node, this choice is never reconsidered 
    • Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions 
  • Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method
  • Preference Bias only effects order in which Hypotheses are searched 
    • In Preference Bias, Hypothesis Space (H) will contain the Target Function 
  • Restriction Bias effects which Hypotheses are searched
    • In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function 
  • Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias
  •  Some Machine Learning Algorithms may combine Preference and Restriction Biases 
    • for e.g. Checker’s Learning Program 
  • Given a hypothesis space H, a hypothesis h ∈ H  overfits the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances 
  • What Causes Overfitting?
      •  Noise in Training Examples 
      • Number of Training Examples is too small to produce a representative sample of Target Function 
  • Why Overfitting is a Serious Problem?
    • Overfitting is a serious problem for many Machine Learning Algorithms 
      • For example 
        • Decision Tree Learning
        • Regular Neural Networks
        • Deep Learning Algorithm etc. 
  • Overfitting is a real problem for Decision Tree Learning 
  • For Decision Tree Learning, one empirical study showed that for a range of tasks there was a
    •  Decrease of 10% – 25% in Accuracy due to Overfitting
  • The simple ID3 Algorithm can produce 
    • Decision Trees that overfit the Training Examples
  • Two general approaches to avoid Overfitting in Decision Tree Learning are 
    •  Stop growing Decision Tree before perfectly fitting Training Examples 
      • e.g. when Data Split is not statistically significant
    •  Grow full Decision Tree, then prune afterwards
  • In practice
    • Second approach has been more successful 
  •  Most Common Approach to avoid Overfitting is
    • Training and Validation Set Approach 
  • Using Training and Validation Set Approach, Sample Data is split into three sets
    • Training Set
      • is used to build the Model 
    • Validation Set
      • is used to check whether the Model is Overfitting or not during Training 
    • Testing Set
      • is used to evaluate the performance of the Model 
  • Holding Data back for a Validation Set reduces Data available for Training
  • During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting 
  • During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting 
  • Two approaches for Diction Tree Pruning to avoid Overfitting are
    • Reduced Error Pruning Approach
    • Rule Post-Pruning Approach
      • Rule Post-Pruning Approach is better than Reduced Error Pruning Approach

 

In Next Chapter

  • In Next Chapter
  • In Sha Allah, in next Chapter, I will present
  • Artificial Neural Network
Chapter 9 - Version Space Algorithms​
  • Previous
Chapter 11 - Artificial Neural Network​
  • Next
Share this article!
Facebook
Twitter
LinkedIn
About Us

Ilm O Irfan Technologies believes that honest dedicated hard work with the ethical approach of commitment true to every word, and trust-building is the road to success in any field of business or life.

Quick Links
  • About Us
  • Contact
Useful Links
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Support
  • FAQ
Subscribe Our Newsletter
Facebook Twitter Linkedin Instagram Youtube

© 2022 Ilmo Irfan. All Rights Reserved.