Skip to content

Ilm o Irfan

Technologies

  • Home
  • Courses
    • Free Courses
  • Motivational Seminars
  • Blog
  • Books
  • Contact
  • Home
  • Courses
    • Free Courses
  • Motivational Seminars
  • Blog
  • Books
  • Contact
Explore Courses

Machine Learning

  • January 20, 2023
  • Home
  • Free Book

Table of Contents

Chapter 12- Bayesian Learning

Chapter Outline

  • Chapter Outline
  • Quick Recap
  • Bayes Theorem and Bayesian Learning Algorithms
  • Bayes Optimal Algorithm
  • Gibbs Algorithm
  • Naive Bayes Algorithm 
  • Chapter Summary

Quick Recap

  • Quick Recap – artificial Neural Networks
  • Following Machine Learning Algorithms are based on Symbolic Representation
    • FIND-S Algorithm
    • List Then Eliminate Algorithm
    • Candidate Elimination Algorithm
    • ID3 Algorithm
  • Symbolic Representation – Representing Training Examples
    • Attribute-Value Pair
    • Input
      • Categorical
    • Output
      • Categorical
  • Symbolic Representation – Representing Hypothesis (h)
    • Two Types of Hypothesis (h) Representations
      • Conjunction (AND) of Constrains on Input Attributes
      • Disjunction (OR) of Conjunction (AND) of Input Attributes 
    • Note that both types of Hypothesis (h) Representations are Symbolic 
      • i.e., based on Symbols (Categorical Values)
  • Problem – Symbolic Representation
    •  Cannot handle Machine Learning Problems with very complex Numeric Representations 
  • Solution
    • Non-Symbolic Representations 
      • for e.g., Artificial Neural Networks 
  • Artificial Neural Networks – Summary
  •  Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Numeric 
    • Output
      • Numeric 
  • Representation of Hypothesis (h)
    • Combination of Weights between Units
      • Weights are Numeric values
  • Searching Strategy
    • Exhaustive Search 
  • Training Regime 
    • Batch Method
  • Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)
  • ANNs are suitable for those Machine Learning Problems in which
    • Training Examples (both Input and Output) can be represented as real values (Numeric Representation)
    • Hypothesis (h) can be represented as real values (Numeric Representation)
    •  Slow Training Times are OK 
    •  Predictive Accuracy is more important then understanding
    • When we have noise in Training Data
  • The diagram below shows the General Architecture of Artificial Neural Networks                 

     

  • ANNs mainly have three Layers
    • Input Layer
    • Hidden Layer
    • Output Layers
  • Each Layer contains one or more Units 
  • Main Parameters of ANNs are
    • No. of Input Units
    • No. of Hidden Layers
      • No. of Hidden Units at each Hidden Layer
    • No. of Output Units
    • Mathematical Function at each Hidden and Output Unit
    • Weights between Units
    • ANN will be Fully Connected or Not?
    • Learning Rate
  • ANNs can be broadly categorized based on 
    •  Number of Hidden Layers in an ANN Architecture 
  • Two Layer Neural Networks (a.k.a. Perceptron)
    • Number of Hidden Layers = 0 
  • Multi-layer Neural Networks 
    • Regular Neural Network
      • Number of Hidden Layers = 1
    • Deep Neural Network
      • Number of Hidden Layers > 1
  • A Perceptron is a simple Two Layer ANN, with 
    • One Input Layer 
      • Multiple Input Units 
    • Output Layer
      • Single Output Unit
  • A Perceptron can be used for Binary Classification Problems
  • We can use Perceptrons to build larger Neural Networks
  • Perceptron has limited learning abilities i.e. fails to learn simple Boolean-valued Functions (for e.g. XOR)
  • A Perceptron works as follows
    • Step 1: Random Weighs are assigned to edges between Input-Output Units
    • Step 2: Input is feed into Input Units 
    • Step 3: Weighted Sum of Inputs (S) is calculated 
    • Step 4: Weighted Sum of Inputs (S) is given as Input to the Mathematical Function at Output Unit 
    • Step 5: Mathematical Function calculates Output for Perceptron 
  • Before using ANN to build a Model
    • Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
  • Perceptron – Summary
    • Representation of Training Examples (D)
      • Numeric
    • Representation of Hypothesis (h)
      • Numeric (Combination of Weights between Units)
    • Searching Strategy
      • Exhaustive Search 
    • Training Regime 
      • Incremental Method
    • Strengths
      • Perceptron’s can learn Binary Classification Problems
      • Perceptron’s can be combined to make larger ANNs
      • Perceptron’s can learn linearly separable functions 
    • Weaknesses
      • Perceptron’s fail to learn simple Boolean-valued Functions which are not linearly separable functions 
  • A Regular Neural Network consists of an Input Layer, one Hidden Layers, and an Output Layer
  • A Regular Neural Network can be used for both
    • Binary Classification Problems and
    • Multi-class Classification Problems
  • Regular Neural Network – Summary
    • Representation of Training Examples (D)
      • Numeric
    • Representation of Hypothesis (h)
      • Numeric (Combination of Weights between Units)
    • Searching Strategy
      • Exhaustive Search 
    • Training Regime 
      • Incremental Method
    • Strengths 
      • Can learn both linearly separable and non-linearly separable Target Functions
      • Can handle noisy Data
      • Can learn Machine Learning Problems with very complex Numerical Representations 
    • Weaknesses 
      • Computational Cost and Training Time are high 
  • A Regular Neural Network works as follows
    • Step 1: Random Weighs are assigned to edges between Input, Hidden, and Output Units
    • Step 2: Inputs are fed simultaneously into the Input Units (making up the Input Layer)
    • Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)
    • Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)
    • Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)
    • Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)
    • Step 7: Softmax Layer converts the vectors at Output Units into probabilities and Class with highest probability is the Prediction of the Regular Neural Network
  • Regular Neural Network Training Rule is used to tweaking Weights, when
    •  Actual Value is different from Predicted Value  
  • How Regular Neural Network Training Rule Works?
    • Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the Class of given Training Example) as 1 and 
      • Set the Target Output Values of remaining Output Units as 0
    • Step 2: Training the Regular Neural Network using Training Example E and get Predictions (Observed Output o(E)) from Regular Neural Network
    • Step 3: If Target Output t(E) is different from Observed Output o(E)
      • Calculate the Network Error 
        •  Back propagate the Error by updating weights between Output-Hidden Units and Hidden-Input Units
    • Step 4: Keep training Regular Neural Network, Network Error becomes very small 
  • Overfitting is a serious problem in ANNs (particularly Deep Learning algorithms)
  • Four Possible Solutions to overcome Overfitting in ANNs are as follows
    • Training and Validation Set Approach
    • Learn Multiple ANNs
    • Momentum
    • Weight Decay Factor
  • ANNs – Strengths and Weaknesses 
    • Strengths 
      • Can learn problems with very complex Numerical Representations 
      • Can handle noisy Data
      •  Execution time in Application Phase is fast 
      • Good for Machine Learning Problems in which
        • Both Training Examples and Hypothesis (h) have Numeric Representation 
    • Weaknesses 
      • Requires a lot of Training Time (particularly Deep Learning Models)
      •  Computations cost for ANNs (particularly Deep Learning Models) is high 
      • Overfitting is a serious problem in ANNs
      • ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision 
        • Accept a Hypothesis (h), if it is consistent with the Training Example
        • Reject a Hypothesis (h), if it is not consistent with the Training Example

Quick Recap - Models

  • Learning Input-Output Functions – General Settings
  • Input to Learner 
    • Set of Training Examples (D)
    • Set of Functions / Hypothesis (H)
  • Output of Learner 
    • a hypothesis h from H, which best fits the Training Examples (D)
  • Learning is a Searching Problem
  • For any Machine Learning Algorithm, we should have clear understanding of the following three points
    • Representation of Training Examples (D)
      • How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can understand and learn from them?
    • Representation of Hypothesis (h)
      • How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can understand?
    • Searching Strategy
      • What searching strategy is used by a Machine Learning Algorithm to find h form H, which best fits the Training Examples (D)?
  • Summary - FIND-S Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Categorical
    • Output
      • Categorical
  • Representation of Hypothesis (h)
    • Conjunction (AND) of Constraints on Input Attributes
  • Searching Strategy
    • I am not clear about Search Strategy of FIND-S Algorithm. Please drop me an email if you know it.
  • Training Regime 
    • Incremental Method
  • Summary – List Then Eliminate Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Categorical
    • Output
      • Categorical
  • Representation of Hypothesis (h)
    • Conjunction (AND) of Constraints on Input Attributes
  • Searching Strategy
    • Sequential Search 
  • Training Regime 
    • Incremental Method
  • Summary – Candidate Elimination Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Categorical
    • Output
      • Categorical
  • Representation of Hypothesis (h)
    • Conjunction (AND) of Constraints on Input Attributes
  • Searching Strategy
    • I am not clear about Search Strategy of Candidate Elimination Algorithm. Please drop me an email if you know it. جزاک اللہ خیر
  • Training Regime 
    • Incremental Method
  • Summary – ID3 Algorithm
  • Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Categorical
    • Output
      • Categorical
  • Representation of Hypothesis (h)
    • Disjunction (OR) OF Conjunction (AND) of Input Attributes (a.k.a. Decision Tree)
  • Searching Strategy
    • Greedy Search 
      • Simple to Complex (Hill Climbing)
  • Training Regime 
    • Batch Method
  • Summary - Symbolic Representations
  • Following Machine Learning Algorithms are based on Symbolic Representation
    • FIND-S Algorithm
    • List Then Eliminate Algorithm
    • Candidate Elimination Algorithm
    • ID3 Algorithm
  • Symbolic Representation – Representing Training Examples
    • Attribute-Value Pair
    • Input
      • Categorical
    • Output
      • Categorical
  • Symbolic Representation – Representing Hypothesis (h)
    • Two Types of Hypothesis (h) Representations
      • Conjunction (AND) of Constrains on Input Attributes
      • Disjunction (OR) of Conjunction (AND) of Input Attributes 
    • Note that both types of Hypothesis (h) Representations are Symbolic 
      • i.e. based on Symbols (Categorical Values)
  • Problem with Symbolic Representations
  • Problem 
    • Cannot handle Machine Learning Problems with very complex Numeric Representations
  • Solution
    • Non-Symbolic Representations 
      • For example, Artificial Neural Networks
  • Non-Symbolic Representations - Artificial Neural Networks
  • Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Numeric 
    • Output
      • Numeric 
  • Representation of Hypothesis (h)
    • Combination of Weights between Units
  • Note 
    • Weights are Numeric values
  • Searching Strategy
    • Exhaustive Search 
  • Training Regime 
    • Batch Method
  • Comparison of Symbolic and Non-Symbolic Representation
  • Representation of Training Examples (D)
    • Symbolic Representations
      • Input 
        • Categorical
      • Output 
        • Categorical
    • Non-Symbolic Representations
      • Input 
        • Numeric 
      • Output 
        • Numeric
  • Representation of Hypothesis (h)
    • Symbolic Representations
      • Categorical Values are used
    • Non-Symbolic Representations
      • Numeric Values are used
  • Conclusion
    • Major difference between Symbolic and Non-Symbolic Representation is that
      • In Symbolic Representations both Training Example (d) and Hypothesis (h) are represented using
        • Categorical Values
      • In Non-Symbolic Representations both Training Example (d) and Hypothesis (h) are represented using
        • Numerical Values
  • Strengths and Weaknesses – ANNs
  • Strengths 
    • Can learn Machine Learning Problems with very complex Numerical Representations
    • Can handle noisy Data
    • Execution time in Application Phase is fast
    • Good for Machine Learning Problems in which
      • Both Training Examples and Hypothesis (h) have Numeric Representation
  • Weaknesses 
    • Require a lot of Training Time (particularly Deep Learning Models)
    • Computational cost for ANNs (particularly Deep Learning Models) is high 
    • Overfitting is a serious problem in ANNs
    • ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision
      • Accept a Hypothesis (h), if it is consistent with the Training Example
      • Reject a Hypothesis (h), if it is not consistent with the Training Example
  • Major Problems in ML Algorithms Studied So Far
  • Alhumdulilah (الحمد للہ), Up till now, we have studied the following Machine Learning Algorithms
    • FIND-S Algorithm
    • List Then Eliminate Algorithm
    • Candidate Elimination Algorithm
    • ID3 Algorithm
    • Perceptron Algorithm (Two Layer ANN)
    • Regular Neural Network (Multi-layer ANN)
  • Major Problems 
    • Problem 01
      • These ML Algorithms do not allow us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
    • Problem 02
      • These ML Algorithms either accept or reject a Hypothesis (h) i.e. take a Binary Decision
        • Instead of assigning a probability to each Hypothesis (h) in the Hypothesis Space (H)
  • A Possible Solution
    • Bayesian Learning Algorithms
  • In sha Allah, in this Chapter, I will try to explain how to use Bayesian Learning Algorithms for Machine Learning ProblemsC
  • Probabilistic ML Algorithms - Naïve Bayes
  • Representation of Training Examples (D)
    • Attribute-Value Pair
    • Input
      • Categorical / Numeric 
    • Output
      • Categorical 
  • Representation of Hypothesis (h)
    • Probability Distribution
  • Searching Strategy
    • I am not clear about Search Strategy of FIND-S Algorithm. Please drop me an email if you know it.  
  • Training Regime 
    • Batch Method

Bayes Theorem and Bayesian Learning Algorithms

  • Bayesian Learning
  • Definition 
    • Bayesian Learning uses Bayes’ Theorem to determine the Conditional Probability of a Hypotheses (h) given a Set of Training Examples (D)
    • Note 
      • Bayesian Learning Algorithms are Probabilistic Machine Learning Algorithms
  • Importance
    • Bayesian Learning Algorithms allow us to combine prior knowledge (or past experience) with your Set of Training Examples (D)
  • Applications 
    • Categorizing News Articles
    • Email Spam Detection
    • Face Recognition
    • Sentiment Analysis
    • Medical Diagnosis
    • Character Recognition
    • Weather Prediction
    • and many more 😊
  • Suitable Problems for Bayesian Learning Algorithms
  • Bayesian Learning Algorithms are suitable for following Machine Learning Problems
    • When you want to combine prior knowledge (or past experience) with your Set of Training Examples (D)
    • When you want to assign a probability to each Hypothesis (h) in the Hypothesis Space (H)
      • Instead of simply accepting or rejecting a Hypothesis (h)
    • When you have noisy Data 
    • When you have small Training Data
      • In recent years, Bayesian Learning Algorithms are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora 
    • When you want to make Probabilistic Predictions
      • Instead of making a Binary Prediction i.e. 0 or 1
  • Bayes Theorem

  • P(h) = Prior Probability of Hypothesis h
  • P(D) = Prior Probability of Training Examples D
  • P(D|h) = Probability of D given h
  • P(h|D) = Probability of h given D
  • Example – Bayes Theorem

  • In above Example
    • Hypothesis Space (H)
      • H = {cancer, ¬cancer}
      • H = {Patient has Cancer, Patient does not have Cancer}
    • Set of Training Examples (D) – Outcome of Laboratory Test
      • D = {Positive, Negative}
  • P(h) = Prior Probability of Hypothesis h
    • h = cancer = Patient has Cancer
    • P(h) = P(cancer) = P (Patient has Cancer) = 0.008
      • i.e. P(h)  = 0.008
  • P(D) = Prior Probability of Training Data (D)
    • D = Positive Outcome of Laboratory Test
    • P(D) = (0.008 * 0.98) + (0.992 * 0.03) = 0.0376
  • P(D\h) = Probability of a Positive Outcome to Laboratory Test given that a Patient has Cancer
    • P(D\h) = 0.98
  • Calculate P(h\D) using Bayes Theorem
    • P(h\D) = Probability of the Patient having Cancer given a Positive Outcome to Laboratory Test 

  • Recall
  • Input to Learner
    • Set of Training Examples (D)
    • Hypothesis Space (H)
  • Output by Learner
    • Find a Hypothesis h from H, which best fits the Training Data
  • Brute Force Bayesian Learning Algorithm

  • Example 01 - Brute Force Bayesian Learning Algorithm
  • Consider a Hypothesis Space (H) of three Hypothesis (h)
    • H = {h1, h2, h3}
  • Finding Maximum A Posteriori (MAP) Hypothesis, hMAP
    • Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H) using Bayes’ Theorem
      • P(h1\D) = 0.58
      • P(h2\D) = 0.48
      • P(h3\D) = 0.79
    • Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis, hMAP)
      • hMAP = h3
  • Example 02 - Brute Force Bayesian Learning Algorithm

Consider the following Machine Learning Problem

  • In above learning problem
    • Hypothesis Space (H)
      • H = {cancer, ¬cancer}
      • H = {Patient has Cancer, Patient does not have Cancer}
    • Set of Training Examples (D) – Outcome of Laboratory Test
      • D = {Positive, Negative}
  • From the Machine Learning Problem statement, we see that in general
    • P(cancer) = 0.008                              P(¬cancer) = 0.992
    • P(+|cancer) = 0.98                           P(−|cancer) = 0.02
    • P(+|¬cancer) = 0.03                        P(−|¬cancer) = 0.97
  • Given that a patient returns a Positive Laboratory Test
    • What is the MAP Hypothesis, hMAP? 
  • Finding Maximum A Posteriori (MAP) Hypothesis, hMAP
    • Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H)
      • PhD=PDhP(h)P(D)
      • Since P(D) is same for all Hypothesis (h) in the Hypothesis Space (H)
      • PhD=PDhP(h)
    • Computing Posterior Probabilities 
      • P(cancer|+)
        • P(cancer|+) = P(+|cancer) P(cancer)
        • P(cancer|+) = 0.98 ∗ 0.008
        • P(cancer|+) = 0.0078
      • P(¬cancer|+) 
        • P(¬cancer|+) = P(+|¬cancer) P(¬cancer)
        • P(¬cancer|+) = 0.03 ∗ 0.992
        • P(¬cancer|+) = 0.0298
    • Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis, hMAP)
      • hMAP = ¬cancer (Patient does not have Cancer)
  • Bayes Theorem and MAP/ML Hypotheses
  • Bayes Theorem

  • Generally, want the Most Probable Hypothesis given the Training Examples
    • i.e.  Maximum A Posteriori (MAP) Hypothesis, hMAP

  • Bayes Theorem and MAP/ML Hypotheses, Cont…
  • In some cases, it may be useful to assume that 
    • all Hypotheses (h) in Hypothesis Space (H) have same Prior Probability 
      • i.e. P(hi) = P(hj) for all hi, hj  ∈ H.
  • In such cases, only consider P(D\h) to find Most Probable Hypothesis
      • P(D|h) is called Likelihood of D given h
  • Any hypothesis that maximizes P(D\h) is called a Maximum Likelihood (ML) Hypothesis
    • hML = argmax P(D\h)
    • h∈H              
  • Strengths and Weaknesses - Brute Force Bayesian Learning Algorithm
  • Strengths 
    • Brute Force Bayesian Learning Algorithm returns hMAP  hypotheses, which can be used to make predictions on unseen data
  • Weaknesses
    • It is computationally expensive because we have to

calculate Posterior Probability for all Hypothesis (h) in Hypothesis Space (H)

TODO and Your Turn​

Todo Tasks
Your Turn Tasks
Todo Tasks

TODO Task 1

 

  • Task 1
    • Consider the following Machine Learning Problems
      • Fake Video Detection
      • Fake News Detection
      • Automatic Paraphrase Generation
      • Language Detection from Speech
      • Language Detection from Text
      • Language Detection from Image
    • Note
      • Consider following Machine Learning Algorithms, when answering questions given below
        • Decision Tree Learning 
        • Artificial Neural Networks
        • Bayesian Learning Algorithms 
      • Your answer should be
        • Well Justified
    • Questions
      • Write Input and Output for each Machine Learning Problem?
      • Which Machine Learning Problem is most complex?
      • Which Machine Learning Algorithm(s) will be most suitable for each Machine Learning Problem?
Your Turn Tasks

Your Turn Task 1

 

  • Task 1
    • Select five Real-world Problems which can be treated as Machine Learning Problems answer the question given below
    • Note
      • Consider following Machine Learning Algorithms, when answering questions given below
        • Decision Tree Learning 
        • Artificial Neural Networks
        • Bayesian Learning Algorithms 
    • Questions
      • Write Input and Output for each selected Machine Learning Problem?
      • Which Machine Learning Problem is most complex?
      • Which Machine Learning Algorithm(s) will be most suitable for each Machine Learning Problem?

Bayes Optimal Classifier

  • Most Probable Classification
  • Question 
    • Given an unseen instance x, what is the Most Probable Classification?
  • A Possible Answer
    • Use Bayes Optimal Classifier to find out the Most Probable Classification
  • Bayes Optimal Classifier
  • Bayes Optimal Classifier works as follows

  • Bayes Optimal Classifier computes Most Probable Classification using the following Formula

  • So, the Most Probable Classification for a new instance is the value vj which maximises P(vj|D)
  • Example – Bayes Optimal Classifier
  • Machine Learning Problem
    • Gender Identification 
  • Given
    • A Hypothesis Space (H) of three Hypothesis (h)
      • H = {h1, h2, h3}
  • Note 
    • vj = {Male, Female}
  • Task
    • What is the Most Probable Classification (Male or Female) of an unseen instance x?
  • Applying Bayes Optimal Classifier to find out Most Probable Classification (Male or Female) for x
    • Step 1: Calculate Posterior Probability for every Hypothesis h ∈ H using Bayes’ Theorem
      • P(h1|D) = 0.4
      • P(h2|D) = 0.3
      • P(h3|D) = 0.3
    • Step 2: Calculate the prediction of each Hypothesis (h) for unseen instance x 
      • h1 predicts unseen instance x as Female
      • h2 predicts unseen instance x as Male
      • h3 predicts unseen instance x as Male
    • Summarizing Steps 1 and 2
      • P(h1|D) = 0.4, P(Male | h1) = 0, P(Female| h1) = 1
      • P(h2|D)  = 0.3, P(Male | h2) = 1, P(Female | h2) = 0
      • P(h3|D)  = 0.3, P(Male | h3) = 1, P(Female | h3) = 0
    • Step 3: Use the combination of Step 1 and 2. to classify unseen instance x

  • Most Probable Classification of unseen instance x is 

  • hMAP and Most Probable Classification
  • Question 
    • Is hMAP(x) the Most Probable Classification? 
      • where x is an unseen instance
  • Answer 
    • No
  • Example - hMAP and Most Probable Classification
  • Consider the Gender Identification Problem discussed in previous Slides to explain the working of Bayes Optimal Classifier
    • Calculating hMAP
      • P(h1|D) = 0.4
      • P(h2|D) = 0.3
      • P(h3|D) = 0.3
    • hMAP = h1
  • Classifying unseen instance x
    • Choice 01 – Use hMAP
      • Prediction = Female
    • Choice 02 – Use Bayes Optimal Classifier
      • Prediction = Male
  • Conclusion
    • Prediction returned by hMAP(x) is not always the Most Probable Classification
  • Strengths and Weaknesses – Bayes Optimal Classifier
  • Strengths 
    • Predicts the Most Probable Classification for unseen data
  • Weaknesses
    • Computationally very expensive because you need to do following calculations
      • Step 1: Calculate Posterior Probability for every Hypothesis (h) in Hypothesis Space (H)
      • Step 2: Calculate prediction for every Hypothesis (h) in Hypothesis Space (H)
        • Combine Steps 1 and 2 to calculate the Most Probable Classification for unseen instances

Gibbs Algorithm

  • Problem in Bayes Optimal Classifier
  • Problem
    • Bayes Optimal Classifier is computationally expensive 
  • A Possible Solution
    • Use Gibbs Algorithm
  • Gibbs Algorithm
  • Gibbs Algorithm works as follows

  • Surprising fact

Expected Error of Gibbs Algorithm is no worse than twice Bayes Optimal Classifier

TODO and Your Turn​

Todo Tasks
Your Turn Tasks
Todo Tasks

TODO Task 2

 

  • Task 1
    • Consider a Machine Learning Problem has three possible outcomes: child, youth and old. A Hypothesis Space (H) containing 8 possible hypotheses h1, h2, h3, h4, h5, h6, h7 and h8. D is the Set of Training Examples available to train the Model. 
      • P(h1|D) = 0.3, P(h2|D) = 0.4, P(h3|D) = 0.3, P(h4|D) = 0.2, P(h5|D) = 0.1, P(h6|D) = 0.4, P(h7|D) = 0.5 P(h8|D) = 0.3
    • Given a new instance x such that 
      • h1(x) = child, h2(x) = youth, h3(x) = old, h4(x) = old, h5(x) = youth, h6(x) = child, h7(x) = old, h8(x) = youth
    • Note
      • Your answer should be
        • Well Justified
    • Questions
      • What is the Most Probable Classification of unseen instance x? 
      • Is the classification assigned by the most probable hypothesis is the same as the most probable classification? If yes, how? If no, how?
      • Should we use Gibbs Algorithms instead of Bayes Optimal Classifier? 
Your Turn Tasks

Your Turn Task 2

 

  • Task 1
    • Create a scenario similar to the Task given in TODO and answer the questions given below.
    • Questions
      • What is the Most Probable Classification of unseen instance x? 
      • Is the classification assigned by the most probable hypothesis is the same as the most probable classification? If yes, how? If no, how?
      • Should we use Gibbs Algorithms instead of Bayes Optimal Classifier? 

Naïve Bayes Algorithm

  • Naive Bayes Algorithm
  • Highly practical Bayesian Learning Algorithm
    • Performance comparable to Artificial Neural Networks and Decision Tree Learning Algorithms on some tasks
  • Naïve Bayes is very good for
    • Text Classification 
  • Assume Target Function f : X → V, where each instance in X described by attributes (𝒂𝟏,𝒂𝟐 ,…𝒂𝒏) 
  • Most Probable Value vMAP of f(x) is:

  • P(a1,a2…an|v j) is impractical to compute unless very large Training Set
    • Number of such terms = Number of Possible Instances × Number of Possible Target Values
      • Example: 20 attributes, 5 values each, 2 target values
        • 520 × 2 = 190,734,863,281,250
  • Reliability of Probability Estimates low, unless 
    • each instance seen many times
  • Key Assumption of Naive Bayes 
    • Attribute Values are conditionally independent given Target Value I.e. assume

  • Making this assumption leads to the Naive Bayes Classifier 

  • Naive Bayes Algorithm

Machine Learning Cycle – Naïve Bayes Algorithm

  • Gender Identification Problem
  • Machine Lerning Problem
    • Gender Identification
  • Input 
    • Human
  • Output
    • Gender of a Human
  • Task
    • Given a Human (Input), predict the Gender of the Human (Output)
  • Treated as
    • Learning Input-Output Function 
      • i.e. Learn from Input to predict Output
  • Representation of Examples
  • Example = Input + Output
    • Input
      • Human
    • Outpu
      • Gender
  • Representation of Input and Output
    • Attribute-Value Pair
  • Representation of Input
  • Input is represented as a 
    • Set of 4 Input Attributes
      • Height
      • Weight
      • HairLength
      • HeadCovered

  • Representation of Output
  • Output is represented as a 
    • Set of 1 Output Attribute
      • Gender

  • Note 
    • Yes means Female and No means Male
  • Sample Data
  • We obtained a Sample Data of 21 examples

  • Split the Sample Data
  • We split the Sample Data using Random Split Approach into
    • Training Data –   2 / 3 of Sample Data
    • Testing Data –   1 / 3 of Sample Data
  • Training Data

  • Testing Data

  • Machine Learning Cycle
  • Four phases of a Machine Learning Cycle are
    • Training Phase
      • Build the Model using Training Data
    • Testing Phase
      • Evaluate the performance of Model using Testing Data
    • Application Phase
      • Deploy the Model in Real-world, to make prediction on Real-time unseen Data
    • Feedback Phase
      • Take Feedback form the Users and Domain Experts to improve the Model
  • Training Phase – Naïve Bayes Algorithm
  • Compute Probabilities 

 

 

  • Summary - Training Phase of ID3 Algorithm
  • Recall

  • Training Data

  • Model

  • In the next Phase i.e. Testing Phase, Insha Allah (انشاء اللہ), we will 
    • Evaluate the performance of the Model
  • Testing Phase – Naïve Bayes Algorithm
  • Question 
    • How good Model has learned?
  • Answer
    • Evaluate the performance of the Model on unseen data (or Testing Data)
  • Evaluation Measures
  • Evaluation will be carried out using 
    • Error measure
  • Error
  • Definition
    • Error is defined as the proportion of incorrectly classified Test instances
  • Formula

  • Note
    • Accuracy = 1 – Error
  • Evaluate Model (Naïve Bayes)
  • Test Example x15
    • x15 = <Tall, Heavy, Long, No>
  • Compute Most Probable Classification for Text Example x15 using Naïve Bayes Model
    • Note
      • = {Male, Female}
    • Using Naive Bayes, we need to compute
      • VNB = P  P (Height = Tall |  ) P (Weight = Heavy   |  ) P (HairLength = Long |  ) P (HeadCovered = No |  )
  • Predicting Class of Text Example x15
    • Compute vNB

  • Prediction
    • VNB = Female
  • Test Example x16
    • x16 = <Short, Normal, Short, No>

  • Prediction
    • VNB = Female
  • Test Example x17
    • x17 = <Short, Normal, Long, Yes>

  • Prediction
    • VNB = Male
  • Test Example x18
    • x18 = <Normal, Normal, Short, No>

  • Prediction
    • VNB = Female
  • Test Example x19
    • x19 = <Short, Light, Short, No>

  • Prediction
    • VNB = Female
  • Test Example x20
      • x20 = <Normal, Light, Short, Yes>

  • Prediction
    • VNB = Female
  • Test Example x21
    • x21 = <Tall, Light, Short, No>

  • Prediction
    • VNB = Female
  • Summary - Evaluate Model (Naïve Bayes)

  • Application Phase
  • We assume that our Model 
    • performed well on large Test Data and can be deployed in Real-world
  • Model is deployed in the Real-world and now we can make
    • predictions on Real-time Data
  • Steps – Making Predictions on Real-time Data
  • Step 1: Take Input from User
  • Step 2: Convert User Input into Feature Vector
    • Exactly same as Feature Vectors of Training and Testing Data
  • Step 3: Apply Model on the Feature Vector
  • Step 4: Return Prediction to the User
  • Making Predictions on Real-time Data
  • Step 1: Take input from User

  • Step 2: Conver User Input into Feature Vector
    • Note that order of Attributes must be exactly same as that of Training and Testing Examples

  • Step 3: Apply Model on Feature Vector
  • x = <Height = Short, Weight = Heavy, HairLength = Long, HeadCovered = Yes>
  • Note
    • = {Male, Female}
  • Using Naive Bayes, we need to compute
    • VNB =  P( )  
    • VNB = P  P (Height = Short |  ) P (Weight = Heavy   |  ) P (HairLength = Long |  ) P (HeadCovered = Yes |  )

  • Using these estimates we can now compute VNB

  • VNB = Male
  • Step 4: Return Prediction to the User
    • Male
  • Note
    • You can take Input from user, apply Model and return predictions as many times as you like 😊
  • Feedback Phase
  • Only Allah is Perfect 😊
  • Take Feedback on your deployed Model from
    • Domain Experts and
    • Users
  • Improve your Model based on Feedback 😊
  • Strengths and Weaknesses – Naïve Bayes Algorithm
  • Strengths 
    • Naïve Bayes Algorithm allows us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
    • Naïve Bayes Algorithm assigns a probability to each Hypothesis (h) in the Hypothesis Space (H)
      • Instead of simply accepting or rejecting a Hypothesis (h)
    • Naïve Bayes Algorithm can handle noisy Data 
    • Naïve Bayes Algorithm show good Accuracy on Testing Data even if the Set of Training Examples (D) is rather small
    • Naïve Bayes Algorithm can avoid Overfitting
    • Naïve Bayes Algorithm are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora 
    • Naïve Bayes Algorithm can make Probabilistic Predictions
      • Instead of making a Binary Prediction i.e. 0 or 1
  • Weaknesses
    • Naive Bayes Algorithm assumes that Attributes / Features are independent of each other
    • However, in Real world, Attributes / Features depend on each other

TODO and Your Turn​

Todo Tasks
Your Turn Tasks
Todo Tasks

TODO Task 3

  • Task 1
    • Consider the Titanic Dataset with the following Attributes
      • Gender: Male, Female
      • Ticket Class: Upper, Middle, Lower
      • Parent/Child Abroad: Zero, One, Two, Three
      • Embarked: Cherbourg, Queenstown, Southampton
      • Survival: No, Yes
    • We obtained the following Sample Data

    • Sample Data was split into Training and Testing Data in a 
      • Train-Test Split Ratio of 80%-20%
    • Training Data 

    • Testing Data

    • Note
      • Consider Naïve Bayes Algorithm when answering questions given below
      • Your answer should be 
        • Well Justified
    • Questions
      • Write down the Input and Output for the above Machine Learning Problem?
      • How Training Example is represented?
      • How Hypothesis (h) should be represented?
      • Execute the Machine Learning Cycle?
  • Write down your observations that you observed during the execution of Machine Learning Cycle?
Your Turn Tasks

Your Turn Task 3

 

  • Task 1
    • Select a Machine Learning Problem (similar to: Titanic – Machine Learning form Disaster) and answer the questions given below.
    • Note
      • Consider Naïve Bayes Algorithm when answering all the questions.
    • Questions
      • Write down the Input and Output for the selected Machine Learning Problem?
      • How Training Example is represented?
      • How Hypothesis (h) should be represented?
      • Execute the Machine Learning Cycle?
    • Write down your observations that you observed during the execution of Machine Learning Cycle?

Chapter Summary

  • Chapter Summary
  • Major Problems 
    • Problem 1
      • These ML Algorithms do not allow us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
    • Problem 2
      • These ML Algorithms either accept or reject a Hypothesis (h) i.e., take a Binary Decision
        • Instead of assigning a probability to each Hypothesis (h) in the Hypothesis Space (H)
  • A Possible Solution
    • Bayesian Learning Algorithms
  • Naïve Bayes – Summary
    • Representation of Training Examples (D)
      • Attribute-Value Pair
    • Representation of Hypothesis (h)
      • Probability Distribution
    • Searching Strategy
      • I am not clear about Search Strategy of Naïve Bayes Algorithm. Please drop me an email if you know it. جزاک اللہ خیر 
    • Training Regime 
      • Batch Method
    • Strengths
      • Naïve Bayes Algorithm allows us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
    • Naïve Bayes Algorithm assigns a probability to each Hypothesis (h) in the Hypothesis Space (H)
      • Instead of simply accepting or rejecting a Hypothesis (h)
    • Naïve Bayes Algorithm can handle noisy Data 
    • Naïve Bayes Algorithm show good Accuracy on Testing Data even if the Set of Training Examples (D) is rather small 
    • Naïve Bayes Algorithm can avoid Overfitting
    • Naïve Bayes Algorithm are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora 
    • Naïve Bayes Algorithm can make Probabilistic Predictions 
      • Instead of making a Binary Prediction i.e. 0 or 1
  • Weaknesses
    • Naive Bayes Algorithm assumes that Attributes / Features are independent of each other
      • However, in Real world, Attributes / Features depend on each other 
  •  Bayesian Learning Algorithms are Probabilistic Machine Learning Algorithms 
  • Bayesian Learning uses Bayes’ Theorem to determine the Conditional Probability of a Hypotheses (h) given a Set of Training Examples (D)
  • Bayesian Learning Algorithms allow us to combine prior knowledge (or past experience) with your Set of Training Examples (D) 
  • Bayesian Learning Algorithms are suitable for following Machine Learning Problems
    • When you want to combine prior knowledge (or past experience) with your Set of Training Examples (D) 
    • When you want to assign a probability to each Hypothesis (h) in the Hypothesis Space (H)
      • Instead of simply accepting or rejecting a Hypothesis (h)
    • When you have noisy Data 
    • When you have small Training Data
      • In recent years, Bayesian Learning Algorithms are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora 
    • When you want to make Probabilistic Predictions 
      • Instead of making a Binary Prediction i.e. 0 or 1
  • A Brute Force Bayesian Learning Algorithm works as follows
    • Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H)
    • Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis, hMAP)
  • Bayes Optimal Classifier works as follows to calculate the Most Probable Classification of an unseen instance x
    • Step 1: Calculate Posterior Probability for every Hypothesis h ∈ H
    • Step 2: Calculate the prediction of each Hypothesis (h) for each new instance
    • Step 3: Use the combination of Step 1 and 2. to classify each new instance 
  • Prediction returned by hMAP(x) is (always) not Most Probable Classification? 
  • To reduce the computation cost in finding the Most Probable Classification using Bayes Optimal Classifier, the Gibbs Algorithm works as follows
    • Step 1: Choose one Hypothesis (h) at random, according to P(h|D)
    • Step 2: Use this h to classify new unseen instances
  • Naive Bayes Algorithm is a highly practical Bayesian Learning Algorithm
    •  Performance comparable to Artificial Neural Networks and Decision Tree Learning Algorithms on some tasks
  • Naïve Bayes is very good for

         Text Classification

In Next Chapter

  • In Next Chapter
  • In Sha Allah, in next Chapter, I will present
  • Instance Based Learning
Chapter 11 - Artificial Neural Networks
  • Previous
Chapter 13 - Instance Based Learning​
  • Next
Share this article!
Facebook
Twitter
LinkedIn
About Us

Ilm O Irfan Technologies believes that honest dedicated hard work with the ethical approach of commitment true to every word, and trust-building is the road to success in any field of business or life.

Quick Links
  • About Us
  • Contact
Useful Links
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Support
  • FAQ
Subscribe Our Newsletter
Facebook Twitter Linkedin Instagram Youtube

© 2022 Ilmo Irfan. All Rights Reserved.