# Chapter 12- Bayesian Learning

### Chapter Outline

- Chapter Outline

**Quick Recap**

**Bayes Theorem and Bayesian Learning Algorithms**

**Bayes Optimal Algorithm**

**Gibbs Algorithm**

**Naive Bayes Algorithm**

**Chapter Summary**

## Quick Recap

- Quick Recap – artificial Neural Networks

**Following Machine Learning Algorithms are based on Symbolic Representation**

**FIND-S Algorithm**

**List Then Eliminate Algorithm**

**Candidate Elimination Algorithm**

**ID3 Algorithm**

**Symbolic Representation – Representing Training Examples**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Symbolic Representation – Representing Hypothesis (h)**

**Two Types of Hypothesis (h) Representations**

**Conjunction (AND) of Constrains on Input Attributes**

**Disjunction (OR) of Conjunction (AND) of Input Attributes**

**Note that****both****types of Hypothesis (h) Representations are****Symbolic**

**i.e., based on****Symbols****(Categorical Values)**

**Problem – Symbolic Representation**

**Cannot handle****Machine Learning Problems with****very complex Numeric Representations**

**Solution**

**Non-Symbolic Representations**

**for e.g., Artificial Neural Networks**

**Artificial Neural Networks – Summary**

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Numeric**

**Output**

**Numeric**

**Representation of Hypothesis (h)**

**Combination of Weights between Units**

**Weights are****Numeric****values**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Batch Method**

**Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)**

**ANNs are suitable for those Machine Learning Problems in which**

**Training Examples (both Input and Output) can be****represented****as real values (Numeric Representation)**

**Hypothesis (h) can be****represented****as real values (Numeric Representation)**

**Slow****Training Times are****OK**

**Predictive Accuracy****is****more****important then understanding**

**When we have noise in Training Data**

**The diagram below shows the General Architecture of Artificial Neural Networks**

** **

- ANNs mainly have three Layers

**Input Layer**

**Hidden Layer**

**Output Layers**

**Each Layer contains one or more****Units**

**Main Parameters of ANNs are**

**No. of Input Units**

**No. of Hidden Layers**

**No. of Hidden Units at each Hidden Layer**

**No. of Output Units**

**Mathematical Function at each Hidden and Output Unit**

**Weights between Units**

**ANN will be Fully Connected or Not?**

**Learning Rate**

**ANNs can be****broadly****categorized based on**

**Number of Hidden Layers****in an****ANN Architecture**

**Two Layer Neural Networks (a.k.a. Perceptron)**

**Number of Hidden Layers = 0**

**Multi-layer Neural Networks**

**Regular Neural Network**

**Number of Hidden Layers = 1**

**Deep Neural Network**

**Number of Hidden Layers > 1**

**A Perceptron is a****simple****Two Layer ANN, with**

**One Input Layer**

**Multiple Input Units**

**Output Layer**

**Single Output Unit**

**A Perceptron can be used for Binary Classification Problems**

**We can****use****Perceptrons to****build larger****Neural Networks**

**Perceptron has****limited****learning abilities i.e.****fails****to learn****simple****Boolean-valued Functions (for e.g. XOR)**

**A Perceptron works as follows**

**Step 1:****Random****Weighs are assigned to****edges****between Input-Output Units**

**Step 2:****Input****is****feed****into****Input Units**

**Step 3: Weighted Sum of Inputs (S) is****calculated**

**Step 4: Weighted Sum of Inputs (S) is given as****Input****to the****Mathematical Function at Output Unit**

**Step 5: Mathematical Function calculates****Output for Perceptron**

**Before using ANN to build a Model**

**Convert****both****Input and Output into****Numeric Representation****(or Non-Symbolic Representation)**

**Perceptron – Summary**

**Representation of Training Examples (D)**

**Numeric**

**Representation of Hypothesis (h)**

**Numeric (Combination of Weights between Units)**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Incremental Method**

**Strengths**

**Perceptron’s can****learn****Binary Classification Problems**

**Perceptron’s can be****combined****to make****larger****ANNs**

**Perceptron’s can learn****linearly separable functions**

**Weaknesses**

**Perceptron’s****fail****to learn simple Boolean-valued Functions which are****not linearly separable functions**

**A Regular Neural Network consists of an Input Layer, one Hidden Layers, and an Output Layer**

**A Regular Neural Network can be used for both**

**Binary Classification Problems and**

**Multi-class Classification Problems**

**Regular Neural Network – Summary**

**Representation of Training Examples (D)**

**Numeric**

**Representation of Hypothesis (h)**

**Numeric (Combination of Weights between Units)**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Incremental Method**

**Strengths**

**Can learn both****linearly separable****and****non-linearly separable****Target Functions**

**Can handle****noisy****Data**

**Can learn Machine Learning Problems with****very complex Numerical Representations**

**Weaknesses**

**Computational Cost and Training Time are****high**

**A Regular Neural Network works as follows**

**Step 1:****Random****Weighs are assigned to****edges****between Input, Hidden, and Output Units**

**Step 2: Inputs are****fed simultaneously****into the Input Units (making up the Input Layer)**

**Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)**

**Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)**

**Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)**

**Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)****Step 7: Softmax Layer converts the****vectors****at Output Units into****probabilities****and****Class with highest probability****is the****Prediction****of the Regular Neural Network**

**Regular Neural Network Training Rule is used to****tweaking Weights****, when**

**Actual Value****is****different****from****Predicted Value**

**How Regular Neural Network Training Rule Works?**

**Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the****Class of given Training Example****) as 1 and**

**Set the Target Output Values of remaining Output Units as 0**

**Step 2: Training the Regular Neural Network using Training Example E and get****Predictions****(Observed Output o(E)) from Regular Neural Network**

**Step 3: If Target Output t(E) is****different****from Observed Output o(E)**

**Calculate the****Network Error**

**Back propagate the Error****by updating weights between Output-Hidden Units and Hidden-Input Units**

**Step 4: Keep****training****Regular Neural Network,****Network Error****becomes****very small**

**Overfitting is a****serious problem****in ANNs (particularly Deep Learning algorithms)**

**Four Possible Solutions to overcome Overfitting in ANNs are as follows**

**Training and Validation Set Approach**

**Learn Multiple ANNs**

**Momentum**

**Weight Decay Factor**

**ANNs – Strengths and Weaknesses**

**Strengths**

**Can learn problems with****very complex Numerical Representations**

**Can handle****noisy****Data**

**Execution****time in Application Phase is****fast**

**Good for Machine Learning Problems in which**

**Both Training Examples and Hypothesis (h) have****Numeric Representation**

**Weaknesses**

**Requires a****lot****of Training Time (particularly Deep Learning Models)**

**Computations cost****for ANNs (particularly Deep Learning Models) is****high**

**Overfitting is a****serious problem****in ANNs**

**ANNs either****reject****or****accept****a Hypothesis (h) during Training i.e. takes a****Binary Decision**

**Accept a Hypothesis (h), if it is****consistent****with the Training Example**

**Reject a Hypothesis (h), if it is****not consistent****with the Training Example**

### Quick Recap - Models

- Learning Input-Output Functions – General Settings

**Input to Learner**

**Set of Training Examples (D)**

**Set of Functions / Hypothesis (H)**

**Output of Learner**

**a hypothesis h from H, which****best fits****the Training Examples (D)**

- Learning is a Searching Problem

**For****any****Machine Learning Algorithm, we should have****clear****understanding of the following three points**

**Representation of Training Examples (D)**

**How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can****understand****and****learn****from them?**

**Representation of Hypothesis (h)**

**How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can****understand****?**

**Searching Strategy**

**What****searching strategy****is used by a Machine Learning Algorithm to****find h form H****, which****best fits****the Training Examples (D)?**

- Summary - FIND-S Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Input Attributes**

**Searching Strategy**

**I am not clear about Search Strategy of FIND-S Algorithm. Please drop me an email if you know it.**

**Training Regime**

**Incremental Method**

- Summary – List Then Eliminate Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Input Attributes**

**Searching Strategy**

**Sequential Search**

**Training Regime**

**Incremental Method**

- Summary – Candidate Elimination Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Representation of Hypothesis (h)**

**Conjunction (AND) of Constraints on Input Attributes**

**Searching Strategy**

**I am not clear about Search Strategy of Candidate Elimination Algorithm. Please drop me an email if you know it.****جزاک اللہ خیر**

**Training Regime**

**Incremental Method**

- Summary – ID3 Algorithm

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Representation of Hypothesis (h)**

**Disjunction (OR) OF Conjunction (AND) of Input Attributes (a.k.a. Decision Tree)**

**Searching Strategy**

**Greedy Search**

**Simple to Complex (Hill Climbing)**

**Training Regime**

**Batch Method**

- Summary - Symbolic Representations

**Following Machine Learning Algorithms are based on****Symbolic Representation**

**FIND-S Algorithm**

**List Then Eliminate Algorithm**

**Candidate Elimination Algorithm**

**ID3 Algorithm**

**Symbolic Representation – Representing Training Examples**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Symbolic Representation – Representing Hypothesis (h)**

**Two Types of Hypothesis (h) Representations**

**Conjunction (AND) of Constrains on Input Attributes**

**Disjunction (OR) of Conjunction (AND) of Input Attributes**

**Note that****both****types of Hypothesis (h) Representations are****Symbolic**

**i.e. based on****Symbols****(Categorical Values)**

- Problem with Symbolic Representations

**Problem**

**Cannot****handle Machine Learning Problems with****very complex Numeric Representations**

**Solution**

**Non-Symbolic Representations**

**For example, Artificial Neural Networks**

- Non-Symbolic Representations - Artificial Neural Networks

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Numeric**

**Output**

**Numeric**

**Representation of Hypothesis (h)**

**Combination of Weights between Units**

**Note**

**Weights are****Numeric****values**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Batch Method**

- Comparison of Symbolic and Non-Symbolic Representation

**Representation of Training Examples (D)**

**Symbolic Representations**

**Input**

**Categorical**

**Output**

**Categorical**

**Non-Symbolic Representations**

**Input**

**Numeric**

**Output**

**Numeric**

**Representation of Hypothesis (h)**

**Symbolic Representations**

**Categorical Values are used**

**Non-Symbolic Representations**

**Numeric Values are used**

**Conclusion**

**Major difference****between Symbolic and Non-Symbolic Representation is that**

**In Symbolic Representations****both****Training Example (d) and Hypothesis (h) are****represented****using**

**Categorical Values**

**In Non-Symbolic Representations****both****Training Example (d) and Hypothesis (h) are****represented****using**

**Numerical Values**

- Strengths and Weaknesses – ANNs

**Strengths**

**Can learn Machine Learning Problems with****very complex Numerical Representations**

**Can handle****noisy****Data**

**Execution time****in Application Phase is****fast**

**Good****for Machine Learning Problems in which**

**Both****Training Examples and Hypothesis (h) have****Numeric Representation**

**Weaknesses**

**Require a****lot of Training Time****(particularly Deep Learning Models)**

**Computational cost****for ANNs (particularly Deep Learning Models) is****high**

**Overfitting****is a****serious problem****in ANNs**

**ANNs either****reject****or****accept****a Hypothesis (h) during Training i.e. takes a****Binary Decision**

**Accept****a Hypothesis (h), if it is****consistent****with the Training Example**

**Reject****a Hypothesis (h), if it is****not consistent****with the Training Example**

- Major Problems in ML Algorithms Studied So Far

**Alhumdulilah (الحمد للہ), Up till now, we have studied the following Machine Learning Algorithms**

**FIND-S Algorithm**

**List Then Eliminate Algorithm**

**Candidate Elimination Algorithm**

**ID3 Algorithm**

**Perceptron Algorithm (Two Layer ANN)**

**Regular Neural Network (Multi-layer ANN)**

**Major Problems**

**Problem 01**

**These ML Algorithms****do not allow****us to****combine prior knowledge****(or past experience) with our****Training Examples****, when learning a Concept**

**Problem 02**

**These ML Algorithms either****accept****or****reject****a Hypothesis (h) i.e. take a Binary Decision**

**Instead of****assigning a probability****to****each Hypothesis (h)****in the Hypothesis Space (H)**

**A Possible Solution**

**Bayesian Learning Algorithms**

**In sha Allah, in this Chapter, I will try to explain how to use Bayesian Learning Algorithms for Machine Learning ProblemsC**

- Probabilistic ML Algorithms - Naïve Bayes

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Categorical / Numeric**

**Output**

**Categorical**

**Representation of Hypothesis (h)**

**Probability Distribution**

**Searching Strategy**

**I am not clear about Search Strategy of FIND-S Algorithm. Please drop me an email if you know it.**

**Training Regime**

**Batch Method**

## Bayes Theorem and Bayesian Learning Algorithms

- Bayesian Learning

**Definition**

**Bayesian Learning****uses Bayes’ Theorem****to****determine the Conditional Probability****of a Hypotheses (h) given a Set of Training Examples (D)**

**Note**

**Bayesian Learning Algorithms****are****Probabilistic Machine Learning Algorithms**

**Importance**

**Bayesian Learning Algorithms****allow****us to****combine prior knowledge****(or past experience) with your****Set of Training Examples (D)**

**Applications**

**Categorizing News Articles**

**Email Spam Detection**

**Face Recognition**

**Sentiment Analysis**

**Medical Diagnosis**

**Character Recognition**

**Weather Prediction**

**and many more****😊**

- Suitable Problems for Bayesian Learning Algorithms

**Bayesian Learning Algorithms are****suitable****for following Machine Learning Problems**

**When you want to****combine prior knowledge****(or past experience) with your****Set of Training Examples (D)**

**When you want to****assign a probability****to each Hypothesis (h) in the Hypothesis Space (H)**

**Instead of simply****accepting or rejecting****a Hypothesis (h)**

**When you have****noisy****Data**

**When you have****small****Training Data**

**In recent years, Bayesian Learning Algorithms are used in Deep Learning Algorithms, which****allows****Deep Learning Algorithms to learn from****small datasets / corpora**

**When you want to make****Probabilistic Predictions**

**Instead of making a****Binary Prediction****i.e. 0 or 1**

- Bayes Theorem

*P(h)***= Prior Probability of Hypothesis***h*

*P(D)***= Prior Probability of Training Examples***D*

*P(D|h)***= Probability of***D***given***h*

*P(h|D)***= Probability of***h***given***D*

- Example – Bayes Theorem

**In above Example**

**Hypothesis Space (H)**

**H = {cancer, ¬***cancer*}

**H = {Patient has Cancer, Patient does not have Cancer}**

**Set of Training Examples (D) – Outcome of Laboratory Test**

**D = {Positive, Negative}**

**P(h) = Prior Probability of Hypothesis h**

**h = cancer = Patient has Cancer**

**P(h) = P(cancer) = P (Patient has Cancer) = 0.008**

**i.e. P(h) = 0.008**

**P(D) = Prior Probability of Training Data (D)**

**D = Positive Outcome of Laboratory Test**

**P(D) = (0.008 * 0.98) + (0.992 * 0.03) = 0.0376**

*P(D\h)*= Probability of a Positive Outcome to Laboratory Test given that a Patient has Cancer

*P(D\h)*= 0.98

**Calculate***P(h\D)*using Bayes Theorem

*P(h\D) =*Probability of the Patient having Cancer given a Positive Outcome to Laboratory Test

- Recall

**Input to Learner**

**Set of Training Examples (D)**

**Hypothesis Space (H)**

**Output by Learner**

**Find a Hypothesis h from H, which****best fits****the Training Data**

- Brute Force Bayesian Learning Algorithm

- Example 01 - Brute Force Bayesian Learning Algorithm

**Consider a Hypothesis Space (H) of three Hypothesis (h)**

**H = {h****1****, h****2****, h****3****}**

**Finding Maximum A Posteriori (MAP) Hypothesis,***h**MAP*

**Step 1: Compute****Posterior Probability****for****all****the Hypothesis (h) in Hypothesis Space (H) using Bayes’ Theorem**

*P(***h****1***\D)***= 0.58**

*P(***h****2***\D)***= 0.48**

*P(***h****3***\D)***= 0.79**

**Step 2: Output****Most Probable Hypothesis****(a.k.a. Maximum A Posteriori (MAP) Hypothesis,***h**MAP***)**

*h**MAP***= h****3**

- Example 02 - Brute Force Bayesian Learning Algorithm

**Consider the following Machine Learning Problem**

**In above learning problem**

**Hypothesis Space (H)**

**H = {cancer, ¬***cancer*}

**H = {Patient has Cancer, Patient does not have Cancer}**

**Set of Training Examples (D) – Outcome of Laboratory Test**

**D = {Positive, Negative}**

**From the Machine Learning Problem statement, we see that in general***P*(*cancer*) = 0.008*P*(¬*cancer*) = 0.992*P*(+|*cancer*) = 0.98*P*(−|*cancer*) = 0.02*P*(+|¬*cancer*) = 0.03*P*(−|¬*cancer*) = 0.97

**Given that a patient returns a Positive Laboratory Test**

**What is the MAP Hypothesis,***h**MAP*?

**Finding Maximum A Posteriori (MAP) Hypothesis,***h**MAP*

**Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H)**

**PhD=PDhP(h)P(D)**

**Since P(D) is same for all Hypothesis (h) in the Hypothesis Space (H)**

**PhD=PDhP(h)**

**Computing Posterior Probabilities**

**P(cancer|***+*)

**P(cancer|***+*) =*P(+*|*cancer) P(cancer)*

**P(cancer|***+*) =*0.98**∗**0.008*

**P(cancer|***+*) =*0.0078*

**P(***¬cancer*|*+*)

**P(***¬cancer*|*+*) =*P(+*|*¬cancer) P(¬cancer)*

**P(***¬cancer*|*+*) =*0.03**∗**0.992*

**P(***¬cancer*|*+*) =*0.0298*

**Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis,***h**MAP*)

*h**MAP*=*¬cancer*(Patient does not have Cancer)

- Bayes Theorem and MAP/ML Hypotheses

**Bayes Theorem**

**Generally, want the Most Probable Hypothesis****given****the Training Examples**

**i.e. Maximum A Posteriori (MAP) Hypothesis,***h**MAP*

- Bayes Theorem and MAP/ML Hypotheses, Cont…

**In****some****cases, it may be useful to****assume****that**

**all****Hypotheses (h) in Hypothesis Space (H) have****same Prior Probability**

*i.e. P(h**i**) = P(h**j**) for all h**i**, h**j**∈**H.*

**In such cases,****only consider P(D***\***h) to find****Most Probable Hypothesis**

**P(D|h) is called****Likelihood****of D given h**

**Any hypothesis that****maximizes****P(D***\***h) is called a Maximum Likelihood (ML) Hypothesis***h**ML***=****argmax P(D***\***h)***h**∈**H*

- Strengths and Weaknesses - Brute Force Bayesian Learning Algorithm

**Strengths**

**Brute Force Bayesian Learning Algorithm****returns***h**MAP***hypotheses, which can be****used****to****make predictions****on****unseen****data**

**Weaknesses**

**It is****computationally expensive****because we have to**

**calculate Posterior Probability**** for ****all**** Hypothesis (h) in Hypothesis Space (H)**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 1**

**Task 1**

**Consider the following Machine Learning Problems**

**Fake Video Detection**

**Fake News Detection**

**Automatic Paraphrase Generation**

**Language Detection from Speech**

**Language Detection from Text**

**Language Detection from Image**

**Note**

**Consider following Machine Learning Algorithms, when answering questions given below**

**Decision Tree Learning**

**Artificial Neural Networks**

**Bayesian Learning Algorithms**

**Your answer should be**

**Well Justified**

**Questions**

**Write Input and Output for each Machine Learning Problem?**

**Which Machine Learning Problem is****most complex****?**

**Which Machine Learning Algorithm(s) will be****most suitable****for each Machine Learning Problem?**

Your Turn Tasks

**Your Turn Task 1**

**Task 1**

**Select five Real-world Problems which can be treated as Machine Learning Problems answer the question given below**

**Note**

**Consider following Machine Learning Algorithms, when answering questions given below**

**Decision Tree Learning**

**Artificial Neural Networks**

**Bayesian Learning Algorithms**

**Questions**

**Write Input and Output for each selected Machine Learning Problem?**

**Which Machine Learning Problem is****most complex****?**

**Which Machine Learning Algorithm(s) will be****most suitable****for each Machine Learning Problem?**

## Bayes Optimal Classifier

- Most Probable Classification

**Question**

**Given an****unseen instance x****, what is the Most Probable Classification?**

**A Possible Answer**

**Use Bayes Optimal Classifier to****find out****the Most Probable Classification**

- Bayes Optimal Classifier

**Bayes Optimal Classifier works as follows**

**Bayes Optimal Classifier computes Most Probable Classification using the following Formula**

**So, the Most Probable Classification for a****new instance****is the value v****j****which maximises P(v****j****|D)**

- Example – Bayes Optimal Classifier

**Machine Learning Problem**

**Gender Identification**

**Given**

**A Hypothesis Space (H) of three Hypothesis (h)**

**H = {h****1****, h****2****, h****3****}**

**Note**

**v****j****= {Male, Female}**

**Task**

**What is the Most Probable Classification (Male or Female) of an****unseen instance x****?**

**Applying Bayes Optimal Classifier to find out Most Probable Classification (Male or Female) for x**

**Step 1:****Calculate Posterior Probability****for every Hypothesis***h**∈**H***using Bayes’ Theorem**

*P(***h****1****|***D)***= 0.4**

*P(***h****2****|***D)***= 0.3**

*P(***h****3****|***D)***= 0.3**

**Step 2: Calculate the****prediction****of each Hypothesis (h) for****unseen instance x**

**h****1****predicts****unseen instance x as Female**

**h****2****predicts****unseen instance x as Male**

**h****3****predicts****unseen instance x as Male**

**Summarizing Steps 1 and 2**

*P(***h****1****|***D)***= 0.4, P(Male | h****1****) = 0, P(Female| h****1****) = 1**

*P(***h****2****|***D)***= 0.3, P(Male | h****2****) = 1, P(Female | h****2****) = 0**

*P(***h****3****|***D)***= 0.3, P(Male | h****3****) = 1, P(Female | h****3****) = 0**

**Step 3: Use the****combination****of Step 1 and 2. to****classify****unseen instance x**

**Most Probable Classification of****unseen instance x****is**

- hMAP and Most Probable Classification

**Question**

**Is h****MAP****(x) the Most Probable Classification?**

**where x is an unseen instance**

**Answer**

**No**

- Example - hMAP and Most Probable Classification

**Consider the Gender Identification Problem discussed in previous Slides to explain the working of Bayes Optimal Classifier**

**Calculating***h**MAP*

*P(***h****1****|***D)***= 0.4**

*P(***h****2****|***D)***= 0.3**

*P(***h****3****|***D)***= 0.3**

*h**MAP***= h****1**

**Classifying****unseen instance x**

**Choice 01 – Use***h**MAP*

**Prediction = Female**

**Choice 02 – Use Bayes Optimal Classifier**

**Prediction = Male**

**Conclusion**

**Prediction****returned by h****MAP****(x) is****not always****the Most Probable Classification**

- Strengths and Weaknesses – Bayes Optimal Classifier

**Strengths**

**Predicts****the Most Probable Classification for****unseen data**

**Weaknesses**

**Computationally****very expensive****because you need to do following calculations**

**Step 1: Calculate****Posterior Probability****for****every****Hypothesis (h) in Hypothesis Space (H)**

**Step 2: Calculate****prediction****for****every****Hypothesis (h) in Hypothesis Space (H)**

**Combine****Steps 1 and 2 to****calculate****the Most Probable Classification for****unseen instances**

## Gibbs Algorithm

- Problem in Bayes Optimal Classifier

**Problem**

**Bayes Optimal Classifier is computationally expensive**

**A Possible Solution**

**Use Gibbs Algorithm**

- Gibbs Algorithm

**Gibbs Algorithm works as follows**

**Surprising fact**

**Expected Error of Gibbs Algorithm is ****no worse**** than ****twice**** Bayes Optimal Classifier**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 2**

**Task 1**

**Consider a Machine Learning Problem has three possible outcomes:***child***,***youth***and***old**.***A Hypothesis Space (H) containing 8 possible hypotheses h****1****, h****2****, h****3****, h****4****, h****5****, h****6****, h****7****and h****8****. D is the Set of Training Examples available to train the Model.**

**P(h****1****|D) = 0.3, P(h****2****|D) = 0.4, P(h****3****|D) = 0.3, P(h****4****|D) = 0.2, P(h****5****|D) = 0.1, P(h****6****|D) = 0.4, P(h****7****|D) = 0.5 P(h****8****|D) = 0.3**

**Given a****new instance***x***such that**

**h****1****(x) = child, h****2****(x) = youth, h****3****(x) = old, h****4****(x) = old, h****5****(x) = youth, h****6****(x) = child, h****7****(x) = old, h****8****(x) = youth**

**Note**

**Your answer should be**

**Well Justified**

**Questions**

**What is the Most Probable Classification of****unseen instance x****?**

**Is the classification assigned by the****most probable hypothesis****is the same as the****most probable classification****? If yes, how? If no, how?**

**Should we use Gibbs Algorithms instead of Bayes Optimal Classifier?**

Your Turn Tasks

**Your Turn Task 2**

**Task 1**

**Create a scenario similar to the Task given in TODO and answer the questions given below.**

**Questions**

**What is the Most Probable Classification of****unseen instance x****?**

**Is the classification assigned by the****most probable hypothesis****is the same as the****most probable classification****? If yes, how? If no, how?**

**Should we use Gibbs Algorithms instead of Bayes Optimal Classifier?**

## Naïve Bayes Algorithm

- Naive Bayes Algorithm

**Highly****practical****Bayesian Learning Algorithm**

**Performance comparable****to Artificial Neural Networks and Decision Tree Learning Algorithms on****some****tasks**

**Naïve Bayes is****very good****for**

**Text Classification**

**Assume Target Function f : X → V, where each instance in X described by attributes (****𝒂****𝟏****,****𝒂****𝟐****,…****𝒂****𝒏****)**

**Most Probable Value v****MAP****of f(x) is:**

**P(a****1****,a****2****…a****n****|v****j****) is****impractical****to compute unless****very large Training Set**

**Number of such terms = Number of Possible Instances × Number of Possible Target Values**

**Example: 20 attributes, 5 values each, 2 target values**

**5****20****× 2 = 190,734,863,281,250**

**Reliability****of****Probability Estimates low****, unless**

**each instance****seen****many times**

**Key Assumption of Naive Bayes**

**Attribute Values****are****conditionally independent****given Target Value I.e. assume**

**Making this assumption leads to the Naive Bayes Classifier**

**Naive Bayes Algorithm**

### Machine Learning Cycle – Naïve Bayes Algorithm

- Gender Identification Problem

**Machine Lerning Problem**

**Gender Identification**

**Input**

**Human**

**Output**

**Gender of a Human**

**Task**

**Given a Human (Input), predict the Gender of the Human (Output)**

**Treated as**

**Learning Input-Output Function**

**i.e. Learn from Input to predict Output**

- Representation of Examples

**Example = Input + Output**

**Input**

**Human**

**Outpu**

**Gender**

**Representation of Input and Output**

**Attribute-Value Pair**

- Representation of Input

**Input is represented as a**

**Set of 4 Input Attributes**

**Height**

**Weight**

**HairLength**

**HeadCovered**

- Representation of Output

**Output is represented as a**

**Set of 1 Output Attribute**

**Gender**

**Note**

**Yes****means Female and****No****means Male**

- Sample Data

**We obtained a Sample Data of 21 examples**

- Split the Sample Data

**We split the Sample Data using****Random Split Approach****into**

**Training Data****– 2 / 3 of Sample Data**

**Testing Data****– 1 / 3 of Sample Data**

- Training Data

- Testing Data

- Machine Learning Cycle

**Four phases of a Machine Learning Cycle are**

**Training Phase**

**Build the Model****using Training Data**

**Testing Phase**

**Evaluate the performance of Model****using Testing Data**

**Application Phase**

**Deploy the Model in Real-world****, to****make prediction****on Real-time unseen Data**

**Feedback Phase**

**Take Feedback form the****Users****and****Domain Experts****to****improve the Model**

- Training Phase – Naïve Bayes Algorithm

**Compute Probabilities**

- Summary - Training Phase of ID3 Algorithm

**Recall**

**Training Data**

**Model**

**In the next Phase i.e. Testing Phase, Insha Allah (انشاء اللہ), we will**

**Evaluate the performance of the Model**

- Testing Phase – Naïve Bayes Algorithm

**Question**

**How****good****Model has learned?**

**Answer**

**Evaluate the performance of the Model on****unseen data****(or Testing Data)**

- Evaluation Measures

**Evaluation will be carried out using**

**Error measure**

- Error

**Definition**

**Error is defined as the proportion of incorrectly classified Test instances**

**Formula**

**Note**

**Accuracy = 1 – Error**

- Evaluate Model (Naïve Bayes)

**Test Example x**_{15}**x**_{15 }**= <Tall, Heavy, Long, No>**

**Compute Most Probable Classification for Text Example x**_{15 }using Naïve Bayes Model**Note****= {Male, Female}**

**Using Naive Bayes, we need to compute****V**_{NB}**=****P****P (****Height = Tall |****) P (****Weight = Heavy |****) P (****HairLength****= Long |****) P (****HeadCovered****= No |****)**

**Predicting Class of Text Example x**_{15}**Compute v**_{NB}

** **

**Prediction****V**_{NB}= Female

**Test Example x**_{16}**x**_{16 }**= <Short, Normal, Short, No>**

** **

**Prediction****V**_{NB}= Female

**Test Example x**_{17}

**x**_{17 }**= <Short, Normal, Long, Yes>**

**Prediction**

**V**_{NB}= Male

**Test Example x**_{18}**x**_{18 }**= <Normal, Normal, Short, No>**

** **

**Prediction****V**_{NB}= Female

**Test Example x**_{19}

**x**_{19 }**= <Short, Light, Short, No>**

**Prediction****V**_{NB}= Female

**Test Example x**_{20}**x**_{20 }**= <Normal, Light, Short, Yes>**

**Prediction****V**_{NB}= Female

**Test Example x**_{21}**x**_{21 }**= <Tall, Light, Short, No>**

**Prediction**

**V**_{NB}= Female

- Summary - Evaluate Model (Naïve Bayes)

- Application Phase

**We assume that our Model**

**performed well****on****large Test Data****and can be****deployed in Real-world**

**Model is****deployed in the Real-world****and now we can make**

**predictions****on Real-time Data**

- Steps – Making Predictions on Real-time Data

**Step 1: Take Input from User**

**Step 2: Convert****User Input****into****Feature Vector**

**Exactly same****as****Feature Vectors****of Training and Testing Data**

**Step 3:****Apply****Model on the****Feature Vector**

**Step 4: Return****Prediction****to the User**

- Making Predictions on Real-time Data

**Step 1: Take input from User**

**Step 2: Conver User Input into Feature Vector**

**Note that order of Attributes must be exactly same as that of Training and Testing Examples**

**Step 3: Apply Model on Feature Vector**

**x = <Height = Short, Weight = Heavy,****HairLength****= Long,****HeadCovered****= Yes>****Note****= {Male, Female}**

**Using Naive Bayes, we need to compute****V**_{NB}=**P(****)****V**_{NB}**=****P****P (****Height = Short |****) P (****Weight = Heavy |****) P (****HairLength****= Long |****) P (****HeadCovered****= Yes |****)**

**Using these estimates we can now compute V**_{NB}

**V**_{NB }=**M****ale**

**Step 4: Return Prediction to the User****Male**

**Note****You can take Input from user, apply Model and return predictions as many times as you like****😊**

- Feedback Phase

**Only Allah is Perfect****😊**

**Take Feedback on your****deployed****Model from**

**Domain Experts and**

**Users**

**Improve****your Model based on Feedback****😊**

- Strengths and Weaknesses – Naïve Bayes Algorithm

**Strengths**

**Naïve Bayes Algorithm****allows****us to****combine prior knowledge****(or past experience) with our****Training Examples****, when learning a Concept**

**Naïve Bayes Algorithm****assigns a probability****to****each Hypothesis (h)****in the Hypothesis Space (H)**

**Instead of simply****accepting or rejecting****a Hypothesis (h)**

**Naïve Bayes Algorithm can handle****noisy****Data**

**Naïve Bayes Algorithm show****good Accuracy on Testing Data****even if the Set of Training Examples (D) is****rather small**

**Naïve Bayes Algorithm can****avoid****Overfitting**

**Naïve Bayes Algorithm are****used****in Deep Learning Algorithms, which****allows****Deep Learning Algorithms to learn from****small datasets / corpora**

**Naïve Bayes Algorithm can make****Probabilistic Predictions**

**Instead of making a****Binary Prediction****i.e. 0 or 1**

**Weaknesses**

**Naive Bayes Algorithm****assumes****that Attributes / Features are****independent****of each other**

**However, in Real world, Attributes / Features****depend****on each other**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 3**

**Task 1**

**Consider the Titanic Dataset with the following Attributes**

**Gender: Male, Female**

**Ticket Class: Upper, Middle, Lower**

**Parent/Child Abroad: Zero, One, Two, Three**

**Embarked: Cherbourg, Queenstown, Southampton**

**Survival: No, Yes**

**We obtained the following Sample Data**

**Sample Data was split into Training and Testing Data in a**

**Train-Test Split Ratio of 80%-20%**

**Training Data**

**Testing Data**

**Note**

**Consider Naïve Bayes Algorithm when answering questions given below**

**Your answer should be**

**Well Justified**

**Questions**

**Write down the Input and Output for the above Machine Learning Problem?**

**How****Training Example****is****represented****?**

**How****Hypothesis (h)****should be****represented****?**

**Execute the Machine Learning Cycle?**

**Write down your****observations****that you observed during the execution of Machine Learning Cycle?**

Your Turn Tasks

**Your Turn Task 3**

**Task 1**

**Select a Machine Learning Problem (similar to: Titanic – Machine Learning form Disaster) and answer the questions given below.**

**Note**

**Consider Naïve Bayes Algorithm when answering all the questions.**

**Questions**

**Write down the Input and Output for the selected Machine Learning Problem?**

**How****Training Example****is****represented****?**

**How****Hypothesis (h)****should be****represented****?**

**Execute the Machine Learning Cycle?**

**Write down your****observations****that you observed during the execution of Machine Learning Cycle?**

## Chapter Summary

- Chapter Summary

**Major Problems**

**Problem 1**

**These ML Algorithms****do not allow****us to****combine prior knowledge****(or past experience) with our****Training Examples****, when learning a Concept**

**Problem 2**

**These ML Algorithms either****accept****or****reject****a Hypothesis (h) i.e., take a Binary Decision**

**Instead of****assigning a probability****to****each Hypothesis (h)****in the Hypothesis Space (H)**

**A Possible Solution**

**Bayesian Learning Algorithms**

**Naïve Bayes – Summary**

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Probability Distribution**

**Searching Strategy**

**I am not clear about Search Strategy of Naïve Bayes Algorithm. Please drop me an email if you know it.****جزاک اللہ خیر**

**Training Regime**

**Batch Method**

**Strengths**

**Naïve Bayes Algorithm****allows****us to****combine prior knowledge****(or past experience) with our****Training Examples****, when learning a Concept**

**Naïve Bayes Algorithm****assigns a probability****to****each Hypothesis (h)****in the Hypothesis Space (H)**

**Instead of simply****accepting or rejecting****a Hypothesis (h)**

**Naïve Bayes Algorithm can handle****noisy****Data**

**Naïve Bayes Algorithm show****good Accuracy on Testing Data****even if the Set of Training Examples (D) is****rather small**

**Naïve Bayes Algorithm can****avoid****Overfitting**

**Naïve Bayes Algorithm are****used****in Deep Learning Algorithms, which****allows****Deep Learning Algorithms to learn from****small datasets / corpora**

**Naïve Bayes Algorithm can make****Probabilistic Predictions**

**Instead of making a****Binary Prediction****i.e. 0 or 1**

**Weaknesses**

**Naive Bayes Algorithm****assumes****that Attributes / Features are****independent****of each other**

**However, in Real world, Attributes / Features****depend****on each other**

**Bayesian Learning Algorithms****are****Probabilistic Machine Learning Algorithms**

**Bayesian Learning****uses Bayes’ Theorem****to****determine the Conditional Probability****of a Hypotheses (h) given a Set of Training Examples (D)**

**Bayesian Learning Algorithms****allow****us to****combine prior knowledge****(or past experience) with your****Set of Training Examples (D)**

**Bayesian Learning Algorithms are****suitable****for following Machine Learning Problems**

**When you want to****combine prior knowledge****(or past experience) with your****Set of Training Examples (D)**

**When you want to****assign a probability****to each Hypothesis (h) in the Hypothesis Space (H)**

**Instead of simply****accepting or rejecting****a Hypothesis (h)**

**When you have****noisy****Data**

**When you have****small****Training Data**

**In recent years, Bayesian Learning Algorithms are used in Deep Learning Algorithms, which****allows****Deep Learning Algorithms to learn from****small datasets / corpora**

**When you want to make****Probabilistic Predictions**

**Instead of making a****Binary Prediction****i.e. 0 or 1**

**A Brute Force Bayesian Learning Algorithm works as follows**

**Step 1: Compute****Posterior Probability****for****all****the Hypothesis (h) in Hypothesis Space (H)**

**Step 2: Output****Most Probable Hypothesis****(a.k.a. Maximum A Posteriori (MAP) Hypothesis, h****MAP****)**

**Bayes Optimal Classifier works as follows to calculate the Most Probable Classification of an unseen instance x**

**Step 1: Calculate Posterior Probability for every Hypothesis***h**∈**H*

**Step 2: Calculate the****prediction****of each Hypothesis (h) for each new instance**

**Step 3: Use the****combination****of Step 1 and 2. to****classify****each****new instance**

**Prediction returned by h****MAP****(x) is (always)****not****Most Probable Classification?**

**To reduce the computation cost in finding the Most Probable Classification using Bayes Optimal Classifier, the Gibbs Algorithm works as follows**

**Step 1: Choose****one****Hypothesis (h) at****random****, according to***P(h|D)*

**Step 2: Use this h to****classify****new unseen instances**

**Naive Bayes Algorithm is a highly****practical****Bayesian Learning Algorithm**

**Performance comparable****to Artificial Neural Networks and Decision Tree Learning Algorithms on****some****tasks**

**Naïve Bayes is****very good****for**

** Text Classification**

### In Next Chapter

- In Next Chapter

- In Sha Allah, in next Chapter, I will present

- Instance Based Learning