Chapter 12- Bayesian Learning
Chapter Outline
- Chapter Outline
- Quick Recap
- Bayes Theorem and Bayesian Learning Algorithms
- Bayes Optimal Algorithm
- Gibbs Algorithm
- Naive Bayes Algorithm
- Chapter Summary
Quick Recap
- Quick Recap – artificial Neural Networks
- Following Machine Learning Algorithms are based on Symbolic Representation
- FIND-S Algorithm
- List Then Eliminate Algorithm
- Candidate Elimination Algorithm
- ID3 Algorithm
- Symbolic Representation – Representing Training Examples
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Symbolic Representation – Representing Hypothesis (h)
- Two Types of Hypothesis (h) Representations
- Conjunction (AND) of Constrains on Input Attributes
- Disjunction (OR) of Conjunction (AND) of Input Attributes
- Note that both types of Hypothesis (h) Representations are Symbolic
- i.e., based on Symbols (Categorical Values)
- Problem – Symbolic Representation
- Cannot handle Machine Learning Problems with very complex Numeric Representations
- Solution
- Non-Symbolic Representations
- for e.g., Artificial Neural Networks
- Artificial Neural Networks – Summary
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Numeric
- Output
- Numeric
- Representation of Hypothesis (h)
- Combination of Weights between Units
- Weights are Numeric values
- Searching Strategy
- Exhaustive Search
- Training Regime
- Batch Method
- Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)
- ANNs are suitable for those Machine Learning Problems in which
- Training Examples (both Input and Output) can be represented as real values (Numeric Representation)
- Hypothesis (h) can be represented as real values (Numeric Representation)
- Slow Training Times are OK
- Predictive Accuracy is more important then understanding
- When we have noise in Training Data
- The diagram below shows the General Architecture of Artificial Neural Networks
- ANNs mainly have three Layers
- Input Layer
- Hidden Layer
- Output Layers
- Each Layer contains one or more Units
- Main Parameters of ANNs are
- No. of Input Units
- No. of Hidden Layers
- No. of Hidden Units at each Hidden Layer
- No. of Output Units
- Mathematical Function at each Hidden and Output Unit
- Weights between Units
- ANN will be Fully Connected or Not?
- Learning Rate
- ANNs can be broadly categorized based on
- Number of Hidden Layers in an ANN Architecture
- Two Layer Neural Networks (a.k.a. Perceptron)
- Number of Hidden Layers = 0
- Multi-layer Neural Networks
- Regular Neural Network
- Number of Hidden Layers = 1
- Deep Neural Network
- Number of Hidden Layers > 1
- A Perceptron is a simple Two Layer ANN, with
- One Input Layer
- Multiple Input Units
- Output Layer
- Single Output Unit
- A Perceptron can be used for Binary Classification Problems
- We can use Perceptrons to build larger Neural Networks
- Perceptron has limited learning abilities i.e. fails to learn simple Boolean-valued Functions (for e.g. XOR)
- A Perceptron works as follows
- Step 1: Random Weighs are assigned to edges between Input-Output Units
- Step 2: Input is feed into Input Units
- Step 3: Weighted Sum of Inputs (S) is calculated
- Step 4: Weighted Sum of Inputs (S) is given as Input to the Mathematical Function at Output Unit
- Step 5: Mathematical Function calculates Output for Perceptron
- Before using ANN to build a Model
- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
- Perceptron – Summary
- Representation of Training Examples (D)
- Numeric
- Representation of Hypothesis (h)
- Numeric (Combination of Weights between Units)
- Searching Strategy
- Exhaustive Search
- Training Regime
- Incremental Method
- Strengths
- Perceptron’s can learn Binary Classification Problems
- Perceptron’s can be combined to make larger ANNs
- Perceptron’s can learn linearly separable functions
- Weaknesses
- Perceptron’s fail to learn simple Boolean-valued Functions which are not linearly separable functions
- A Regular Neural Network consists of an Input Layer, one Hidden Layers, and an Output Layer
- A Regular Neural Network can be used for both
- Binary Classification Problems and
- Multi-class Classification Problems
- Regular Neural Network – Summary
- Representation of Training Examples (D)
- Numeric
- Representation of Hypothesis (h)
- Numeric (Combination of Weights between Units)
- Searching Strategy
- Exhaustive Search
- Training Regime
- Incremental Method
- Strengths
- Can learn both linearly separable and non-linearly separable Target Functions
- Can handle noisy Data
- Can learn Machine Learning Problems with very complex Numerical Representations
- Weaknesses
- Computational Cost and Training Time are high
- A Regular Neural Network works as follows
- Step 1: Random Weighs are assigned to edges between Input, Hidden, and Output Units
- Step 2: Inputs are fed simultaneously into the Input Units (making up the Input Layer)
- Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)
- Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)
- Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)
- Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)
- Step 7: Softmax Layer converts the vectors at Output Units into probabilities and Class with highest probability is the Prediction of the Regular Neural Network
- Regular Neural Network Training Rule is used to tweaking Weights, when
- Actual Value is different from Predicted Value
- How Regular Neural Network Training Rule Works?
- Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the Class of given Training Example) as 1 and
- Set the Target Output Values of remaining Output Units as 0
- Step 2: Training the Regular Neural Network using Training Example E and get Predictions (Observed Output o(E)) from Regular Neural Network
- Step 3: If Target Output t(E) is different from Observed Output o(E)
- Calculate the Network Error
- Back propagate the Error by updating weights between Output-Hidden Units and Hidden-Input Units
- Step 4: Keep training Regular Neural Network, Network Error becomes very small
- Overfitting is a serious problem in ANNs (particularly Deep Learning algorithms)
- Four Possible Solutions to overcome Overfitting in ANNs are as follows
- Training and Validation Set Approach
- Learn Multiple ANNs
- Momentum
- Weight Decay Factor
- ANNs – Strengths and Weaknesses
- Strengths
- Can learn problems with very complex Numerical Representations
- Can handle noisy Data
- Execution time in Application Phase is fast
- Good for Machine Learning Problems in which
- Both Training Examples and Hypothesis (h) have Numeric Representation
- Weaknesses
- Requires a lot of Training Time (particularly Deep Learning Models)
- Computations cost for ANNs (particularly Deep Learning Models) is high
- Overfitting is a serious problem in ANNs
- ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision
- Accept a Hypothesis (h), if it is consistent with the Training Example
- Reject a Hypothesis (h), if it is not consistent with the Training Example
Quick Recap - Models
- Learning Input-Output Functions – General Settings
- Input to Learner
- Set of Training Examples (D)
- Set of Functions / Hypothesis (H)
- Output of Learner
- a hypothesis h from H, which best fits the Training Examples (D)
- Learning is a Searching Problem
- For any Machine Learning Algorithm, we should have clear understanding of the following three points
- Representation of Training Examples (D)
- How to represent Training Examples (D) in a Format which a Machine Learning Algorithm can understand and learn from them?
- Representation of Hypothesis (h)
- How to represent Hypothesis (h) in a Format which a Machine Learning Algorithm can understand?
- Searching Strategy
- What searching strategy is used by a Machine Learning Algorithm to find h form H, which best fits the Training Examples (D)?
- Summary - FIND-S Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Input Attributes
- Searching Strategy
- I am not clear about Search Strategy of FIND-S Algorithm. Please drop me an email if you know it.
- Training Regime
- Incremental Method
- Summary – List Then Eliminate Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Input Attributes
- Searching Strategy
- Sequential Search
- Training Regime
- Incremental Method
- Summary – Candidate Elimination Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Representation of Hypothesis (h)
- Conjunction (AND) of Constraints on Input Attributes
- Searching Strategy
- I am not clear about Search Strategy of Candidate Elimination Algorithm. Please drop me an email if you know it. جزاک اللہ خیر
- Training Regime
- Incremental Method
- Summary – ID3 Algorithm
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Representation of Hypothesis (h)
- Disjunction (OR) OF Conjunction (AND) of Input Attributes (a.k.a. Decision Tree)
- Searching Strategy
- Greedy Search
- Simple to Complex (Hill Climbing)
- Training Regime
- Batch Method
- Summary - Symbolic Representations
- Following Machine Learning Algorithms are based on Symbolic Representation
- FIND-S Algorithm
- List Then Eliminate Algorithm
- Candidate Elimination Algorithm
- ID3 Algorithm
- Symbolic Representation – Representing Training Examples
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Symbolic Representation – Representing Hypothesis (h)
- Two Types of Hypothesis (h) Representations
- Conjunction (AND) of Constrains on Input Attributes
- Disjunction (OR) of Conjunction (AND) of Input Attributes
- Note that both types of Hypothesis (h) Representations are Symbolic
- i.e. based on Symbols (Categorical Values)
- Problem with Symbolic Representations
- Problem
- Cannot handle Machine Learning Problems with very complex Numeric Representations
- Solution
- Non-Symbolic Representations
- For example, Artificial Neural Networks
- Non-Symbolic Representations - Artificial Neural Networks
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Numeric
- Output
- Numeric
- Representation of Hypothesis (h)
- Combination of Weights between Units
- Note
- Weights are Numeric values
- Searching Strategy
- Exhaustive Search
- Training Regime
- Batch Method
- Comparison of Symbolic and Non-Symbolic Representation
- Representation of Training Examples (D)
- Symbolic Representations
- Input
- Categorical
- Output
- Categorical
- Non-Symbolic Representations
- Input
- Numeric
- Output
- Numeric
- Representation of Hypothesis (h)
- Symbolic Representations
- Categorical Values are used
- Non-Symbolic Representations
- Numeric Values are used
- Conclusion
- Major difference between Symbolic and Non-Symbolic Representation is that
- In Symbolic Representations both Training Example (d) and Hypothesis (h) are represented using
- Categorical Values
- In Non-Symbolic Representations both Training Example (d) and Hypothesis (h) are represented using
- Numerical Values
- Strengths and Weaknesses – ANNs
- Strengths
- Can learn Machine Learning Problems with very complex Numerical Representations
- Can handle noisy Data
- Execution time in Application Phase is fast
- Good for Machine Learning Problems in which
- Both Training Examples and Hypothesis (h) have Numeric Representation
- Weaknesses
- Require a lot of Training Time (particularly Deep Learning Models)
- Computational cost for ANNs (particularly Deep Learning Models) is high
- Overfitting is a serious problem in ANNs
- ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision
- Accept a Hypothesis (h), if it is consistent with the Training Example
- Reject a Hypothesis (h), if it is not consistent with the Training Example
- Major Problems in ML Algorithms Studied So Far
- Alhumdulilah (الحمد للہ), Up till now, we have studied the following Machine Learning Algorithms
- FIND-S Algorithm
- List Then Eliminate Algorithm
- Candidate Elimination Algorithm
- ID3 Algorithm
- Perceptron Algorithm (Two Layer ANN)
- Regular Neural Network (Multi-layer ANN)
- Major Problems
- Problem 01
- These ML Algorithms do not allow us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
- Problem 02
- These ML Algorithms either accept or reject a Hypothesis (h) i.e. take a Binary Decision
- Instead of assigning a probability to each Hypothesis (h) in the Hypothesis Space (H)
- A Possible Solution
- Bayesian Learning Algorithms
- In sha Allah, in this Chapter, I will try to explain how to use Bayesian Learning Algorithms for Machine Learning ProblemsC
- Probabilistic ML Algorithms - Naïve Bayes
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Categorical / Numeric
- Output
- Categorical
- Representation of Hypothesis (h)
- Probability Distribution
- Searching Strategy
- I am not clear about Search Strategy of FIND-S Algorithm. Please drop me an email if you know it.
- Training Regime
- Batch Method
Bayes Theorem and Bayesian Learning Algorithms
- Bayesian Learning
- Definition
- Bayesian Learning uses Bayes’ Theorem to determine the Conditional Probability of a Hypotheses (h) given a Set of Training Examples (D)
- Note
- Bayesian Learning Algorithms are Probabilistic Machine Learning Algorithms
- Importance
- Bayesian Learning Algorithms allow us to combine prior knowledge (or past experience) with your Set of Training Examples (D)
- Applications
- Categorizing News Articles
- Email Spam Detection
- Face Recognition
- Sentiment Analysis
- Medical Diagnosis
- Character Recognition
- Weather Prediction
- and many more 😊
- Suitable Problems for Bayesian Learning Algorithms
- Bayesian Learning Algorithms are suitable for following Machine Learning Problems
- When you want to combine prior knowledge (or past experience) with your Set of Training Examples (D)
- When you want to assign a probability to each Hypothesis (h) in the Hypothesis Space (H)
- Instead of simply accepting or rejecting a Hypothesis (h)
- When you have noisy Data
- When you have small Training Data
- In recent years, Bayesian Learning Algorithms are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora
- When you want to make Probabilistic Predictions
- Instead of making a Binary Prediction i.e. 0 or 1
- Bayes Theorem
- P(h) = Prior Probability of Hypothesis h
- P(D) = Prior Probability of Training Examples D
- P(D|h) = Probability of D given h
- P(h|D) = Probability of h given D
- Example – Bayes Theorem
- In above Example
- Hypothesis Space (H)
- H = {cancer, ¬cancer}
- H = {Patient has Cancer, Patient does not have Cancer}
- Set of Training Examples (D) – Outcome of Laboratory Test
- D = {Positive, Negative}
- P(h) = Prior Probability of Hypothesis h
- h = cancer = Patient has Cancer
- P(h) = P(cancer) = P (Patient has Cancer) = 0.008
- i.e. P(h) = 0.008
- P(D) = Prior Probability of Training Data (D)
- D = Positive Outcome of Laboratory Test
- P(D) = (0.008 * 0.98) + (0.992 * 0.03) = 0.0376
- P(D\h) = Probability of a Positive Outcome to Laboratory Test given that a Patient has Cancer
- P(D\h) = 0.98
- Calculate P(h\D) using Bayes Theorem
- P(h\D) = Probability of the Patient having Cancer given a Positive Outcome to Laboratory Test
- Recall
- Input to Learner
- Set of Training Examples (D)
- Hypothesis Space (H)
- Output by Learner
- Find a Hypothesis h from H, which best fits the Training Data
- Brute Force Bayesian Learning Algorithm
- Example 01 - Brute Force Bayesian Learning Algorithm
- Consider a Hypothesis Space (H) of three Hypothesis (h)
- H = {h1, h2, h3}
- Finding Maximum A Posteriori (MAP) Hypothesis, hMAP
- Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H) using Bayes’ Theorem
- P(h1\D) = 0.58
- P(h2\D) = 0.48
- P(h3\D) = 0.79
- Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis, hMAP)
- hMAP = h3
- Example 02 - Brute Force Bayesian Learning Algorithm
Consider the following Machine Learning Problem
- In above learning problem
- Hypothesis Space (H)
- H = {cancer, ¬cancer}
- H = {Patient has Cancer, Patient does not have Cancer}
- Set of Training Examples (D) – Outcome of Laboratory Test
- D = {Positive, Negative}
- From the Machine Learning Problem statement, we see that in general
- P(cancer) = 0.008 P(¬cancer) = 0.992
- P(+|cancer) = 0.98 P(−|cancer) = 0.02
- P(+|¬cancer) = 0.03 P(−|¬cancer) = 0.97
- Given that a patient returns a Positive Laboratory Test
- What is the MAP Hypothesis, hMAP?
- Finding Maximum A Posteriori (MAP) Hypothesis, hMAP
- Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H)
- PhD=PDhP(h)P(D)
- Since P(D) is same for all Hypothesis (h) in the Hypothesis Space (H)
- PhD=PDhP(h)
- Computing Posterior Probabilities
- P(cancer|+)
- P(cancer|+) = P(+|cancer) P(cancer)
- P(cancer|+) = 0.98 ∗ 0.008
- P(cancer|+) = 0.0078
- P(¬cancer|+)
- P(¬cancer|+) = P(+|¬cancer) P(¬cancer)
- P(¬cancer|+) = 0.03 ∗ 0.992
- P(¬cancer|+) = 0.0298
- Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis, hMAP)
- hMAP = ¬cancer (Patient does not have Cancer)
- Bayes Theorem and MAP/ML Hypotheses
- Bayes Theorem
- Generally, want the Most Probable Hypothesis given the Training Examples
- i.e. Maximum A Posteriori (MAP) Hypothesis, hMAP
- Bayes Theorem and MAP/ML Hypotheses, Cont…
- In some cases, it may be useful to assume that
- all Hypotheses (h) in Hypothesis Space (H) have same Prior Probability
- i.e. P(hi) = P(hj) for all hi, hj ∈ H.
- In such cases, only consider P(D\h) to find Most Probable Hypothesis
- P(D|h) is called Likelihood of D given h
- Any hypothesis that maximizes P(D\h) is called a Maximum Likelihood (ML) Hypothesis
- hML = argmax P(D\h)
- h∈H
- Strengths and Weaknesses - Brute Force Bayesian Learning Algorithm
- Strengths
- Brute Force Bayesian Learning Algorithm returns hMAP hypotheses, which can be used to make predictions on unseen data
- Weaknesses
- It is computationally expensive because we have to
calculate Posterior Probability for all Hypothesis (h) in Hypothesis Space (H)
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 1
- Task 1
- Consider the following Machine Learning Problems
- Fake Video Detection
- Fake News Detection
- Automatic Paraphrase Generation
- Language Detection from Speech
- Language Detection from Text
- Language Detection from Image
- Note
- Consider following Machine Learning Algorithms, when answering questions given below
- Decision Tree Learning
- Artificial Neural Networks
- Bayesian Learning Algorithms
- Your answer should be
- Well Justified
- Questions
- Write Input and Output for each Machine Learning Problem?
- Which Machine Learning Problem is most complex?
- Which Machine Learning Algorithm(s) will be most suitable for each Machine Learning Problem?
Your Turn Tasks
Your Turn Task 1
- Task 1
- Select five Real-world Problems which can be treated as Machine Learning Problems answer the question given below
- Note
- Consider following Machine Learning Algorithms, when answering questions given below
- Decision Tree Learning
- Artificial Neural Networks
- Bayesian Learning Algorithms
- Questions
- Write Input and Output for each selected Machine Learning Problem?
- Which Machine Learning Problem is most complex?
- Which Machine Learning Algorithm(s) will be most suitable for each Machine Learning Problem?
Bayes Optimal Classifier
- Most Probable Classification
- Question
- Given an unseen instance x, what is the Most Probable Classification?
- A Possible Answer
- Use Bayes Optimal Classifier to find out the Most Probable Classification
- Bayes Optimal Classifier
- Bayes Optimal Classifier works as follows
- Bayes Optimal Classifier computes Most Probable Classification using the following Formula
- So, the Most Probable Classification for a new instance is the value vj which maximises P(vj|D)
- Example – Bayes Optimal Classifier
- Machine Learning Problem
- Gender Identification
- Given
- A Hypothesis Space (H) of three Hypothesis (h)
- H = {h1, h2, h3}
- Note
- vj = {Male, Female}
- Task
- What is the Most Probable Classification (Male or Female) of an unseen instance x?
- Applying Bayes Optimal Classifier to find out Most Probable Classification (Male or Female) for x
- Step 1: Calculate Posterior Probability for every Hypothesis h ∈ H using Bayes’ Theorem
- P(h1|D) = 0.4
- P(h2|D) = 0.3
- P(h3|D) = 0.3
- Step 2: Calculate the prediction of each Hypothesis (h) for unseen instance x
- h1 predicts unseen instance x as Female
- h2 predicts unseen instance x as Male
- h3 predicts unseen instance x as Male
- Summarizing Steps 1 and 2
- P(h1|D) = 0.4, P(Male | h1) = 0, P(Female| h1) = 1
- P(h2|D) = 0.3, P(Male | h2) = 1, P(Female | h2) = 0
- P(h3|D) = 0.3, P(Male | h3) = 1, P(Female | h3) = 0
- Step 3: Use the combination of Step 1 and 2. to classify unseen instance x
- Most Probable Classification of unseen instance x is
- hMAP and Most Probable Classification
- Question
- Is hMAP(x) the Most Probable Classification?
- where x is an unseen instance
- Answer
- No
- Example - hMAP and Most Probable Classification
- Consider the Gender Identification Problem discussed in previous Slides to explain the working of Bayes Optimal Classifier
- Calculating hMAP
- P(h1|D) = 0.4
- P(h2|D) = 0.3
- P(h3|D) = 0.3
- hMAP = h1
- Classifying unseen instance x
- Choice 01 – Use hMAP
- Prediction = Female
- Choice 02 – Use Bayes Optimal Classifier
- Prediction = Male
- Conclusion
- Prediction returned by hMAP(x) is not always the Most Probable Classification
- Strengths and Weaknesses – Bayes Optimal Classifier
- Strengths
- Predicts the Most Probable Classification for unseen data
- Weaknesses
- Computationally very expensive because you need to do following calculations
- Step 1: Calculate Posterior Probability for every Hypothesis (h) in Hypothesis Space (H)
- Step 2: Calculate prediction for every Hypothesis (h) in Hypothesis Space (H)
- Combine Steps 1 and 2 to calculate the Most Probable Classification for unseen instances
Gibbs Algorithm
- Problem in Bayes Optimal Classifier
- Problem
- Bayes Optimal Classifier is computationally expensive
- A Possible Solution
- Use Gibbs Algorithm
- Gibbs Algorithm
- Gibbs Algorithm works as follows
- Surprising fact
Expected Error of Gibbs Algorithm is no worse than twice Bayes Optimal Classifier
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 2
- Task 1
- Consider a Machine Learning Problem has three possible outcomes: child, youth and old. A Hypothesis Space (H) containing 8 possible hypotheses h1, h2, h3, h4, h5, h6, h7 and h8. D is the Set of Training Examples available to train the Model.
- P(h1|D) = 0.3, P(h2|D) = 0.4, P(h3|D) = 0.3, P(h4|D) = 0.2, P(h5|D) = 0.1, P(h6|D) = 0.4, P(h7|D) = 0.5 P(h8|D) = 0.3
- Given a new instance x such that
- h1(x) = child, h2(x) = youth, h3(x) = old, h4(x) = old, h5(x) = youth, h6(x) = child, h7(x) = old, h8(x) = youth
- Note
- Your answer should be
- Well Justified
- Questions
- What is the Most Probable Classification of unseen instance x?
- Is the classification assigned by the most probable hypothesis is the same as the most probable classification? If yes, how? If no, how?
- Should we use Gibbs Algorithms instead of Bayes Optimal Classifier?
Your Turn Tasks
Your Turn Task 2
- Task 1
- Create a scenario similar to the Task given in TODO and answer the questions given below.
- Questions
- What is the Most Probable Classification of unseen instance x?
- Is the classification assigned by the most probable hypothesis is the same as the most probable classification? If yes, how? If no, how?
- Should we use Gibbs Algorithms instead of Bayes Optimal Classifier?
Naïve Bayes Algorithm
- Naive Bayes Algorithm
- Highly practical Bayesian Learning Algorithm
- Performance comparable to Artificial Neural Networks and Decision Tree Learning Algorithms on some tasks
- Naïve Bayes is very good for
- Text Classification
- Assume Target Function f : X → V, where each instance in X described by attributes (𝒂𝟏,𝒂𝟐 ,…𝒂𝒏)
- Most Probable Value vMAP of f(x) is:
- P(a1,a2…an|v j) is impractical to compute unless very large Training Set
- Number of such terms = Number of Possible Instances × Number of Possible Target Values
- Example: 20 attributes, 5 values each, 2 target values
- 520 × 2 = 190,734,863,281,250
- Reliability of Probability Estimates low, unless
- each instance seen many times
- Key Assumption of Naive Bayes
- Attribute Values are conditionally independent given Target Value I.e. assume
- Making this assumption leads to the Naive Bayes Classifier
- Naive Bayes Algorithm
Machine Learning Cycle – Naïve Bayes Algorithm
- Gender Identification Problem
- Machine Lerning Problem
- Gender Identification
- Input
- Human
- Output
- Gender of a Human
- Task
- Given a Human (Input), predict the Gender of the Human (Output)
- Treated as
- Learning Input-Output Function
- i.e. Learn from Input to predict Output
- Representation of Examples
- Example = Input + Output
- Input
- Human
- Outpu
- Gender
- Representation of Input and Output
- Attribute-Value Pair
- Representation of Input
- Input is represented as a
- Set of 4 Input Attributes
- Height
- Weight
- HairLength
- HeadCovered
- Representation of Output
- Output is represented as a
- Set of 1 Output Attribute
- Gender
- Note
- Yes means Female and No means Male
- Sample Data
- We obtained a Sample Data of 21 examples
- Split the Sample Data
- We split the Sample Data using Random Split Approach into
- Training Data – 2 / 3 of Sample Data
- Testing Data – 1 / 3 of Sample Data
- Training Data
- Testing Data
- Machine Learning Cycle
- Four phases of a Machine Learning Cycle are
- Training Phase
- Build the Model using Training Data
- Testing Phase
- Evaluate the performance of Model using Testing Data
- Application Phase
- Deploy the Model in Real-world, to make prediction on Real-time unseen Data
- Feedback Phase
- Take Feedback form the Users and Domain Experts to improve the Model
- Training Phase – Naïve Bayes Algorithm
- Compute Probabilities
- Summary - Training Phase of ID3 Algorithm
- Recall
- Training Data
- Model
- In the next Phase i.e. Testing Phase, Insha Allah (انشاء اللہ), we will
- Evaluate the performance of the Model
- Testing Phase – Naïve Bayes Algorithm
- Question
- How good Model has learned?
- Answer
- Evaluate the performance of the Model on unseen data (or Testing Data)
- Evaluation Measures
- Evaluation will be carried out using
- Error measure
- Error
- Definition
- Error is defined as the proportion of incorrectly classified Test instances
- Formula
- Note
- Accuracy = 1 – Error
- Evaluate Model (Naïve Bayes)
- Test Example x15
- x15 = <Tall, Heavy, Long, No>
- Compute Most Probable Classification for Text Example x15 using Naïve Bayes Model
- Note
- = {Male, Female}
- Using Naive Bayes, we need to compute
- VNB = P P (Height = Tall | ) P (Weight = Heavy | ) P (HairLength = Long | ) P (HeadCovered = No | )
- Note
- Predicting Class of Text Example x15
- Compute vNB
- Prediction
- VNB = Female
- Test Example x16
- x16 = <Short, Normal, Short, No>
- Prediction
- VNB = Female
- Test Example x17
- x17 = <Short, Normal, Long, Yes>
- x17 = <Short, Normal, Long, Yes>
- Prediction
- VNB = Male
- Test Example x18
- x18 = <Normal, Normal, Short, No>
- Prediction
- VNB = Female
- Test Example x19
- x19 = <Short, Light, Short, No>
- Prediction
- VNB = Female
- Test Example x20
- x20 = <Normal, Light, Short, Yes>
- Prediction
- VNB = Female
- Test Example x21
- x21 = <Tall, Light, Short, No>
- Prediction
- VNB = Female
- Summary - Evaluate Model (Naïve Bayes)
- Application Phase
- We assume that our Model
- performed well on large Test Data and can be deployed in Real-world
- Model is deployed in the Real-world and now we can make
- predictions on Real-time Data
- Steps – Making Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector
- Step 4: Return Prediction to the User
- Making Predictions on Real-time Data
- Step 1: Take input from User
- Step 2: Conver User Input into Feature Vector
- Note that order of Attributes must be exactly same as that of Training and Testing Examples
- Step 3: Apply Model on Feature Vector
- x = <Height = Short, Weight = Heavy, HairLength = Long, HeadCovered = Yes>
- Note
- = {Male, Female}
- Using Naive Bayes, we need to compute
- VNB = P( )
- VNB = P P (Height = Short | ) P (Weight = Heavy | ) P (HairLength = Long | ) P (HeadCovered = Yes | )
- Using these estimates we can now compute VNB
- VNB = Male
- Step 4: Return Prediction to the User
- Male
- Note
- You can take Input from user, apply Model and return predictions as many times as you like 😊
- Feedback Phase
- Only Allah is Perfect 😊
- Take Feedback on your deployed Model from
- Domain Experts and
- Users
- Improve your Model based on Feedback 😊
- Strengths and Weaknesses – Naïve Bayes Algorithm
- Strengths
- Naïve Bayes Algorithm allows us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
- Naïve Bayes Algorithm assigns a probability to each Hypothesis (h) in the Hypothesis Space (H)
- Instead of simply accepting or rejecting a Hypothesis (h)
- Naïve Bayes Algorithm can handle noisy Data
- Naïve Bayes Algorithm show good Accuracy on Testing Data even if the Set of Training Examples (D) is rather small
- Naïve Bayes Algorithm can avoid Overfitting
- Naïve Bayes Algorithm are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora
- Naïve Bayes Algorithm can make Probabilistic Predictions
- Instead of making a Binary Prediction i.e. 0 or 1
- Weaknesses
- Naive Bayes Algorithm assumes that Attributes / Features are independent of each other
- However, in Real world, Attributes / Features depend on each other
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 3
- Task 1
- Consider the Titanic Dataset with the following Attributes
- Gender: Male, Female
- Ticket Class: Upper, Middle, Lower
- Parent/Child Abroad: Zero, One, Two, Three
- Embarked: Cherbourg, Queenstown, Southampton
- Survival: No, Yes
- We obtained the following Sample Data
- Sample Data was split into Training and Testing Data in a
- Train-Test Split Ratio of 80%-20%
- Training Data
- Testing Data
- Note
- Consider Naïve Bayes Algorithm when answering questions given below
- Your answer should be
- Well Justified
- Questions
- Write down the Input and Output for the above Machine Learning Problem?
- How Training Example is represented?
- How Hypothesis (h) should be represented?
- Execute the Machine Learning Cycle?
- Write down your observations that you observed during the execution of Machine Learning Cycle?
Your Turn Tasks
Your Turn Task 3
- Task 1
- Select a Machine Learning Problem (similar to: Titanic – Machine Learning form Disaster) and answer the questions given below.
- Note
- Consider Naïve Bayes Algorithm when answering all the questions.
- Questions
- Write down the Input and Output for the selected Machine Learning Problem?
- How Training Example is represented?
- How Hypothesis (h) should be represented?
- Execute the Machine Learning Cycle?
- Write down your observations that you observed during the execution of Machine Learning Cycle?
Chapter Summary
- Chapter Summary
- Major Problems
- Problem 1
- These ML Algorithms do not allow us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
- Problem 2
- These ML Algorithms either accept or reject a Hypothesis (h) i.e., take a Binary Decision
- Instead of assigning a probability to each Hypothesis (h) in the Hypothesis Space (H)
- A Possible Solution
- Bayesian Learning Algorithms
- Naïve Bayes – Summary
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Probability Distribution
- Searching Strategy
- I am not clear about Search Strategy of Naïve Bayes Algorithm. Please drop me an email if you know it. جزاک اللہ خیر
- Training Regime
- Batch Method
- Strengths
- Naïve Bayes Algorithm allows us to combine prior knowledge (or past experience) with our Training Examples, when learning a Concept
- Naïve Bayes Algorithm assigns a probability to each Hypothesis (h) in the Hypothesis Space (H)
- Instead of simply accepting or rejecting a Hypothesis (h)
- Naïve Bayes Algorithm can handle noisy Data
- Naïve Bayes Algorithm show good Accuracy on Testing Data even if the Set of Training Examples (D) is rather small
- Naïve Bayes Algorithm can avoid Overfitting
- Naïve Bayes Algorithm are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora
- Naïve Bayes Algorithm can make Probabilistic Predictions
- Instead of making a Binary Prediction i.e. 0 or 1
- Weaknesses
- Naive Bayes Algorithm assumes that Attributes / Features are independent of each other
- However, in Real world, Attributes / Features depend on each other
- Bayesian Learning Algorithms are Probabilistic Machine Learning Algorithms
- Bayesian Learning uses Bayes’ Theorem to determine the Conditional Probability of a Hypotheses (h) given a Set of Training Examples (D)
- Bayesian Learning Algorithms allow us to combine prior knowledge (or past experience) with your Set of Training Examples (D)
- Bayesian Learning Algorithms are suitable for following Machine Learning Problems
- When you want to combine prior knowledge (or past experience) with your Set of Training Examples (D)
- When you want to assign a probability to each Hypothesis (h) in the Hypothesis Space (H)
- Instead of simply accepting or rejecting a Hypothesis (h)
- When you have noisy Data
- When you have small Training Data
- In recent years, Bayesian Learning Algorithms are used in Deep Learning Algorithms, which allows Deep Learning Algorithms to learn from small datasets / corpora
- When you want to make Probabilistic Predictions
- Instead of making a Binary Prediction i.e. 0 or 1
- A Brute Force Bayesian Learning Algorithm works as follows
- Step 1: Compute Posterior Probability for all the Hypothesis (h) in Hypothesis Space (H)
- Step 2: Output Most Probable Hypothesis (a.k.a. Maximum A Posteriori (MAP) Hypothesis, hMAP)
- Bayes Optimal Classifier works as follows to calculate the Most Probable Classification of an unseen instance x
- Step 1: Calculate Posterior Probability for every Hypothesis h ∈ H
- Step 2: Calculate the prediction of each Hypothesis (h) for each new instance
- Step 3: Use the combination of Step 1 and 2. to classify each new instance
- Prediction returned by hMAP(x) is (always) not Most Probable Classification?
- To reduce the computation cost in finding the Most Probable Classification using Bayes Optimal Classifier, the Gibbs Algorithm works as follows
- Step 1: Choose one Hypothesis (h) at random, according to P(h|D)
- Step 2: Use this h to classify new unseen instances
- Naive Bayes Algorithm is a highly practical Bayesian Learning Algorithm
- Performance comparable to Artificial Neural Networks and Decision Tree Learning Algorithms on some tasks
- Naïve Bayes is very good for
Text Classification
In Next Chapter
- In Next Chapter
- In Sha Allah, in next Chapter, I will present
- Instance Based Learning