# Chapter 14 - Evaluating Hypothesis (Models)

### Chapter Outline

- Chapter Outline

**Quick Recap**

**Why Evaluate Hypotheses (Model)?**

**Two Main Diseases of Machine Learning Algorithms**

**Comparing Machine Learning Algorithms**

**Evaluation Measures for Classification Problems**

**Evaluation Measures for Regression Problems**

**Evaluation Measures for Sequence-to-Sequence Problems**

**Chapter Summary**

## Quick Recap

- Quick Recap - Evaluating Hypothesis (Models)

**Machine Learning Algorithms****studied so far****are**

**FIND-S Algorithm**

**List Then Eliminate Algorithm**

**Candidate Elimination Algorithm**

**ID3 Algorithm (Decision Tree Learning)**

**Perceptron (Two Layer Artificial Neural Network)**

**Regular Neural Network (Multi-layer Artificial Neural Network)**

**Naïve Bayes Algorithm (Bayesian Learning Algorithm)**

**The above-mentioned ML Algorithms are called Eager Methods**

**How Eager ML Algorithms Work?**

**Given**

**Set of Training Examples (D)**

**Set of Functions / Hypothesis (H)**

**Training Phase**

**Build an****explicit representation****of the Target Function (called approximated Target Function) by**

**Searching****Hypothesis Space (H) to****find****a Hypothesis h (approximated Target Function), which best fits the Set of Training Examples (D)**

**Testing Phase**

**Use the****explicit representation****(approximated Target Function) to**

**classify****unseen instances**

**Approximated Target Function is****applied****on an****unseen instance****and it****predicts****the Target Classification**

**How Lazy ML Algorithms Work?**

**Given**

**Set of Training Examples (D)**

**Training Phase**

**Simply****store****the Set of Training Examples (D)**

**Testing Phase**

**To classify a****new unseen****instance x**

**Step 1:****Compare****unseen instance x with****all****the stored Training Examples**

**Step 2: Depending on****relationship****of****unseen instance x****with****stored Training Examples**

**Predict****the Target Classification for unseen instance x**

**Eager Machine Learning Algorithms which****build an explicit representation****of the Target Function****as Training Examples are presented****are called Eager Machine Learning Algorithms**

**Lazy Machine Learning Algorithms which****defer processing****until a new unseen instance****must be classified****are called Lazy Machine Learning Algorithms**

**Instance-based ML Algorithms are Lazy ML Algorithms**

*k***-NN Algorithm is the****grand-daddy****of Instance-based Machine Learning Algorithms**

**k nearest neighbors****of an****unseen instance x****are Training Examples that have the****k smallest distance****to unseen instance x**

**k-Nearest Neighbor (k-NN) Algorithm – Summary**

**Representation of Training Examples**

**Attribute-Value Pair**

**Computing Relationship between unseen instance x and stored Training Examples (D)**

**Use a Distance Metric**

**Distance Matrices – Numeric Data**

**Euclidean Distance, Jaccard Co-efficient etc.**

**Distance Matrices – Categorical Data**

**Hamming Distance, Edit Distance etc.**

**Strengths**

**Very simple implementation**

**Robust****regarding the****search space**

**For example,****Classes****don’t have to be****linearly separable**

**k-NN Algorithm can be****easily updated****with****new****Training Examples (Online Learning) at****very little cost**

**Need to****tune****only two main parameters**

**Distance Metric and**

**Value of k**

**Weaknesses**

**Testing each instance is****expensive**

**Sensitive****to****noisy or irrelevant****Attributes, which can result in****less meaningful Distance scores**

**Sensitiveness to****highly unbalanced****datasets**

**k-Nearest Neighbor (k-NN) Algorithm****requires****three things**

**Set of Training Examples (D)**

**Distance Metric****to compute distance between****unseen instance x****and****Set of stored Training Examples**

**Value of***k*

**Number of Nearest Neighbors****to consider when****classifying unseen instance x**

**In k-Nearest Neighbor (k-NN) Algorithm,****value if k****should be**

**Carefully****chosen**

**If k is****too small**

**Sensitive to****noise****points (Training Examples)**

**If k is****too large**

**Neighborhood****may include****points (Training Examples) from****other Classes**

**Steps – Classify an Unseen Instance using k-NN Algorithm**

**Given**

**Set of 15 Training Examples**

**Distance Metric = Euclidean Distance**

**k = 3**

**Classifying an unseen instance x using k-NN Algorithm**

**Step 1:****Compute****Euclidean Distance between unseen instance x and all 15 Training Examples**

**Identify****3-nearest neighbors (as k = 3)**

**Use Class labels****of****k nearest neighbors****to determine the****Class label of unseen instance x****(for e.g., by taking Majority Vote)**

**k-NN Algorithm suffers from Scaling issue**

**Attributes may have to be****scaled****to****prevent distance measures****from being****dominated****by one of the Attributes**

**Scale / normalize Attributes before applying k-NN Algorithm**

**In Offline Learning, we learn a Concept from a****static****dataset**

**In Online Learning, we learn a Concept from data and****keep on updating your Hypothesis (h)****as****more data is available****for the same Concept**

## Why Evaluate Hypotheses (Model)?

- Completely and Correctly Learning any Task

**Question**

**How to****completely****and****correctly****learn any task?**

**A Possible Answer**

**Follow the Learning Cycle**

**Learning Cycle**

**The four main phases of a Learning Cycle to****completely****and****correctly****learn any task are**

**Training Phase**

**Testing Phase**

**Application Phase**

**Feedback Phase**

- Example - Completely and Correctly Learning any Task

**Learning Problem****Learn to Drive a Car in Real-world**

**Question****How to achieve this goal?**

**A Possible Answer****Follow the Learning Cycle**

**Training Phase****Learn to drive a Car from****A Trainer in a Training Centre**

**Assumption****The environment of the Training Centre mimics that of the Real-world**

**Outcome of Training Phase****I have learned to drive a car in a Training Centre**

**Question****Can we completely trust the quality of Training and allow the Trainee to drive a car in the Real-world?**

**Answer****No**

**Problem****We cannot completely trust the quality of Training and allow the Trainee to drive a car in the Real-world**

**A Possible Solution****Evaluate****(test) the driving skills of the Trainee at a Test Centree. an environment which is****different****from the Training Centre and****mimics****the Real-world**

**Testing Phase****Evaluate****the driving skills of the Trainee at Test Centre**

**Question****How to ensure quality in evaluation?**

**A Possible Answer****Design such an evaluation process which completely and correctly evaluates all aspects of a Task (to be evaluated)**

**Assumption – in Current Example****In this current example, I am assuming that the evaluation process to evaluate the driving skills of a Trainee is****complete and correct in all aspects**

**Outcome of Testing Phase****If (Trainee Performed Well in Testing Phase)****Then****Allow****Trainee to Drive a Car in Real-world**

**Else****Do not Allow****Trainee to Drive a Car in Real-world and ask him/her to take further Training to refine his / her Driving Skills**

**In this example, we assume that****Trainee Performed Well in Testing Phase and (s)he has been given a****Driving License**

**Duration of a Driving License****Note that duration of a Driving License is normally for a period of 5 years**

**Note****Learning is a continuous process till death****😊**

**Application Phase****Person (Driver)****with Driving License is driving the car in the Real-world**

**Feedback Phase****Driving skills****of a Driver are constantly monitored by Traffic Police****😊**

**Positive Feedback****Keep driving****a car in Real-world**

**Negative Feedback****Punishment****In extreme situations, punishment may result in cancellation of Driving License**

**Conclusion****It can be noted form the Learning Cycle that there is****Continuous Evaluation**

**This indicates that Continuous Evaluation with Quality is very important to bring Quality in Learning Process**

- Why Evaluate Hypothesis (Model)?

**Given**

**A Machine Learning Algorithm (e.g. ID3 Algorithm) and**

**A Hypothesis (h) or Model**

**which ID3 Algorithm has****learned****from a Set of Training Examples**

**Question**

**Why should we****evaluate h (Model)****?**

**Answer**

**The****main goal****of****building a Model****(Training / Learning) is to****use****it in doing****Real-world Tasks****with****good Accuracy**

**No one can****perfectly predict****, that if a Model performs well in****Training Phase****, then it will also perform well in****Real-world**

**However,****before deploying a Model****in the Real-world, it is****important****to know**

**How****well****will it****perform****on****unseen****Real-time Data?**

**To****judge/estimate****the performance of a Model (or h) in Real-world (Application Phase)**

**Evaluate****the Model (or h) on****large Test Data****(Testing Phase)**

**Recall – Machine Learning Assumption**

**If a Model (or h) performs well on****large Test Data****, it will****also****perform well on unseen Real-world Data**

**Again, this is an****assumption****and we are****not 100% sure****that if a Model performs well on large Test Data will****definitely****perform well on Real-world Data**

**Therefore, it is useful to take****continuous Feedback****on****deployed Model****(Feedback Phase) and keep on****improving****it**

- Advantages of Evaluating Hypothesis (Model)?

**Two main advantages of Evaluating Hypothesis / Model are**

**We get the answer to an important question i.e.**

**Should we****rely****on****predictions of Hypothesis / Model****when****deployed****in Real-world?**

**Machine Learning Algorithms may****rely on Evaluation****to****refine****Hypothesis (h)**

- What We Want to Know When Evaluating Hypothesis / Model

**Question**

**What do we****want to know****when evaluating a Hypothesis h?**

**Answer**

**Estimate of Error (EoE)**

**How****accurately****will it****classify****future****unseen****instances?**

**Error in Estimate of Error (Error in EoE)**

**How****accurate****is our Estimate of Error (EoE)?**

**I.e. what****Margin of Error****(±? %) is****associated****with our****Estimate of Error (EoE)****?**

- Recall – Population and Sample

**Population (N)**

**Definition**

**Total****set of observations (or examples) for a Machine Learning Problem**

**Collecting Data****equivalent****to the****size of Population****, will lead to****perfect learning**

**Sample (S)**

**Definition**

**Subset****of observations (or examples)****drawn****from a Population**

**Note**

**The****size of a Sample****is****always less than****the****size of the Population****from which it is taken**

**Most Important Property of a Sample**

**A Sample should be****true representative****of the Population**

**Example – Population and Sample**

**Machine Learning Problem**

**Gender Identification**

**Population**

**Set of****all****observations (humans) in the world**

**Sample**

**A set of 5000 observations (humans)****drawn****from Population**

**A Sample should be****true representative****of the Population**

**if in a Population**

**60% are Females and 40% are Males**

**Then a Sample (5000 observations)****drawn****from this Population should have**

**60% Females (3000 observations) and 40% Males (2000 observations)**

- Why We Need Data Sampling?

**For****Perfect Learning****(Ideal Situation)**

**Collect****all****Data (or observations/examples) for a Machine Learning Problem**

**Problem**

**Practically Not Possible**

**A Possible Solution (Realistic Situation)**

**Draw a Sample****from Population which should be its****true representative****(called a Representative Sample)**

**Note**

**Since ML Algorithms learn from****Sample Data****instead of****Population Data****, that is why**

**They have****Scope of Error**

**Remember**

**Only Allah and His Habib, Hazrat Muhammad S.A.W.W. (teachings) are****perfect**

**So,****follow them****to be successful in this world and hereafter, Ameen**

- True Error vs Sample Error

**True Error**

**Error computed on****entire Population**

**Sample Error**

**Error computed on****Sample Data**

- True Error – Formal Definition

**The True Error of hypothesis h (or Model) with respect to Target Function f and Probability Distribution D is the****probability****that****h will misclassify****an instance drawn at random according to D**

- Sample Error – Formal Definition

**The sample Error of hypothesis h or Model() with respect to Target Function f and Data sample S is the proportion of examples****h misclassifies**

- Calculating True Error

**Problem**

**Since we****cannot acquire****entire Population**

**Therefore, we****cannot calculate****True Error**

**A Possible Solution**

**Calculate Sample Error in****such a way****that**

**Sample Error****estimates****True Error****well**

- How Sample Error Estimates True Error Well?

**Statistical Theory tells us that Sample Error****can estimate****True Error****well****if the following two conditions are fulfilled**

**Condition 01 – the***n***instances in Sample***S***are****drawn**

**independently of one another**

**independently of***h*

**according to Probability Distribution***D*

**Condition 02**

**n ≥ 30**

- Importance Note

**A very Common Mistake in Evaluation**

**Size of Test Data < 30 instances**

**Remember**

- Example – Calculating Sample Error

**Problem**

**Calculating Sample Error**

- Sample Error vs Estimate of Sample Error (EoSE)

**Ideal Situation**

**Calculate Sample Error**

**Realistic Situation**

**Calculate Estimate of Sample Error (EoSE)**

**In sha Allah (انشاء اللہ), in next Slides, I will try to explain the difference between**

**Sample Error and**

**Estimate of Sample Error (EoSE)**

- Example - Estimate of Sample Error (EoSE)

**Machine Learning Problem**

**Gender Identification**

**Machine Learning Algorithm**

**ID3 Algorithm**

**Population (Instance Space (X))**

**All humans in the world**

**Sampling Technique**

**Random Sampling**

**Representative Sample (Sample Data)**

**Set of Examples (or humans)****randomly drawn****from the Population**

**Sample Size = 5000 instances**

**Split Sample Data**

**Using a Train-Test Split Ratio of 80%-20%**

**Training Data****= 4000 instances**

**Testing Data****= 1000 instances**

**Evaluation Measure**

**Error**

- Example - Estimate of Sample Error (EoE)

**Consider three Samples****randomly drawn****from Population**

**Sample S****1**

**Sample S****2**

**Sample S****3**

**Training and Testing ID3 Algorithm on Sample S****1**

**Sample Error (S****1****) = 0.25**

**Training and Testing ID3 Algorithm on Sample S****2**

**Sample Error (S****2****) = 0.20**

**Training and Testing ID3 Algorithm on Sample S****3**

**Sample Error (S****3****) = 0.30**

**Note that Sample Error for****all three random****Samples (S****1****, S****2****and S****3****) is**

**Different**

**Conclusion**

**If we****randomly****draw****n****different Samples and compute Sample Error, then**

**Sample Error is****likely to vary****from Sample to Sample**

**Therefore, we****cannot compute Sample Error****, we can**

**only compute****Estimate of Sample Error (EoSE)**

- How to Calculate Estimate of Sample Error (EoSE)?

**Question**

**How can we calculate Estimate of Sample Error (EoSE)??**

**Answer**

**Step 1:****Randomly****select a Representative Sample S from the Population**

**Step 2: Calculate Sample Error on Representative Sample S drawn in Step 1**

**Note that this Sample Error is called Estimate of Sample Error (EoSE)**

- Estimate of Sample Error (EoSE)

**Problem**

**Estimate****cannot****be****perfect****and will contain****Error**

**A Possible Solution**

**Calculate****Error****in Estimate of Sample Error (EoSE)**

**Question**

**How to****calculate Error****in Estimate of Sample Error (EoSE)?**

**A Possible Answer**

**Use****Confidence Interval**

- Confidence Interval – Formal Definition

**The Most Probable Value of****True Error is Sample Error****with****approximately****N% Probability, True Error lies in****interval**

**where n represents the size of Sample**

**error****s****(h) represents Sample Error**

- Confidence Interval

**Definition**

**Confidence Interval provides us with****lower and upper limits****around our Estimate of Sample Error (EoSE), and****within this interval****we can then be****confident****that we have****captured the True Error**

**The lower limit and upper limit around our Sample Error tells us the****range of values****our****True Error is likely to lie within**

**Confidence Interval is often used with a**

**Margin of Error**

**Margin of Error – Definition**

**Margin of Error is the****range of values below and above****the Estimate of Sample Error (EoSE) in a Confidence Interval**

- Confidence Level

**Definition**

**Confidence Level is defined as the****probability****that the****value****of an Estimate of Sample Error (EoSE)****falls within****a****specified range of values**

**Formula**

**The****choice****of Confidence Level depends on the****field of study**

**Generally, the****most common****Confidence Level used by****Researchers****is 95%**

- Example – Confidence Interval

**Question**

**Explain the following statement**

**With 95% Confidence, we can say that Estimate of Sample Error (EoSE) lies in an Interval of 0.20 – 0.30**

**Answer**

**If we****randomly draw****Samples from a Population (using the****same****technique) and****compute****Estimate of Sample Error (EoSE), then****95% of the times****, the True Error****will fall within the interval****i.e. between 0.20 and 0.30**

- Example – Calculating Error in Estimate of Sample Error

**Problem**

**A Two-Step Process**

**Step 1: Calculate Estimate of Sample Error (EoSE)**

**Step 2: Calculate Error in Estimate of Sample Error (EoSE)**

**Step 1: Calculate Estimate of Sample Error (EoSE)**

**Step 2: Calculate Error in Estimate of Sample Error (EoSE)**

**Error in Estimate of Sample Error (EoSE)**

**We know with****approximately****95% Probability that**

**True Error lies in the range 0.1628 to 0.4372**

- Summary – Confidence Interval

**Our goal is to have a**

**Small****Interval with****High****Confidence**

**Note**

**For****Small Intervals****normally****Confidence is Low**

**For****Large Intervals****normally****Confidence is High**

**Example – Interval and Confidence**

**Question**

**How much money Adeel have in his purse?**

**Answer 01**

**Interval**

**0 – 100K**

**Confidence**

**99%**

**Answer 02**

**Interval**

**500 – 1000**

**Confidence**

**50%**

**Conclusion**

**It is ****not easy**** to have a ****Small Interval**** with ****High Confidence** **😊**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**Task 1****Consider the following Task and answer the questions given below**

**Memorize Quran.e. Pak by Heart**

**Dua****:****اللہ پاک ہم سب کو قرآن پاک کا حافظ بنائے آمین**

**Note**

**Your answer should be**

**Well Justified**

**Questions**

**Write the Input and Output for the above Task?**

**Execute the Learning Cycle**

**What****Evaluation Process****will you use to ensure****quality****in Evaluation?**

**What****strategies****you will use so that you****do not forget****Quran.e.Pak till death?**

**Task 2**

**Consider the following scenario and answer the questions below**

**A Model (h) was tested on a Sample of 25 instances. The Model (h) misclassified 5 instances.**

**Questions**

**Calculate the Sample Error?**

**Sample Error****will not estimate****True Error? Why?**

**Task 3**

**Consider the following scenario and answer the questions below**

**A Model (h) was tested on a Sample of 50 instances. The Model (h) misclassified 10 instances.**

**Questions**

**Calculate the Estimate of Sample Error (EoSE)?**

**Sample Error****will estimate****True Error? Why?**

**Calculate Error in Estimate of Sample Error (EoSE) with**

**Confidence Level = 50%**

**Confidence Level = 95%**

**Discuss the****impact****of Confidence Level on Error in Estimate of Sample Error (EoSE)**

**We ****cannot deploy**** the Model (h) in Real-world, although Sample Error is ****low****? Why?**

Your Turn Tasks

**Task 1****Select a Task (similar to Memorize Quran .e. Pak by Heart)**

**Questions**

**Write the Input and Output for the selected Task?**

**Execute the Learning Cycle**

**What****Evaluation Process****will you use to ensure****quality****in Evaluation?**

**What****strategies****you will use so that you****do not forget****what you have learned till death?**

**Task 2**

**Write a scenario (similar to Task 03 in TODO) and answer the questions below**

**Questions**

**Calculate the Estimate of Sample Error (EoSE)?**

**Will Sample Error estimate True Error? Explain?**

**Calculate Error in Estimate of Sample Error (EoSE) with**

**Confidence Level = 50%**

**Confidence Level = 95%**

**Discuss the****impact****of Confidence Level on Error in Estimate of Sample Error (EoSE)**

**Will you be able to****deploy****your Model (h) in Real-world? Explain.**

## Two Main Diseases of Machine Learning

- Two Main Diseases of Machine Learning

**The two main diseases in Machine Learning are**

**Overfitting**

**Underfitting**

**Overfitting**

**The condition when a Machine Learning Algorithm tries to****remember all****the Training Examples from the Training Data (Rote Learning) is known as Overfitting of the Model (h)**

**Overfitting happens when our**

**Model (h) has****lot of features****or**

**Model (h) is****too complex**

**Underfitting**

**The condition when a Machine Learning Algorithm****could not learn the correlations between Attributes / Features properly****then it is known as Underfitting of the Model (h)**

**Underfitting happens when our**

**Model****misses****the****trends or patterns****in the Training Data and****could not generalize well****for the Training Examples**

**Question**

**How to overcome the problems of Overfitting and Underfitting?**

**A Possible Solution**

**Use Train-Test Split Approach**

- Train-Test Split Approach

**Definition**

**Train-Test Split Approach, splits the Sample Data into two sets: (1) Train Set and (2) Test Set**

**Train Set**

**used to****build/train****the Model (h)**

**Test Set**

**used to****evaluate****the performance of the Model (h)**

- Train-Test Split Approach

**Two main variations of Train-Test Split Approach are**

**Random Split Approach**

**Class Balanced Split Approach**

**Note**

**Class Balanced Split Approach should be****preferred****over Random Split Approach**

**For details**

**See****Chapter 03 – Basics of Machine Learning**

- Train-Test Split Ratio

**Definition**

**Train-Test Split Ratio determines what****percentage****of the Sample Data will be used as Train Set and what****percentage****of the Sample Data will be used as Test Set**

**Question**

**What Train-Test Split Ratio is best?**

**A Possible Answer**

**The Train-Test Split Ratio****may vary****from Machine Learning Problem to Machine Learning Problem**

**e.g. 70%-30%, 80%-20%, 90%-10% etc.**

**Most Common Train-Test Split Ratio**

**Use 2 / 3 of Sample Data as Train Set**

**Use 1 / 3 of Sample Data as Test Set**

- Strengths and Weaknesses - Train-Test Split Approach

**Strengths**

**Train-Test Split Approach helps us to****address the problems****of Overfitting and Underfitting**

**Weaknesses**

**Train-Test Split Approach provides****high variance in Estimate of Sample Error****since****Changing****which****examples****happens to be in Train Set can****significantly****change Sample Error**

**Question**

**How to****overcome the problem****of****high variance****in Estimate of Sample Error calculated using Train-Test Split Approach?**

**A Possible Answer**

**Use the K-fold Cross-Validation Approach**

- Train-Test Split Approach Cont…

**Question**

**How to****overcome the problem****of****high variance****in Estimate of Sample Error calculated using Train-Test Split Approach?**

**A Possible Answer**

**Use the K-fold Cross-Validation Approach**

- K-fold Cross-Validation Approach

**K-fold Cross-Validation Approach works as follows**

**Step 1: Split Train Set into K****equal folds****(or partitions)**

**Step 2: Use****one of the folds (k****th****fold)****as the Test Set and****union of remaining folds (k – 1 folds)****as Training Set**

**Step 3: Calculate Error of Model (h)**

**Step 4: Repeat Steps 2 and 3, to****choose****Train Sets and Test Sets from****different folds****, and calculate Error****K-times**

**Step 5: Calculate****Average Error**

**Important Note**

**In each fold, there must be****at least 30 instances**

**All K-folds must be****disjoint****i.e. instance appearing in one fold****must not appear****in any other fold**

**Question**

**What is the****best value****for K?**

**Answer**

**Empirical study****showed that****best****value for**

**K is 10**

**Important Note**

**To run experiments with 10-fold Cross-Validation, you must have**

**At least 300 instances****in your Train Set**

- Example 1 – Selecting Value of K

**Consider the following Machine Learning Problem**

**Calculations**

**Size of Train Set = 200 instances**

**There should be at least 30 instances in each fold**

**K = 200 / 30 = 6.66**

**Answer**

**Ali will apply**

**6-fold Cross-Validation**

- Example 2 – Selecting Value of K

**Consider the following Machine Learning Problem**

**Calculations**

**Size of Train Set = 300 instances**

**There should be at least 30 instances in each fold**

**K = 300 / 30 = 10**

**Answer**

**Ali will apply**

**10-fold Cross-Validation**

- Example 3 – Selecting Value of K

**Consider the following Machine Learning Problem**

**Calculations**

**Size of Train Set = 2000 instances**

**There should be at least 30 instances in each fold**

**K = 2000 / 30 = 66.66**

**Answer**

**Ali will apply**

**10-fold Cross-Validation**

**Note**

**Calculation shows K = 66.6 but we apply**

**10-fold Cross-Validation i.e. K = 10**

**Reason**

**Empirical****study has shown that the****best****value for****K = 10**

- Example - K-fold Cross-Validation Approach

**Consider the following Machine Learning Problem**

**Applying K-fold Cross-Validation Approach**

**Step 1: Split Train Set into K****equal folds****(or partitions)**

**Calculating the Value of K**

**Splitting Train Set into K-equal folds (here K = 3)**

**Fold 01 = 1 – 33****(total 33 instances)**

**Fold 02 = 33 – 66****(total 33 instances)**

**Fold 03 = 67 – 100****(total 34 instances)**

**Step 2: Use****one of the folds (k****th****fold)****as the Test Set and****union of remaining folds (k – 1 folds)****as Training Set**

**Step 3: Calculate Error of Model (h)**

**Step 4: Repeat Steps 2 and 3, to****choose****Train Sets and Test Sets from****different folds****, and calculate Error K-times**

**Iteration 1**

**Iteration 2**

**Iteration 3**

**Step 5: Calculate Average Error**

**Average Error = (Error in Iteration 01 + Error in Iteration 02 + Error in Iteration 03) / 3**

**Average Error = (0.25 + 0.20 + 0.30) / 3 = 0.25**

- Example - 10-fold Cross-Validation Approach

**Diagram below shows split of Train Sets and Test Sets**

**when K = 10**

- Strengths and Weaknesses – K-fold Cross-Validation Approach

**Strengths**

**K-fold Cross-Validation Approach is a****better estimator of Error****since**

**All data****is used for****both****Training and Testing**

**Weaknesses**

**It is****computationally expensive****since**

**we have to****repeat****Training and Testing Phases****K-times**

- Suitable Situations to use Train-Test Split Approach

**It is****suitable****to use Train-Test Split Approach in the following situations**

**Situation 1**

**When Training Time is****Very Large**

**Example**

**Since Training Time of****Deep Learning Algorithms****is****very large****, therefore, Train-Test Split Approach is****more suitable****(compared to K-fold Cross-Validation Approach)**

**Situation 2**

**Organizing International Competitions**

**Example**

**PAN organized International Competition on Author Profiling task, which mainly comprised of two phases**

**Training Phase**

**PAN Organizers****released****the Training Data so that****participants****can****train****their Models (h)**

**Evaluation Phase**

**PAN Organizers****released****the Test Data and asked****participants****to apply their Models (h) on Test Data and****submit****their****predictions****for****evaluation**

**Situation 3**

**Having****Very Huge****Sample Data**

**Example**

**Suppose we want to Train and Test our Machine Learning Algorithms for Plagiarism Detection task on a Sample Data of 21 million instances (PubMed Medline Citations Dataset)**

**In the above situation, it will be****more suitable****to use Train-Test Split Approach (compared to K-fold Cross-Validation Approach)**

- Suitable Situations to use K-fold Cross-Validation Approach

**It is****suitable****to use K-fold Cross-Validation Approach in the following situations**

**Situation 1**

**When Training Time is****Not Very Large**

**Example**

**Since Training Time of****Feature-based Machine Learning Algorithms****is relatively****fast****, therefore, K-fold Cross-Validation Approach is****more suitable****(compared to Train-Test Split Approach)**

**Situation 2**

**Having Sample Data that is****Not Very Huge**

**Example**

**Suppose we want to Train and Test our Machine Learning Algorithms for Sentiment Analysis task on a Sample Data of 10,000 instances**

**In the above situation, it will be****more suitable****to use K-fold Cross-Validation Approach (compared to Train-Test Split Approach)**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 2**

**Task 1**

**Consider the following scenario and answer the questions given below**

**Rashid has a Sample Data of 10,000 instances for the Sentiment Analysis task (4000 Positive, 2000 are Neutral and 4000 are Negative). He aims to apply Naïve Bayes, Random Forest, RNN and LSTM on his dataset. Error measure is used to evaluate the performance of the Model.**

**Note**

**Your answer should be**

**Well Justified**

**Questions**

**Split Sample Data using Class Balanced Approach with a Train-Test Split Ratio of 80%-20%?**

**In which of the four ML Algorithms (Naïve Bayes, Random Forest, RNN, LSTM), Overfitting is a serious problem?**

**How will you****check****whether your Model is Overfitting or Underfitting?**

**What approach is****most suitable****in the above scenario?**

**Train-Test Split Approach or**

**K-fold Cross-Validation Approach**

**If K-fold Cross-Validation Approach is applied**

**Calculate****the value of K?**

**Apply****K-fold Cross-Validation Approach to****evaluate the performance of the Model**

**If Train-Test Split Approach is applied**

**What is****the most suitable****value for Train-Test Split Ratio?**

**What Data Split Approach is****more suitable****(Random Split Approach or Class Balanced Split Approach)?**

Your Turn Tasks

**Your Turn Task 2**

**Task 1**

**Consider s scenario (similar to the one given in TODO Task) and answer the questions given below**

**Questions**

**Split Sample Data using Class Balanced Approach with a Train-Test Split Ratio of 80%-20%?**

**In which of the four selected ML Algorithms, Overfitting is a serious problem?**

**How will you****check****whether your Model is Overfitting or Underfitting?**

**What approach is****most suitable****in your selected scenario?**

**Train-Test Split Approach or**

**K-fold Cross-Validation Approach**

**If K-fold Cross-Validation Approach is applied**

**Calculate****the value of K?**

**Apply****K-fold Cross-Validation Approach to****evaluate the performance of the Model**

**If Train-Test Split Approach is applied**

**What is****the most suitable****value for Train-Test Split Ratio?**

**What Data Split Approach is****more suitable****(Random Split Approach or Class Balanced Split Approach)?**

## Comparing Machine Learning Algorithms

- Comparing Machine Learning Algorithms

**To****compare****various Machine Learning Algorithms, following things****must****be****same**

**Train Set**

**Test Set**

**Evaluation Measure**

**Evaluation Methodology**

**Important Note**

**If****any****of the above things are****not same****then it will**

**Not****be a****valid comparison**

- Example 1 – Comparing Machine Learning Algorithms

**Machine Learning Problem**

**Gender Identification**

**Feature-based Machine Learning Algorithms**

**Naïve Bayes**

**Random Forest**

**Support Vector Machine**

**Logistic Regression**

**Multi-Layer Perceptron**

**Dataset**

**Gender Identification on Twitter**

**Total instances = 10000**

**Male instances****= 5000**

**Female instances****= 5000**

**Evaluation Measure**

**Accuracy**

**Evaluation Methodology**

**10-fold Cross-Validation**

**Results obtained after running experiments**

**Conclusion**

**Best****Machine Learning Algorithm on Twitter Corpus is**

**Random Forest with an Accuracy score of 0.80**

**Question**

**Was the compassion of Feature-based Machine Learning Algorithms****valid****?**

**Answer**

**Yes**

**Reason**

**For****all five****Machine Learning Algorithms we used****same**

**Train Set**

**Test Set**

**Evaluation Measure and**

**Evaluation Methodology**

- Example 2 – Comparing Machine Learning Algorithms

**Machine Learning Problem**

**Gender Identification**

**Deep Learning Algorithms**

**Recurrent Neural Network (RNN)**

**Long Short-Term Memory (LSTM)**

**BI-LSTM**

**Dataset**

**Gender Identification on Twitter**

**Total instances = 10000**

**Male instances****= 5000**

**Female instances****= 5000**

**Evaluation Measure**

**Accuracy**

**Evaluation Methodology**

**Train-Test Split Ratio of**

**80% – 20%**

**Results obtained after running experiments**

**Conclusion**

**Best****Machine Learning Algorithm on Twitter Corpus is**

**LSTM with an Accuracy score of 0.78**

**Question**

**Was the compassion of Deep Learning Algorithms****valid****?**

**Answer**

**Yes**

**Reason**

**For****all three****Deep Learning Algorithms we used****same**

**Train Set**

**Test Set**

**Evaluation Measure and**

**Evaluation Methodology**

- Comparison of Example 01 and Example 2

**Major difference****in Example 01 and Example 02 is of**

**Evaluation Methodology**

**Evaluation Methodology – Example 01**

**10-fold Cross-Validation**

**Reason**

**We were applying****Feature-based****ML Algorithms on a dataset of 10000 instances**

**Evaluation Methodology – Example 02**

**Train-Test Split Approach**

**Train-Test Split Ratio of 80% – 20%**

**Reason**

**We were applying****Deep Learning****ML Algorithms on a dataset of 10000 instances**

- Example 3 – Comparing Machine Learning Algorithms

**Machine Learning Problem**

**Gender Identification**

**Feature-based Machine Learning Algorithms**

**Naïve Bayes**

**Random Forest**

**Support Vector Machine**

**Logistic Regression**

**Multi-Layer Perceptron**

**Deep Learning Algorithms**

**Recurrent Neural Network (RNN)**

**Long Short Term Memory (LSTM)**

**BI-LSTM**

**Dataset**

**Gender Identification on Twitter**

**Total instances = 10000**

**Male instances****= 5000**

**Female instances****= 5000**

**Evaluation Measure**

**Accuracy**

**Evaluation Methodology**

**Important Note**

**We are applying both****Feature-based ML Algorithms****and****Deep Learning ML Algorithms****on our Twitter Dataset for Gender Identification task**

**Question**

**In current situation, should we go for Train-Test Split or K-Fold Cross Validation Approach?**

**Answer**

**Train-Test Split Approach is****more suitable**

**Therefore, we will use Train-Test Split Ratio of**

**80% – 20%**

**Results obtained after running experiments**

**Conclusion**

**Best****Machine Learning Algorithm on Twitter Corpus is**

**Random Forest with an Accuracy score of 0.83**

**Question**

**Was the compassion of Machine Learning Algorithms****valid****?**

**Answer**

**Yes**

**Reason**

**For****all eight****Machine Learning Algorithms we used****same**

**Train Set**

**Test Set**

**Evaluation Measure and**

**Evaluation Methodology**

- Important Point to Note

**Question**

**Why the****results of Feature-based****ML Algorithms in****Example 03****are****different****from those****reported****in****Example 01****, although the****dataset is same****?**

**Answer**

**In Example 01, Evaluation Methodology was**

**10-fold Cross-Validation Approach**

**In Example 03, Evaluation Methodology was**

**Train-Test Split Approach**

**Therefore,****results****obtained on the****same dataset****are****different**

**Conclusion**

**Before running experiments****, we need to****ensure****that our Evaluation Methodology is**

**Standard****and**

**Same****for****all****Machine Learning Algorithms**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 3**

**Task 1**

**Consider the following scenario and answer the questions given below**

**Rashed has a Sample Data of 10,000 instances for the Sentiment Analysis task (4000 Positive, 2000 are Neutral and 4000 are Negative). He aims to apply Naïve Bayes, Random Forest, RNN and LSTM on his dataset.**

**Note**

**Your answer should be**

**Well Justified**

**Questions**

**What main points Rasheed should consider to make a****valid comparison****of all four Machine Learning Algorithms?**

**Discuss the pros and cons of comparing Machine Learning Algorithms using Train-Test Split Approach?**

**Discuss the pros and cons of comparing Machine Learning Algorithms using K-fold Cross-Validation Approach?**

Your Turn Tasks

**Your Turn Task 3**

**Task 1**

**Consider s scenario (similar to the one given in TODO Task) and answer the questions given below**

**Questions**

**What main points you should consider to make a****valid comparison****of all Machine Learning Algorithms?**

**Discuss the pros and cons of comparing Machine Learning Algorithms using Train-Test Split Approach?**

**Discuss the pros and cons of comparing Machine Learning Algorithms using K-fold Cross-Validation Approach?**

## Evaluation Measures for Classification Problems

- Evaluation Measures for Classification Problem

**Some of the****most popular****and****widely used****Evaluation Measures for Classification Problems are**

**Baseline Accuracy**

**Accuracy**

**Precision**

**Recall**

**F****1**

**Area Under the Curve (AUC)**

- Baseline Accuracy (BA)

**Definition**

**Baseline Accuracy (a.k.a. Majority Class Categorization (MCC)) is calculated by assigning the****label of Majority Class****to****all****the Test Instances**

**Formula**

- Example 1 – Calculating Baseline Accuracy (BA)

**Problem Description **

**Baseline Approach**

**Majority Class Categorization (MCC)**

**Proposed Approach**

**Excellent Learner**

**To make a****contribution****, in the****existing research**

**Proposed Approach****must outperform****Baseline Approach**

**Calculating Baseline Accuracy (BA)**

**Number of Classes**

**Class 01 (Female)**

**Class 02 (Male)**

**Total Number of Test Instances = 600**

**Class 01 (Female)****= 300**

**Class 02 (Male)****= 300**

**Majority Class**

**Both****Classes****, Female and Male are****equal****i.e. have****same****Number of Instances**

**Therefore, we can take****any of the Classes****as Majority Class**

**Majority Class**

**Female**

**Calculating Baseline Accuracy (BA)**

**Important Note**

**If (Accuracy of Proposed Approach (Excellent Learner) > 0.50)****Then****Proposed Approach****has contribute**

**Else****Proposed Approach****has not contributed**

- Example 2 – Calculating Baseline Accuracy (BA)

**Problem Description**

**Baseline Approach**

**Majority Class Categorization (MCC)**

**Proposed Approach**

**Excellent Learner**

**To make a****contribution****, in the****existing research**

**Proposed Approach****must outperform****the Baseline Approach**

**Calculating Baseline Accuracy (BA)**

**Number of Classes**

**Class 01 (Female)**

**Class 02 (Male)**

**Total Number of Test Instances = 600**

**Class 01 (Female)****= 400**

**Class 02 (Male)****= 200**

**Majority Class**

**Female**

**Calculating Baseline Accuracy (BA)**

**Important Note**

- Comparing Example 1 and Example 2

**Example 01 has****Balanced****Data**

**BA = 0.50**

**Example 02 has****Unbalanced****Data**

**BA = 0.66**

**Note**

**BA score****changes****as the****Number of Instance in each Class****changes**

**Conclusion**

**Class Balancing****has a****significant****impact on the**

**Calculation of Baseline Accuracy (BA)**

- Example 3 – Calculating Baseline Accuracy (BA)

**Problem Description**

**Baseline Approach**

**Majority Class Categorization (MCC)**

**Proposed Approach**

**Excellent Learner**

**To make a****contribution****, in the****existing research**

**Proposed Approach****must outperform****the Baseline Approach**

**Calculating Baseline Accuracy (BA)**

**Number of Classes**

**Class 1 (Positive)**

**Class 2 (Negative)**

**Class 3 (Neutral)**

**Total Number of Test Instances = 1000**

**Class 01 (Positive)****= 400**

**Class 02 (Negative)****= 250**

**Class 03 (Neutral)****= 350**

**Majority Class**

**Positive**

**Calculating Baseline Accuracy (BA)**

**Important Note**

- Comparing Example 2 and Example 3

**Example 2 is a****Binary****Classification Problem**

**BA = 0.66**

**Example 03 is a****Ternary****Classification Problem**

**BA = 0.40**

**Note**

**The BA score for Ternary Classification Problem is****smaller****than the Binary Classification Problem**

**Conclusion**

**As the****Number of Classes****increases in a Machine Learning Problem, the**

**Baseline Accuracy (BA)****decreases****(considering****almost****Balanced Data)**

- Baseline Accuracy for Balanced Data

**Assume that we have Balanced Data for all the Classification Tasks given below**

**Gender Identification (Binary Classification Problem)**

**Classes = Male, Female**

**BA = 0.50**

**Sentiment Analysis (Multi-class Classification Problem with 3 Classes)**

**Classes = Positive, Negative, Neutral**

**BA = 0.33**

**Age Group Identification (Multi-class Classification Problem with 3 Classes)**

**Classes = [1 – 18], [19 – 25] , [25 – 40] , [40 – 100]**

**BA = 0.25**

**Note**

**As the****Number of Classes****increases in a Machine Learning Problem, the**

**Baseline Accuracy (BA) decreases**

**Conclusion**

**Number of Classes****has a****significant****impact on the calculation of Baseline Accuracy (BA)**

- Two Main Factor Affecting Calculation of Baseline Accuracy (BA)

**The two main factors affecting the calculation of Baseline Accuracy (BA) are**

**Number of Classes**

**Number of Instances in each Class**

- Strengths and Weaknesses – Baseline Accuracy (BA)

**Strengths**

**Baseline Accuracy (BA) provides a****simple****and****very basic****Baseline Approach to****compare****your Proposed Approach (Machine Learning Algorithm)**

**Weaknesses**

**Baseline Accuracy (BA) is very****naïve****and****cannot be considered****as a****state-of-the-art****and****strong****Baseline Approach**

- Accuracy

**Definition**

**Accuracy is defined as the proportion of correctly classified Test instances**

**Formula**

**Note**

**Question**

**When it will****more suitable****to use Accuracy measure?**

**Answer**

**Accuracy evaluation measure is****more suitable****to use for****evaluation****of Machine Learning Algorithms when we have**

**Balanced Data**

- Example – Calculating Accuracy

**Problem Description**

**Baseline Approach 1**

**Majority Class Categorization (MCC)**

**Baseline Accuracy = 0.50**

**Baseline Approach 2**

**Efficient Learner previously reported by Adeel**

**Accuracy (Efficient Learner) = 0.70**

**Proposed Approach**

**Excellent Learner**

**Accuracy (Excellent Learner) =?**

**Question**

**With which Baseline Approach (Majority Class Categorization or Efficient Learner) Rasheed should compare his Proposed Approach (Excellent Learner)?**

**Answer**

**Efficient Learner**

**Reason**

**Efficient Learner is a****state-of-the-art****approach and can be considered as a**

**strong****Baseline Approach**

**Important Note**

**To have****quality****in your****research work****,****always compare****your Proposed Approach with a**

**state-of-the-art****and****strong****Baseline Approach**

**Calculating Accuracy for Efficient Learner (Machine Learning Algorithm)**

**Note**

- Strengths and Weaknesses - Accuracy

**Strengths**

**We can****evaluate****and****compare****various Machine Learning Algorithms using Accuracy evaluation measure**

**Weaknesses**

**Accuracy fails to****accurately****evaluate a Machine Learning Algorithm when Test Data is****highly unbalanced**

**Accuracy****ignores****possibility of****different misclassiﬁcation costs**

- Example – Accuracy is a Poor Measure for Highly Unbalanced Data

**Consider a Binary Classification Problem with two classes: Positive and Negative. Test Data comprises of 1000 instances, out of which 995 instances are Negative and 5 are Positive.**

**A Machine Learning Algorithm which****always predicts****Negative, will have an**

**Accuracy of 0.995**

**Problem**

**Machine Learning Algorithm has****very high****Accuracy (99.5%) on Test Data, even though it****never correctly predicts Positive Test Examples**

**A Possible Solution**

**Confusion Matrix**

- Confection Matrix

**Definition**

**A Confusion Matrix is a table used to describe the****performance of a Classification Model****(or Classifier) on a Set of Test Examples (Test Data), whose****Actual Values****(or True Values) are****known**

**Purpose**

**To get****deeper insights****into Model / Classiﬁer****behavior**

**Advantages**

**Confusion Matrix****allows****us to****visualize****the****performance****of a Model / Classifier**

**Confusion Matrix****allows****to****separately****get****insights****into the****Errors****made by****each Class**

**Confusion Matrix gives insights to****both**

**Errors****made by a Model / Classifier and**

**Types of Errors****made by a Model / Classifier**

**Confusion Matrix****allows****us to****compute****many****different Evaluation Measures****including**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (or Recall)**

**True Negative Rate**

**False Positive Rate**

**False Negative Rate**

**Precision**

**F****1**

- Confusion Matrices

**Confusion Matrix for a Machine Learning Problem with****n****Number of Classes is given below,**

- Confusion Matrix for Concept Learning

**Confusion Matrix for Concept Learning (a.k.a. Binary Classification Problem) is given below,**

**Considering that**

**Class 01 = Negative**

**Class 02 = Positive**

- Extracting Various Evaluation Measures from Confusion Matrix

**In sha Allah, in the next Slides I will try to explain how to extract the following Evaluation Measures from Confutation Matrix**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (or Recall or Sensitivity)**

**True Negative Rate**

**False Positive Rate**

**False Negative Rate**

**Precision (or Specificity)**

**F-measure**

**F****1**

**Baseline Accuracy (BA)**

**Definition**

**Classify all the Test Examples by assigning them the label of the Majority Class**

**Formula**

**Accuracy**

**Definition**

**Accuracy (AC) is the proportion of the total number of predictions that were correct**

**Formula**

**Note**

**This is the same as Accuracy evaluation measure defined earlier**

**Recall or True Positive Rate (TPR) or Sensitivity**

**Definition**

**Recall or True Positive Rate (TPR) or Sensitivity is the proportion of Positive cases that were correctly classified**

**Formula**

**False Positive Rate**

**Definition**

**False Positive Rate (FPR) is the proportion of Negative cases that were incorrectly classified as Positive**

**Formula**

**True Negative Rate or Specificity**

**Definition**

**True Negative Rate (TNR) or Specificity is defined as the proportion of Negatives cases that were classified correctly**

**Formula**

**False Negative Rate**

**Definition**

**False Negative Rate (FNR) is the proportion of Positive cases that were incorrectly classified as Negative**

**Formula**

**Precision**

**Definition**

**Precision (P) is the proportion of the predicted Positive cases that were correct**

**Formula**

- Trade-Off Between Precision and Recall

**Problem**

**There is a****trade-off****between Precision and Recall**

**A Possible Solution**

**Find out Evaluation Measures which****efficiently combine****Precision and Recall**

**Example**

**F-measure (combines Precision and Recall)**

- F-measure

**F-measure**

**Definition**

**Harmonic Mean of Precision and Recall**

**Formula**

**where****controls relative****weight****assigned to Precision and Recall**

- F1-measure

**Definition**

**When we assign****same weights****to Precision and Recall i.e. β = 1, the F-measure becomes F****1****-measure**

- Harmonic Mean in F-measure

**Question**

**In F-measure, why we use****Harmonic Mean****instead of****Arithmetic Mean****?**

**Answer**

**We want our Machine Learning Algorithm to have**

**High****F****1****score**

**To achieve High F****1****score, we need both**

**Good****Precision and**

**Good****Recall**

**Harmonic Mean****penalizes****F****1****score, when****either**

**Recall is****high****and Precision is****low****or**

**Recall is****low****and Precision is****high**

**Note the difference in Formulas of Arithmetic Means and Harmonic Mean**

**Harmonic Mean Formula**

**Arithmetic Mean Formula**

- Example 1 – Arithmetic Mean vs Harmonic Mean

**Situation**

**Both Precision and Recall are****almost same**

**Consider the following values of Precision and Recall**

**Precision = 0.90**

**Recall = 0.87**

**Arithmetic Mean**

**F****1****= 0.885**

**Harmonic Mean**

**F****1****= 0.884**

**Note**

**Both Arithmetic Mean and Harmonic Mean score are****almost same**

**Conclusion**

**Arithmetic Mean and Harmonic Mean scores are****almost same****when Precision and Recall scores are****almost same**

- Example 2 – Arithmetic Mean vs Harmonic Mean

**Situation**

**Precision is****high****and Recall is****low**

**Consider following values of Precision and Recall**

**Precision = 0.90**

**Recall = 0.20**

**Arithmetic Mean**

**F****1****= 0.55**

**Harmonic Mean**

**F****1****= 0.32**

**Note**

**Arithmetic Mean score is****high****compared to Harmonic Mean score**

**Conclusion**

**Harmonic Mean****penalized****the F****1****scores because Precision is****high****and Recall is****low**

- Example 3 – Arithmetic Mean vs Harmonic Mean

**Situation**

**Precision is****low****and Recall is****high**

**Consider the following values of Precision and Recall**

**Precision = 0.20**

**Recall = 0.90**

**Arithmetic Mean**

**F****1****= 0.55**

**Harmonic Mean**

**F****1****= 0.32**

**Note**

**Again, similar to Example 02, Arithmetic Mean score is****high****compared to Harmonic Mean score**

**Conclusion**

**Harmonic Mean****penalized****the F****1****scores because Precision is****low****and Recall is****high**

- Summary - Arithmetic Mean vs Harmonic Mean

**Arithmetic Mean score is****almost same****compared to Harmonic Mean score when**

**Precision and Recall are****almost same**

**Arithmetic Mean score is****high****compared to Harmonic Mean score when**

**Precision is****high****and Recall is****low**

**Precision is****low****and Recall is****high**

**Conclusion**

**To get****Good F****1****score**

**Both****Precision and Recall should be****good**

- Example 1 – Confusion Matrix for Binary Classification Problem

**Consider the following Machine Learning Problem**

**Confusion Matrix**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (TPR) or Recall or Sensitivity**

**False Positive Rate (FPR)**

**True Negative Rate (TNR) or Specificity**

**False Negative Rate (FNR)**

**Precision**

**F****1****Measure**

- Example 2 – Confusion Matrix for Ternary Classification Problem

**Consider the following Machine Learning Problem**

**Confusion Matrix**

**Total Documents = 2000**

**Positive Documents = 1000**

**Negative Documents = 600**

**Neutral Documents = 400**

**Baseline Accuracy**

**Accuracy**

- Second Major Disadvantage of Accuracy

**Recall**

**The second major disadvantage of Accuracy evaluation measure is that**

**It****ignores****possibility of****different misclassiﬁcation costs**

**Recall – Misclassification**

**Positive Instances is Classified as Negative**

**Negative Instances is Classified as Positive**

**Misclassifying (or incorrectly predicting)**

**Positive costs****may be****more or less important****than misclassifying (or incorrectly predicting)****Negative costs**

- Example 1 – Impact of Misclassification Costs

**Machine Learning Problem**

**Treating a Patient**

**Classes**

**Class 1****= Patient is Sick****(Positive)**

**Class 2****= Patient is Not Sick****(Negative)**

**Situation 1**

**Positive Example is****misclassified****as Negative Example**

**A Patient is Sick (Positive) but****Model Predicts****that**

**Patient is Not Sick (Negative) i.e. False Negative Rate**

**Situation 2**

**Negative Example is****misclassified****as Positive Example**

**A Patient is Not Sick (Negative) but****Model Predicts****that**

**Patient is Sick (Positive) i.e. False Positive Rate**

**Question**

**Among the two situations discussed above, which one is****more costly****?**

**Answer**

**Situation 1**

**A Patient is Sick (Positive) but****Model Predicts****that Patient is Not Sick (Negative) i.e. False Negative Rate**

**Question**

**Should we build a Model which has****high FPR and low FNR****?**

**Answer**

**Yes, it will be a good approach to build a Model with****high FPR and low FNR**

**Reason**

**The cost of not treating an ill patient (FNR) is****very high****compared to treating a patient who is not ill (FPR)**

**Therefore, a****high FPR****is****acceptable****, but we should try to****minimize FNR as much as possible**

- Example 2 – Impact of Misclassification Costs

**Machine Learning Problem**

**Plagiarism Detection in Student’s Assignments**

**Classes**

**Class 1****= Plagiarized****(Positive)**

**Class 2****= Non-Plagiarized****(Negative)**

**Situation 1**

**Positive Example is****misclassified****as a Negative Example**

**Model misclassifies (incorrectly predicts) a Plagiarized Assignment (Positive) as Non-Plagiarized (Negative) i.e. False Negative Rate**

**Situation 2**

**Negative Example is****misclassified****as Positive Example**

**Model misclassifies (incorrectly predicts) a Non-Plagiarized Assignment (Negative) as Plagiarized (Positive) i.e. False Positive Rate**

**Question**

**Among the two situations discussed above, which one is****more costly****?**

**Answer**

**Situation 1**

**Model misclassifies (incorrectly predicts) a Plagiarized Assignment (Positive) as Non-Plagiarized (Negative) i.e. False Negative Rate**

**Question**

**Should we build a Model which has****high FPR and low FNR****?**

**Answer**

**Yes, it will be a****good****approach to build a Model with****high FPR and low FNR**

**Reason**

**The cost of****not detecting****Plagiarized Assignments (FNR) is****very high****compared to predicting Non-Plagiarized Assignments as Plagiarized (FPR)**

**Therefore, a****high FPR****is****acceptable****, but we should try to****minimize FNR as much as possible**

- Example 3 – Impact of Misclassification Costs

**Machine Learning Problem**

**Predicting Fraud in Loan Applicants**

**Classes**

**Class 1****= Fraud****(Positive)**

**Class 2****= Not Fraud****(Negative)**

**Situation 1**

**Positive Example is****misclassified****as Negative Example**

**Model misclassifies (incorrectly predicts) a Fraud Applicant (Positive) as Not Fraud (Negative) i.e. False Negative Rate**

**Situation 2**

**Negative Example is****misclassified****as Positive Example**

**Model misclassifies (incorrectly predicts) a Not Fraud Applicant (Negative) as Fraud (Positive) i.e. False Positive Rate**

**Question**

**Among the two situations discussed above, which one is****more costly****?**

**Answer**

**Situation 01**

**Model misclassifies (incorrectly predicts) a Fraud Applicant (Positive) as Not Fraud (Negative) i.e. False Negative Rate**

**Question**

**Should we build a Model which has****high FPR and low FNR****?**

**Answer**

**Yes, it will be a****good****approach to build a Model with****high FPR and low FNR**

**Reason**

**The cost of****giving loan****to a Fraud Applicant (FNR) is****very high****compared to****not giving loan****to an Applicant who is Not Fraud (FPR)**

**Therefore, a****high FPR****is****acceptable****, but we should try to****minimize FNR as much as possible**

- Comparing Example 1, Example 2 and Example 3

**To summarize**

**In all three Examples disused in previous slides, we found a****common pattern**

**Misclassification Cost****of predicting****Positive Examples as Negative Examples****(FNR) is****very high****compared to****Misclassification Cost****of predicting****Negative Examples as Positive Examples****(FPR)**

**To Conclude**

**For three Machine Learning Problems discussed in previous Slides including Treating a Patient, Plagiarism Detection in Students’ Assignments and Detection of Fraud Loan Applicants, it will be****wise****to****build Models****which have**

**High FPR and Low FNR**

- Important Note – Misclassification Costs

**Before Training your Model****, you must be****very clear****about the Misclassification Costs, otherwise**

**Your Model will****fail****to perform well in Real-world (i.e. Application Phase)**

- Second Major Disadvantage of Accuracy Cont…

**Problem**

**How to****handle the problem****in Accuracy evaluation measure caused by****ignoring****the possibility of****different misclassiﬁcation costs****?**

**Two Possible Solutions**

**ROC Curves**

**Precision-Recall Curves**

- ROC Curve

**Definition**

**ROC Curve****summarizes****the****trade-off****between the True Positive Rate (TPR) and False Positive Rate (FPR) for a Classifier / Model using****different****Probability Thresholds**

**Suitable to Use**

**ROC Curves are****more suitable****to use when we have**

**Balanced Data**

- ROC Curve

- ROC Curve – Example

- Precision-Recall Curve

**Definition**

**Precision-Recall Curve****summarizes****the****trade-off****between Precision and Recall (or True Positive Rate) for a Classifier / Model using****different****Probability Thresholds**

**Suitable to Use**

**Precision-Recall Curves are****more suitable****to use when we have**

**Highly****Unbalanced Data**

- Are Under the Curve (AUC)

**Definition**

**Area Under the ROC Curve (AUC) is defined as the****probability****that randomly chosen Positive instance is ranked above than the randomly selected Negative one**

**Purpose**

**AUC measures the****degree of separability****between Classes**

**i.e. AUC tells how much Model is capable of****distinguishing between Classes****?**

**How AUC Score is calculated?**

**AUC score is computed using True Positive Rate (TPR) and False Positive Rate (FPR)**

**Range of AUC Score**

**Range of AUC Score is [0 – 1]**

**0 means All Predictions of Model are Wrong**

**1 means All Predictions of Model are Correct**

**What is a good AUC Score?**

**AUC Score = 0.5**

**suggests****No Discrimination****(i.e., Model****do not****have the ability to****differentiate****Positive instances from Negative ones)**

**AUC Score = 0.7 to 0.8**

**considered Acceptable**

**AUC Score = 0.8 to 0.9**

**considered Excellent**

**AUC Score = Greater Than 0.9**

**considered Outstanding**

- ROC Graph vs AUC

**ROC Graph**

**ROC Graph is a****probability curve**

**AUC**

**AUC represents****degree of separability****between Classes**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 4**

**Task 1**

**Consider the following Binary Classification Problem and answer the questions given below.**

**Ayesha had a collection of 800 documents contains news articles. She liked 300 news articles (Positive instances) and disliked (Negative instances) the remaining ones. She applied the Naïve Bayes classifier on the entire dataset. 50 of the liked news articles were correctly classified and 200 of the disliked news articles were incorrectly classified.**

**Note**

**Your answer should be**

**Well Justified**

**Questions**

**Draw Confusion Matrix?**

**Calculate the following**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (Recall)**

**False Positive Rate**

**True Negative Rate**

**False Negative Rate**

**Precision**

**F****1**

**What****deeper insights****about a Classifier’s behaviour you can get using Confusion Matrix?**

**What Impact****Data Distribution****has on the calculation of Baseline Accuracy?**

**Considering Baseline Accuracy as****Baseline Approach**

**If a Proposed ML Algorithm produces and Accuracy of 0.60**

**Will we consider it a****research contribution****?**

**Compare Harmonic Mean score with Arithmetic Mean score?**

**What****impact****Precision and Recall has on calculation of F****1****?**

**Whose****cost of misclassification****is high, Positive instances or Negative instances?**

**What possible solutions are there to handle the problems of****cost of misclassification****?**

**Naïve Bayes Algorithm obtained an AUC score of 0.75. What does it mean?**

**Which one is more suitable to use, ROC Curve or Precision-Recall Curve?**

**Task 2**

**Consider the following Ternary Classification Problem and answer the questions given below.**

**Fatima had a collection of 2000 documents, which can be classified into three categories: Wholly Derived (WD), Partially Derived (PD) and Non-Derived (ND). 600 documents are WD, 500 are PD and 900 are ND. She ran Naïve Bayes classifier on the entire dataset. For WD, half of the documents were correctly classified and 100 were classified as PD. For PD, 350 documents were correctly classified and 50 were classified as ND. For ND, 700 documents were correctly classified and 0 as WD.**

**Questions**

**Draw Confusion Matrix?**

**Calculate the following**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (Recall)**

**False Positive Rate**

**True Negative Rate**

**False Negative Rate**

**Precision**

**F****1**

**What****deeper insights****about a Classifier’s behaviour you can get using Confusion Matrix?**

**What Impact****Data Distribution****has on the calculation of Baseline Accuracy?**

**Considering Baseline Accuracy as****Baseline Approach**

**If a Proposed ML Algorithm produces and Accuracy of 0.60**

**Will we consider it a****research contribution****?**

**Compare Harmonic Mean score with Arithmetic Mean score?**

**What****impact****Precision and Recall has on calculation of F****1****?**

**Whose****cost of misclassification****is high, Positive instances or Negative instances?**

**What possible solutions are there to handle the problems of****cost of misclassification****?**

**Naïve Bayes Algorithm obtained an AUC score of 0.5. What does it mean?**

**Which one is****more suitable****to use, ROC Curve or Precision-Recall Curve?**

Your Turn Tasks

**Your Turn Task 4**

**Task 1**

**Consider a Binary Classification Problem similar to Task 01 in TODO and answer the questions given below.**

**Questions**

**Draw Confusion Matrix?**

**Calculate the following**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (Recall)**

**False Positive Rate**

**True Negative Rate**

**False Negative Rate**

**Precision**

**F****1**

**What****deeper insights****about a Classifiers behaviour you can get using Confusion Matrix?**

**What Impact****Data Distribution****has on calculation of Baseline Accuracy?**

**Considering Baseline Accuracy as****Baseline Approach**

**If a Proposed ML Algorithm produces and Accuracy of 0.60**

**Will we consider it a****research contribution****?**

**Compare Harmonic Mean score with Arithmetic Mean score?**

**What****impact****Precision and Recall has on calculation of F****1****?**

**Whose****cost of misclassification****is high, Positive instances or Negative instances?**

**What possible solutions are there to handle the problems of****cost of misclassification****?**

**Naïve Bayes Algorithm obtained an AUC score of 0.90. What does it mean?**

**Which one is****more suitable****to use, ROC Curve or Precision-Recall Curve?**

**Task 2**

**Consider a Ternary Classification Problem similar to Task 02 in TODO and answer the questions given below.**

**Questions**

**Draw Confusion Matrix?**

**Calculate the following**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (Recall)**

**False Positive Rate**

**True Negative Rate**

**False Negative Rate**

**Precision**

**F****1**

**What****deeper insights****about a Classifiers behavior you can get using Confusion Matrix?**

**What Impact****Data Distribution****has on calculation of Baseline Accuracy?**

**Considering Baseline Accuracy as****Baseline Approach**

**If a Proposed ML Algorithm produces and Accuracy of 0.60**

**Will we consider it a****research contribution****?**

**Compare Harmonic Mean score with Arithmetic Mean score?**

**What****impact****Precision and Recall has on calculation of F****1****?**

**Whose****cost of misclassification****is high, Positive instances or Negative instances?**

**What possible solutions are there to handle the problems of****cost of misclassification****?**

**Naïve Bayes Algorithm obtained an AUC score of 0.60. What does it mean?**

**Which one is****more suitable****to use, ROC Curve or Precision-Recall Curve?**

## Evaluation Measures for Regression Problems

- Evaluation Measures for Regression Problems

**Some of the****most popular****and****widely used****Evaluation Measures for Regression Problems are**

**Mean Absolute Error (MAE)**

**Mean Squared Error (MSE)**

**Root Mean Squared Error (RMSE)**

**R****2****or Coefficient of Determination**

**Adjusted R****2**

**Usage**

**These measures are****widely****and****commonly****used in**

**Climatology**

**Forecasting**

**Regression Analysis**

**Note**

**Insha Allah (انشاء اللہ), In this Chapter, I will discuss only three measures**

**Mean Absolute Error (MAE)**

**Mean Square Error (MSE)**

**Root Mean Square Error (RMSE)**

- Mean Absolute Error

**Absolute Error**

**Absolute Error (AE) is the****difference****between the****Actual Value****and the****Predicted Value**

**Formula**

**where**XActual**and**XPredicted**represent****Actual Value****and****Predicted Value****respectively**

**Mean Absolute Error**

**Mean Absolute Error (MAE) is the****average****of****all****Absolute Errors**

**Formula**

**where**

**n****represents the****total number of instances**

**X****actual****represents****Actual Value**

**X****predicted****represents****Predicted Value**

- Mean Square Error

**Square Error**

**Square Error (SE) is the****Square of difference****between the Actual Value and the****Predicted Value**

**Formula**

**where****X**_{actual}**and****X**_{predicted }**represent****Actual Value****and****Predicted Value****respectively**

**Mean Square Error**

**Mean Square Error (MSE) is the****average****of****all****Square Errors**

**Formula**

**where**

**n****represents the****total number of instances**

**X****actual****represents****Actual Value**

**X****predicted****represents****Predicted Value**

- Root Mean Square Error

**Root Mean Square Error**

**Root Mean Square Error (RMSE) is the****Square root****of****all****Mean Square Errors**

**Formula**

**where**

**n****represents the****total number of instances**

**X****actual****represents****Actual Value**

**X****predicted****represents****Predicted Value**

- Example – Calculating MAE, MSE, RMSE

**Consider the****predictions****returned by the GPA Prediction Problem discusses in****Chapter 05 – Treating a Problem as a Machine Learning Problem – Step by Step Examples**

**Calculate MAR, MSE and RMSE**

- Calculating Mean Absolute Error

**To calculate Mean Absolute Error, we will****compare**

**Actual Values****with****Predicted Values**

**Step 1: Calculate Absolute for each Test Example**

**AEd11=│ XActual – XPredicted│= │2.25-2.00 │=0.25**

**AEd12=│ XActual – XPredicted│= │1.94-2.14 │=0.20**

**AEd13=│ XActual – XPredicted│= │3.43-2.00 │=1.43**

**AEd14=│ XActual – XPredicted│= │1.86-1.90 │=0.04**

**AEd15=│ XActual – XPredicted│= │1.94-2.00 │=0.06**

**Step 2: Calculate Mean Absolute Error**

**MAE = ( AE(d1) + AE(d2) + AE(d3) + AE(d4) + AE(d5) ) 5**

**MAE = ( 0.25) + (0.20) + (1.43) + (0.04) + (0.06) ) 5**

**MAE =0.396**

- Calculating Mean Square Error

**To calculate Mean Square Error, we will compare**

**Actual Values with Predicted Values**

**Step 1: Calculate Square Error for each Test Example**

**SEd11= XActual – XPredicted2=2.25-2.002=0.0625**

**SEd12= XActual – XPredicted2=1.94-2.142=0.0400**

**SEd13= XActual – XPredicted2=3.43-2.002=2.0449**

**SEd14= XActual – XPredicted2=1.86-1.902=0.0016**

**SEd15= XActual – XPredicted2=1.94-2.002=0.0036**

**Step 2: Calculate Mean Square Error**

**MSE = ( SE(d1) + SE(d2) + SE(d3) + SE(d4) + SE(d5) ) 5**

**MSE = ( 0.0625) + (0.04) + (2.0449) + (0.0016) + (0.0036) ) 5**

**MSE =0.430**

- Calculating Root Mean Square Error

**To Calculate Root Mean Square Error, we will compare**

**Actual Values with Predicted Values**

**Step 1: Calculate Square Error for each Test Example**

**SEd11= XActual – XPredicted2=2.25-2.002=0.0625**

**SEd12= XActual – XPredicted2=1.94-2.142=0.04**

**SEd13= XActual – XPredicted2=3.43-2.002=2.0449**

**SEd14= XActual – XPredicted2=1.86-1.902=0.0016**

**SEd15= XActual – XPredicted2=1.94-2.002=0.0036**

**Step 2: Calculate Mean Square Error**

**MSE = ( SE(d1) + SE(d2) + SE(d3) + SE(d4) + SE(d5) ) 5**

**MSE = ( 0.0625) + (0.04) + (2.0449) + (0.0016) + (0.0036) ) 5**

**MSE =0.430**

**Step 3: Calculate Root Mean Square Error**

**RMSE =( SE(d1) + SE(d2) + SE(d3) + SE(d4) + SE(d5) ) 5**

**RMSE = ( 0.0625) + (0.04) + (2.0449) + (0.0016) + (0.0036) ) 5**

**RMSE =0.656**

- Comparing MAE, MSE and RMSE

**In the previous example**

**MAE****= 0.396**

**MSE****= 0.430**

**RMSE****= 0.655**

**Note**

**MAE is the lowest score**

**Reason**

**It****only calculates difference****between Actual Values and Predicted Values**

**MSE is higher than MAE**

**Reason**

**It****amplifies****the effect of****large differences****between Actual Values and Predicted by****taking the Square of difference in Actual Values and Predicted Values**

**RMSE is higher than MSE**

**Reason**

**It ****further amplifies**** the effect of ****large differences ****between Actual Values and Predicted by ****taking the Square Root of Square Errors**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 5**

**Task 1**

**Consider a Regression Problem which aims to predict the Number of Admission in Next Year. Table below contains the Actual and Prediction Values.**

**Note**

**Your answer should be**

**Well Justified**

**Questions**

**Calculate the following**

**Mean Absolute Error**

**Mean Square Error**

**Root Mean Square Error**

**Which one of the three evaluation measures (MAE, MSE, RMSE) is****more suitable****?**

Your Turn Tasks

**Your Turn Task 5**

**Task 1**

**Consider a Regression Problem (similar to the one given in TODO Task) and answer the questions given below.**

**Questions**

**Calculate the following**

**Mean Absolute Error**

**Mean Square Error**

**Root Mean Square Error**

**Which one of the three evaluation measures (MAE, MSE, RMSE) is****more suitable****?**

## Evaluation Measures for Sequence to Sequence Problem

- Evaluation Measures for Sequence to Sequence Problem

**Some of the****most popular****and****widely used****Evaluation Measures for Sequence-to-Sequence Problems are**

**ROUGE****(Recall-Oriented Understudy for Gisting Evaluation)**

**BLEU****(Bi-Lingual Evaluation Understudy)**

**METEOR****(Metric for Evaluation of Translation with Explicit Ordering)**

**Usage**

**ROUGE, BLEU and METEOR are****widely****and****commonly****used to evaluate a range of Sequence to Sequence Problems including**

**Text Summarization**

**Machine Translation**

**Chatbot**

**Question Answering**

**Automatic Paraphrase Generation**

**Automatic Grading of Essays**

**Generating Caption for an Image / Video**

**Generating Natural Language Description for an Image / Video**

**Speech to Text**

**Note**

**In this Chapter, In sha Allah (انشاء اللہ), I will give an example of ROUGE and BLEU**

- Evaluating Text Summarization System using ROUGE

**ROUGE is a****de facto standard****to****automatically evaluate****the performance of Text Summarization Systems**

**Insha Allah (انشاء اللہ), I will use the following three metrics of ROUGE to evaluate Urdu Text Summarization System**

**ROUGE-1**

**ROUGE-2**

**ROUGE-L**

**Average F****1****scores****will be****reported****for ROUGE-1, ROUGE-2 and ROUGE-L metrics**

**Note**

**To understand the working of ROUGE-L, ROUGE-1 and ROUGE-2 metrics**

**See****Tutorial – Evaluating Sequence to Sequence Models using ROUGE**

**Consider the Text Summarization System discussed in****Chapter 05 – Treating a Problem as a Machine Learning Problem – Step by Step Examples**

**Below are Predictions Returned by Model on Test Data**

- Calculating Average F1 Scores for ROUGE-1, ROUGE-2 and ROUGE-L

- Evaluating Machine Translation System using ROUGE

**BLEU is a****de facto standard****to****automatically evaluate****the performance of Machine Translation Systems**

**In sha Allah, I will use following four metrics of BLEU to evaluate Urdu Machine Translation System**

**BLEU-1**

**BLEU-2**

**BLEU-3**

**BLEU-4**

**To understand the working of BLEU-1, BLEU-2, BLEU-3 and BLEU-4 metrics**

**See****Tutorial – Evaluating Sequence to Sequence Models using BLEU**

- Evaluating Machine Translation System using BLEU Cont….

**Consider the Machine Translation System discussed in****Chapter 05 – Treating a Problem as a Machine Learning Problem – Step by Step Examples**

**Below are Predictions Returned by Model on Test Data**

### TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

**TODO Task 6**

**Task 1**

**Consider the Text Summarization and Machine Translation Problems given in this Chapter. I have calculated ROUGE scores for Text Summarization System and BLEU for Machine Translation System.**

**Note**

**Your answer should be**

**Well Justified**

**Question**

**Describe following things about METEOR**

**Definition**

**Purpose**

**Importance**

**Applicators**

**Strengths**

**Weakness**

**Calculate METEOR scores for Text Summarization and Machine Translation Systems are given in this Chapter?**

**TIP**

**See the following paper on METEOR**

**Wikipedia Article on METEOR**

**URL:****https://en.wikipedia.org/wiki/METEOR**ttps://en.wikipedia

Your Turn Tasks

**Your Turn Task 6**

**Task 1**

**Identify a Machine Learning Problem (similar to Text Summarization and Machine Translation in TODO Task) and answer the questions given below.**

**Question**

**Calculate METEOR score for the selected Machine Learning Problem?**

## Chapter Summary

- Chapter Summary

**To****completely****and****correctly****learn any task follow the Learning Cycle**

**The four main phases of a Learning Cycle to****completely****and correctly learn any task are**

**Training Phase**

**Testing Phase**

**Applicants Phase**

**Feedback Phase**

**The****main goal****of****building a Model****(Training / Learning) is to****use****it in doing****Real-world Tasks****with****good Accuracy**

**No one can****perfectly predict****, that if a Model performs well in****Training Phase****, then it will also perform well in****Real-world**

**However,****before deploying a Model****in the Real-world, it is****important****to know**

**How****well****will it****perform****on****unseen****Real-time Data?**

**To****judge / estimate****the performance of a Model (or h) in Real-world (Application Phase)**

**Evaluate****the Model (or h) on****large Test Data****(Testing Phase)**

**If (Model Performance = Good AND Test Data = Large)****Then****Use****the Model in Real-world**

**Else****Refine****(re-train) the Model**

**Recall – Machine Learning Assumption**

**If a Model (or h) performs well on****large Test Data****, it will****also****perform well on unseen Real-world Data**

**Again, this is an****assumption****and we are****not 100% sure****that if a Model performs well on large Test Data will****definitely****perform well on Real-world Data**

**Therefore, it is useful to take****continuous Feedback****on****deployed Model****(Feedback Phase) and keep on****improving****it**

**Two main advantages of Evaluating Hypothesis / Model are**

**We get the answer to an important question i.e.**

**Should we****rely****on****predictions of Hypothesis / Model****when****deployed****in Real-world?**

**Machine Learning Algorithms may****rely on Evaluation****to****refine****Hypothesis (h)**

**What we evaluate a Hypothesis h, we****want to know**

**How****accurately****will it****classify****future****unseen****instances?**

**i.e. Estimate of Error (EoE)**

**How****accurate****is our Estimate of Error (EoE)?**

**I.e. what****Margin of Error****(±? %) is****associated****with our****Estimate of Error (EoE)****? (we call it Error in Estimate of Error (EoE))**

**True Error**

**Error computed on****entire Population**

**Sample Error**

**Error computed on****Sample Data**

**Since we****cannot acquire****entire Population**

**Therefore, we****cannot calculate****True Error**

**Calculate Sample Error in****such a way****that**

**Sample Error****estimates****True Error****well**

**Statistical Theory tells us that Sample Error****can estimate****True Error****well****if the following two conditions are fulfilled**

**Condition 01 – the***n***instances in Sample***S***are****drawn**

**independently of one another**

**independently of***h*

**according to Probability Distribution***D*

**Condition 02**

**n ≥ 30**

**Estimate of Sample Error (EoSE) can be calculated as follows**

**Step 1:****Randomly****select a Representative Sample S from the Population**

**Step 2: Calculate Sample Error on Representative Sample S drawn in Step 1**

**Estimate****cannot****be****perfect****and will contain****Error**

**Therefore, calculate****Error****in Estimate of Sample Error (EoSE)**

**Error in Estimate of Sample Error (EoSE) can be calculated using****Confidence Interval**

**The Most Probable Value of****True Error is Sample Error****with****approximately****N% Probability (Confidence Level), True Error lies in****interval**

**where n represents size of Sample**

**error****s****(h) represents Sample Error**

**The****choice****of Confidence Level depends on the****field of study**

**Generally, the****most common****Confidence Level used by****Researchers****is 95%**

**Our goal is to have a**

**Small****Interval with****High****Confidence**

**The two main diseases in Machine Learning are**

**Overfitting**

**Underfitting**

**The condition when a Machine Learning Algorithm tries to****remember all****the Training Examples from the Training Data (Rote Learning) is known as Overfitting of the Model (h)**

**Overfitting happens when our**

**Model (h) has****lot of features****or**

**Model (h) is****too complex**

**The condition when a Machine Learning Algorithm****could not learn the correlations between Attributes / Features properly****then it is known as Underfitting of the Model (h)**

**Underfitting happens when our**

**Model****misses****the****trends or patterns****in the Training Data and could not generalize well for the Training Examples**

**To overcome the problems of Overfitting and Underfitting, we**

**Use Train-Test Split Approach**

**Train-Test Split Approach, splits the Sample Data into two sets: (1) Train Set and (2) Test Set**

**Two main variations of Train-Test Split Approach are**

**Random Split Approach**

**Class Balanced Split Approach**

**Class Balanced Split Approach should be****preferred****over Random Split Approach**

**Train-Test Split Ratio determines what****percentage****of the Sample Data will be used as Train Set and what****percentage****of the Sample Data will be used as Test Set**

**The Train-Test Split Ratio****may vary****from Machine Learning Problem to Machine Learning Problem**

**e.g. 70%-30%, 80%-20%, 90%-10% etc.**

**Most Common Train-Test Split Ratio**

**Use 2 / 3 of Sample Data as Train Set**

**Use 1 / 3 of Sample Data as Test Set**

**Train-Test Split Approach provides****high variance in Estimate of Sample Error****since**

**Changing****which****examples****happens to be in Train Set can****significantly****change Sample Error**

**To****overcome the problem****of****high variance****in Estimate of Sample Error calculated using Train-Test Split Approach, we**

**Use K-fold Cross Validation Approach**

**K-fold Cross Validation Approach works as follows**

**Step 1: Split Train Set into K****equal folds****(or partitions)**

**Step 2: Use****one of the folds (k****th****fold)****as the Test Set and****union of remaining folds (k – 1 folds)****as Training Set**

**Step 3: Calculate Error of Model (h)**

**Step 4: Repeat Steps 2 and 3, to****choose****Train Sets and Test Sets from****different folds****, and calculate Error****K-times**

**Step 5: Calculate****Average Error**

**Important Note**

**In each fold, there must be****at least 30 instances**

**All K-folds must be****disjoint****i.e. instance appearing in one fold****must not appear****in any other fold**

**Empirical study****showed that****best****value for**

**K is 10**

**K-fold Cross Validation Approach is a****better estimator of Error****since**

**All data****is used for****both****Training and Testing**

**K-fold Cross Validation Approach is****computationally expensive****since**

**we have to****repeat****Training and Testing Phases****K-times**

**It is****suitable****to use Train-Test Split Approach in the following situations**

**When Training Time is****Very Large**

**Organizing International Competitions**

**Having****Very Huge****Sample Data**

**It is****suitable****to use K-fold Cross Validation Approach in the following situations**

**When Training Time is****Not Very Large**

**Having Sample Data that is****Not Very Huge**

**To****compare****various Machine Learning Algorithms, following things****must****be****same**

**Train Set**

**Test Set**

**Evaluation Measure**

**Evaluation Methodology**

**Important Note**

**If****any****of the above things are****not same****then it will**

**Not****be a****valid comparison**

**Some of the****most popular****and****widely used****Evaluation Measures for Classification Problems are**

**Baseline Accuracy**

**Accuracy**

**Precision**

**Recall**

**F****1**

**Area Under the Curve (AUC)**

**Baseline Accuracy (a.k.a. Majority Class Categorization (MCC)) is calculated by assigning the****label of Majority Class****to****all****the Test Instances**

**BA provides a****simple baseline****to compare****proposed****Machine Learning Algorithms**

**Accuracy is defined as the proportion of correctly classified Test instances**

**Accuracy = 1 – Error**

**Accuracy evaluation measure is****more suitable****to use for****evaluation****of Machine Learning Algorithms when we have**

**Balanced Data**

**Two main limitations of Accuracy measure are**

**Accuracy fails to****accurately****evaluate a Machine Learning Algorithm when Test Data is****highly unbalanced**

**Accuracy****ignores****possibility of****different misclassiﬁcation costs**

**To overcome the limitations of Accuracy measure, we use**

**Confusion Matrix**

**A Confusion Matrix is a table used to describe the****performance of a Classification Model****(or Classifier) on a Set of Test Examples (Test Data), whose****Actual Values****(or True Values) are****known**

**Some of the main advantages of Accuracy measure are**

**Confusion Matrix****allows****us to****visualize****the performance of a Model / Classifier**

**Confusion Matrix****allows****to****separately****get insights into the****Errors****made by****each Class**

**Confusion Matrix gives insights to****both**

**Errors****made by a Model / Classifier and**

**Types of Errors****made by a Model / Classifier**

**Confusion Matrix****allows****us to****compute****many****different Evaluation Measures****including**

**Baseline Accuracy**

**Accuracy**

**True Positive Rate (or Recall)**

**True Negative Rate**

**False Positive Rate**

**False Negative Rate**

**Precision**

**F****1**

**Recall or True Positive Rate (TPR) or Sensitivity is the proportion of Positive cases that were correctly classified**

**False Positive Rate (FPR) is the proportion of Negative cases that were incorrectly classified as Positive**

**True Negative Rate (TNR) or Specificity is defined as the proportion of Negatives cases that were classified correctly**

**False Negative Rate (FNR) is the proportion of Positive cases that were incorrectly classified as Negative**

**Precision (P) is the proportion of the predicted Positive cases that were correct**

**F-measure is the Harmonic Mean of Precision and Recall**

**where****controls relative****weight****assigned to Precision and Recall**

**F****1****-measure is that, When we assign****same weights****to Precision and Recall i.e. β = 1, the F-measure becomes F****1****-measure**

**When considering Misclassifications**

**Positive costs****may be****more or less important****than misclassifying (or incorrectly predicting)****Negative costs**

**Before Training your Model****, you must be****very clear****about the Misclassification Costs, otherwise**

**Your Model will****fail****to perform well in Real-world (i.e. Application Phase)**

**The problem of considering****different misclassiﬁcation costs****can be handled using**

**ROC Curves**

**Precision-Recall Curves**

**ROC Curve****summarizes****the****trade-off****between the True Positive Rate (TPR) and False Positive Rate (FPR) for a Classifier / Model using****different****Probability Thresholds**

**ROC Curves are****more suitable****to use when we have Balanced Data**

**Precision-Recall Curve****summarizes****the****trade-off****between Precision and Recall (or True Positive Rate) for a Classifier / Model using****different****Probability Thresholds**

**Precision-Recall Curves are****more suitable****to use when we have****Highly****Unbalanced Data**

**Area Under the ROC Curve (AUC) is defined as the****probability****that randomly chosen Positive instance is ranked above than the randomly selected Negative one**

**AUC tells how much Model is capable of****distinguishing between Classes****?**

**AUC score is computed using True Positive Rate (TPR) and False Positive Rate (FPR)**

**Range of AUC Score**

**Range of AUC Score is [0 – 1]**

**0 means All Predictions of Model are Wrong**

**1 means All Predictions of Model are Correct**

**What is a good AUC Score?**

**AUC Score = 0.5**

**suggests****No Discrimination****(i.e., Model****do not****have the ability to****differentiate****Positive instances from Negative ones)**

**AUC Score = 0.7 to 0.8**

**considered Acceptable**

**AUC Score = 0.8 to 0.9**

**considered Excellent**

**AUC Score = Greater Than 0.9**

**considered Outstanding**

**Some of the****most popular****and****widely used****Evaluation Measures for Regression Problems are**

**Mean Absolute Error (MAE)**

**Mean Squared Error (MSE)**

**Root Mean Squared Error (RMSE)**

**R****2****or Coefficient of Determination**

**Adjusted R****2**

**Absolute Error (AE) is the****difference****between the****Actual Value****and the****Predicted Value**

**Mean Absolute Error (MAE) is the****average****of****all****Absolute Errors**

**Square Error (SE) is the****Square of difference****between the****Actual Value****and the****Predicted Value**

**Mean Square Error (MSE) is the****average****of****all****Square Errors**

**Root Mean Square Error (RMSE) is the****Square root****of****all****Mean Square Errors**

**Some of the****most popular****and****widely used****Evaluation Measures for Sequence-to-Sequence Problems are**

**ROUGE****(Recall-Oriented Understudy for Gisting Evaluation)**

**BLEU****(Bi-Lingual Evaluation Understudy)**

**METEOR****(Metric for Evaluation of Translation with Explicit Ordering)**

**ROUGE is a****de facto standard****to****automatically evaluate****the performance of Text Summarization Systems**

**Normally, we calculate**

**ROUGE-1**

**ROUGE-2**

**ROUGE-L**

**Average F****1****scores****are****reported****for ROUGE-1, ROUGE-2 and ROUGE-L metrics**

**BLEU is a****de facto standard****to****automatically evaluate****the performance of Machine Translation Systems**

**Normally, we calculate**

**BLEU-1**

**BLEU-2**

**BLEU-3**

**BLEU-4**

### In Next Chapter

- In Next Chapter

- In Sha Allah, in next Chapter, I will present

- Book Main Findings, Conclusion and Future Work