Chapter 14 - Evaluating Hypothesis (Models)
Chapter Outline
- Chapter Outline
- Quick Recap
- Why Evaluate Hypotheses (Model)?
- Two Main Diseases of Machine Learning Algorithms
- Comparing Machine Learning Algorithms
- Evaluation Measures for Classification Problems
- Evaluation Measures for Regression Problems
- Evaluation Measures for Sequence-to-Sequence Problems
- Chapter Summary
Quick Recap
- Quick Recap - Evaluating Hypothesis (Models)
- Machine Learning Algorithms studied so far are
- FIND-S Algorithm
- List Then Eliminate Algorithm
- Candidate Elimination Algorithm
- ID3 Algorithm (Decision Tree Learning)
- Perceptron (Two Layer Artificial Neural Network)
- Regular Neural Network (Multi-layer Artificial Neural Network)
- Naïve Bayes Algorithm (Bayesian Learning Algorithm)
- The above-mentioned ML Algorithms are called Eager Methods
- How Eager ML Algorithms Work?
- Given
- Set of Training Examples (D)
- Set of Functions / Hypothesis (H)
- Training Phase
- Build an explicit representation of the Target Function (called approximated Target Function) by
- Searching Hypothesis Space (H) to find a Hypothesis h (approximated Target Function), which best fits the Set of Training Examples (D)
- Testing Phase
- Use the explicit representation (approximated Target Function) to
- classify unseen instances
- Approximated Target Function is applied on an unseen instance and it predicts the Target Classification
- How Lazy ML Algorithms Work?
- Given
- Set of Training Examples (D)
- Training Phase
- Simply store the Set of Training Examples (D)
- Testing Phase
- To classify a new unseen instance x
- Step 1: Compare unseen instance x with all the stored Training Examples
- Step 2: Depending on relationship of unseen instance x with stored Training Examples
- Predict the Target Classification for unseen instance x
- Eager Machine Learning Algorithms which build an explicit representation of the Target Function as Training Examples are presented are called Eager Machine Learning Algorithms
- Lazy Machine Learning Algorithms which defer processing until a new unseen instance must be classified are called Lazy Machine Learning Algorithms
- Instance-based ML Algorithms are Lazy ML Algorithms
- k-NN Algorithm is the grand-daddy of Instance-based Machine Learning Algorithms
- k nearest neighbors of an unseen instance x are Training Examples that have the k smallest distance to unseen instance x
- k-Nearest Neighbor (k-NN) Algorithm – Summary
- Representation of Training Examples
- Attribute-Value Pair
- Computing Relationship between unseen instance x and stored Training Examples (D)
- Use a Distance Metric
- Distance Matrices – Numeric Data
- Euclidean Distance, Jaccard Co-efficient etc.
- Distance Matrices – Categorical Data
- Hamming Distance, Edit Distance etc.
- Strengths
- Very simple implementation
- Robust regarding the search space
- For example, Classes don’t have to be linearly separable
- k-NN Algorithm can be easily updated with new Training Examples (Online Learning) at very little cost
- Need to tune only two main parameters
- Distance Metric and
- Value of k
- Weaknesses
- Testing each instance is expensive
- Sensitive to noisy or irrelevant Attributes, which can result in less meaningful Distance scores
- Sensitiveness to highly unbalanced datasets
- k-Nearest Neighbor (k-NN) Algorithm requires three things
- Set of Training Examples (D)
- Distance Metric to compute distance between unseen instance x and Set of stored Training Examples
- Value of k
- Number of Nearest Neighbors to consider when classifying unseen instance x
- In k-Nearest Neighbor (k-NN) Algorithm, value if k should be
- Carefully chosen
- If k is too small
- Sensitive to noise points (Training Examples)
- If k is too large
- Neighborhood may include points (Training Examples) from other Classes
- Steps – Classify an Unseen Instance using k-NN Algorithm
- Given
- Set of 15 Training Examples
- Distance Metric = Euclidean Distance
- k = 3
- Classifying an unseen instance x using k-NN Algorithm
- Step 1: Compute Euclidean Distance between unseen instance x and all 15 Training Examples
- Identify 3-nearest neighbors (as k = 3)
- Use Class labels of k nearest neighbors to determine the Class label of unseen instance x (for e.g., by taking Majority Vote)
- k-NN Algorithm suffers from Scaling issue
- Attributes may have to be scaled to prevent distance measures from being dominated by one of the Attributes
- Scale / normalize Attributes before applying k-NN Algorithm
- In Offline Learning, we learn a Concept from a static dataset
- In Online Learning, we learn a Concept from data and keep on updating your Hypothesis (h) as more data is available for the same Concept
Why Evaluate Hypotheses (Model)?
- Completely and Correctly Learning any Task
- Question
- How to completely and correctly learn any task?
- A Possible Answer
- Follow the Learning Cycle
- Learning Cycle
- The four main phases of a Learning Cycle to completely and correctly learn any task are
- Training Phase
- Testing Phase
- Application Phase
- Feedback Phase
- Example - Completely and Correctly Learning any Task
- Learning Problem
- Learn to Drive a Car in Real-world
- Question
- How to achieve this goal?
- A Possible Answer
- Follow the Learning Cycle
- Training Phase
- Learn to drive a Car from
- A Trainer in a Training Centre
- Assumption
- The environment of the Training Centre mimics that of the Real-world
- Outcome of Training Phase
- I have learned to drive a car in a Training Centre
- Question
- Can we completely trust the quality of Training and allow the Trainee to drive a car in the Real-world?
- Answer
- No
- Learn to drive a Car from
- Problem
- We cannot completely trust the quality of Training and allow the Trainee to drive a car in the Real-world
- A Possible Solution
- Evaluate (test) the driving skills of the Trainee at a Test Centree. an environment which is
- different from the Training Centre and
- mimics the Real-world
- Evaluate (test) the driving skills of the Trainee at a Test Centree. an environment which is
- Testing Phase
- Evaluate the driving skills of the Trainee at Test Centre
- Question
- How to ensure quality in evaluation?
- A Possible Answer
- Design such an evaluation process which completely and correctly evaluates all aspects of a Task (to be evaluated)
- Assumption – in Current Example
- In this current example, I am assuming that the evaluation process to evaluate the driving skills of a Trainee is
- complete and correct in all aspects
- In this current example, I am assuming that the evaluation process to evaluate the driving skills of a Trainee is
- Outcome of Testing Phase
- If (Trainee Performed Well in Testing Phase)
- Then
- Allow Trainee to Drive a Car in Real-world
- Then
- Else
- Do not Allow Trainee to Drive a Car in Real-world and ask him/her to take further Training to refine his / her Driving Skills
- In this example, we assume that
- Trainee Performed Well in Testing Phase and (s)he has been given a
- Driving License
- Duration of a Driving License
- Note that duration of a Driving License is normally for a period of 5 years
- Note
- Learning is a continuous process till death 😊
- Trainee Performed Well in Testing Phase and (s)he has been given a
- Application Phase
- Person (Driver) with Driving License is driving the car in the Real-world
- Feedback Phase
- Driving skills of a Driver are constantly monitored by Traffic Police 😊
- Positive Feedback
- Keep driving a car in Real-world
- Negative Feedback
- Punishment
- In extreme situations, punishment may result in cancellation of Driving License
- Punishment
- Conclusion
- It can be noted form the Learning Cycle that there is
- Continuous Evaluation
- This indicates that Continuous Evaluation with Quality is very important to bring Quality in Learning Process
- It can be noted form the Learning Cycle that there is
- Why Evaluate Hypothesis (Model)?
- Given
- A Machine Learning Algorithm (e.g. ID3 Algorithm) and
- A Hypothesis (h) or Model
- which ID3 Algorithm has learned from a Set of Training Examples
- Question
- Why should we evaluate h (Model)?
- Answer
- The main goal of building a Model (Training / Learning) is to use it in doing Real-world Tasks with good Accuracy
- No one can perfectly predict, that if a Model performs well in Training Phase, then it will also perform well in Real-world
- However, before deploying a Model in the Real-world, it is important to know
- How well will it perform on unseen Real-time Data?
- To judge/estimate the performance of a Model (or h) in Real-world (Application Phase)
- Evaluate the Model (or h) on large Test Data (Testing Phase)
- Recall – Machine Learning Assumption
- If a Model (or h) performs well on large Test Data, it will also perform well on unseen Real-world Data
- Again, this is an assumption and we are not 100% sure that if a Model performs well on large Test Data will definitely perform well on Real-world Data
- Therefore, it is useful to take continuous Feedback on deployed Model (Feedback Phase) and keep on improving it
- Advantages of Evaluating Hypothesis (Model)?
- Two main advantages of Evaluating Hypothesis / Model are
- We get the answer to an important question i.e.
- Should we rely on predictions of Hypothesis / Model when deployed in Real-world?
- Machine Learning Algorithms may rely on Evaluation to refine Hypothesis (h)
- What We Want to Know When Evaluating Hypothesis / Model
- Question
- What do we want to know when evaluating a Hypothesis h?
- Answer
- Estimate of Error (EoE)
- How accurately will it classify future unseen instances?
- Error in Estimate of Error (Error in EoE)
- How accurate is our Estimate of Error (EoE)?
- I.e. what Margin of Error (±? %) is associated with our Estimate of Error (EoE)?
- Recall – Population and Sample
- Population (N)
- Definition
- Total set of observations (or examples) for a Machine Learning Problem
- Collecting Data equivalent to the size of Population, will lead to perfect learning
- Sample (S)
- Definition
- Subset of observations (or examples) drawn from a Population
- Note
- The size of a Sample is always less than the size of the Population from which it is taken
- Most Important Property of a Sample
- A Sample should be true representative of the Population
- Example – Population and Sample
- Machine Learning Problem
- Gender Identification
- Population
- Set of all observations (humans) in the world
- Sample
- A set of 5000 observations (humans) drawn from Population
- A Sample should be true representative of the Population
- if in a Population
- 60% are Females and 40% are Males
- Then a Sample (5000 observations) drawn from this Population should have
- 60% Females (3000 observations) and 40% Males (2000 observations)
- Why We Need Data Sampling?
- For Perfect Learning (Ideal Situation)
- Collect all Data (or observations/examples) for a Machine Learning Problem
- Problem
- Practically Not Possible
- A Possible Solution (Realistic Situation)
- Draw a Sample from Population which should be its true representative (called a Representative Sample)
- Note
- Since ML Algorithms learn from Sample Data instead of Population Data, that is why
- They have Scope of Error
- Remember
- Only Allah and His Habib, Hazrat Muhammad S.A.W.W. (teachings) are perfect
- So, follow them to be successful in this world and hereafter, Ameen
- True Error vs Sample Error
- True Error
- Error computed on entire Population
- Sample Error
- Error computed on Sample Data
- True Error – Formal Definition
- The True Error of hypothesis h (or Model) with respect to Target Function f and Probability Distribution D is the probability that h will misclassify an instance drawn at random according to D
- Sample Error – Formal Definition
- The sample Error of hypothesis h or Model() with respect to Target Function f and Data sample S is the proportion of examples h misclassifies
- Calculating True Error
- Problem
- Since we cannot acquire entire Population
- Therefore, we cannot calculate True Error
- A Possible Solution
- Calculate Sample Error in such a way that
- Sample Error estimates True Error well
- How Sample Error Estimates True Error Well?
- Statistical Theory tells us that Sample Error can estimate True Error well if the following two conditions are fulfilled
- Condition 01 – the n instances in Sample S are drawn
- independently of one another
- independently of h
- according to Probability Distribution D
- Condition 02
- n ≥ 30
- Importance Note
- A very Common Mistake in Evaluation
- Size of Test Data < 30 instances
- Remember
- Example – Calculating Sample Error
- Problem
- Calculating Sample Error
- Sample Error vs Estimate of Sample Error (EoSE)
- Ideal Situation
- Calculate Sample Error
- Realistic Situation
- Calculate Estimate of Sample Error (EoSE)
- In sha Allah (انشاء اللہ), in next Slides, I will try to explain the difference between
- Sample Error and
- Estimate of Sample Error (EoSE)
- Example - Estimate of Sample Error (EoSE)
- Machine Learning Problem
- Gender Identification
- Machine Learning Algorithm
- ID3 Algorithm
- Population (Instance Space (X))
- All humans in the world
- Sampling Technique
- Random Sampling
- Representative Sample (Sample Data)
- Set of Examples (or humans) randomly drawn from the Population
- Sample Size = 5000 instances
- Split Sample Data
- Using a Train-Test Split Ratio of 80%-20%
- Training Data = 4000 instances
- Testing Data = 1000 instances
- Evaluation Measure
- Error
- Example - Estimate of Sample Error (EoE)
- Consider three Samples randomly drawn from Population
- Sample S1
- Sample S2
- Sample S3
- Training and Testing ID3 Algorithm on Sample S1
- Sample Error (S1) = 0.25
- Training and Testing ID3 Algorithm on Sample S2
- Sample Error (S2) = 0.20
- Training and Testing ID3 Algorithm on Sample S3
- Sample Error (S3) = 0.30
- Note that Sample Error for all three random Samples (S1, S2 and S3) is
- Different
- Conclusion
- If we randomly draw n different Samples and compute Sample Error, then
- Sample Error is likely to vary from Sample to Sample
- Therefore, we cannot compute Sample Error, we can
- only compute Estimate of Sample Error (EoSE)
- How to Calculate Estimate of Sample Error (EoSE)?
- Question
- How can we calculate Estimate of Sample Error (EoSE)??
- Answer
- Step 1: Randomly select a Representative Sample S from the Population
- Step 2: Calculate Sample Error on Representative Sample S drawn in Step 1
- Note that this Sample Error is called Estimate of Sample Error (EoSE)
- Estimate of Sample Error (EoSE)
- Problem
- Estimate cannot be perfect and will contain Error
- A Possible Solution
- Calculate Error in Estimate of Sample Error (EoSE)
- Question
- How to calculate Error in Estimate of Sample Error (EoSE)?
- A Possible Answer
- Use Confidence Interval
- Confidence Interval – Formal Definition
- The Most Probable Value of True Error is Sample Error with approximately N% Probability, True Error lies in interval
- where n represents the size of Sample
- errors(h) represents Sample Error
- Confidence Interval
- Definition
- Confidence Interval provides us with lower and upper limits around our Estimate of Sample Error (EoSE), and within this interval we can then be confident that we have captured the True Error
- The lower limit and upper limit around our Sample Error tells us the range of values our True Error is likely to lie within
- Confidence Interval is often used with a
- Margin of Error
- Margin of Error – Definition
- Margin of Error is the range of values below and above the Estimate of Sample Error (EoSE) in a Confidence Interval
- Confidence Level
- Definition
- Confidence Level is defined as the probability that the value of an Estimate of Sample Error (EoSE) falls within a specified range of values
- Formula
- The choice of Confidence Level depends on the field of study
- Generally, the most common Confidence Level used by Researchers is 95%
- Example – Confidence Interval
- Question
- Explain the following statement
- With 95% Confidence, we can say that Estimate of Sample Error (EoSE) lies in an Interval of 0.20 – 0.30
- Answer
- If we randomly draw Samples from a Population (using the same technique) and compute Estimate of Sample Error (EoSE), then 95% of the times, the True Error will fall within the interval i.e. between 0.20 and 0.30
- Example – Calculating Error in Estimate of Sample Error
- Problem
- A Two-Step Process
- Step 1: Calculate Estimate of Sample Error (EoSE)
- Step 2: Calculate Error in Estimate of Sample Error (EoSE)
- Step 1: Calculate Estimate of Sample Error (EoSE)
- Step 2: Calculate Error in Estimate of Sample Error (EoSE)
- Error in Estimate of Sample Error (EoSE)
- We know with approximately 95% Probability that
- True Error lies in the range 0.1628 to 0.4372
- Summary – Confidence Interval
- Our goal is to have a
- Small Interval with High Confidence
- Note
- For Small Intervals normally Confidence is Low
- For Large Intervals normally Confidence is High
- Example – Interval and Confidence
- Question
- How much money Adeel have in his purse?
- Answer 01
- Interval
- 0 – 100K
- Confidence
- 99%
- Answer 02
- Interval
- 500 – 1000
- Confidence
- 50%
- Conclusion
It is not easy to have a Small Interval with High Confidence 😊
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
- Task 1
- Consider the following Task and answer the questions given below
- Memorize Quran.e. Pak by Heart
- Dua: اللہ پاک ہم سب کو قرآن پاک کا حافظ بنائے آمین
- Note
- Your answer should be
- Well Justified
- Questions
- Write the Input and Output for the above Task?
- Execute the Learning Cycle
- What Evaluation Process will you use to ensure quality in Evaluation?
- What strategies you will use so that you do not forget Quran.e.Pak till death?
- Task 2
- Consider the following scenario and answer the questions below
- A Model (h) was tested on a Sample of 25 instances. The Model (h) misclassified 5 instances.
- Questions
- Calculate the Sample Error?
- Sample Error will not estimate True Error? Why?
- Task 3
- Consider the following scenario and answer the questions below
- A Model (h) was tested on a Sample of 50 instances. The Model (h) misclassified 10 instances.
- Questions
- Calculate the Estimate of Sample Error (EoSE)?
- Sample Error will estimate True Error? Why?
- Calculate Error in Estimate of Sample Error (EoSE) with
- Confidence Level = 50%
- Confidence Level = 95%
- Discuss the impact of Confidence Level on Error in Estimate of Sample Error (EoSE)
We cannot deploy the Model (h) in Real-world, although Sample Error is low? Why?
Your Turn Tasks
- Task 1
- Select a Task (similar to Memorize Quran .e. Pak by Heart)
- Questions
- Write the Input and Output for the selected Task?
- Execute the Learning Cycle
- What Evaluation Process will you use to ensure quality in Evaluation?
- What strategies you will use so that you do not forget what you have learned till death?
- Task 2
- Write a scenario (similar to Task 03 in TODO) and answer the questions below
- Questions
- Calculate the Estimate of Sample Error (EoSE)?
- Will Sample Error estimate True Error? Explain?
- Calculate Error in Estimate of Sample Error (EoSE) with
- Confidence Level = 50%
- Confidence Level = 95%
- Discuss the impact of Confidence Level on Error in Estimate of Sample Error (EoSE)
- Will you be able to deploy your Model (h) in Real-world? Explain.
Two Main Diseases of Machine Learning
- Two Main Diseases of Machine Learning
- The two main diseases in Machine Learning are
- Overfitting
- Underfitting
- Overfitting
- The condition when a Machine Learning Algorithm tries to remember all the Training Examples from the Training Data (Rote Learning) is known as Overfitting of the Model (h)
- Overfitting happens when our
- Model (h) has lot of features or
- Model (h) is too complex
- Underfitting
- The condition when a Machine Learning Algorithm could not learn the correlations between Attributes / Features properly then it is known as Underfitting of the Model (h)
- Underfitting happens when our
- Model misses the trends or patterns in the Training Data and could not generalize well for the Training Examples
- Question
- How to overcome the problems of Overfitting and Underfitting?
- A Possible Solution
- Use Train-Test Split Approach
- Train-Test Split Approach
- Definition
- Train-Test Split Approach, splits the Sample Data into two sets: (1) Train Set and (2) Test Set
- Train Set
- used to build/train the Model (h)
- Test Set
- used to evaluate the performance of the Model (h)
- Train-Test Split Approach
- Two main variations of Train-Test Split Approach are
- Random Split Approach
- Class Balanced Split Approach
- Note
- Class Balanced Split Approach should be preferred over Random Split Approach
- For details
- See Chapter 03 – Basics of Machine Learning
- Train-Test Split Ratio
- Definition
- Train-Test Split Ratio determines what percentage of the Sample Data will be used as Train Set and what percentage of the Sample Data will be used as Test Set
- Question
- What Train-Test Split Ratio is best?
- A Possible Answer
- The Train-Test Split Ratio may vary from Machine Learning Problem to Machine Learning Problem
- e.g. 70%-30%, 80%-20%, 90%-10% etc.
- Most Common Train-Test Split Ratio
- Use 2 / 3 of Sample Data as Train Set
- Use 1 / 3 of Sample Data as Test Set
- Strengths and Weaknesses - Train-Test Split Approach
- Strengths
- Train-Test Split Approach helps us to address the problems of Overfitting and Underfitting
- Weaknesses
- Train-Test Split Approach provides high variance in Estimate of Sample Error since
- Changing which examples happens to be in Train Set can significantly change Sample Error
- Train-Test Split Approach provides high variance in Estimate of Sample Error since
- Question
- How to overcome the problem of high variance in Estimate of Sample Error calculated using Train-Test Split Approach?
- A Possible Answer
- Use the K-fold Cross-Validation Approach
- Train-Test Split Approach Cont…
- Question
- How to overcome the problem of high variance in Estimate of Sample Error calculated using Train-Test Split Approach?
- A Possible Answer
- Use the K-fold Cross-Validation Approach
- K-fold Cross-Validation Approach
- K-fold Cross-Validation Approach works as follows
- Step 1: Split Train Set into K equal folds (or partitions)
- Step 2: Use one of the folds (kth fold) as the Test Set and union of remaining folds (k – 1 folds) as Training Set
- Step 3: Calculate Error of Model (h)
- Step 4: Repeat Steps 2 and 3, to choose Train Sets and Test Sets from different folds, and calculate Error K-times
- Step 5: Calculate Average Error
- Important Note
- In each fold, there must be at least 30 instances
- All K-folds must be disjoint i.e. instance appearing in one fold must not appear in any other fold
- Question
- What is the best value for K?
- Answer
- Empirical study showed that best value for
- K is 10
- Important Note
- To run experiments with 10-fold Cross-Validation, you must have
- At least 300 instances in your Train Set
- Example 1 – Selecting Value of K
- Consider the following Machine Learning Problem
- Calculations
- Size of Train Set = 200 instances
- There should be at least 30 instances in each fold
- K = 200 / 30 = 6.66
- Answer
- Ali will apply
- 6-fold Cross-Validation
- Example 2 – Selecting Value of K
- Consider the following Machine Learning Problem
- Calculations
- Size of Train Set = 300 instances
- There should be at least 30 instances in each fold
- K = 300 / 30 = 10
- Answer
- Ali will apply
- 10-fold Cross-Validation
- Example 3 – Selecting Value of K
- Consider the following Machine Learning Problem
- Calculations
- Size of Train Set = 2000 instances
- There should be at least 30 instances in each fold
- K = 2000 / 30 = 66.66
- Answer
- Ali will apply
- 10-fold Cross-Validation
- Note
- Calculation shows K = 66.6 but we apply
- 10-fold Cross-Validation i.e. K = 10
- Reason
- Empirical study has shown that the best value for K = 10
- Example - K-fold Cross-Validation Approach
- Consider the following Machine Learning Problem
- Applying K-fold Cross-Validation Approach
- Step 1: Split Train Set into K equal folds (or partitions)
- Calculating the Value of K
- Splitting Train Set into K-equal folds (here K = 3)
- Fold 01 = 1 – 33 (total 33 instances)
- Fold 02 = 33 – 66 (total 33 instances)
- Fold 03 = 67 – 100 (total 34 instances)
- Step 2: Use one of the folds (kth fold) as the Test Set and union of remaining folds (k – 1 folds) as Training Set
- Step 3: Calculate Error of Model (h)
- Step 4: Repeat Steps 2 and 3, to choose Train Sets and Test Sets from different folds, and calculate Error K-times
- Iteration 1
- Iteration 2
- Iteration 3
- Step 5: Calculate Average Error
- Average Error = (Error in Iteration 01 + Error in Iteration 02 + Error in Iteration 03) / 3
- Average Error = (0.25 + 0.20 + 0.30) / 3 = 0.25
- Example - 10-fold Cross-Validation Approach
- Diagram below shows split of Train Sets and Test Sets
- when K = 10
- Strengths and Weaknesses – K-fold Cross-Validation Approach
- Strengths
- K-fold Cross-Validation Approach is a better estimator of Error since
- All data is used for both Training and Testing
- Weaknesses
- It is computationally expensive since
- we have to repeat Training and Testing Phases K-times
- Suitable Situations to use Train-Test Split Approach
- It is suitable to use Train-Test Split Approach in the following situations
- Situation 1
- When Training Time is Very Large
- Example
- Since Training Time of Deep Learning Algorithms is very large, therefore, Train-Test Split Approach is more suitable (compared to K-fold Cross-Validation Approach)
- Situation 2
- Organizing International Competitions
- Example
- PAN organized International Competition on Author Profiling task, which mainly comprised of two phases
- Training Phase
- PAN Organizers released the Training Data so that participants can train their Models (h)
- Evaluation Phase
- PAN Organizers released the Test Data and asked participants to apply their Models (h) on Test Data and submit their predictions for evaluation
- Situation 3
- Having Very Huge Sample Data
- Example
- Suppose we want to Train and Test our Machine Learning Algorithms for Plagiarism Detection task on a Sample Data of 21 million instances (PubMed Medline Citations Dataset)
- In the above situation, it will be more suitable to use Train-Test Split Approach (compared to K-fold Cross-Validation Approach)
- Suitable Situations to use K-fold Cross-Validation Approach
- It is suitable to use K-fold Cross-Validation Approach in the following situations
- Situation 1
- When Training Time is Not Very Large
- Example
- Since Training Time of Feature-based Machine Learning Algorithms is relatively fast, therefore, K-fold Cross-Validation Approach is more suitable (compared to Train-Test Split Approach)
- Situation 2
- Having Sample Data that is Not Very Huge
- Example
- Suppose we want to Train and Test our Machine Learning Algorithms for Sentiment Analysis task on a Sample Data of 10,000 instances
- In the above situation, it will be more suitable to use K-fold Cross-Validation Approach (compared to Train-Test Split Approach)
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 2
- Task 1
- Consider the following scenario and answer the questions given below
- Rashid has a Sample Data of 10,000 instances for the Sentiment Analysis task (4000 Positive, 2000 are Neutral and 4000 are Negative). He aims to apply Naïve Bayes, Random Forest, RNN and LSTM on his dataset. Error measure is used to evaluate the performance of the Model.
- Note
- Your answer should be
- Well Justified
- Questions
- Split Sample Data using Class Balanced Approach with a Train-Test Split Ratio of 80%-20%?
- In which of the four ML Algorithms (Naïve Bayes, Random Forest, RNN, LSTM), Overfitting is a serious problem?
- How will you check whether your Model is Overfitting or Underfitting?
- What approach is most suitable in the above scenario?
- Train-Test Split Approach or
- K-fold Cross-Validation Approach
- If K-fold Cross-Validation Approach is applied
- Calculate the value of K?
- Apply K-fold Cross-Validation Approach to evaluate the performance of the Model
- If Train-Test Split Approach is applied
- What is the most suitable value for Train-Test Split Ratio?
- What Data Split Approach is more suitable (Random Split Approach or Class Balanced Split Approach)?
Your Turn Tasks
Your Turn Task 2
- Task 1
- Consider s scenario (similar to the one given in TODO Task) and answer the questions given below
- Questions
- Split Sample Data using Class Balanced Approach with a Train-Test Split Ratio of 80%-20%?
- In which of the four selected ML Algorithms, Overfitting is a serious problem?
- How will you check whether your Model is Overfitting or Underfitting?
- What approach is most suitable in your selected scenario?
- Train-Test Split Approach or
- K-fold Cross-Validation Approach
- If K-fold Cross-Validation Approach is applied
- Calculate the value of K?
- Apply K-fold Cross-Validation Approach to evaluate the performance of the Model
- If Train-Test Split Approach is applied
- What is the most suitable value for Train-Test Split Ratio?
- What Data Split Approach is more suitable (Random Split Approach or Class Balanced Split Approach)?
Comparing Machine Learning Algorithms
- Comparing Machine Learning Algorithms
- To compare various Machine Learning Algorithms, following things must be same
- Train Set
- Test Set
- Evaluation Measure
- Evaluation Methodology
- Important Note
- If any of the above things are not same then it will
- Not be a valid comparison
- Example 1 – Comparing Machine Learning Algorithms
- Machine Learning Problem
- Gender Identification
- Feature-based Machine Learning Algorithms
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Multi-Layer Perceptron
- Dataset
- Gender Identification on Twitter
- Total instances = 10000
- Male instances = 5000
- Female instances = 5000
- Evaluation Measure
- Accuracy
- Evaluation Methodology
- 10-fold Cross-Validation
- Results obtained after running experiments
- Conclusion
- Best Machine Learning Algorithm on Twitter Corpus is
- Random Forest with an Accuracy score of 0.80
- Question
- Was the compassion of Feature-based Machine Learning Algorithms valid?
- Answer
- Yes
- Reason
- For all five Machine Learning Algorithms we used same
- Train Set
- Test Set
- Evaluation Measure and
- Evaluation Methodology
- Example 2 – Comparing Machine Learning Algorithms
- Machine Learning Problem
- Gender Identification
- Deep Learning Algorithms
- Recurrent Neural Network (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Dataset
- Gender Identification on Twitter
- Total instances = 10000
- Male instances = 5000
- Female instances = 5000
- Evaluation Measure
- Accuracy
- Evaluation Methodology
- Train-Test Split Ratio of
- 80% – 20%
- Results obtained after running experiments
- Conclusion
- Best Machine Learning Algorithm on Twitter Corpus is
- LSTM with an Accuracy score of 0.78
- Question
- Was the compassion of Deep Learning Algorithms valid?
- Answer
- Yes
- Reason
- For all three Deep Learning Algorithms we used same
- Train Set
- Test Set
- Evaluation Measure and
- Evaluation Methodology
- Comparison of Example 01 and Example 2
- Major difference in Example 01 and Example 02 is of
- Evaluation Methodology
- Evaluation Methodology – Example 01
- 10-fold Cross-Validation
- Reason
- We were applying Feature-based ML Algorithms on a dataset of 10000 instances
- Evaluation Methodology – Example 02
- Train-Test Split Approach
- Train-Test Split Ratio of 80% – 20%
- Reason
- We were applying Deep Learning ML Algorithms on a dataset of 10000 instances
- Example 3 – Comparing Machine Learning Algorithms
- Machine Learning Problem
- Gender Identification
- Feature-based Machine Learning Algorithms
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Multi-Layer Perceptron
- Deep Learning Algorithms
- Recurrent Neural Network (RNN)
- Long Short Term Memory (LSTM)
- BI-LSTM
- Dataset
- Gender Identification on Twitter
- Total instances = 10000
- Male instances = 5000
- Female instances = 5000
- Evaluation Measure
- Accuracy
- Evaluation Methodology
- Important Note
- We are applying both Feature-based ML Algorithms and Deep Learning ML Algorithms on our Twitter Dataset for Gender Identification task
- Question
- In current situation, should we go for Train-Test Split or K-Fold Cross Validation Approach?
- Answer
- Train-Test Split Approach is more suitable
- Therefore, we will use Train-Test Split Ratio of
- 80% – 20%
- Results obtained after running experiments
- Conclusion
- Best Machine Learning Algorithm on Twitter Corpus is
- Random Forest with an Accuracy score of 0.83
- Question
- Was the compassion of Machine Learning Algorithms valid?
- Answer
- Yes
- Reason
- For all eight Machine Learning Algorithms we used same
- Train Set
- Test Set
- Evaluation Measure and
- Evaluation Methodology
- Important Point to Note
- Question
- Why the results of Feature-based ML Algorithms in Example 03 are different from those reported in Example 01, although the dataset is same?
- Answer
- In Example 01, Evaluation Methodology was
- 10-fold Cross-Validation Approach
- In Example 03, Evaluation Methodology was
- Train-Test Split Approach
- Therefore, results obtained on the same dataset are different
- Conclusion
- Before running experiments, we need to ensure that our Evaluation Methodology is
- Standard and
- Same for all Machine Learning Algorithms
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 3
- Task 1
- Consider the following scenario and answer the questions given below
- Rashed has a Sample Data of 10,000 instances for the Sentiment Analysis task (4000 Positive, 2000 are Neutral and 4000 are Negative). He aims to apply Naïve Bayes, Random Forest, RNN and LSTM on his dataset.
- Note
- Your answer should be
- Well Justified
- Questions
- What main points Rasheed should consider to make a valid comparison of all four Machine Learning Algorithms?
- Discuss the pros and cons of comparing Machine Learning Algorithms using Train-Test Split Approach?
- Discuss the pros and cons of comparing Machine Learning Algorithms using K-fold Cross-Validation Approach?
Your Turn Tasks
Your Turn Task 3
- Task 1
- Consider s scenario (similar to the one given in TODO Task) and answer the questions given below
- Questions
- What main points you should consider to make a valid comparison of all Machine Learning Algorithms?
- Discuss the pros and cons of comparing Machine Learning Algorithms using Train-Test Split Approach?
- Discuss the pros and cons of comparing Machine Learning Algorithms using K-fold Cross-Validation Approach?
Evaluation Measures for Classification Problems
- Evaluation Measures for Classification Problem
- Some of the most popular and widely used Evaluation Measures for Classification Problems are
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F1
- Area Under the Curve (AUC)
- Baseline Accuracy (BA)
- Definition
- Baseline Accuracy (a.k.a. Majority Class Categorization (MCC)) is calculated by assigning the label of Majority Class to all the Test Instances
- Formula
- Example 1 – Calculating Baseline Accuracy (BA)
Problem Description
- Baseline Approach
- Majority Class Categorization (MCC)
- Proposed Approach
- Excellent Learner
- To make a contribution, in the existing research
- Proposed Approach must outperform Baseline Approach
- Calculating Baseline Accuracy (BA)
- Number of Classes
- Class 01 (Female)
- Class 02 (Male)
- Total Number of Test Instances = 600
- Class 01 (Female) = 300
- Class 02 (Male) = 300
- Majority Class
- Both Classes, Female and Male are equal i.e. have same Number of Instances
- Therefore, we can take any of the Classes as Majority Class
- Majority Class
- Female
- Calculating Baseline Accuracy (BA)
- Important Note
- If (Accuracy of Proposed Approach (Excellent Learner) > 0.50)
- Then
- Proposed Approach has contribute
- Else
- Proposed Approach has not contributed
- Example 2 – Calculating Baseline Accuracy (BA)
- Problem Description
- Baseline Approach
- Majority Class Categorization (MCC)
- Proposed Approach
- Excellent Learner
- To make a contribution, in the existing research
- Proposed Approach must outperform the Baseline Approach
- Calculating Baseline Accuracy (BA)
- Number of Classes
- Class 01 (Female)
- Class 02 (Male)
- Total Number of Test Instances = 600
- Class 01 (Female) = 400
- Class 02 (Male) = 200
- Majority Class
- Female
- Calculating Baseline Accuracy (BA)
- Important Note
- Comparing Example 1 and Example 2
- Example 01 has Balanced Data
- BA = 0.50
- Example 02 has Unbalanced Data
- BA = 0.66
- Note
- BA score changes as the Number of Instance in each Class changes
- Conclusion
- Class Balancing has a significant impact on the
- Calculation of Baseline Accuracy (BA)
- Example 3 – Calculating Baseline Accuracy (BA)
- Problem Description
- Baseline Approach
- Majority Class Categorization (MCC)
- Proposed Approach
- Excellent Learner
- To make a contribution, in the existing research
- Proposed Approach must outperform the Baseline Approach
- Calculating Baseline Accuracy (BA)
- Number of Classes
- Class 1 (Positive)
- Class 2 (Negative)
- Class 3 (Neutral)
- Total Number of Test Instances = 1000
- Class 01 (Positive) = 400
- Class 02 (Negative) = 250
- Class 03 (Neutral) = 350
- Majority Class
- Positive
- Calculating Baseline Accuracy (BA)
- Important Note
- Comparing Example 2 and Example 3
- Example 2 is a Binary Classification Problem
- BA = 0.66
- Example 03 is a Ternary Classification Problem
- BA = 0.40
- Note
- The BA score for Ternary Classification Problem is smaller than the Binary Classification Problem
- Conclusion
- As the Number of Classes increases in a Machine Learning Problem, the
- Baseline Accuracy (BA) decreases (considering almost Balanced Data)
- Baseline Accuracy for Balanced Data
- Assume that we have Balanced Data for all the Classification Tasks given below
- Gender Identification (Binary Classification Problem)
- Classes = Male, Female
- BA = 0.50
- Sentiment Analysis (Multi-class Classification Problem with 3 Classes)
- Classes = Positive, Negative, Neutral
- BA = 0.33
- Age Group Identification (Multi-class Classification Problem with 3 Classes)
- Classes = [1 – 18], [19 – 25] , [25 – 40] , [40 – 100]
- BA = 0.25
- Note
- As the Number of Classes increases in a Machine Learning Problem, the
- Baseline Accuracy (BA) decreases
- Conclusion
- Number of Classes has a significant impact on the calculation of Baseline Accuracy (BA)
- Two Main Factor Affecting Calculation of Baseline Accuracy (BA)
- The two main factors affecting the calculation of Baseline Accuracy (BA) are
- Number of Classes
- Number of Instances in each Class
- Strengths and Weaknesses – Baseline Accuracy (BA)
- Strengths
- Baseline Accuracy (BA) provides a simple and very basic Baseline Approach to compare your Proposed Approach (Machine Learning Algorithm)
- Weaknesses
- Baseline Accuracy (BA) is very naïve and cannot be considered as a state-of-the-art and strong Baseline Approach
- Accuracy
- Definition
- Accuracy is defined as the proportion of correctly classified Test instances
- Formula
- Note
- Question
- When it will more suitable to use Accuracy measure?
- Answer
- Accuracy evaluation measure is more suitable to use for evaluation of Machine Learning Algorithms when we have
- Balanced Data
- Example – Calculating Accuracy
- Problem Description
- Baseline Approach 1
- Majority Class Categorization (MCC)
- Baseline Accuracy = 0.50
- Baseline Approach 2
- Efficient Learner previously reported by Adeel
- Accuracy (Efficient Learner) = 0.70
- Proposed Approach
- Excellent Learner
- Accuracy (Excellent Learner) =?
- Question
- With which Baseline Approach (Majority Class Categorization or Efficient Learner) Rasheed should compare his Proposed Approach (Excellent Learner)?
- Answer
- Efficient Learner
- Reason
- Efficient Learner is a state-of-the-art approach and can be considered as a
- strong Baseline Approach
- Important Note
- To have quality in your research work, always compare your Proposed Approach with a
- state-of-the-art and strong Baseline Approach
- Calculating Accuracy for Efficient Learner (Machine Learning Algorithm)
- Note
- Strengths and Weaknesses - Accuracy
- Strengths
- We can evaluate and compare various Machine Learning Algorithms using Accuracy evaluation measure
- Weaknesses
- Accuracy fails to accurately evaluate a Machine Learning Algorithm when Test Data is highly unbalanced
- Accuracy ignores possibility of different misclassification costs
- Example – Accuracy is a Poor Measure for Highly Unbalanced Data
- Consider a Binary Classification Problem with two classes: Positive and Negative. Test Data comprises of 1000 instances, out of which 995 instances are Negative and 5 are Positive.
- A Machine Learning Algorithm which always predicts Negative, will have an
- Accuracy of 0.995
- Problem
- Machine Learning Algorithm has very high Accuracy (99.5%) on Test Data, even though it never correctly predicts Positive Test Examples
- A Possible Solution
- Confusion Matrix
- Confection Matrix
- Definition
- A Confusion Matrix is a table used to describe the performance of a Classification Model (or Classifier) on a Set of Test Examples (Test Data), whose Actual Values (or True Values) are known
- Purpose
- To get deeper insights into Model / Classifier behavior
- Advantages
- Confusion Matrix allows us to visualize the performance of a Model / Classifier
- Confusion Matrix allows to separately get insights into the Errors made by each Class
- Confusion Matrix gives insights to both
- Errors made by a Model / Classifier and
- Types of Errors made by a Model / Classifier
- Confusion Matrix allows us to compute many different Evaluation Measures including
- Baseline Accuracy
- Accuracy
- True Positive Rate (or Recall)
- True Negative Rate
- False Positive Rate
- False Negative Rate
- Precision
- F1
- Confusion Matrices
- Confusion Matrix for a Machine Learning Problem with n Number of Classes is given below,
- Confusion Matrix for Concept Learning
- Confusion Matrix for Concept Learning (a.k.a. Binary Classification Problem) is given below,
- Considering that
- Class 01 = Negative
- Class 02 = Positive
- Extracting Various Evaluation Measures from Confusion Matrix
- In sha Allah, in the next Slides I will try to explain how to extract the following Evaluation Measures from Confutation Matrix
- Baseline Accuracy
- Accuracy
- True Positive Rate (or Recall or Sensitivity)
- True Negative Rate
- False Positive Rate
- False Negative Rate
- Precision (or Specificity)
- F-measure
- F1
- Baseline Accuracy (BA)
- Definition
- Classify all the Test Examples by assigning them the label of the Majority Class
- Formula
- Accuracy
- Definition
- Accuracy (AC) is the proportion of the total number of predictions that were correct
- Formula
- Note
- This is the same as Accuracy evaluation measure defined earlier
- Recall or True Positive Rate (TPR) or Sensitivity
- Definition
- Recall or True Positive Rate (TPR) or Sensitivity is the proportion of Positive cases that were correctly classified
- Formula
- False Positive Rate
- Definition
- False Positive Rate (FPR) is the proportion of Negative cases that were incorrectly classified as Positive
- Formula
- True Negative Rate or Specificity
- Definition
- True Negative Rate (TNR) or Specificity is defined as the proportion of Negatives cases that were classified correctly
- Formula
- False Negative Rate
- Definition
- False Negative Rate (FNR) is the proportion of Positive cases that were incorrectly classified as Negative
- Formula
- Precision
- Definition
- Precision (P) is the proportion of the predicted Positive cases that were correct
- Formula
- Trade-Off Between Precision and Recall
- Problem
- There is a trade-off between Precision and Recall
- A Possible Solution
- Find out Evaluation Measures which efficiently combine Precision and Recall
- Example
- F-measure (combines Precision and Recall)
- F-measure
- F-measure
- Definition
- Harmonic Mean of Precision and Recall
- Formula
- where controls relative weight assigned to Precision and Recall
- F1-measure
- Definition
- When we assign same weights to Precision and Recall i.e. β = 1, the F-measure becomes F1-measure
- Harmonic Mean in F-measure
- Question
- In F-measure, why we use Harmonic Mean instead of Arithmetic Mean?
- Answer
- We want our Machine Learning Algorithm to have
- High F1 score
- To achieve High F1 score, we need both
- Good Precision and
- Good Recall
- Harmonic Mean penalizes F1 score, when either
- Recall is high and Precision is low or
- Recall is low and Precision is high
- Note the difference in Formulas of Arithmetic Means and Harmonic Mean
- Harmonic Mean Formula
- Arithmetic Mean Formula
- Example 1 – Arithmetic Mean vs Harmonic Mean
- Situation
- Both Precision and Recall are almost same
- Consider the following values of Precision and Recall
- Precision = 0.90
- Recall = 0.87
- Arithmetic Mean
- F1 = 0.885
- Harmonic Mean
- F1 = 0.884
- Note
- Both Arithmetic Mean and Harmonic Mean score are almost same
- Conclusion
- Arithmetic Mean and Harmonic Mean scores are almost same when Precision and Recall scores are almost same
- Example 2 – Arithmetic Mean vs Harmonic Mean
- Situation
- Precision is high and Recall is low
- Consider following values of Precision and Recall
- Precision = 0.90
- Recall = 0.20
- Arithmetic Mean
- F1 = 0.55
- Harmonic Mean
- F1 = 0.32
- Note
- Arithmetic Mean score is high compared to Harmonic Mean score
- Conclusion
- Harmonic Mean penalized the F1 scores because Precision is high and Recall is low
- Example 3 – Arithmetic Mean vs Harmonic Mean
- Situation
- Precision is low and Recall is high
- Consider the following values of Precision and Recall
- Precision = 0.20
- Recall = 0.90
- Arithmetic Mean
- F1 = 0.55
- Harmonic Mean
- F1 = 0.32
- Note
- Again, similar to Example 02, Arithmetic Mean score is high compared to Harmonic Mean score
- Conclusion
- Harmonic Mean penalized the F1 scores because Precision is low and Recall is high
- Summary - Arithmetic Mean vs Harmonic Mean
- Arithmetic Mean score is almost same compared to Harmonic Mean score when
- Precision and Recall are almost same
- Arithmetic Mean score is high compared to Harmonic Mean score when
- Precision is high and Recall is low
- Precision is low and Recall is high
- Conclusion
- To get Good F1 score
- Both Precision and Recall should be good
- Example 1 – Confusion Matrix for Binary Classification Problem
- Consider the following Machine Learning Problem
- Confusion Matrix
- Baseline Accuracy
- Accuracy
- True Positive Rate (TPR) or Recall or Sensitivity
- False Positive Rate (FPR)
- True Negative Rate (TNR) or Specificity
- False Negative Rate (FNR)
- Precision
- F1 Measure
- Example 2 – Confusion Matrix for Ternary Classification Problem
Consider the following Machine Learning Problem
- Confusion Matrix
- Total Documents = 2000
- Positive Documents = 1000
- Negative Documents = 600
- Neutral Documents = 400
- Baseline Accuracy
- Accuracy
- Second Major Disadvantage of Accuracy
- Recall
- The second major disadvantage of Accuracy evaluation measure is that
- It ignores possibility of different misclassification costs
- Recall – Misclassification
- Positive Instances is Classified as Negative
- Negative Instances is Classified as Positive
- Misclassifying (or incorrectly predicting)
- Positive costs may be more or less important than misclassifying (or incorrectly predicting) Negative costs
- Example 1 – Impact of Misclassification Costs
- Machine Learning Problem
- Treating a Patient
- Classes
- Class 1 = Patient is Sick (Positive)
- Class 2 = Patient is Not Sick (Negative)
- Situation 1
- Positive Example is misclassified as Negative Example
- A Patient is Sick (Positive) but Model Predicts that
- Patient is Not Sick (Negative) i.e. False Negative Rate
- Situation 2
- Negative Example is misclassified as Positive Example
- A Patient is Not Sick (Negative) but Model Predicts that
- Patient is Sick (Positive) i.e. False Positive Rate
- Question
- Among the two situations discussed above, which one is more costly?
- Answer
- Situation 1
- A Patient is Sick (Positive) but Model Predicts that Patient is Not Sick (Negative) i.e. False Negative Rate
- Question
- Should we build a Model which has high FPR and low FNR?
- Answer
- Yes, it will be a good approach to build a Model with high FPR and low FNR
- Reason
- The cost of not treating an ill patient (FNR) is very high compared to treating a patient who is not ill (FPR)
- Therefore, a high FPR is acceptable, but we should try to minimize FNR as much as possible
- Example 2 – Impact of Misclassification Costs
- Machine Learning Problem
- Plagiarism Detection in Student’s Assignments
- Classes
- Class 1 = Plagiarized (Positive)
- Class 2 = Non-Plagiarized (Negative)
- Situation 1
- Positive Example is misclassified as a Negative Example
- Model misclassifies (incorrectly predicts) a Plagiarized Assignment (Positive) as Non-Plagiarized (Negative) i.e. False Negative Rate
- Situation 2
- Negative Example is misclassified as Positive Example
- Model misclassifies (incorrectly predicts) a Non-Plagiarized Assignment (Negative) as Plagiarized (Positive) i.e. False Positive Rate
- Question
- Among the two situations discussed above, which one is more costly?
- Answer
- Situation 1
- Model misclassifies (incorrectly predicts) a Plagiarized Assignment (Positive) as Non-Plagiarized (Negative) i.e. False Negative Rate
- Question
- Should we build a Model which has high FPR and low FNR?
- Answer
- Yes, it will be a good approach to build a Model with high FPR and low FNR
- Reason
- The cost of not detecting Plagiarized Assignments (FNR) is very high compared to predicting Non-Plagiarized Assignments as Plagiarized (FPR)
- Therefore, a high FPR is acceptable, but we should try to minimize FNR as much as possible
- Example 3 – Impact of Misclassification Costs
- Machine Learning Problem
- Predicting Fraud in Loan Applicants
- Classes
- Class 1 = Fraud (Positive)
- Class 2 = Not Fraud (Negative)
- Situation 1
- Positive Example is misclassified as Negative Example
- Model misclassifies (incorrectly predicts) a Fraud Applicant (Positive) as Not Fraud (Negative) i.e. False Negative Rate
- Situation 2
- Negative Example is misclassified as Positive Example
- Model misclassifies (incorrectly predicts) a Not Fraud Applicant (Negative) as Fraud (Positive) i.e. False Positive Rate
- Question
- Among the two situations discussed above, which one is more costly?
- Answer
- Situation 01
- Model misclassifies (incorrectly predicts) a Fraud Applicant (Positive) as Not Fraud (Negative) i.e. False Negative Rate
- Question
- Should we build a Model which has high FPR and low FNR?
- Answer
- Yes, it will be a good approach to build a Model with high FPR and low FNR
- Reason
- The cost of giving loan to a Fraud Applicant (FNR) is very high compared to not giving loan to an Applicant who is Not Fraud (FPR)
- Therefore, a high FPR is acceptable, but we should try to minimize FNR as much as possible
- Comparing Example 1, Example 2 and Example 3
- To summarize
- In all three Examples disused in previous slides, we found a common pattern
- Misclassification Cost of predicting Positive Examples as Negative Examples (FNR) is very high compared to Misclassification Cost of predicting Negative Examples as Positive Examples (FPR)
- To Conclude
- For three Machine Learning Problems discussed in previous Slides including Treating a Patient, Plagiarism Detection in Students’ Assignments and Detection of Fraud Loan Applicants, it will be wise to build Models which have
- High FPR and Low FNR
- Important Note – Misclassification Costs
- Before Training your Model, you must be very clear about the Misclassification Costs, otherwise
- Your Model will fail to perform well in Real-world (i.e. Application Phase)
- Second Major Disadvantage of Accuracy Cont…
- Problem
- How to handle the problem in Accuracy evaluation measure caused by ignoring the possibility of different misclassification costs?
- Two Possible Solutions
- ROC Curves
- Precision-Recall Curves
- ROC Curve
- Definition
- ROC Curve summarizes the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) for a Classifier / Model using different Probability Thresholds
- Suitable to Use
- ROC Curves are more suitable to use when we have
- Balanced Data
- ROC Curve
- ROC Curve – Example
- Precision-Recall Curve
- Definition
- Precision-Recall Curve summarizes the trade-off between Precision and Recall (or True Positive Rate) for a Classifier / Model using different Probability Thresholds
- Suitable to Use
- Precision-Recall Curves are more suitable to use when we have
- Highly Unbalanced Data
- Are Under the Curve (AUC)
- Definition
- Area Under the ROC Curve (AUC) is defined as the probability that randomly chosen Positive instance is ranked above than the randomly selected Negative one
- Purpose
- AUC measures the degree of separability between Classes
- i.e. AUC tells how much Model is capable of distinguishing between Classes?
- How AUC Score is calculated?
- AUC score is computed using True Positive Rate (TPR) and False Positive Rate (FPR)
- Range of AUC Score
- Range of AUC Score is [0 – 1]
- 0 means All Predictions of Model are Wrong
- 1 means All Predictions of Model are Correct
- What is a good AUC Score?
- AUC Score = 0.5
- suggests No Discrimination (i.e., Model do not have the ability to differentiate Positive instances from Negative ones)
- AUC Score = 0.7 to 0.8
- considered Acceptable
- AUC Score = 0.8 to 0.9
- considered Excellent
- AUC Score = Greater Than 0.9
- considered Outstanding
- ROC Graph vs AUC
- ROC Graph
- ROC Graph is a probability curve
- AUC
- AUC represents degree of separability between Classes
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 4
- Task 1
- Consider the following Binary Classification Problem and answer the questions given below.
- Ayesha had a collection of 800 documents contains news articles. She liked 300 news articles (Positive instances) and disliked (Negative instances) the remaining ones. She applied the Naïve Bayes classifier on the entire dataset. 50 of the liked news articles were correctly classified and 200 of the disliked news articles were incorrectly classified.
- Note
- Your answer should be
- Well Justified
- Questions
- Draw Confusion Matrix?
- Calculate the following
- Baseline Accuracy
- Accuracy
- True Positive Rate (Recall)
- False Positive Rate
- True Negative Rate
- False Negative Rate
- Precision
- F1
- What deeper insights about a Classifier’s behaviour you can get using Confusion Matrix?
- What Impact Data Distribution has on the calculation of Baseline Accuracy?
- Considering Baseline Accuracy as Baseline Approach
- If a Proposed ML Algorithm produces and Accuracy of 0.60
- Will we consider it a research contribution?
- Compare Harmonic Mean score with Arithmetic Mean score?
- What impact Precision and Recall has on calculation of F1?
- Whose cost of misclassification is high, Positive instances or Negative instances?
- What possible solutions are there to handle the problems of cost of misclassification?
- Naïve Bayes Algorithm obtained an AUC score of 0.75. What does it mean?
- Which one is more suitable to use, ROC Curve or Precision-Recall Curve?
- Task 2
- Consider the following Ternary Classification Problem and answer the questions given below.
- Fatima had a collection of 2000 documents, which can be classified into three categories: Wholly Derived (WD), Partially Derived (PD) and Non-Derived (ND). 600 documents are WD, 500 are PD and 900 are ND. She ran Naïve Bayes classifier on the entire dataset. For WD, half of the documents were correctly classified and 100 were classified as PD. For PD, 350 documents were correctly classified and 50 were classified as ND. For ND, 700 documents were correctly classified and 0 as WD.
- Questions
- Draw Confusion Matrix?
- Calculate the following
- Baseline Accuracy
- Accuracy
- True Positive Rate (Recall)
- False Positive Rate
- True Negative Rate
- False Negative Rate
- Precision
- F1
- What deeper insights about a Classifier’s behaviour you can get using Confusion Matrix?
- What Impact Data Distribution has on the calculation of Baseline Accuracy?
- Considering Baseline Accuracy as Baseline Approach
- If a Proposed ML Algorithm produces and Accuracy of 0.60
- Will we consider it a research contribution?
- Compare Harmonic Mean score with Arithmetic Mean score?
- What impact Precision and Recall has on calculation of F1?
- Whose cost of misclassification is high, Positive instances or Negative instances?
- What possible solutions are there to handle the problems of cost of misclassification?
- Naïve Bayes Algorithm obtained an AUC score of 0.5. What does it mean?
- Which one is more suitable to use, ROC Curve or Precision-Recall Curve?
Your Turn Tasks
Your Turn Task 4
- Task 1
- Consider a Binary Classification Problem similar to Task 01 in TODO and answer the questions given below.
- Questions
- Draw Confusion Matrix?
- Calculate the following
- Baseline Accuracy
- Accuracy
- True Positive Rate (Recall)
- False Positive Rate
- True Negative Rate
- False Negative Rate
- Precision
- F1
- What deeper insights about a Classifiers behaviour you can get using Confusion Matrix?
- What Impact Data Distribution has on calculation of Baseline Accuracy?
- Considering Baseline Accuracy as Baseline Approach
- If a Proposed ML Algorithm produces and Accuracy of 0.60
- Will we consider it a research contribution?
- Compare Harmonic Mean score with Arithmetic Mean score?
- What impact Precision and Recall has on calculation of F1?
- Whose cost of misclassification is high, Positive instances or Negative instances?
- What possible solutions are there to handle the problems of cost of misclassification?
- Naïve Bayes Algorithm obtained an AUC score of 0.90. What does it mean?
- Which one is more suitable to use, ROC Curve or Precision-Recall Curve?
- Task 2
- Consider a Ternary Classification Problem similar to Task 02 in TODO and answer the questions given below.
- Questions
- Draw Confusion Matrix?
- Calculate the following
- Baseline Accuracy
- Accuracy
- True Positive Rate (Recall)
- False Positive Rate
- True Negative Rate
- False Negative Rate
- Precision
- F1
- What deeper insights about a Classifiers behavior you can get using Confusion Matrix?
- What Impact Data Distribution has on calculation of Baseline Accuracy?
- Considering Baseline Accuracy as Baseline Approach
- If a Proposed ML Algorithm produces and Accuracy of 0.60
- Will we consider it a research contribution?
- Compare Harmonic Mean score with Arithmetic Mean score?
- What impact Precision and Recall has on calculation of F1?
- Whose cost of misclassification is high, Positive instances or Negative instances?
- What possible solutions are there to handle the problems of cost of misclassification?
- Naïve Bayes Algorithm obtained an AUC score of 0.60. What does it mean?
- Which one is more suitable to use, ROC Curve or Precision-Recall Curve?
Evaluation Measures for Regression Problems
- Evaluation Measures for Regression Problems
- Some of the most popular and widely used Evaluation Measures for Regression Problems are
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R2 or Coefficient of Determination
- Adjusted R2
- Usage
- These measures are widely and commonly used in
- Climatology
- Forecasting
- Regression Analysis
- Note
- Insha Allah (انشاء اللہ), In this Chapter, I will discuss only three measures
- Mean Absolute Error (MAE)
- Mean Square Error (MSE)
- Root Mean Square Error (RMSE)
- Mean Absolute Error
- Absolute Error
- Absolute Error (AE) is the difference between the Actual Value and the Predicted Value
- Formula
- where XActual and XPredicted represent Actual Value and Predicted Value respectively
- Mean Absolute Error
- Mean Absolute Error (MAE) is the average of all Absolute Errors
- Formula
- where
- n represents the total number of instances
- Xactual represents Actual Value
- Xpredicted represents Predicted Value
- Mean Square Error
- Square Error
- Square Error (SE) is the Square of difference between the Actual Value and the Predicted Value
- Formula
- where Xactual and Xpredicted represent Actual Value and Predicted Value respectively
- Mean Square Error
- Mean Square Error (MSE) is the average of all Square Errors
- Formula
- where
- n represents the total number of instances
- Xactual represents Actual Value
- Xpredicted represents Predicted Value
- Root Mean Square Error
- Root Mean Square Error
- Root Mean Square Error (RMSE) is the Square root of all Mean Square Errors
- Formula
- where
- n represents the total number of instances
- Xactual represents Actual Value
- Xpredicted represents Predicted Value
- Example – Calculating MAE, MSE, RMSE
- Consider the predictions returned by the GPA Prediction Problem discusses in Chapter 05 – Treating a Problem as a Machine Learning Problem – Step by Step Examples
- Calculate MAR, MSE and RMSE
- Calculating Mean Absolute Error
- To calculate Mean Absolute Error, we will compare
- Actual Values with Predicted Values
- Step 1: Calculate Absolute for each Test Example
- AEd11=│ XActual – XPredicted│= │2.25-2.00 │=0.25
- AEd12=│ XActual – XPredicted│= │1.94-2.14 │=0.20
- AEd13=│ XActual – XPredicted│= │3.43-2.00 │=1.43
- AEd14=│ XActual – XPredicted│= │1.86-1.90 │=0.04
- AEd15=│ XActual – XPredicted│= │1.94-2.00 │=0.06
- Step 2: Calculate Mean Absolute Error
- MAE = ( AE(d1) + AE(d2) + AE(d3) + AE(d4) + AE(d5) ) 5
- MAE = ( 0.25) + (0.20) + (1.43) + (0.04) + (0.06) ) 5
- MAE =0.396
- Calculating Mean Square Error
- To calculate Mean Square Error, we will compare
- Actual Values with Predicted Values
- Step 1: Calculate Square Error for each Test Example
- SEd11= XActual – XPredicted2=2.25-2.002=0.0625
- SEd12= XActual – XPredicted2=1.94-2.142=0.0400
- SEd13= XActual – XPredicted2=3.43-2.002=2.0449
- SEd14= XActual – XPredicted2=1.86-1.902=0.0016
- SEd15= XActual – XPredicted2=1.94-2.002=0.0036
- Step 2: Calculate Mean Square Error
- MSE = ( SE(d1) + SE(d2) + SE(d3) + SE(d4) + SE(d5) ) 5
- MSE = ( 0.0625) + (0.04) + (2.0449) + (0.0016) + (0.0036) ) 5
- MSE =0.430
- Calculating Root Mean Square Error
- To Calculate Root Mean Square Error, we will compare
- Actual Values with Predicted Values
- Step 1: Calculate Square Error for each Test Example
- SEd11= XActual – XPredicted2=2.25-2.002=0.0625
- SEd12= XActual – XPredicted2=1.94-2.142=0.04
- SEd13= XActual – XPredicted2=3.43-2.002=2.0449
- SEd14= XActual – XPredicted2=1.86-1.902=0.0016
- SEd15= XActual – XPredicted2=1.94-2.002=0.0036
- Step 2: Calculate Mean Square Error
- MSE = ( SE(d1) + SE(d2) + SE(d3) + SE(d4) + SE(d5) ) 5
- MSE = ( 0.0625) + (0.04) + (2.0449) + (0.0016) + (0.0036) ) 5
- MSE =0.430
- Step 3: Calculate Root Mean Square Error
- RMSE =( SE(d1) + SE(d2) + SE(d3) + SE(d4) + SE(d5) ) 5
- RMSE = ( 0.0625) + (0.04) + (2.0449) + (0.0016) + (0.0036) ) 5
- RMSE =0.656
- Comparing MAE, MSE and RMSE
- In the previous example
- MAE = 0.396
- MSE = 0.430
- RMSE = 0.655
- Note
- MAE is the lowest score
- Reason
- It only calculates difference between Actual Values and Predicted Values
- MSE is higher than MAE
- Reason
- It amplifies the effect of large differences between Actual Values and Predicted by taking the Square of difference in Actual Values and Predicted Values
- RMSE is higher than MSE
- Reason
It further amplifies the effect of large differences between Actual Values and Predicted by taking the Square Root of Square Errors
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 5
- Task 1
- Consider a Regression Problem which aims to predict the Number of Admission in Next Year. Table below contains the Actual and Prediction Values.
- Note
- Your answer should be
- Well Justified
- Questions
- Calculate the following
- Mean Absolute Error
- Mean Square Error
- Root Mean Square Error
- Which one of the three evaluation measures (MAE, MSE, RMSE) is more suitable?
Your Turn Tasks
Your Turn Task 5
- Task 1
- Consider a Regression Problem (similar to the one given in TODO Task) and answer the questions given below.
- Questions
- Calculate the following
- Mean Absolute Error
- Mean Square Error
- Root Mean Square Error
- Which one of the three evaluation measures (MAE, MSE, RMSE) is more suitable?
Evaluation Measures for Sequence to Sequence Problem
- Evaluation Measures for Sequence to Sequence Problem
- Some of the most popular and widely used Evaluation Measures for Sequence-to-Sequence Problems are
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- BLEU (Bi-Lingual Evaluation Understudy)
- METEOR (Metric for Evaluation of Translation with Explicit Ordering)
- Usage
- ROUGE, BLEU and METEOR are widely and commonly used to evaluate a range of Sequence to Sequence Problems including
- Text Summarization
- Machine Translation
- Chatbot
- Question Answering
- Automatic Paraphrase Generation
- Automatic Grading of Essays
- Generating Caption for an Image / Video
- Generating Natural Language Description for an Image / Video
- Speech to Text
- Note
- In this Chapter, In sha Allah (انشاء اللہ), I will give an example of ROUGE and BLEU
- Evaluating Text Summarization System using ROUGE
- ROUGE is a de facto standard to automatically evaluate the performance of Text Summarization Systems
- Insha Allah (انشاء اللہ), I will use the following three metrics of ROUGE to evaluate Urdu Text Summarization System
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Average F1 scores will be reported for ROUGE-1, ROUGE-2 and ROUGE-L metrics
- Note
- To understand the working of ROUGE-L, ROUGE-1 and ROUGE-2 metrics
- See Tutorial – Evaluating Sequence to Sequence Models using ROUGE
- Consider the Text Summarization System discussed in Chapter 05 – Treating a Problem as a Machine Learning Problem – Step by Step Examples
- Below are Predictions Returned by Model on Test Data
- Calculating Average F1 Scores for ROUGE-1, ROUGE-2 and ROUGE-L
- Evaluating Machine Translation System using ROUGE
- BLEU is a de facto standard to automatically evaluate the performance of Machine Translation Systems
- In sha Allah, I will use following four metrics of BLEU to evaluate Urdu Machine Translation System
- BLEU-1
- BLEU-2
- BLEU-3
- BLEU-4
- To understand the working of BLEU-1, BLEU-2, BLEU-3 and BLEU-4 metrics
- See Tutorial – Evaluating Sequence to Sequence Models using BLEU
- Evaluating Machine Translation System using BLEU Cont….
- Consider the Machine Translation System discussed in Chapter 05 – Treating a Problem as a Machine Learning Problem – Step by Step Examples
- Below are Predictions Returned by Model on Test Data
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 6
- Task 1
- Consider the Text Summarization and Machine Translation Problems given in this Chapter. I have calculated ROUGE scores for Text Summarization System and BLEU for Machine Translation System.
- Note
- Your answer should be
- Well Justified
- Question
- Describe following things about METEOR
- Definition
- Purpose
- Importance
- Applicators
- Strengths
- Weakness
- Calculate METEOR scores for Text Summarization and Machine Translation Systems are given in this Chapter?
- TIP
- See the following paper on METEOR
- Wikipedia Article on METEOR
- URL: https://en.wikipedia.org/wiki/METEORttps://en.wikipedia
Your Turn Tasks
Your Turn Task 6
- Task 1
- Identify a Machine Learning Problem (similar to Text Summarization and Machine Translation in TODO Task) and answer the questions given below.
- Question
- Calculate METEOR score for the selected Machine Learning Problem?
Chapter Summary
- Chapter Summary
- To completely and correctly learn any task follow the Learning Cycle
- The four main phases of a Learning Cycle to completely and correctly learn any task are
- Training Phase
- Testing Phase
- Applicants Phase
- Feedback Phase
- The main goal of building a Model (Training / Learning) is to use it in doing Real-world Tasks with good Accuracy
- No one can perfectly predict, that if a Model performs well in Training Phase, then it will also perform well in Real-world
- However, before deploying a Model in the Real-world, it is important to know
- How well will it perform on unseen Real-time Data?
- To judge / estimate the performance of a Model (or h) in Real-world (Application Phase)
- Evaluate the Model (or h) on large Test Data (Testing Phase)
- If (Model Performance = Good AND Test Data = Large)
- Then
- Use the Model in Real-world
- Else
- Refine (re-train) the Model
- Recall – Machine Learning Assumption
- If a Model (or h) performs well on large Test Data, it will also perform well on unseen Real-world Data
- Again, this is an assumption and we are not 100% sure that if a Model performs well on large Test Data will definitely perform well on Real-world Data
- Therefore, it is useful to take continuous Feedback on deployed Model (Feedback Phase) and keep on improving it
- Two main advantages of Evaluating Hypothesis / Model are
- We get the answer to an important question i.e.
- Should we rely on predictions of Hypothesis / Model when deployed in Real-world?
- Machine Learning Algorithms may rely on Evaluation to refine Hypothesis (h)
- What we evaluate a Hypothesis h, we want to know
- How accurately will it classify future unseen instances?
- i.e. Estimate of Error (EoE)
- How accurate is our Estimate of Error (EoE)?
- I.e. what Margin of Error (±? %) is associated with our Estimate of Error (EoE)? (we call it Error in Estimate of Error (EoE))
- True Error
- Error computed on entire Population
- Sample Error
- Error computed on Sample Data
- Since we cannot acquire entire Population
- Therefore, we cannot calculate True Error
- Calculate Sample Error in such a way that
- Sample Error estimates True Error well
- Statistical Theory tells us that Sample Error can estimate True Error well if the following two conditions are fulfilled
- Condition 01 – the n instances in Sample S are drawn
- independently of one another
- independently of h
- according to Probability Distribution D
- Condition 02
- n ≥ 30
- Estimate of Sample Error (EoSE) can be calculated as follows
- Step 1: Randomly select a Representative Sample S from the Population
- Step 2: Calculate Sample Error on Representative Sample S drawn in Step 1
- Estimate cannot be perfect and will contain Error
- Therefore, calculate Error in Estimate of Sample Error (EoSE)
- Error in Estimate of Sample Error (EoSE) can be calculated using Confidence Interval
- The Most Probable Value of True Error is Sample Error with approximately N% Probability (Confidence Level), True Error lies in interval
- where n represents size of Sample
- errors(h) represents Sample Error
- The choice of Confidence Level depends on the field of study
- Generally, the most common Confidence Level used by Researchers is 95%
- Our goal is to have a
- Small Interval with High Confidence
- The two main diseases in Machine Learning are
- Overfitting
- Underfitting
- The condition when a Machine Learning Algorithm tries to remember all the Training Examples from the Training Data (Rote Learning) is known as Overfitting of the Model (h)
- Overfitting happens when our
- Model (h) has lot of features or
- Model (h) is too complex
- The condition when a Machine Learning Algorithm could not learn the correlations between Attributes / Features properly then it is known as Underfitting of the Model (h)
- Underfitting happens when our
- Model misses the trends or patterns in the Training Data and could not generalize well for the Training Examples
- To overcome the problems of Overfitting and Underfitting, we
- Use Train-Test Split Approach
- Train-Test Split Approach, splits the Sample Data into two sets: (1) Train Set and (2) Test Set
- Two main variations of Train-Test Split Approach are
- Random Split Approach
- Class Balanced Split Approach
- Class Balanced Split Approach should be preferred over Random Split Approach
- Train-Test Split Ratio determines what percentage of the Sample Data will be used as Train Set and what percentage of the Sample Data will be used as Test Set
- The Train-Test Split Ratio may vary from Machine Learning Problem to Machine Learning Problem
- e.g. 70%-30%, 80%-20%, 90%-10% etc.
- Most Common Train-Test Split Ratio
- Use 2 / 3 of Sample Data as Train Set
- Use 1 / 3 of Sample Data as Test Set
- Train-Test Split Approach provides high variance in Estimate of Sample Error since
- Changing which examples happens to be in Train Set can significantly change Sample Error
- To overcome the problem of high variance in Estimate of Sample Error calculated using Train-Test Split Approach, we
- Use K-fold Cross Validation Approach
- K-fold Cross Validation Approach works as follows
- Step 1: Split Train Set into K equal folds (or partitions)
- Step 2: Use one of the folds (kth fold) as the Test Set and union of remaining folds (k – 1 folds) as Training Set
- Step 3: Calculate Error of Model (h)
- Step 4: Repeat Steps 2 and 3, to choose Train Sets and Test Sets from different folds, and calculate Error K-times
- Step 5: Calculate Average Error
- Important Note
- In each fold, there must be at least 30 instances
- All K-folds must be disjoint i.e. instance appearing in one fold must not appear in any other fold
- Empirical study showed that best value for
- K is 10
- K-fold Cross Validation Approach is a better estimator of Error since
- All data is used for both Training and Testing
- K-fold Cross Validation Approach is computationally expensive since
- we have to repeat Training and Testing Phases K-times
- It is suitable to use Train-Test Split Approach in the following situations
- When Training Time is Very Large
- Organizing International Competitions
- Having Very Huge Sample Data
- It is suitable to use K-fold Cross Validation Approach in the following situations
- When Training Time is Not Very Large
- Having Sample Data that is Not Very Huge
- To compare various Machine Learning Algorithms, following things must be same
- Train Set
- Test Set
- Evaluation Measure
- Evaluation Methodology
- Important Note
- If any of the above things are not same then it will
- Not be a valid comparison
- Some of the most popular and widely used Evaluation Measures for Classification Problems are
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F1
- Area Under the Curve (AUC)
- Baseline Accuracy (a.k.a. Majority Class Categorization (MCC)) is calculated by assigning the label of Majority Class to all the Test Instances
- BA provides a simple baseline to compare proposed Machine Learning Algorithms
- Accuracy is defined as the proportion of correctly classified Test instances
- Accuracy = 1 – Error
- Accuracy evaluation measure is more suitable to use for evaluation of Machine Learning Algorithms when we have
- Balanced Data
- Two main limitations of Accuracy measure are
- Accuracy fails to accurately evaluate a Machine Learning Algorithm when Test Data is highly unbalanced
- Accuracy ignores possibility of different misclassification costs
- To overcome the limitations of Accuracy measure, we use
- Confusion Matrix
- A Confusion Matrix is a table used to describe the performance of a Classification Model (or Classifier) on a Set of Test Examples (Test Data), whose Actual Values (or True Values) are known
- Some of the main advantages of Accuracy measure are
- Confusion Matrix allows us to visualize the performance of a Model / Classifier
- Confusion Matrix allows to separately get insights into the Errors made by each Class
- Confusion Matrix gives insights to both
- Errors made by a Model / Classifier and
- Types of Errors made by a Model / Classifier
- Confusion Matrix allows us to compute many different Evaluation Measures including
- Baseline Accuracy
- Accuracy
- True Positive Rate (or Recall)
- True Negative Rate
- False Positive Rate
- False Negative Rate
- Precision
- F1
- Recall or True Positive Rate (TPR) or Sensitivity is the proportion of Positive cases that were correctly classified
- False Positive Rate (FPR) is the proportion of Negative cases that were incorrectly classified as Positive
- True Negative Rate (TNR) or Specificity is defined as the proportion of Negatives cases that were classified correctly
- False Negative Rate (FNR) is the proportion of Positive cases that were incorrectly classified as Negative
- Precision (P) is the proportion of the predicted Positive cases that were correct
- F-measure is the Harmonic Mean of Precision and Recall
- where controls relative weight assigned to Precision and Recall
- F1-measure is that, When we assign same weights to Precision and Recall i.e. β = 1, the F-measure becomes F1-measure
- When considering Misclassifications
- Positive costs may be more or less important than misclassifying (or incorrectly predicting) Negative costs
- Before Training your Model, you must be very clear about the Misclassification Costs, otherwise
- Your Model will fail to perform well in Real-world (i.e. Application Phase)
- The problem of considering different misclassification costs can be handled using
- ROC Curves
- Precision-Recall Curves
- ROC Curve summarizes the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) for a Classifier / Model using different Probability Thresholds
- ROC Curves are more suitable to use when we have Balanced Data
- Precision-Recall Curve summarizes the trade-off between Precision and Recall (or True Positive Rate) for a Classifier / Model using different Probability Thresholds
- Precision-Recall Curves are more suitable to use when we have Highly Unbalanced Data
- Area Under the ROC Curve (AUC) is defined as the probability that randomly chosen Positive instance is ranked above than the randomly selected Negative one
- AUC tells how much Model is capable of distinguishing between Classes?
- AUC score is computed using True Positive Rate (TPR) and False Positive Rate (FPR)
- Range of AUC Score
- Range of AUC Score is [0 – 1]
- 0 means All Predictions of Model are Wrong
- 1 means All Predictions of Model are Correct
- What is a good AUC Score?
- AUC Score = 0.5
- suggests No Discrimination (i.e., Model do not have the ability to differentiate Positive instances from Negative ones)
- AUC Score = 0.7 to 0.8
- considered Acceptable
- AUC Score = 0.8 to 0.9
- considered Excellent
- AUC Score = Greater Than 0.9
- considered Outstanding
- Some of the most popular and widely used Evaluation Measures for Regression Problems are
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R2 or Coefficient of Determination
- Adjusted R2
- Absolute Error (AE) is the difference between the Actual Value and the Predicted Value
- Mean Absolute Error (MAE) is the average of all Absolute Errors
- Square Error (SE) is the Square of difference between the Actual Value and the Predicted Value
- Mean Square Error (MSE) is the average of all Square Errors
- Root Mean Square Error (RMSE) is the Square root of all Mean Square Errors
- Some of the most popular and widely used Evaluation Measures for Sequence-to-Sequence Problems are
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- BLEU (Bi-Lingual Evaluation Understudy)
- METEOR (Metric for Evaluation of Translation with Explicit Ordering)
- ROUGE is a de facto standard to automatically evaluate the performance of Text Summarization Systems
- Normally, we calculate
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Average F1 scores are reported for ROUGE-1, ROUGE-2 and ROUGE-L metrics
- BLEU is a de facto standard to automatically evaluate the performance of Machine Translation Systems
- Normally, we calculate
- BLEU-1
- BLEU-2
- BLEU-3
- BLEU-4
In Next Chapter
- In Next Chapter
- In Sha Allah, in next Chapter, I will present
- Book Main Findings, Conclusion and Future Work