Chapter 6 - Treating a Problem as a Machine Learning Problem – Step by Step Examples
Chapter Outline
- Chapter Outline
- Quick Recap
- Steps – Treating a Real-world Problem as a Machine Learning Problem
- GPA Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- Emotion Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- Text Summarization System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- Machine Translation System – Treating a Real-world Problem as a Supervised Machine Learning Problem
Chapter Summary
Quick Recap
- Quick Recap – Data and Annotations – Step by Step Example
- The five main Steps for Data Annotation are as follows
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
- If Yes
- Go to Next Step
- If Yes
- Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
- Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
- Step 4: Develop proposed Annotated Corpus at the Prototype Level
- Record the problems that you faced in developing Annotated Corpus at prototype level
- Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
- Step 5: Develop proposed Annotated Corpus at full scale
- When you create your proposed Annotated Corpus at prototype and / or full scale level, follow the following main steps
- Step 1: Raw Data Collection
- Step 1.1: Data Cleaning (if needed)
- Step 1.2: Data Pre-processing (if needed)
- Step 2: Annotation Process
- Step 2.1: Annotation Guidelines
- Step 2.2: Annotations
- Step 2.3: Inter-Annotator Agreement (IAA)
- Step 3: Corpus Characteristics and Standardization
- An authentic and appropriate Data Source is essential to develop a large Gold Standard Annotated Corpus
- As discusses in Lecture 3 – Data and Annotations, the main types of Data Sources are
- Sources with Annotations
- Sources without Annotations
- Sources with / without Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Supervised Machine Learning Problems can be broadly categorized into three main types
- Classification Problems
- Regression Problems
- Sequence to Sequence Problems
- For each type of Machine Learning Problem
- Suitable Machine Learning Algorithms may differ
- Classification Problems – Input and Output
- Input
- Structured / Unstructured / Semi-structured
- Output
- Categorical
- Input
- Regression Problems – Input and Output
- Input
- Structured / Unstructured / Semi-structured
- Output
- Numeric
- Input
- Sequence to Sequence Problems – Input and Output
- Input
- Unstructured (of variable length)
- Output
- Unstructured (of variable length)
- Input
- In this Lecture, we have discussed four Step by Step examples to created Gold Standard Annotated Corpus
- Data Sources with Annotations
- Developing a Gold Standard Annotated Corpus for Urdu Text Summarization Task
- Developing a Gold Standard Annotated Corpus for GPA Prediction Task
- Data Sources without Annotations
- Developing a Gold Standard Annotated Corpus for Emotion Precision on Tweets
- Developing a Gold Standard Annotated Corpus for Urdu-English Machine Translation Task
- Data Sources with Annotations
Steps - Treating a Real-world Problem as a Machine Learning Problem
- Steps - Treating a Real-world Problem as a Machine Learning Problem
- Follow the following steps to treat a Real-world Problem as a Machine Learning Problem
- Step 1: Decide the Learning Setting
- Step 2: Obtain Sample Data
- Step 3: Understand and Pre-process Sample Data
- Step 4: Represent Sample Data in Machine Understandable Format
- Step 5: Select Suitable Machine Learning Algorithms
- Step 6: Split Sample Data into Training Data and Testing Data
- Step 7: Select Suitable Evaluation Measure(s)
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Training Phase
- Testing Phase
Step 9: Analyze Results
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Application Phase
- Feedback Phase
- Step 11: Based on Feedback
- Go to Step 1 and Repeat all the Steps
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Step 1 - Decide the Learning Setting
- Three main Learning Settings are
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Step 2: Obtain Sample Data
- What Type of Data should be obtained depends upon
- Learning Settings you selected in Step 1
- Supervised Learning requires
- Annotated Data
- Unsupervised Learning requires
- Unannotated Data
- Semi-supervised Learning requires
- Semi-annotated Data
- Two Main Choices to Obtain Sample Data
- Use Existing Corpora / Datasets
- Develop your Own Corpora / Datasets
- Note
- For details on how to develop a Gold Standard Annotated Corpus
- See Lecture 04 – Data and Annotations – Step by Step Examples
- For details on how to develop a Gold Standard Annotated Corpus
- Remember – Very Important
- Machine Learning Approaches are Data Driven Approaches
- In recent years, Research Community has made efforts to develop Gold Standard (or Benchmark) Corpora / Datasets by organizing
- Shared Tasks
- Apart from Shared Tasks
- Researchers have made efforts to develop Gold Standard (or Benchmark) Corpora / Datasets for various tasks
- For details on Shared Tasks
- See Lecture 06 – A Template based Approach to Read a Research Paper (Research Methodology in I.T. Course)
- URL:https://ilmoirfan.com/research-methodology-in-it/ch-methodology-in-it/
- Some Popular Shared Tasks
- Popular Shared Tasks in Natural Language Processing (NLP)
- SemEval
- URL: http://alt.qcri.org/semeval2020/index.php?id=tasks
- PAN
- Note that these Shared Tasks mainly focus on
- Text Data
- SemEval
- Popular Shared Tasks in Information Retrieval (IR)
- TREC
- Note that this Shared Task mainly focuses on
- Text Data
- Image Data
- Video Data
- Popular Shared Tasks in Image Processing (IP)
- BraTS (Brain Tumor Segmentation Challenge)
- SpaceNet
- URL:https://www.grssieee.org/earthvision2020/challenge.html
- AI City Challenge
- Note that these Shared Task mainly focus on
- Image Data
- Video Data
- Popular Shared Tasks in Natural Language Processing (NLP)
- Step 3: Understand and Pre-process Sample Data
- For details on how to understand and pre-process data
- See Lecture 03 – Data and Annotations
- Step 4: Represent Sample Data in Machine Understandable Format
- Machine Understandable Data Format
- Representing Data in a Format, which a Learner (Machine Learning Algorithm) can use to learn
- Very often , Machine Learning Algorithms understand Data represented in the form of
- Attribute-Value Pair
- For more details on Data Representation
- See Lecture 03 – Data and Annotations
- Step 5: Suitable ML Algorithms for Supervised Learning
- Scikit-Learn Cheat Sheet is a Good Starting Point for Selecting Suitable ML Algorithms for a specific Machine Learning Problem
- Problem
- Scikit-Learn Cheat Sheet will not work for all situations
- Solution
- Build a deeper understanding of ML Algorithms
- Important Question
- How to choose suitable Machine Learning Algorithm(s) for your Machine Learning Problem?
- A Possible Answer
- Consider following main points when choosing suitable Machine Learning Algorithms for your Machine Learning Problem
- Main Points to Consider
- Type of Machine Learning Problem
- Number of Parameters
- Size of Training and Testing Data
- Number of Features
- Training and Testing Time
- Accuracy
- Speed and Accuracy in Application Phase
- Type of Machine Learning Problem
- Machine Learning Algorithms are designed to solve specific Machine Learning Problems
- Two Important Points to Know
- Complete and correct understanding of the Type of Machine Learning Problem , you are trying to solve using Machine Learning Algorithms
- In previous studies, what Machine Learning Algorithms have proven to be most effective for the Type of Machine Learning Problemyou are solving?
- Three main types of Machine Learning Problems are
- Supervised Learning
- Unsupervised Learning (a.k.a. Clustering)
- Semi-supervised Learning
- Three main types of Machine Learning Problems are
- Supervised Learning Problems
- A Supervised Learning Problem may fall into one of the following three categories
- Classification Problem
- Regression Problem
- Sequence to Sequence Problem
- A Supervised Learning Problem may fall into one of the following three categories
- Good Starting Points for Classification Problems
- Feature-based ML Algorithms
- For Textual Data
- Random Forest
- Support Vector Machine
- Logistic Regression
- Naïve Bayes
- Gradient Boost
- For Image / Video Data
- Support Vector Machine
- Regular Neural Networks
- Logistic Regression
- Naive Bayes
- Extreme Learning Machines
- Random Forest
- Extreme Gradient Boost
- Type II Approximate Reasoning
- For Audio Data
- Connectionist Temporal Classification
- For Textual Data
- Deep Learning ML Algorithms
- For Textual Data
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Gated Recurrent Units (GRU)
- BI-GRU
- For Image / Video Data
- Convolutional Neural Networks (most popular)
- For Audio Data
- Recurrent Neural Networks (RNN)
- For Textual Data
- Feature-based ML Algorithms
- Good Starting Points for Regression Problems
- Feature-based ML Algorithms
- Linear Regression
- Regression Trees
- Lasso Regression
- Multivariate Regression
- Feature-based ML Algorithms
- Good Starting Points for Sequence to Sequence Problems
- For Textual Data
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Gated Recurrent Units (GRU)
- BI-GRU
- For Image / Video Data
- Convolutional Neural Networks
- For Audio Data
- Recurrent Neural Networks (RNN)
- For Textual Data
- Unsupervised Learning Problems
- Good Starting Points for Unsupervised Learning Problems
- Feature based Ml Algorithms
- For Textual Data
- K-Means
- Agglomerative Hierarchical Clustering
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
- For Image / Video Data
- K-Means
- Fuzzy C Means
- Deep Learning ML Algorithms
- For Image / Video Data
- Generative Adversarial Networks
- Auto-encoders
- For Image / Video Data
- For Textual Data
- Feature based Ml Algorithms
- Semi-supervised Learning Problems
- Good Starting Points for Semi-Supervised Learning Problems
- Feature based ML Algorithms
- Label Spreading Algorithm
- Feature based ML Algorithms
- Good Starting Points in Machine Learning
- Many Machine Learning Algorithms make use of linearity
- For example, Linear Regression, Logistic Regression, Support Vector Machines etc.
- Machine Learning Algorithms based on linearity are considered as a Good Starting Point
- Two main characteristics of Machine Learning Algorithms based on linearity are
- They are algorithmically simple
- They are fast to train
- Many Machine Learning Algorithms make use of linearity
- Selection of Best Machine Learning Algorithm
- Question
- Which Machine Learning Algorithm(s) is best for a specific Machine Learning Problem?
- Answer
- Apply all available Machine Learning Algorithms and see which performs best ���
- Problem
- It requires a lot of effort, time and resources to
- Apply all available Machine Learning Algorithms and find the best one
- A Possible Solution
- Start with Good Starting Points
- Machine Learning Experts say that following Machine Learning Algorithms are Good Starting Points
- Feature based ML Algorithms
- For Structured / Unstructured / Semi-structured Data
- Support Vector Machine
- Logistic Regression
- Deep Learning ML Algorithms
- For Textual Data
- Recurrent Neural Network (RNN)
- For Image / Video Data
- Convolutional Neural Network (CNN)
- For Audio Data
- Recurrent Neural Network (RNN)
- For Textual Data
- For Structured / Unstructured / Semi-structured Data
- Feature based ML Algorithms
- It requires a lot of effort, time and resources to
- Number of Parameters
- ML Algorithm’s behavior is affected by
- No. of Parameters
- ML Algorithms with Small Number of Parameters
- Strengths
- Require few Hit and Trial to find a good combination of Parameters (or Model)
- Weaknesses
- Do not provide flexibility
- Strengths
- ML Algorithms will Large Number of Parameters
- Strengths
- Provide flexibility
- Weaknesses
- Require large Hit and Trial to find a good combination of Parameters (or Model)
- Strengths
- ML Algorithm’s behavior is affected by
- Size of Training and Testing Data
- Size of Training Data
- Size of Training Data plays a very important role in the Selection of Suitable ML Algorithms
- Feature-based ML Algorithms
- Feature based ML Algorithms (a.k.a. Classical ML Algorithms) can be accurately trained , even if the Training Data is small
- Deep Learning ML Algorithms
- To accurately train Deep Learning Algorithms huge amount of Training Data is required
- Size of Testing Data
- Size of Testing Data plays a very important when evaluating a Machine Learning Algorithm
- To deploy a Model in Real-world (Application Phase), it should fulfill the following two conditions
- Model should perform well (Condition 01) on large Test Data (Condition 02)
- Size of Training Data
- Number of Features
- Features used to Train a Model, have a significant impact on the performance of the Model
- Selection of most discriminating Features is important to get good results
- Problem
- In some Corpora / Datasets, it may happen that
- No. of Features is very high compared to the No. of Instances in a Corpus / Dataset
- Consequently, the Training Time may become unfeasibly long
- Very often , this happens in
- Textual Data
- Genetics Data
- Image / Video Data
- In some Corpora / Datasets, it may happen that
- Possible Solutions
- Two popular and widely used approaches to reduce Number of Features in a Corpus / Dataset are
- Feature Reduction
- Feature Selection
- Two popular and widely used approaches to reduce Number of Features in a Corpus / Dataset are
- Feature Reduction
- Feature Reduction (a.k.a. Dimensionality Reduction) is a process which transforms Features into a lower dimension
- Popular Methods for Feature Reduction are
- Principal Component Analysis
- Generalized Discriminant Analysis
- Auto-encoders
- Non-negative Matrix Factorization
- Feature Selection
- Feature Selection is the process of selecting most discrimination (or important) subset of Features (excluding redundant or irrelevant Features) from the Original Set of Features (without changing them)
- Popular Methods for Feature Selection are
- Wrapper Methods
- Filter Methods
- Feature Selection
- Feature Extraction Vs Feature Selection
- Feature Extraction
- Creates new Features
- Feature Selection
- Selects a subset of Features from the Original Set of Features
- Note
- Given a Corpus / Dataset
- First carry out
- Feature Extraction then
- Feature Reduction / Feature Selection
- Feature Extraction
- Training and Testing Time
- Training and Testing Time mainly depends upon two main factors
- Size of Training and Testing Data
- Target Accuracy
- Note
- Training Time of Deep Learning ML Algorithms is quite high compared to Feature based ML Algorithms
- Training and Testing Time mainly depends upon two main factors
- Target Accuracy
- The Target Accuracy may differ from Machine Learning Problem to Machine Learning Problem
- Machine Learning Problem 01
- Detection of Enemy Tank from Vehicles
- i.e. Tank vs Non-Tank
- Detection of Enemy Tank from Vehicles
- Machine Learning Problem 02
- Gender Identification from Image
- i.e. Male vs Female
- Gender Identification from Image
- We need very high Accuracy for Machine Learning Problem 01 (Tank vs Non-Tank) compared to Machine Learning Problem 02 (Gender Identification) ���
- Speed and Accuracy in Application Phase
- Speed and Accuracy requirements in Application Phase, may vary from Machine Learning Problem to Machine Learning Problem
- Machine Learning Problem 01
- Detection of Enemy Tank from Vehicles
- i.e. Tank vs Non-Tank
- Detection of Enemy Tank from Vehicles
- Machine Learning Problem 02
- Plagiarism Detection in Students’ Assignments
- i.e. Plagiarized vs Non-Plagiarized
- Plagiarism Detection in Students’ Assignments
- Note that in the Applications Phase
- For Enemy Tank Detection
- We need both
- High Accuracy and
- High Speed
- We need both
- For Plagiarism Detection
- We need
- High Accuracy
- Slow Speed is acceptable
- We need
- For Enemy Tank Detection
- ML Algorithms – Scikit-Learn Cheat sheet
- Step 6: Split Sample Data into Training Data and Testing Data
- Split Sample Data into
- Training Data
- Testing Data
- Standard Practice for Splitting Sample Data
- Use a Train-Test Split Ratio of
- 67% – 33%
- Use a Train-Test Split Ratio of
- Step 7: Select Suitable Evaluation Measure(s)
- Selection of Suitable Evaluation Measure(s) is important to
- correctly evaluate the performance of a Model
- Selection of Suitable Evaluation Measure(s) mainly depends on
- Type of Machine Learning Problem
- Evaluation Measures for Classification Problem
- Some of the most popular and widely used Evaluation Measures for Classification Problems are
- Baseline Accuracy (a.k.a. Most Common Categorization (MCC))
- Accuracy
- True Negative Rate
- False Positive Rate
- False Negative Rate
- Recall or True Positive Rate or Sensitivity
- Precision or Specificity
- F1
- Area Under the Curve (AUC)
- Some of the most popular and widely used Evaluation Measures for Classification Problems are
- Evaluation Measures for Regression Problem
- Some of the most popular and widely used Evaluation Measures for Regression Problems are
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R2 or Coefficient of Determination
- Adjusted R2
- Some of the most popular and widely used Evaluation Measures for Regression Problems are
- Evaluation Measures for Sequence to Sequence Problem
- Some of the most popular and widely used Evaluation Measures for Sequence-to-Sequence Problems are
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- BLEU (Bi-Lingual Evaluation Understudy) BLEU
- METEOR (Metric for Evaluation of Translation with Explicit ORdering)
- Some of the most popular and widely used Evaluation Measures for Sequence-to-Sequence Problems are
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Recall the Equation
- Training Phase
- Use Training Data to build the Model
- Testing Phase
- Use Testing Data to evaluate the performance of the Model
- e. calculate Error in the Model
- Use Testing Data to evaluate the performance of the Model
- Step 9: Analyze Results
- Machine Learning Assumption
- Question
- What is a good Model?
- A Possible Answer
- It varies from Machine Learning Problem to Machine Learning Problem
- Generally ,
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Application Phase
- Deploy the Model in the Real-world to make predictions on unseen data
- Feedback Phase
- Take Feedback from
- Domain Experts
- Users of the ML system
- Take Feedback from
- Step 11: improve Model based on Feedback
- There is Always Room for Improvement 😊
- Based on Feedback form Domain Experts and Users
- Improve your Model
- To learn any task, follow the following cycle
- Plan – in Mind
- Design – on Paper
- Execute – at Prototype Level
- Execute – in Real-world
- Feedback – from Audience and Domain Experts
- To be successful in life
- Be a Learner till Death 😊
- Just change one person in life i.e. Yourself
- People fail in life because they
- Try to Change the World 😊
TODO and Your Turn
TODO Task 1
- Task 1
- Consider the following Machine Learning Problems and for each Machine Learning Problem answer the questions given below.
- Automatically Generating Caption for an Image
- Speaker Identification (from Audio)
- Note
- Your answer should be
- Well Justified
- Your answer should be
- Questions
- Write Input and Output for each Machine Learning Problem?
- Decide Learning Settings
- What Learning Settings will be more suitable?
- Obtain Sample Data
- Write names of potential Data Sources?
- Understand and Pre-process Sample Data
- What Pre-processing Techniques should be applied to improve the quality of Sample Data?
- Represent Sample Data in Machine Understandable Format
- How Sample Data can be represented in Machine Understandable Format
- Select Suitable Machine Learning Algorithms
- Write down suitable Machine Learning Algorithms?
- Number of Features
- Do we need to apply Feature Selection or Feature Reduction?
- Split Sample Data into Training Data and Testing Data
- How Sample Data should be split into Training Data and Testing Data?
- Select Suitable Evaluation Measure(s)
- Write down suitable Evaluation Measure(s)?
- Training Time
- How much Training Time will be required to Train selected Machine Learning Algorithms?
- Number of Parameters
- How many Parameters need to be tuned for selected Machine Learning Algorithms?
- Deployment in Application Phase
- Write two conditions that must be fulfilled before deploying a Model in Real-world?
- Discuss requirements of Speed and Accuracy in Application Phase?
- Consider the following Machine Learning Problems and for each Machine Learning Problem answer the questions given below.
Your Turn Task 1
- Task 1
- Select any two Machine Learning Problems and for each Machine Learning Problem answer the questions given below.
- Note
- Your answer should be
- Well Justified
- Questions
- Write Input and Output for each Machine Learning Problem?
- Decide Learning Settings
- What Learning Settings will be more suitable?
- Obtain Sample Data
- Write names of potential Data Sources?
- Understand and Pre-process Sample Data
- What Pre-processing Techniques should be applied to improve the quality of Sample Data?
- Represent Sample Data in Machine Understandable Format
- How Sample Data can be represented in Machine Understandable Format
- Select Suitable Machine Learning Algorithms
- Write down suitable Machine Learning Algorithms?
- Number of Features
- Do we need to apply Feature Selection or Feature Reduction?
- Split Sample Data into Training Data and Testing Data
- How Sample Data should be split into Training Data and Testing Data?
- Select Suitable Evaluation Measure(s)
- Write down suitable Evaluation Measure(s)?
- Training Time
- How much Training Time will be required to Train selected Machine Learning Algorithms?
- Number of Parameters
- How many Parameters need to be tuned for selected Machine Learning Algorithms?
- Deployment in Application Phase
- Write two conditions that must be fulfilled before deploying a Model in Real-world?
- Discuss requirements of Speed and Accuracy in Application Phase?
- Your answer should be
GPA Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- GPA Prediction Problem
- Task
- Develop a GPA Prediction system to predict GPA of a university student (1stsemester) from his / her Matric and FSc marks
- Input
- Matric Marks
- FSc Marks
- Output
- GPA (1stsemester)
- Treated as a
- Supervised Machine Learning Problems
- Goal
- Learn an Input-Output Function
- i.e. Learn from Input to predict Output
- Learn an Input-Output Function
- GPA Prediction is a Regression Problem
- GPA Prediction is a Regression Problem because
- Output is Numeric
- GPA Prediction – Input and Output
- Input
- Structured
- Fixed Set of Two Attributes
- Matric Marks
- FSc Marks
- Fixed Set of Two Attributes
- Structured
- Output
- Numeric
- Research Focus – GPA Prediction System
- Research Focus
- Develop a GPA Prediction system for university students in Pakistan studying Computer Science at Undergrad level in three degree programs: BS(CS), BS(SE) and BS(IT)
- Steps – Treating GPA Prediction as a Regression Problem
- In sha Allah, I will follow the following steps to treat the GPA Prediction Problem as a Regression Problem
- Step 1: Decide the Learning Setting
- Step 2: Obtain Sample Data
- Step 3: Understand and Pre-process Sample Data
- Step 4: Represent Sample Data in Machine Understandable Format
- Step 5: Select Suitable Machine Learning Algorithms
- Step 6: Split Sample Data into Training Data and Testing Data
- Step 7: Select Suitable Evaluation Measure(s)
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Training Phase
- Testing Phase
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Application Phase
- Feedback Phase
- Step 11: Based on Feedback
- Go to Step 1 and Repeat all the Steps
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Step 1: Decide the Learning Setting
- In sha Allah, I aim to treat GPA Prediction Problem as a
- Supervised Machine Learning Problem
- Since Output is Numeric , it will be treated as a
- Regression Problem
- Step 2: Obtain Sample Data
- Since, I am treating GPA Prediction Problem as a Regression Problem, I will need
- Annotated Data
- For more accurate learning, I need
- Large amount of Annotated Data
- High-quality Annotated Data
- Balanced Data
- Note
- For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
- i.e. Size of Sample Data = 15 instances
- For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
- Two Main Choices to Obtain Sample Data
- Use an Existing Corpus / Dataset
- Develop Your Own Corpus / Dataset
- Since, there is no existing Corpus / Dataset available to develop a GPA Prediction system for university students studying in Pakistan
- I developed my own Corpus / Dataset
- For details on how to create a Gold Standard Annotated Corpus
- See Lecture 4 – Data and Annotations – Step by Step Examples
- To develop our GPA Prediction system, we obtained
- Sample Data of 15 instances
- See gpa-sample-data.csv file in Supporting Material
- Sample Data
- Step 3: Understand and Pre-process data
- Understanding Data
- The Gold Standard Annotated Corpus contains three Attributes / Features
- Matric Marks
- FSc Marks
- GPA (1stsemester)
- Separating Input from Output
- Input comprises of two Attributes / Features
- Matric Marks
- FSc Marks
- Output comprises of a single Attribute
- GPA (1stsemester)
- Input comprises of two Attributes / Features
- The Gold Standard Annotated Corpus contains three Attributes / Features
- Pre-processing Data
- Gold Standard Annotated Corpus is already pre-processed
- Step 4: Represent Sample Data in Machine Understandable Format
- Feature-based Regression Algorithms (implemented in Scikit-Learn) can understand data in
- Attribute-Value Pair
- Values of Attribute must be Numeric
- Attribute-Value Pair
- Our Gold Standard Annotated Data is already in
- Attribute-Value Pair form with Numerical Values
- Therefore, it is already in Machine Understandable Format
- Step 5: Select Suitable Machine Learning Algorithms
- Previous students have shown that Good Starting Points for Regression Problems are
- Support Vector Regressor
- Logistic Regression
- Linear Regression
- Random Forest Regressor
- Gradient Boosting Regressor
- Step 6: Split Sample Data into Training Data and Testing Data
- Use Standard Practice for Sample Data Split
- i.e. Train-Test Split Ratio of
- 67% – 33%
- i.e. Train-Test Split Ratio of
- In our Corpus / Dataset
- Total Instances = 15
- Splitting Data into Training Data and Testing Data
- Training Data = 10
- Testing Data = 5
- Training Data
- See gpa-training-data.csv file in Supporting Material
- Training Data
- Testing Data
- See gpa-testing-data.csv file in Supporting Material
- Testing Data
- Step 7: Select Suitable Evaluation Measure
- I will use Mean Absolute Error (MAE) Evaluation Measure to evaluate the performance of the Model
- Absolute Error
- Absolute Error (AE) is the difference between the Actual Value and the Predicted Value
- Formula
- where and represent Actual Value and Predicted Value respectively
- Mean Absolute Error
- Mean Absolute Error (MAE) is the average of all Absolute Errors
- Formula
- where
- n represents the total number of instances
- Xactual represents Actual Value
- Xpredicted represents Predicted Value
- where
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Recall the Equation
- Training Phase
- Use Training Data to build the Model
- Note that our aim is to
- Learn an Input-Output Function
- Recall – General Settings of Learning Input-Output Functions
Training Phase
- Training Example = x1, …. xm} + f(xi) for each xi ϵ TE
- Testing Phase
- Use Testing Data to compute Error in Modelusing Mean Absolute Error (MAE) measure
- Predictions Returned by Model (h)
- Calculating Mean Absolute Error
- To calculate Mean Absolute Error, we will compare
- Actual Values with Predicted Values
- To calculate Mean Absolute Error, we will compare
- Step 1: Calculate Absolute for each Test Example
- Step 2: Calculate Mean Absolute Error
- Step 9: Analyze Results
- Assumption for this Example
- Here, I am assuming that Model
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Application Phase
- Model is deployed in Real-world to make predictions on Real-time Data
- Steps – Making Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector
- Step 4: Return Prediction to the User
Example – Making GPA Prediction on Real-time Data
- Step 1: Take Input from User
- Enter Matric Marks : 704
- Enter FSc Marks : 853
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Feature Vector
- <704, 853>
- Note that the order of Attributes in both Training and Testing Data was
- <Matric, FSc>
- Similarly, order of Attributes in unseen instance is exactly same as in Training and Testing Data
- Step 3: Apply Model on the Feature Vector of unseen instance
- Model (or h) is applied on <704, 853>
- Step 1: Take Input from User
- Step 4: Return Prediction to the User
- 86
- Step 4: Return Prediction to the User
- Feedback Phase
- A Two Step Process
- Step 1: After sometime , take Feedback from
- Domain Experts and Users on deployed GPA Prediction system
- Step 2: Make a List of Possible Improvements based on Feedback received
- Step 11: Improve GPA Prediction System based on Feedback
- Go to Step 1 and improve the GPA Prediction system based on
- List of Possible Improvements made in Step 10
Emotion Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- Emotion Prediction Problem
- Task
- Develop an Emotion Prediction system to predict emotion from a written text
- Input
- A text
- Output
- Emotion
- Possible Output Values (12 Categories)
- Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Sadness, Surprise, Trust, Neutral (or No Emotion)
- Treated as a
- Supervised Machine Learning Problem
- Goal
- Learn an Input-Output Function
- i.e. Learn from Input to predict Output
- Learn an Input-Output Function
- Emotion Prediction is a Classification Problem
- Emotion Prediction is a Classification Problem because
- Output is Categorical
- Emotion Prediction – Input and Output
- Input
- Unstructured (Text)
- Output
- Categorical
- Research Focus – Emotion Prediction System
- Research Focus
- Develop an Emotion Prediction system for English Tweets
- Steps – Treating Emotion Prediction as a Classification Problem
- In sha Allah, I will follow the following steps to treat the Emotion Prediction Problem as a Classification Problem
- Step 1: Decide the Learning Settings
- Step 2: Obtain Sample Data
- Step 3: Understand and Pre-process Sample Data
- Step 4: Represent Sample Data in Machine Understandable Format
- Step 5: Select Suitable Machine Learning Algorithms
- Step 6: Split Sample Data into Training Data and Testing Data
- Step 7: Select Suitable Evaluation Measure(s)
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Training Phase
- Testing Phase
- Step 9: Analyze Results
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Application Phase
- Feedback Phase
- Step 11: Based on Feedback
- Go to Step 1 and Repeat all the Steps
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Step 1: Decide the Learning Setting
- In sha Allah, I aim to treat Emotion Prediction Problem as a
- Supervised Machine Learning Problem
- Since Output is Categorical , it will be treated as a
Classification Problem
- Step 2: Obtain Sample Data
- Since, I am treating Emotion Prediction Problem as a Classification Problem, I will need
- Annotated Data
- For more accurate learning, I need
- Large amount of Annotated Data
- High-quality Annotated Data
- Balanced Data
- Note
- For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
- Two Main Choices to Obtain Data
- Use an Existing Corpus
- Develop’ Your Own Corpus
- A Gold Standard Annotated Corpus is availablefor Emotion Analysis of English Tweets
- Corpus / Dataset Link:
- https://competitions.codalab.org/competitions/17751#learn_the_details-datasets Last visited: 07-04-2020
- Paper Link
- https://www.aclweb.org/anthology/S18-1001.pdf Last visited: 07-04-2020
- Paper Reference
- Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018, June). Semeval-2018 task 1: Affect in tweets. In proceedings of the 12th international workshop on semantic evaluation (pp. 1 – 17)
- Corpus / Dataset Link:
- We obtained a Sample Data of 15 instances
- See emotion-sample-data.csv File in Supporting Material
- Sample Data
- Step 3: Understand and Pre-process Data
- Understanding Data
- The Gold Standard Annotated Corpus contains two Attributes
- Tweet
- Emotion
- Separating Input from Output
- Input comprises of one Attribute
- Tweet
- Output comprises of a single Attribute
- Emotion
- Input comprises of one Attribute
- The Gold Standard Annotated Corpus contains two Attributes
- Pre-processing Data
- Gold Standard Annotated Corpus is already pre-processed
Therefore, no pre-processing is needed
- Gold Standard Annotated Corpus is already pre-processed
- Step 4: Represent Data in Machine Understandable Format
- Feature-based Classification Algorithms (implemented in Scikit-Learn) can understand data in
- Attribute-Value Pair
- Values of Attributes / Features must be Numeric
- Attribute-Value Pair
- Problem
- Our Sample Data is not in Attribute-Value Pair form
- We need to transform our Sample Data into Machine Understandable Format
- Our Sample Data is not in Attribute-Value Pair form
- Solution
- There are many approaches to transform Sample Data into Machine Understandable Format
Transforming Sample Data in Machine Understandable Format
- In our Sample Data
- Input is Text
- Output is Categorical
- Considering Input (Tweet) and Output (Emotion), we will need to
- Transform Input (Text) into Numerical Representation
- Transform Output (Categorical) into Numerical Representation
- A Two Step Process
- Step 1: Define an Encoding Scheme
- Step 2: Use Encoding Scheme defined in Step 1, to convert Categorical Output Values to Numerical Output Values for all instances in the Sample Data
Converting Output into Numerical Representation
- Step 1: Define an Encoding Scheme
- Encoding Scheme
- Anger = 0
- Anticipation = 1
- Disgust = 2
- Fear = 3
- Joy = 4
- Love = 5
- Optimism = 6
- Pessimism = 7
- Sadness = 8
- Surprise = 9
- Trust = 10
- Neutral = 11
- Encoding Scheme
- Step 2: Use Encoding Scheme defined in Step 1, to convert Categorical Output Values to Numerical Output Values for all instances in the Sample Data
- Sample Data after Encoding Categorical Output Values to Numerical Output Values
- See emotion-sample-data-encoded-output.csv File in Supporting Material
- In our Sample Data
- Note
- Alhumdulilah, Output is transformed into Numerical Representation
- In sha Allah, in next slides I will try to explain how to transform Input into Numerical Representation
- Converting Input into Numerical Representation
- Considering Feature-based ML Algorithms , an Input can be transformed into Numerical Representation in the following steps
- Step 1: Select a Feature Extraction Method
- Step 2: Extract Features from Input using the Feature Extraction Method selected in Step 1
- Considering Feature-based ML Algorithms , an Input can be transformed into Numerical Representation in the following steps
- Note
- For details on Feature Extraction from Text
- SeeTutorial – Feature Extraction from Text
- For details on Feature Extraction from Text
- Converting Input into Numerical Representation
- Step 1: Select a Feature Extraction Method
- In sha Allah, I will use Word Uni-gram Features to transform Sample Data into Numerical Representation
- Feature = Word Uni-gram
- Feature Weight = Frequency Count of a Word in a Tweet
- Maximum Features = 10
- For details on Feature Extraction from Text using N-gram Models
- SeeTutorial – Feature Extraction from Text
- Converting Input into Numerical Representation
- Step 2: Extract Features from Input using the Feature Extraction Method selected in Step 1
- After Feature Extraction, Input is transformed into Numerical Representation
- See emotion-sample-data-encoded.csv File in Supporting Material
- Recap – Original Sample Data
- Recap – Sample Data in Numerical Representation
- Step 5: Select Suitable Machine Learning Algorithms
- Previous students have shown that Good Starting Points for Classification Problems are
- Support Vector Classifier
- Naïve Bayes
- Random Forest Classifier
- Gradient Boosting Classifier
- Step 6: Split Sample Data into Training Data and Testing Data
- Use Standard Practice for Data Split
- i.e. Train-Test Split Ratio of
- 67% – 33%
- i.e. Train-Test Split Ratio of
- In our Corpus / Dataset
- Total Instances = 15
- Splitting Data into Training and Testing
- Training Data = 10
- Testing Data = 5
Training Data
- See emotion-training-data-encoded.csv File in Supporting Material
- Testing Data
- See emotion-testing-data-encoded.csv File in Supporting Material
- Step 7: Select Suitable Evaluation Measure(s)
- I will use Accuracy Evaluation Measure to evaluate the performance of the Model
- Accuracy
- Accuracy is defined as the proportion of correctly classified instances
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Recall the Equation
- Training Phase
- Use Training Data to build the Model
- Note that our aim is to
- Learn an Input-Output Function
General Settings – Learning Input-Output Function
- Training Phase
- Testing Phase
- Predictions Returned by the Model (h)
- Calculating Accuracy
- To calculate Accuracy, we will compare
- Actual Values with Predicted Values
- Note
- To explain calculations more clearly , I have converted Numerical Predicted Values to Categorical Predicted Values
- To calculate Accuracy, we will compare
- Step 9: Analyze Results
- Assumption for this Example
- Here, I am assuming that Model
- performed well on large Test Data and we can deeply it in the real-world
- Here, I am assuming that Model
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Application Phase
- Model is deployed in Real-world to make predictions on Real-time Data
- Steps – Make Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector of the unseen instance
- Step 4: Return Prediction to the User
Example – Making Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Feature Vector
- <0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
- Note that the order of Attributes in both Training and Testing Data was
- <18, activists, agree, amazing, apart, atsu, baloch, band, basically, battle>
- Similarly, order of Attributes in unseen instance is exactly same as those of Training and Testing Data
- Step 3: Apply Model on the Feature Vector of unseen instance
- Model (h) is applied on <0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
- Step 4: Return Prediction to the User
- Optimism
- Step 2: Convert User Input into Feature Vector
- Application Phase
- Feedback Phase
- A Two Step Process
- Step 1: After sometime , take Feedback from
- Domain Experts and Users on deployed Emotion Prediction System
- Step 2: Make a List of Possible Improvements based on Feedback received
- Step 11: Improve Emotion Prediction System based on Feedback
- Go to Step 1 and improve the Emotion Prediction System based on
- List of Possible Improvements made in Step 10
Text Summarization System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- Text Summarization Problem
- Task
- Develop a Text Summarization system to automatically (predict) generate summary of an Urdu news article
- Input
- An Urdu News Article
- Output
- Summary
- Treated as a
- Supervised Machine Learning Problems
- Goal
- Learn an Input-Output Function
- i.e. Learn from Input to predict Output
- Learn an Input-Output Function
- Important Note
- Be careful in the use of terms
- Example
- Term 01
- A News Article
- Term 02
- An Urdu News Article
- Term 03
- An Urdu News Article on Science and Technology
- Term 04
- An Urdu News Article on Hazrat Jalal.ud.Din Romi R.A.
- Remarks
- Term 01 is very broad
- Term 02 is broad
- Term 03 is specific
- Term 04 is very specific
- Term 01
- Text Summarization is a Sequence to Sequence Problem
- Text Summarization is a Sequence to Sequence Problem because
- Input is Unstructured and of variable length
- Output is Unstructured and of variable length
- Text Summarization is a Sequence to Sequence Problem because
- Text Summarization – Input and Output
- Input
- Unstructured (Text) – An Urdu News Article
- Output
- Unstructured (Text) – Summary
- Note
- Length of Input is much greater than the Length of Output
- Input
- Research Focus – Text Summarization System
- Research Focus
- Develop a Text Summarization system to automatically generate summary of an Urdu news article
- Research Focus
- Steps – Treating Text Summarization as a Sequence to Sequence Problem
- In sha Allah, I will follow the following steps to treat Text Summarization Problem as a Sequence to Sequence Problem
- Step 1: Decide the Learning Settings
- Step 2: Obtain Sample Data
- Step 3: Understand and Pre-process Sample Data
- Step 4: Represent Sample Data in Machine Understandable Format
- Step 5: Select Suitable Machine Learning Algorithms
- Step 6: Split Sample Data into Training Data and Testing Data
- Step 7: Select Suitable Evaluation Measure(s)
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Training Phase
- Testing Phase
- Step 9: Analyze Results
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Application Phase
- Feedback Phase
- Step 11: Based on Feedback
- Go to Step 1 and Repeat all the Steps
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Step 1: Decide the Learning Setting
- In sha Allah, I aim to treat Text Summarization Problem as a
- Supervised Machine Learning Problem
- Since both Input and Output are Unstructured (Text) and of variable length , it will be treated as a
- Sequence to Sequence Problem
- Step 2: Obtain Sample Data
- In sha Allah, I aim to treat Text Summarization Problem as a
- Supervised Machine Learning Problem
- Since both Input and Output are Unstructured (Text) and of variable length , it will be treated as a
- Sequence to Sequence Problem
- Since, I am treating Text Summarization Problem as a Supervised Learning Problem, I will need
- Annotated Data
- For more accurate learning, I need
- Large amount of Annotated Data
- High-quality Annotated Data
- Balanced Data
- Note
- For simplicity, In sha Allah I will use a toy Corpus / Dataset of 15 instances only
- Two Main Choices to Obtain Sample Data
- Use an Existing Corpus
- Develop Your Own Corpus
- Since, there is a benchmark Corpus / Dataset available for Urdu Text Summarization
- I will use the existing Corpus / Dataset called
- Urdu Text Summarization Corpus
- I will use the existing Corpus / Dataset called
- We obtained a Sample Data of 15 instances
- See summarization-sample-data.xlsx File in Supporting Material
- Note
To save space I am putting below one instance from Sample Data
- Next Slide contains set of 15 instances in the Sample Data
- To save space , I am only putting the Summary and not presenting the Urdu News Article
- Complete Sample Data is given in
- summarization-sample-data.xlsx File in Supporting Material
- Step 3: Understand and Pre-process data
- Understanding Data
- The Sample Data contains two Attributes
- Urdu News Article
- Summary
- Separating Input from Output
- Input comprises of one Attribute
- An Urdu News Article
- Output comprises of one Attribute
- Summary
- Input comprises of one Attribute
- The Sample Data contains two Attributes
- Pre-processing Data
- Sample Data is already pre-processed
- Therefore, no pre-processing is needed
- Sample Data is already pre-processed
- Step 4: Represent Data in Machine Understandable Format
- Deep Learning ML Algorithms (implemented in Keras or PyTorch) can understand data in
- Numerical Representations
- Problem
- Our Sample Data is in Textual form
- Therefore, Deep Learning ML Algorithms cannot understand it
- Our Sample Data is in Textual form
- A Possible Solution
- Use Word Embedding Techniques to transform
- Textual Data into Numerical Representation
- Popular Word Embedding Techniques used in Deep Learning ML Algorithms are
- Word2Vec
- Glove
- FastText
- Use Word Embedding Techniques to transform
- Step 5: Select Suitable Machine Learning Algorithms
- Previous students have shown that Good Starting Points for Sequence to Sequence Problems (considering Textual Data ) are
- Recurrent Neural Network (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Gated Recurrent Units (GRU)
- BI-GRU
- Step 6: Split Sample Data into Training Data and Testing Data
- Use Standard Practice for Data Split
- e. Train-Test Split Ratio of
- 67% – 33%
- In our Sample Data
- Total Instances = 15
- Splitting Sample Data into Training Data and Testing Data
- Training Data = 10
- Testing Data = 5
- e. Train-Test Split Ratio of
Training Data
- Complete Training Data is given in
summarization-training-data.xlsx File in Supporting Material
- Complete Training Data is given in
- Testing Data
- Complete Testing Data is given in
- summarization-testing-data.xlsx File in Supporting Material
- Complete Testing Data is given in
- Step 7: Select Suitable Evaluation Measure(S)
- Two Choices to Evaluate Text Summarization System
- Manual Approach
- Automatic Approach
- Manual Approach
- A human (Domain Expert) will manually judge the quality of summary automatically generated by Text Summarization System
- Strengths
- Evaluation will be very accurate and of high-quality
- Weaknesses
- It is practically not possible to manually evaluate thousands of summaries
- Automatic Approach
- A program will automatically judge the quality of summary automatically generated by Text Summarization System
- Strengths
- You can quickly and easily evaluate very large Test Data
- Weaknesses
- Evaluation will not be very accurate and of high-quality
Evaluating Urdu Text Summarization System
- In sha Allah, I will use Automatic Approach to evaluate the performance of Urdu Text Summarization System
- ROUGE is a de facto standard to automatically evaluate the performance of Text Summarization Systems
- In sha Allah, I will use following three metrics of ROUGE to evaluate Urdu Text Summarization System
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Note
- To understand the working of ROUGE-L, ROUGE-1 and ROUGE-2 metrics
- See Tutorial – Evaluating Sequence to Sequence Models using ROUGE
- To understand the working of ROUGE-L, ROUGE-1 and ROUGE-2 metrics
- To summarize
- Average F1 scores will be reported for ROUGE-1, ROUGE-2 and ROUGE-L metrics
- Note
- For details on F1 measure
- See Lecture 13 – Evaluating Hypothesis (Model)
- For details on F1 measure
- Step 8: Execute First Two Phases of Machine Learning Cycle
Recall the Equation
- Training Phase
- Use Training Data to build the Model
- Note that our aim is to
- Learn an Input-Output Function
- General Settings – Learning Input-Output Function
- Training Phase
- Testing Phase
- Testing Phase
- Use Testing Data to compute Error in Model using
- ROUGE-1, ROUGE-2 and ROUGE-L metrics
- Report Average F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L metrics
- Use Testing Data to compute Error in Model using
- Testing Phase
- Predictions Returned by Model on Test Data
- Calculating Average F1 Scores for ROUGE-1, ROUGE-2 and ROUGE-L
- Step 9: Analyze Results
- Assumption for this Example
- Here, I am assuming that Model
- performed well on large Testing Data and we can deploy it in the real-world
- Here, I am assuming that Model
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Application Phase
- Model is deployed in Real-world to make predictions on Real-time Data
- Steps – Make Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector
- Step 4: Return Prediction to the User
- Example – Generating Text Summary of Real-time Data
- Step 1: Enter an Urdu News Article
Step 2: Tokenize Text
- Step3: Text Boundary
Step 4: Word to Index Mapping
Step 5: Word Embedding
Step 6: Apply Model on Feature Vector
Step 7: Predict Summary
Feedback Phase
- A Two Step Process
- Step 1: After sometime , take Feedback from
- Domain Experts and Users on deployed Text Summarization System
- Step 2: Make a List of Possible Improvements based on Feedback received
- Step 11: Improve Text Summarization System based on Feedback
- Go to Step 1 and improve the Text Summarization System based on
- List of Possible Improvements made in Step 10
Machine Translation System – Treating a Real-world Problem as a Supervised Machine Learning Problem
- Machine Translation Problem
- Task
- Develop a Machine Translation system for Urdu-English language pair to automatically translate (predict) Source Text (Urdu) into the Target Language (English)
- Input
- A Source Text (Urdu)
- Output
- Translation of Source Text in Target Language (English)
- Treated as a
- Supervised Machine Learning Problems
- Goal
- Learn an Input-Output Function
- i.e. Learn from Input to predict Output
- Learn an Input-Output Function
- Machine Translation is a Sequence to Sequence Problem
- Machine Translation is a Sequence to Sequence Problem because
- Input is Unstructured and of variable length
- Output is Unstructured and of variable length
- Machine Translation – Input and Output
- Input
- Unstructured (Source Text in Urdu)
- Output
- Unstructured (Translated Text in English)
- Note
- Length of Input is almost same as the Length of Output
- Recall
- In Text Summarization
- Difference in Lengths of Input and Output was quite high
- In Text Summarization
- Conclusion
- Completely and correctly understand the Input and Output before treating a Real-world Problem as a Machine Learning Problem
- Research Focus – Machine Translation System
- Research Focus
- Develop a Machine Translation system for Urdu-English language pair to automatically translate Source Text (Urdu) into the Target Language (English)
- Steps – Treating Machine Translation as a Sequence to Sequence Problem
- In sha Allah, I will follow the following steps to treat Machine Translation Problem as a Sequence to Sequence Problem
- Step 1: Decide the Learning Settings
- Step 2: Obtain Sample Data
- Step 3: Understand and Pre-process Sample Data
- Step 4: Represent Sample Data in Machine Understandable Format
- Step 5: Select Suitable Machine Learning Algorithms
- Step 6: Split Sample Data into Training Data and Testing Data
- Step 7: Select Suitable Evaluation Measure(s)
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Training Phase
- Testing Phase
- Step 9: Analyze Results
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Application Phase
- Feedback Phase
- Step 11: Based on Feedback
- Go to Step 1 and Repeat all the Steps
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Step 1: Decide the Learning Setting
- In sha Allah, I aim to treat Machine Translation Problem as a
- Supervised Machine Learning Problem
- Since both Input and Output are Unstructured (Text) and of variable length , it will be treated as a
- Sequence to Sequence Problem
- Step 2: Obtain Sample Data
- Since, I am treating Machine Translation Problem as a Supervised Learning Problem, I will need
- Annotated Data
- For more accurate learning, I need
- Large amount of Annotated Data
- High-quality Annotated Data
- Balanced Data
- Note
- For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
- Two Main Choices to Obtain Data
- Use an Existing Corpus
- Develop Your Own Corpus
- Since, there is a benchmark Corpus / Dataset available for Urdu Machine Translation
- I will use the existing Corpus / Dataset
- MT-UE-20 Corpus
- I will use the existing Corpus / Dataset
- We obtained Sample Data of 15 instances
- See mt-sample-data.xlsx File in Supporting Material
- Sample Data
- Step 3: Understand and Pre-process data
- Understanding Data
- Sample Data contains two Attributes
- Source Text (Urdu)
- Target Text (Translation of Source Text in English)
- Separating Input from Output
- Input comprises of one Attribute
- Source Text (Urdu)
- Output comprises of one Attribute
- Target Text (English)
- Input comprises of one Attribute
- Sample Data contains two Attributes
- Pre-processing Data
- Sample Data is already pre-processed
- Therefore, no pre-processing is needed
- Sample Data is already pre-processed
- Step 4: Represent Data in Machine Understandable Format
- Statistical Machine Translation (SMT) Techniques and Neural Machine Translation (NMT) Techniques can understand data in
- Numerical Representation
- Problem
- Our Sample Data is in Textual form
- Therefore, SMT and NMT Techniques cannot understand it
- Our Sample Data is in Textual form
- Possible Solutions
- Statistical Machine Translation (SMT) Technique
- Use Probabilities of Words / Phrases to align Words / Phrases for Machine Translation
- Neural Machine Translation Techniques
- Use Word Embedding Techniques to transform
- Textual Data into Numerical Representation
- Popular Word Embedding Techniques are
- Word2Vec
- Glove
- FastText
- Use Word Embedding Techniques to transform
- Statistical Machine Translation (SMT) Technique
- Step 5: Select Suitable Machine Learning Algorithms
- Previous students have shown that Good Starting Points for Machine Translation are
- Statistical Machine Translation Techniques
- Neural Machine Translation Techniques
- Step 6: Split Sample Data into Training Data and Testing Data
- Use Standard Practice for Data Split
- i.e. Train-Test Split Ratio of
- 67% – 33%
- In our dataset / corpus
- Total Instances = 15
- Splitting Sample Data into Training Data and Testing Data
- Training Data = 10
- Testing Data = 5
- i.e. Train-Test Split Ratio of
Training Data
See mt-training-data.xlsx File in Supporting Material
- Testing Data
- See mt-testing-data.xlsx File in Supporting Material
- Step 7: Select Suitable Evaluation Measure(S)
- Two Choices to Evaluate Machine Translation System
- Manual Approach
- Automatic Approach
- Manual Approach
- A human (Domain Expert) will manually judge the quality of a translation automatically generated by Machine Translation System
- Strengths
- Evaluation will be very accurate and of high-quality
- Weaknesses
- It is practically not possible to manually evaluated thousands of translations
- Automatic Approach
- A program will automatically judge the quality of translation automatically generated by Machine Translation System
- Strengths
- You can quickly and easily evaluate very large Test Data
- Weaknesses
- Evaluation will not be very accurate and of high-quality
Evaluating Machine Translation System
- In sha Allah, I will use Automatic Approach to evaluate the performance of Machine Translation system
- BLEU is a de facto standard to automatically evaluate the performance of Machine Translation systems
- In sha Allah, I will use following four metrics of BLEU to evaluate Urdu Machine Translation System
- BLEU-1
- BLEU-2
- BLEU-3
- BLEU-4
- Note
- To understand the working of BLEU-1, BLEU-2, BLEU-3 and BLEU-4 metrics
- See Tutorial – Evaluating Sequence to Sequence Models using BLEU
- Step 8: Execute First Two Phases of Machine Learning Cycle
Recall the Equation
- Training Phase
- Use Training Data to build the Model
- Note that our aim is to
- Learn an Input-Output Function
- General Settings – Learning Input-Output Function
- Training Phase
- The Table below shows how Statistical Machine Translation Technique works to learn a Model
- Testing Phase
- Use Testing Data to compute Error in Model using BLEU-1, BLEU-2, BLEU-3 and BLEU-4 metrics
- Predictions returned by the Model
- Automatic Evaluation using BLEU
- Automatic Evaluation using BLEU
- Step 9: Analyze Results
- Assumption for this Example
- Here, I am assuming that Model
- performed well on large Testing Data and we can deploy it in the real-world
- Here, I am assuming that Model
- Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
- Application Phase
- Model is deployed in Real-world to make predictions on Real-time Data
- Steps – Make Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Numerical Representation
- Exactly same as Numerical Representation of Training and Testing Data
- Step 3: Apply Model on the Numerical Representation
- Step 4: Return Prediction to the User
- Example – Making Prediction on Real-time Data
- Table below gives a Step by Step example to automatically translate an unseen instance using Statistical Machine Translation Technique
- Feedback Phase
- A Two Step Process
- Step 1: After sometime , take Feedback from
- Domain Experts and Users on deployed Machine Translation System
- Step 2: Make a List of Possible Improvements based on Feedback received
- Step 1: After sometime , take Feedback from
- A Two Step Process
- Step 11: Improve Machine Translation System based on Feedback
List of Possible Improvements made in Step 1
Go to Step 1 and improve the Machine Translation System based on
Chapter Summary
- Chapter Summary
In this Chapter, I presented the following main concepts:
- A Real-world Problem can be treatedas a Machine Learning Problem using the following Step by Step approach
- Step 1: Decide the Learning Setting
- Step 2: ObtainSample Data
- Step 3: Understandand Pre-process Sample Data
- Step 4: RepresentSample Data in Machine Understandable Format
- Step 5: Select SuitableMachine Learning Algorithms
- Step 6: Split Sample Data into Training Data and Testing Data
- Step 7: Select SuitableEvaluation Measure(s)
- Step 8: Execute First Two Phases of Machine Learning Cycle
- Training Phase
- Testing Phase
- Step 9: AnalyzeResults
- Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
- Application Phase
- Feedback Phase
- Step 11: Based on Feedback
- Go to Step 1 and Repeal allthe Steps
- Three mainLearning Settings are
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- What Type of Datashould be obtained depends upon
- Leering Setting you selected in Step 1
- Two Main Choices to Obtain Sample Data are
- Use ExistingCorpora / Datasets
- Develop your Own Corpora / Datasets
- If (Corpora / Datasets Existfor your Research Problem)
Then
Use existing Corpora / Datasets
Else
You will need to develop your own Corpora / Datasets
- Very often, Machine Learning Algorithms understand Data represented in the form of
- Attribute-Value Pair
- We should consider the following main points when choosing suitable Machine Learning Algorithmsfor your Machine Learning Problem
- Type of Machine Learning Problem
- Number of Parameters
- Size of Training and Testing Data
- Number of Features
- Training and Testing Time
- Accuracy
- Speed and Accuracy in Application Phase
- Machine Learning Algorithms are designedto solve specific Machine Learning Problems
- Two Important Points to Know
- Complete and correct understanding of the Type of Machine Learning Problem , you are trying to solve using Machine Learning Algorithms
- In previous studies, what Machine Learning Algorithms have provento be most effective for the Type of Machine Learning Problem you are solving?
- Good Starting Points for Classification Problems
- Feature-based ML Algorithms
- For Textual Data
- Random Forest
- Support Vector Machine
- Logistic Regression
- Naïve Bayes
- Gradient Boost
- For Image / Video Data
- Support Vector Machine
- Regular Neural Networks
- Logistic Regression
- Naive Bayes
- Extreme Learning Machines
- Random Forest
- Extreme Gradient Boost
- Type II Approximate Reasoning
- For Audio Data
- Connectionist Temporal Classification
- For Textual Data
- Deep Learning ML Algorithms
- For Textual Data
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Gated Recurrent Units (GRU)
- BI-GRU
- For Image / Video Data
- Convolutional Neural Networks (most popular)
- For Audio Data
- Recurrent Neural Networks (RNN)
- For Textual Data
- Feature-based ML Algorithms
- Good Starting Points for Regression Problems
- Feature-based ML Algorithms
- Linear Regression
- Regression Trees
- Lasso Regression
- Multivariate Regression
- Feature-based ML Algorithms
- Good Starting Points for Sequence to Sequence Problems
- For Textual Data
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Gated Recurrent Units (GRU)
- BI-GRU
- For Image / Video Data
- Convolutional Neural Networks
- For Audio Data
- Recurrent Neural Networks (RNN)
- For Textual Data
- Good Starting Points for Unsupervised Learning Problems
- Feature based Mal Algorithms
- For Textual Data
- K-Means
- Agglomerative Hierarchical Clustering
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
- For Image / Video Data
- K-Means
- Fuzzy C Means
- For Textual Data
- Deep Learning ML Algorithms
- For Image / Video Data
- Generative Adversarial Networks
- Auto-encoders
- For Image / Video Data
- Feature based Mal Algorithms
- Good Starting Points for Semi-supervised Learning Problems
- Feature based ML Algorithms
- Label Spreading Algorithm
- Feature based ML Algorithms
- Question
- Which Machine Learning Algorithm(s) is bestfor a specific Machine Learning Problem?
- Answer
- Apply all availableMachine Learning Algorithms and see which performs best
- Problem
- It requires a lot of effort, time and resourcesto
- Apply all availableMachine Learning Algorithms and find the best one
- It requires a lot of effort, time and resourcesto
- A Possible Solution
- Start with Good Starting Points
- Machine Learning Experts say that following Machine Learning Algorithms are Good Starting Points
- Feature based ML Algorithms
- For Structured / Unstructured / Semi-structured Data
- Support Vector Machine
- Logistic Regression
- For Structured / Unstructured / Semi-structured Data
- Deep Learning ML Algorithms
- For Textual Data
- Recurrent Neural Network (RNN)
- For Image / Video Data
- Convolutional Neural Network (CNN)
- For Audio Data
- Recurrent Neural Network (RNN)
- For Textual Data
- Feature based ML Algorithms
- A ML Algorithm’s behavioris affected by
- of Parameters
- Size of Training Data
- Size of Training Data plays a very importantrole in the Selection of Suitable ML Algorithms
- Feature-based ML Algorithms
- Feature based ML Algorithms (a.k.a. Classical ML Algorithms) can be accurately trained , even if the Training Data is small
- Deep Learning ML Algorithms
- To accurately trainDeep Learning Algorithms huge amount of Training Data is required
- Size of Testing Data
- Size of Testing Data plays a very importantwhen evaluating a Machine Learning Algorithm
- To deploy a Model in Real-world(Application Phase), it should fulfill the following two conditions
- Model should perform well(Condition 01) on large Test Data (Condition 02)
- Number of Features usedto Train a Model, have a significant impact on the performance of the Model
- Selection of most discriminatingFeatures is important to get good results
- In Text / Image / Video / Genetic Corpora / Datasets
- Number of Features is very high compared to the of Instances in a Corpus / Dataset
- Two popularand widely used approaches to reduce Number of Features in a Corpus / Dataset are
- Feature Reduction
- Feature Reduction (a.k.a. Dimensionality Reduction) is a process which transformsFeatures into a lower dimension
- Feature Selection
- Feature Selection is the process of selecting most discrimination(or important) subset of Features (excluding redundant or irrelevant Features) from the Original Set of Features (without changing them)
- Popular Methods for Feature Reduction are
- Principal Component Analysis
- Generalized Discriminant Analysis
- Auto-encoders
- Non-negative Matrix Factorization
- Popular Methods for Feature Selection are
- Wrapper Methods
- Filter Methods
- Feature Extraction
- Creates new Features
- Feature Selection
- Selects a subset of Features from the Original Set of Features
- Given a Corpus / Dataset
- First carry out
- Feature Extraction then
- Feature Reduction / Feature Selection
- Training and Testing Time mainly depends upon two main factors
- Size of Training and Testing Data
- Target Accuracy
- Training Time of Deep Learning ML Algorithms is quite high compared to Feature based ML Algorithms
- First carry out
- Feature Reduction
- The Target Accuracymay differ from Machine Learning Problem to Machine Learning Problem
- Speed and Accuracy requirementsin Application Phase, may vary from Machine Learning Problem to Machine Learning Problem
- Standard Practice for Splitting Sample Data
- Use a Train-Test Split Ratio of
- 67% – 33%
- Selection of SuitableEvaluation Measure(s) is important to
- correctly evaluate the performance of a Model
- Selection of Suitable Evaluation Measure(s) mainly dependson
- Type of Machine Learning Problem
- Some of the most popularand widely used Evaluation Measures for Classification Problems are
- Baseline Accuracy (a.k.a. Most Common Categorization (MCC))
- Accuracy
- True Negative Rate
- False Positive Rate
- False Negative Rate
- Recall or True Positive Rate or Sensitivity
- Precision or Specificity
- F1
- Area Under the Curve (AUC)
- Some of the most popularand widely used Evaluation Measures for Regression Problems are
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R2or Coefficient of Determination
- Adjusted R2
- Some of the most popularand widely used Evaluation Measures for Sequence-to-Sequence Problems are
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- BLEU (Bi-Lingual Evaluation Understudy)BLEU
- METEOR (Metric for Evaluation of Translation with Explicit Ordering)
- Recall the Equation
- Training Phase
- Use Training Data to build the Model
- Testing Phase
- Use Testing Data to evaluate the performanceof the Model
- i.e. calculate Error in the Model
- Use Testing Data to evaluate the performanceof the Model
- Use a Train-Test Split Ratio of
- A ML Algorithm’s behavioris affected by
- When analyzing results, remember the Machine Learning Assumption
- In Application Phase
- Deploy the Model in the Real-world to make predictions on unseen data
- In Feedback Phase
- Take Feedback from
- Domain Experts
- Users of the ML system
- Based on Feedback from Domain Experts and Users
- Improve your Model
- Take Feedback from
- In this Lecture, we treated (Step by Step) following four Real-world Problems as Machine Learning Problems
- GPA Prediction Problem
- Emotion Prediction Problem
- Text Summarization Problem
- Machine Translation Problem
In Next Chapter
- In Next Chapter
- In Sha Allah, in the next Chapter, I will present a detailed discussion on
- Concept Learning and Hypothesis Representation