Skip to content

Ilm o Irfan

Technologies

  • Home
  • Courses
    • Free Courses
  • Motivational Seminars
  • Blog
  • Books
  • Contact
  • Home
  • Courses
    • Free Courses
  • Motivational Seminars
  • Blog
  • Books
  • Contact
Explore Courses

Machine Learning

  • October 26, 2022
  • Home
  • Free Book

Table of Contents

Chapter 6 - Treating a Problem as a Machine Learning Problem – Step by Step Examples

Chapter Outline

  • Chapter Outline
  • Quick Recap
  • Steps – Treating a Real-world Problem as a Machine Learning Problem
  • GPA Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem
  • Emotion Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem
  • Text Summarization System – Treating a Real-world Problem as a Supervised Machine Learning Problem
  • Machine Translation System – Treating a Real-world Problem as a Supervised Machine Learning Problem  
  • Chapter Summary

Quick Recap

  • Quick Recap –  Data and Annotations – Step by Step Example
  • The five main Steps for Data Annotation are as follows
    • Step 1: Completely and correctly understand the Real-world Problem
    • Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
      • If Yes
        • Go to Next Step
    • Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
      • Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
    • Step 4: Develop proposed Annotated Corpus at the Prototype Level
      • Record the problems that you faced in developing Annotated Corpus at prototype level
      • Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
    • Step 5: Develop proposed Annotated Corpus at full scale
  • When you create your proposed Annotated Corpus at prototype and / or full scale level, follow the following main steps
  • Step 1: Raw Data Collection
    • Step 1.1: Data Cleaning (if needed)
    • Step 1.2: Data Pre-processing (if needed)
  • Step 2: Annotation Process
    • Step 2.1: Annotation Guidelines
    • Step 2.2: Annotations
    • Step 2.3: Inter-Annotator Agreement (IAA)
  • Step 3: Corpus Characteristics and Standardization
  • An authentic and appropriate Data Source is essential to develop a large Gold Standard Annotated Corpus
  • As discusses in Lecture 3 – Data and Annotations, the main types of Data Sources are
    1. Sources with Annotations
    2. Sources without Annotations
  • Sources with / without Annotations can be
    1. Online Digital Repositories
    2. Non-digital Repositories
    3. Existing Corpora
  • Supervised Machine Learning Problems can be broadly categorized into three main types
    1. Classification Problems
    2. Regression Problems
    3. Sequence to Sequence Problems
  • For each type of Machine Learning Problem
    • Suitable Machine Learning Algorithms may differ
  • Classification Problems – Input and Output
    • Input
      • Structured / Unstructured / Semi-structured
    • Output
      • Categorical
  • Regression Problems – Input and Output
    • Input
      • Structured / Unstructured / Semi-structured
    • Output
      • Numeric
  • Sequence to Sequence Problems – Input and Output
    • Input
      • Unstructured (of variable length)
    • Output
      • Unstructured (of variable length)
  • In this Lecture, we have discussed four Step by Step examples to created Gold Standard Annotated Corpus
    • Data Sources with Annotations
      • Developing a Gold Standard Annotated Corpus for Urdu Text Summarization Task
      • Developing a Gold Standard Annotated Corpus for GPA Prediction Task
    • Data Sources without Annotations
      • Developing a Gold Standard Annotated Corpus for Emotion Precision on Tweets
      • Developing a Gold Standard Annotated Corpus for Urdu-English Machine Translation Task

Steps - Treating a Real-world Problem as a Machine Learning Problem

  • Steps - Treating a Real-world Problem as a Machine Learning Problem
  • Follow the following steps to treat a Real-world Problem as a Machine Learning Problem
    • Step 1: Decide the Learning Setting
    • Step 2: Obtain Sample Data
    • Step 3: Understand and Pre-process Sample Data
    • Step 4: Represent Sample Data in Machine Understandable Format
    • Step 5: Select Suitable Machine Learning Algorithms
    • Step 6: Split Sample Data into Training Data and Testing Data
    • Step 7: Select Suitable Evaluation Measure(s)
    • Step 8: Execute First Two Phases of Machine Learning Cycle
      • Training Phase
      • Testing Phase
    • Step 9: Analyze Results

    • Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
      • Application Phase
      • Feedback Phase
    • Step 11: Based on Feedback
      • Go to Step 1 and Repeat all the Steps
  • Step 1 - Decide the Learning Setting
  • Three main Learning Settings are
    1. Supervised Learning
    2. Unsupervised Learning
    3. Semi-supervised Learning
  • Step 2: Obtain Sample Data
  • What Type of Data should be obtained depends upon
    • Learning Settings you selected in Step 1
  • Supervised Learning requires
    • Annotated Data
  • Unsupervised Learning requires
    • Unannotated Data
  • Semi-supervised Learning requires
    • Semi-annotated Data
  • Two Main Choices to Obtain Sample Data
    1. Use Existing Corpora / Datasets
    2. Develop your Own Corpora / Datasets
  • Note

    • For details on how to develop a Gold Standard Annotated Corpus
      • See Lecture 04 – Data and Annotations – Step by Step Examples
  • Remember – Very Important
    • Machine Learning Approaches are Data Driven Approaches

  • In recent years, Research Community has made efforts to develop Gold Standard (or Benchmark) Corpora / Datasets by organizing
    • Shared Tasks
  • Apart from Shared Tasks
    • Researchers have made efforts to develop Gold Standard (or Benchmark) Corpora / Datasets for various tasks
  • For details on Shared Tasks
    • See Lecture 06 – A Template based Approach to Read a Research Paper (Research Methodology in I.T. Course)
    • URL:https://ilmoirfan.com/research-methodology-in-it/ch-methodology-in-it/
  • Some Popular Shared Tasks
    • Popular Shared Tasks in Natural Language Processing (NLP)
      • SemEval
        • URL: http://alt.qcri.org/semeval2020/index.php?id=tasks
      • PAN
        • URL: https://pan.webis.de/
      • Note that these Shared Tasks mainly focus on
        • Text Data
    • Popular Shared Tasks in Information Retrieval (IR)
      • TREC
        • URL: https://trec.nist.gov/pubs/call2020.html
      • Note that this Shared Task mainly focuses on
        • Text Data
        • Image Data
        • Video Data
    • Popular Shared Tasks in Image Processing (IP)
      • BraTS (Brain Tumor Segmentation Challenge)
        • URL: http://braintumorsegmtion.org/
      • SpaceNet
        • URL:https://www.grssieee.org/earthvision2020/challenge.html
        • AI City Challenge
        • URL:https://www.aicitychallenge.org/
      • Note that these Shared Task mainly focus on
        • Image Data
        • Video Data
  • Step 3: Understand and Pre-process Sample Data
  • For details on how to understand and pre-process data
    • See Lecture 03 – Data and Annotations
  • Step 4: Represent Sample Data in Machine Understandable Format
  • Machine Understandable Data Format
    • Representing Data in a Format, which a Learner (Machine Learning Algorithm) can use to learn
    • Very often , Machine Learning Algorithms understand Data represented in the form of
      • Attribute-Value Pair
  • For more details on Data Representation
    • See Lecture 03 – Data and Annotations
  • Step 5: Suitable ML Algorithms for Supervised Learning
  • Scikit-Learn Cheat Sheet is a Good Starting Point for Selecting Suitable ML Algorithms for a specific Machine Learning Problem
  • Problem
    • Scikit-Learn Cheat Sheet will not work for all situations
  • Solution
    • Build a deeper understanding of ML Algorithms
  • Important Question
    • How to choose suitable Machine Learning Algorithm(s) for your Machine Learning Problem?
  • A Possible Answer
    • Consider following main points when choosing suitable Machine Learning Algorithms for your Machine Learning Problem
  • Main Points to Consider
    1. Type of Machine Learning Problem
    2. Number of Parameters
    3. Size of Training and Testing Data
    4. Number of Features
    5. Training and Testing Time
    6. Accuracy
    7. Speed and Accuracy in Application Phase
  • Type of Machine Learning Problem
    • Machine Learning Algorithms are designed to solve specific Machine Learning Problems
    • Two Important Points to Know
      • Complete and correct understanding of the Type of Machine Learning Problem , you are trying to solve using Machine Learning Algorithms
      • In previous studies, what Machine Learning Algorithms have proven to be most effective for the Type of Machine Learning Problemyou are solving?
    • Three main types of Machine Learning Problems are
      • Supervised Learning
      • Unsupervised Learning (a.k.a. Clustering)
      • Semi-supervised Learning
  • Supervised Learning Problems
    • A Supervised Learning Problem may fall into one of the following three categories
      • Classification Problem
      • Regression Problem
      • Sequence to Sequence Problem
  • Good Starting Points for Classification Problems
    • Feature-based ML Algorithms
      • For Textual Data
        • Random Forest
        • Support Vector Machine
        • Logistic Regression
        • Naïve Bayes
        • Gradient Boost
      • For Image / Video Data
        • Support Vector Machine
        • Regular Neural Networks
        • Logistic Regression
        • Naive Bayes
        • Extreme Learning Machines
        • Random Forest
        • Extreme Gradient Boost
        • Type II Approximate Reasoning
      • For Audio Data
        • Connectionist Temporal Classification
    • Deep Learning ML Algorithms
      • For Textual Data
        • Recurrent Neural Networks (RNN)
        • Long Short-Term Memory (LSTM)
        • BI-LSTM
        • Gated Recurrent Units (GRU)
        • BI-GRU
      • For Image / Video Data
        • Convolutional Neural Networks (most popular)
      • For Audio Data
        • Recurrent Neural Networks (RNN)
  • Good Starting Points for Regression Problems
    • Feature-based ML Algorithms
      • Linear Regression
      • Regression Trees
      • Lasso Regression
      • Multivariate Regression
  • Good Starting Points for Sequence to Sequence Problems
    • For Textual Data
      • Recurrent Neural Networks (RNN)
      • Long Short-Term Memory (LSTM)
      • BI-LSTM
      • Gated Recurrent Units (GRU)
      • BI-GRU
    • For Image / Video Data
      • Convolutional Neural Networks
    • For Audio Data
      • Recurrent Neural Networks (RNN)
  • Unsupervised Learning Problems
  • Good Starting Points for Unsupervised Learning Problems
    • Feature based Ml Algorithms
      • For Textual Data
        • K-Means
        • Agglomerative Hierarchical Clustering
        • Mean-Shift Clustering Algorithm
        • DBSCAN – Density-Based Spatial Clustering of Applications with Noise
        • EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
      • For Image / Video Data
        • K-Means
        • Fuzzy C Means
      • Deep Learning ML Algorithms
        • For Image / Video Data
          • Generative Adversarial Networks
          • Auto-encoders
  • Semi-supervised Learning Problems
  • Good Starting Points for Semi-Supervised Learning Problems
    • Feature based ML Algorithms
      • Label Spreading Algorithm
  • Good Starting Points in Machine Learning
    • Many Machine Learning Algorithms make use of linearity
      • For example, Linear Regression, Logistic Regression, Support Vector Machines etc.
    • Machine Learning Algorithms based on linearity are considered as a Good Starting Point
    • Two main characteristics of Machine Learning Algorithms based on linearity are
      • They are algorithmically simple
      • They are fast to train
  • Selection of Best Machine Learning Algorithm
  • Question
    • Which Machine Learning Algorithm(s) is best for a specific Machine Learning Problem?
  • Answer
    • Apply all available Machine Learning Algorithms and see which performs best ���
  • Problem
    • It requires a lot of effort, time and resources to
      • Apply all available Machine Learning Algorithms and find the best one
    • A Possible Solution
      • Start with Good Starting Points
    • Machine Learning Experts say that following Machine Learning Algorithms are Good Starting Points
      • Feature based ML Algorithms
        • For Structured / Unstructured / Semi-structured Data
          • Support Vector Machine
          • Logistic Regression
        • Deep Learning ML Algorithms
          • For Textual Data
            • Recurrent Neural Network (RNN)
          • For Image / Video Data
            • Convolutional Neural Network (CNN)
          • For Audio Data
            • Recurrent Neural Network (RNN)
  • Number of Parameters
    • ML Algorithm’s behavior is affected by
      • No. of Parameters
    • ML Algorithms with Small Number of Parameters
      • Strengths
        • Require few Hit and Trial to find a good combination of Parameters (or Model)
      • Weaknesses
        • Do not provide flexibility
    • ML Algorithms will Large Number of Parameters
      • Strengths
        • Provide flexibility
      • Weaknesses
        • Require large Hit and Trial to find a good combination of Parameters (or Model)
  • Size of Training and Testing Data
    • Size of Training Data
      • Size of Training Data plays a very important role in the Selection of Suitable ML Algorithms
    • Feature-based ML Algorithms
      • Feature based ML Algorithms (a.k.a. Classical ML Algorithms) can be accurately trained , even if the Training Data is small
    • Deep Learning ML Algorithms
      • To accurately train Deep Learning Algorithms huge amount of Training Data is required
    • Size of Testing Data
      • Size of Testing Data plays a very important when evaluating a Machine Learning Algorithm
      • To deploy a Model in Real-world (Application Phase), it should fulfill the following two conditions
        • Model should perform well (Condition 01) on large Test Data (Condition 02)
  • Number of Features
    • Features used to Train a Model, have a significant impact on the performance of the Model
    • Selection of most discriminating Features is important to get good results
    • Problem
      • In some Corpora / Datasets, it may happen that
        • No. of Features is very high compared to the No. of Instances in a Corpus / Dataset
      • Consequently, the Training Time may become unfeasibly long
      • Very often , this happens in
        • Textual Data
        • Genetics Data
        • Image / Video Data
    • Possible Solutions
      • Two popular and widely used approaches to reduce Number of Features in a Corpus / Dataset are
        1. Feature Reduction
        2. Feature Selection
    • Feature Reduction
      • Feature Reduction (a.k.a. Dimensionality Reduction) is a process which transforms Features into a lower dimension
      • Popular Methods for Feature Reduction are
        • Principal Component Analysis
        • Generalized Discriminant Analysis
        • Auto-encoders
        • Non-negative Matrix Factorization
    • Feature Selection
      • Feature Selection is the process of selecting most discrimination (or important) subset of Features (excluding redundant or irrelevant Features) from the Original Set of Features (without changing them)
      • Popular Methods for Feature Selection are
        • Wrapper Methods
        • Filter Methods
  • Feature Extraction Vs Feature Selection
    • Feature Extraction
      • Creates new Features
    • Feature Selection
      • Selects a subset of Features from the Original Set of Features
    • Note
      • Given a Corpus / Dataset
      • First carry out
        • Feature Extraction then
        • Feature Reduction / Feature Selection
  • Training and Testing Time
    • Training and Testing Time mainly depends upon two main factors
      • Size of Training and Testing Data
      • Target Accuracy
    • Note
      • Training Time of Deep Learning ML Algorithms is quite high compared to Feature based ML Algorithms
  • Target Accuracy
    • The Target Accuracy may differ from Machine Learning Problem to Machine Learning Problem
    • Machine Learning Problem 01
      • Detection of Enemy Tank from Vehicles
        • i.e. Tank vs Non-Tank
    • Machine Learning Problem 02
      • Gender Identification from Image
        • i.e. Male vs Female
    • We need very high Accuracy for Machine Learning Problem 01 (Tank vs Non-Tank) compared to Machine Learning Problem 02 (Gender Identification) ���
  • Speed and Accuracy in Application Phase
    • Speed and Accuracy requirements in Application Phase, may vary from Machine Learning Problem to Machine Learning Problem
    • Machine Learning Problem 01
      • Detection of Enemy Tank from Vehicles
        • i.e. Tank vs Non-Tank
    • Machine Learning Problem 02
      • Plagiarism Detection in Students’ Assignments
        • i.e. Plagiarized vs Non-Plagiarized
    • Note that in the Applications Phase
      • For Enemy Tank Detection
        • We need both
          • High Accuracy and
          • High Speed
      • For Plagiarism Detection
        • We need
          • High Accuracy
          • Slow Speed is acceptable
  • ML Algorithms – Scikit-Learn Cheat sheet

  • Step 6: Split Sample Data into Training Data and Testing Data
  • Split Sample Data into
    • Training Data
    • Testing Data
  • Standard Practice for Splitting Sample Data
    • Use a Train-Test Split Ratio of
      • 67% – 33%
  • Step 7: Select Suitable Evaluation Measure(s)
  • Selection of Suitable Evaluation Measure(s) is important to
    • correctly evaluate the performance of a Model
  • Selection of Suitable Evaluation Measure(s) mainly depends on
    • Type of Machine Learning Problem
  • Evaluation Measures for Classification Problem
    • Some of the most popular and widely used Evaluation Measures for Classification Problems are
      • Baseline Accuracy (a.k.a. Most Common Categorization (MCC))
      • Accuracy
      • True Negative Rate
      • False Positive Rate
      • False Negative Rate
      • Recall or True Positive Rate or Sensitivity
      • Precision or Specificity
      • F1
      • Area Under the Curve (AUC)
  • Evaluation Measures for Regression Problem
    • Some of the most popular and widely used Evaluation Measures for Regression Problems are
      • Mean Absolute Error (MAE)
      • Mean Squared Error (MSE)
      • Root Mean Squared Error (RMSE)
      • R2 or Coefficient of Determination
      • Adjusted R2
  • Evaluation Measures for Sequence to Sequence Problem
      • Some of the most popular and widely used Evaluation Measures for Sequence-to-Sequence Problems are
        • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
        • BLEU (Bi-Lingual Evaluation Understudy) BLEU
        • METEOR (Metric for Evaluation of Translation with Explicit ORdering)
  • Step 8: Execute First Two Phases of Machine Learning Cycle
  • Recall the Equation

  • Training Phase
    • Use Training Data to build the Model
  • Testing Phase
    • Use Testing Data to evaluate the performance of the Model
      • e. calculate Error in the Model
  • Step 9: Analyze Results
  • Machine Learning Assumption

  • Question
    • What is a good Model?
  • A Possible Answer
    • It varies from Machine Learning Problem to Machine Learning Problem
    • Generally ,

  • Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
  • Application Phase
    • Deploy the Model in the Real-world to make predictions on unseen data
  • Feedback Phase
    • Take Feedback from
      • Domain Experts
      • Users of the ML system
  • Step 11: improve Model based on Feedback
  • There is Always Room for Improvement 😊
  • Based on Feedback form Domain Experts and Users
    • Improve your Model
  • To learn any task, follow the following cycle
    • Plan – in Mind
    • Design – on Paper
    • Execute – at Prototype Level
    • Execute – in Real-world
    • Feedback – from Audience and Domain Experts
  • To be successful in life
    • Be a Learner till Death 😊
    • Just change one person in life i.e. Yourself
  • People fail in life because they
    • Try to Change the World 😊

TODO and Your Turn​

Todo Tasks
Your Turn Tasks
Todo Tasks

TODO Task 1

  • Task 1
    • Consider the following Machine Learning Problems and for each Machine Learning Problem answer the questions given below.
      • Automatically Generating Caption for an Image
      • Speaker Identification (from Audio)
    • Note
      • Your answer should be
        • Well Justified
    • Questions
      • Write Input and Output for each Machine Learning Problem?
      • Decide Learning Settings
        • What Learning Settings will be more suitable?
      • Obtain Sample Data
        • Write names of potential Data Sources?
      • Understand and Pre-process Sample Data
        • What Pre-processing Techniques should be applied to improve the quality of Sample Data?
      • Represent Sample Data in Machine Understandable Format
        • How Sample Data can be represented in Machine Understandable Format
      • Select Suitable Machine Learning Algorithms
        • Write down suitable Machine Learning Algorithms?
      • Number of Features
        • Do we need to apply Feature Selection or Feature Reduction?
      • Split Sample Data into Training Data and Testing Data
        • How Sample Data should be split into Training Data and Testing Data?
      • Select Suitable Evaluation Measure(s)
        • Write down suitable Evaluation Measure(s)?
      • Training Time
        • How much Training Time will be required to Train selected Machine Learning Algorithms?
      • Number of Parameters
        • How many Parameters need to be tuned for selected Machine Learning Algorithms?
      • Deployment in Application Phase
        • Write two conditions that must be fulfilled before deploying a Model in Real-world?
      • Discuss requirements of Speed and Accuracy in Application Phase?
Your Turn Tasks

Your Turn Task 1

  • Task 1
    • Select any two Machine Learning Problems and for each Machine Learning Problem answer the questions given below.
    • Note
      • Your answer should be
        • Well Justified
      • Questions
        • Write Input and Output for each Machine Learning Problem?
        • Decide Learning Settings
          • What Learning Settings will be more suitable?
        • Obtain Sample Data
          • Write names of potential Data Sources?
        • Understand and Pre-process Sample Data
          • What Pre-processing Techniques should be applied to improve the quality of Sample Data?
        • Represent Sample Data in Machine Understandable Format
          • How Sample Data can be represented in Machine Understandable Format
        • Select Suitable Machine Learning Algorithms
          • Write down suitable Machine Learning Algorithms?
        • Number of Features
          • Do we need to apply Feature Selection or Feature Reduction?
        • Split Sample Data into Training Data and Testing Data
          • How Sample Data should be split into Training Data and Testing Data?
        • Select Suitable Evaluation Measure(s)
          • Write down suitable Evaluation Measure(s)?
        • Training Time
          • How much Training Time will be required to Train selected Machine Learning Algorithms?
        • Number of Parameters
          • How many Parameters need to be tuned for selected Machine Learning Algorithms?
        • Deployment in Application Phase
          • Write two conditions that must be fulfilled before deploying a Model in Real-world?
        • Discuss requirements of Speed and Accuracy in Application Phase?

GPA Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem

  • GPA Prediction Problem
  • Task
    • Develop a GPA Prediction system to predict GPA of a university student (1stsemester) from his / her Matric and FSc marks
  • Input
    1. Matric Marks
    2. FSc Marks
  • Output
    • GPA (1stsemester)
  • Treated as a
    • Supervised Machine Learning Problems
  • Goal
    • Learn an Input-Output Function
      • i.e. Learn from Input to predict Output
  • GPA Prediction is a Regression Problem
  • GPA Prediction is a Regression Problem because
    • Output is Numeric
  • GPA Prediction – Input and Output
  • Input
    • Structured
      • Fixed Set of Two Attributes
        1. Matric Marks
        2. FSc Marks
  • Output
    • Numeric
  • Research Focus – GPA Prediction System
  • Research Focus
    • Develop a GPA Prediction system for university students in Pakistan studying Computer Science at Undergrad level in three degree programs: BS(CS), BS(SE) and BS(IT)
  • Steps – Treating GPA Prediction as a Regression Problem
  • In sha Allah, I will follow the following steps to treat the GPA Prediction Problem as a Regression Problem
    • Step 1: Decide the Learning Setting
    • Step 2: Obtain Sample Data
    • Step 3: Understand and Pre-process Sample Data
    • Step 4: Represent Sample Data in Machine Understandable Format
    • Step 5: Select Suitable Machine Learning Algorithms
    • Step 6: Split Sample Data into Training Data and Testing Data
    • Step 7: Select Suitable Evaluation Measure(s)
    • Step 8: Execute First Two Phases of Machine Learning Cycle
      • Training Phase
      • Testing Phase

    • Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
      • Application Phase
      • Feedback Phase
    • Step 11: Based on Feedback
      • Go to Step 1 and Repeat all the Steps
  • Step 1: Decide the Learning Setting
  • In sha Allah, I aim to treat GPA Prediction Problem as a
    • Supervised Machine Learning Problem
  • Since Output is Numeric , it will be treated as a
      • Regression Problem
  • Step 2: Obtain Sample Data
  • Since, I am treating GPA Prediction Problem as a Regression Problem, I will need
    • Annotated Data
  • For more accurate learning, I need
    1. Large amount of Annotated Data
    2. High-quality Annotated Data
    3. Balanced Data
  • Note
    • For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
      • i.e. Size of Sample Data = 15 instances
  • Two Main Choices to Obtain Sample Data
    1. Use an Existing Corpus / Dataset
    2. Develop Your Own Corpus / Dataset
  • Since, there is no existing Corpus / Dataset available to develop a GPA Prediction system for university students studying in Pakistan
    • I developed my own Corpus / Dataset
  • For details on how to create a Gold Standard Annotated Corpus
    • See Lecture 4 – Data and Annotations – Step by Step Examples
  • To develop our GPA Prediction system, we obtained
    • Sample Data of 15 instances
  • See gpa-sample-data.csv file in Supporting Material
  • Sample Data

  • Step 3: Understand and Pre-process data
  • Understanding Data
    • The Gold Standard Annotated Corpus contains three Attributes / Features
      • Matric Marks
      • FSc Marks
      • GPA (1stsemester)
    • Separating Input from Output
      • Input comprises of two Attributes / Features
        • Matric Marks
        • FSc Marks
      • Output comprises of a single Attribute
        • GPA (1stsemester)
  • Pre-processing Data
    • Gold Standard Annotated Corpus is already pre-processed
  • Step 4: Represent Sample Data in Machine Understandable Format
  • Feature-based Regression Algorithms (implemented in Scikit-Learn) can understand   data in
    • Attribute-Value Pair
      • Values of Attribute must be Numeric
  • Our Gold Standard Annotated Data is already in
    • Attribute-Value Pair form with Numerical Values
  • Therefore, it is already in Machine Understandable Format
  • Step 5: Select Suitable Machine Learning Algorithms
  • Previous students have shown that Good Starting Points for Regression Problems are
    • Support Vector Regressor
    • Logistic Regression
    • Linear Regression
    • Random Forest Regressor
    • Gradient Boosting Regressor
  • Step 6: Split Sample Data into Training Data and Testing Data
  • Use Standard Practice for Sample Data Split
    • i.e. Train-Test Split Ratio of
      • 67% – 33%
  • In our Corpus / Dataset
    • Total Instances = 15
  • Splitting Data into Training Data and Testing Data
    • Training Data = 10
    • Testing Data = 5
  • Training Data
    • See gpa-training-data.csv file in Supporting Material
    • Training Data

  • Testing Data
    • See gpa-testing-data.csv file in Supporting Material
    • Testing Data

  • Step 7: Select Suitable Evaluation Measure
  • I will use Mean Absolute Error (MAE) Evaluation Measure to evaluate the performance of the Model
  • Absolute Error
    • Absolute Error (AE) is the difference between the Actual Value and the Predicted Value
  • Formula

  • where and  represent Actual Value and Predicted Value respectively
  • Mean Absolute Error
    • Mean Absolute Error (MAE) is the average of all Absolute Errors
  • Formula

    • where
      • n represents the total number of instances
      • Xactual represents Actual Value
      • Xpredicted represents Predicted Value
  • Step 8: Execute First Two Phases of Machine Learning Cycle
  • Recall the Equation

  • Training Phase
    • Use Training Data to build the Model
  • Note that our aim is to
    • Learn an Input-Output Function
  • Recall – General Settings of Learning Input-Output Functions

  • Training Phase

    • Training Example = x1, …. xm} + f(xi) for each xi ϵ TE

  • Testing Phase
    • Use Testing Data to compute Error in Modelusing Mean Absolute Error (MAE) measure

  • Predictions Returned by Model (h)

  • Calculating Mean Absolute Error
    • To calculate Mean Absolute Error, we will compare
      • Actual Values with Predicted Values

  • Step 1: Calculate Absolute for each Test Example

  • Step 2: Calculate Mean Absolute Error

  • Step 9: Analyze Results
  • Assumption for this Example
    • Here, I am assuming that Model
  • Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
  • Application Phase
    • Model is deployed in Real-world to make predictions on Real-time Data
  • Steps – Making Predictions on Real-time Data
    • Step 1: Take Input from User
    • Step 2: Convert User Input into Feature Vector
      • Exactly same as Feature Vectors of Training and Testing Data
    • Step 3: Apply Model on the Feature Vector
    • Step 4: Return Prediction to the User
  • Example – Making GPA Prediction on Real-time Data

    • Step 1: Take Input from User
      • Enter Matric Marks : 704
      • Enter FSc Marks : 853
    • Step 2: Convert User Input into Feature Vector
      • Exactly same as Feature Vectors of Training and Testing Data
      • Feature Vector
        • <704, 853>
      • Note that the order of Attributes in both Training and Testing Data was
        • <Matric, FSc>
      • Similarly, order of Attributes in unseen instance is exactly same as in Training and Testing Data
    • Step 3: Apply Model on the Feature Vector of unseen instance
      • Model (or h) is applied on <704, 853>

    • Step 4: Return Prediction to the User
      • 86
  • Feedback Phase
    • A Two Step Process
    • Step 1: After sometime , take Feedback from
      • Domain Experts and Users on deployed GPA Prediction system
    • Step 2: Make a List of Possible Improvements based on Feedback received
  • Step 11: Improve GPA Prediction System based on Feedback
  • Go to Step 1 and improve the GPA Prediction system based on
    • List of Possible Improvements made in Step 10

Emotion Prediction System – Treating a Real-world Problem as a Supervised Machine Learning Problem

  • Emotion Prediction Problem
  • Task
    • Develop an Emotion Prediction system to predict emotion from a written text
  • Input
    • A text
  • Output
    • Emotion
  • Possible Output Values (12 Categories)
    • Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Sadness, Surprise, Trust, Neutral (or No Emotion)
  • Treated as a
    • Supervised Machine Learning Problem
  • Goal
    • Learn an Input-Output Function
      • i.e. Learn from Input to predict Output
  • Emotion Prediction is a Classification Problem
  • Emotion Prediction is a Classification Problem because
    • Output is Categorical
  • Emotion Prediction – Input and Output
  • Input
    • Unstructured (Text)
  • Output
    • Categorical
  • Research Focus – Emotion Prediction System
  • Research Focus
    • Develop an Emotion Prediction system for English Tweets
  • Steps – Treating Emotion Prediction as a Classification Problem
  • In sha Allah, I will follow the following steps to treat the Emotion Prediction Problem as a Classification Problem
    • Step 1: Decide the Learning Settings
    • Step 2: Obtain Sample Data
    • Step 3: Understand and Pre-process Sample Data
    • Step 4: Represent Sample Data in Machine Understandable Format
    • Step 5: Select Suitable Machine Learning Algorithms
    • Step 6: Split Sample Data into Training Data and Testing Data
    • Step 7: Select Suitable Evaluation Measure(s)
    • Step 8: Execute First Two Phases of Machine Learning Cycle
      • Training Phase
      • Testing Phase
    • Step 9: Analyze Results

    • Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
      • Application Phase
      • Feedback Phase
    • Step 11: Based on Feedback
      • Go to Step 1 and Repeat all the Steps
  • Step 1: Decide the Learning Setting
  • In sha Allah, I aim to treat Emotion Prediction Problem as a
    • Supervised Machine Learning Problem
  • Since Output is Categorical , it will be treated as a
    • Classification Problem

  • Step 2: Obtain Sample Data
  • Since, I am treating Emotion Prediction Problem as a Classification Problem, I will need
    • Annotated Data
  • For more accurate learning, I need
    1. Large amount of Annotated Data
    2. High-quality Annotated Data
    3. Balanced Data
  • Note
    • For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
  • Two Main Choices to Obtain Data
    1. Use an Existing Corpus
    2. Develop’ Your Own Corpus
  • A Gold Standard Annotated Corpus is availablefor Emotion Analysis of English Tweets
    • Corpus / Dataset Link:
      • https://competitions.codalab.org/competitions/17751#learn_the_details-datasets Last visited: 07-04-2020
    • Paper Link
      • https://www.aclweb.org/anthology/S18-1001.pdf Last visited: 07-04-2020
    • Paper Reference
      • Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018, June). Semeval-2018 task 1: Affect in tweets. In proceedings of the 12th international workshop on semantic evaluation (pp. 1 – 17)
  • We obtained a Sample Data of 15 instances
  • See emotion-sample-data.csv File in Supporting Material
  • Sample Data

  • Step 3: Understand and Pre-process Data
  • Understanding Data
    • The Gold Standard Annotated Corpus contains two Attributes
      • Tweet
      • Emotion
    • Separating Input from Output
      • Input comprises of one Attribute
        • Tweet
      • Output comprises of a single Attribute
        • Emotion
  • Pre-processing Data
    • Gold Standard Annotated Corpus is already pre-processed
      • Therefore, no pre-processing is needed

  • Step 4: Represent Data in Machine Understandable Format
  • Feature-based Classification Algorithms (implemented in Scikit-Learn) can understand data in
    • Attribute-Value Pair
      • Values of Attributes / Features must be Numeric
  • Problem
    • Our Sample Data is not in Attribute-Value Pair form
      • We need to transform our Sample Data into Machine Understandable Format
  • Solution
    • There are many approaches to transform Sample Data into Machine Understandable Format
  • Transforming Sample Data in Machine Understandable Format

    • In our Sample Data
      • Input is Text
      • Output is Categorical
    • Considering Input (Tweet) and Output (Emotion), we will need to
      • Transform Input (Text) into Numerical Representation
      • Transform Output (Categorical) into Numerical Representation
    • A Two Step Process
      • Step 1: Define an Encoding Scheme
      • Step 2: Use Encoding Scheme defined in Step 1, to convert Categorical Output Values to Numerical Output Values for all instances in the Sample Data

        Converting Output into Numerical Representation

    • Step 1: Define an Encoding Scheme
      • Encoding Scheme
        • Anger = 0
        • Anticipation = 1
        • Disgust = 2
        • Fear = 3
        • Joy = 4
        • Love = 5
        • Optimism = 6
        • Pessimism = 7
        • Sadness = 8
        • Surprise = 9
        • Trust = 10
        • Neutral = 11
    • Step 2: Use Encoding Scheme defined in Step 1, to convert Categorical Output Values to Numerical Output Values for all instances in the Sample Data
    • Sample Data after Encoding Categorical Output Values to Numerical Output Values
    • See emotion-sample-data-encoded-output.csv File in Supporting Material

  • Note
    • Alhumdulilah, Output is transformed into Numerical Representation
    • In sha Allah, in next slides I will try to explain how to transform Input into Numerical Representation
  • Converting Input into Numerical Representation
    • Considering Feature-based ML Algorithms , an Input can be transformed into Numerical Representation in the following steps
      • Step 1: Select a Feature Extraction Method
      • Step 2: Extract Features from Input using the Feature Extraction Method selected in Step 1
  • Note
    • For details on Feature Extraction from Text
      • SeeTutorial – Feature Extraction from Text
  • Converting Input into Numerical Representation
    • Step 1: Select a Feature Extraction Method
    • In sha Allah, I will use Word Uni-gram Features to transform Sample Data into Numerical Representation
      • Feature = Word Uni-gram
      • Feature Weight = Frequency Count of a Word in a Tweet
      • Maximum Features = 10
    • For details on Feature Extraction from Text using N-gram Models
      • SeeTutorial – Feature Extraction from Text
  • Converting Input into Numerical Representation
    • Step 2: Extract Features from Input using the Feature Extraction Method selected in Step 1
    • After Feature Extraction, Input is transformed into Numerical Representation
    • See emotion-sample-data-encoded.csv File in Supporting Material

  • Recap – Original Sample Data

  • Recap – Sample Data in Numerical Representation

  • Step 5: Select Suitable Machine Learning Algorithms
  • Previous students have shown that Good Starting Points for Classification Problems are
    • Support Vector Classifier
    • Naïve Bayes
    • Random Forest Classifier
    • Gradient Boosting Classifier
  • Step 6: Split Sample Data into Training Data and Testing Data
  • Use Standard Practice for Data Split
    • i.e. Train-Test Split Ratio of
      • 67% – 33%
  • In our Corpus / Dataset
    • Total Instances = 15
  • Splitting Data into Training and Testing
    • Training Data = 10
    • Testing Data = 5
  • Training Data

    • See emotion-training-data-encoded.csv File in Supporting Material

  • Testing Data
    • See emotion-testing-data-encoded.csv File in Supporting Material

  • Step 7: Select Suitable Evaluation Measure(s)
  • I will use Accuracy Evaluation Measure to evaluate the performance of the Model
  • Accuracy
    • Accuracy is defined as the proportion of correctly classified instances

  • Step 8: Execute First Two Phases of Machine Learning Cycle
  • Recall the Equation

  • Training Phase
    • Use Training Data to build the Model
  • Note that our aim is to
    • Learn an Input-Output Function
  • General Settings – Learning Input-Output Function

  • Training Phase

  • Testing Phase

  • Predictions Returned by the Model (h)

  • Calculating Accuracy
    • To calculate Accuracy, we will compare
      • Actual Values with Predicted Values
    • Note
      • To explain calculations more clearly , I have converted Numerical Predicted Values to Categorical Predicted Values

  • Step 9: Analyze Results
  • Assumption for this Example
    • Here, I am assuming that Model
      • performed well on large Test Data and we can deeply it in the real-world
  • Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
  • Application Phase
    • Model is deployed in Real-world to make predictions on Real-time Data
  • Steps – Make Predictions on Real-time Data
    • Step 1: Take Input from User
    • Step 2: Convert User Input into Feature Vector
      • Exactly same as Feature Vectors of Training and Testing Data
    • Step 3: Apply Model on the Feature Vector of the unseen instance
    • Step 4: Return Prediction to the User
  • Example – Making Predictions on Real-time Data

    • Step 1: Take Input from User

    • Step 2: Convert User Input into Feature Vector
      • Exactly same as Feature Vectors of Training and Testing Data
      • Feature Vector
        • <0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
      • Note that the order of Attributes in both Training and Testing Data was
        • <18, activists, agree, amazing, apart, atsu, baloch, band, basically, battle>
      • Similarly, order of Attributes in unseen instance is exactly same as those of Training and Testing Data
    • Step 3: Apply Model on the Feature Vector of unseen instance
      • Model (h) is applied on <0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
    • Step 4: Return Prediction to the User
      • Optimism
  • Application Phase

  • Feedback Phase
    • A Two Step Process
    • Step 1: After sometime , take Feedback from
      • Domain Experts and Users on deployed Emotion Prediction System
    • Step 2: Make a List of Possible Improvements based on Feedback received
  • Step 11: Improve Emotion Prediction System based on Feedback
  • Go to Step 1 and improve the Emotion Prediction System based on
    • List of Possible Improvements made in Step 10

Text Summarization System – Treating a Real-world Problem as a Supervised Machine Learning Problem

  • Text Summarization Problem
  • Task
    • Develop a Text Summarization system to automatically (predict) generate summary of an Urdu news article
  • Input
    • An Urdu News Article
  • Output
    • Summary
  • Treated as a
    • Supervised Machine Learning Problems
  • Goal
    • Learn an Input-Output Function
      • i.e. Learn from Input to predict Output
  • Important Note
    • Be careful in the use of terms
  • Example
    • Term 01
      • A News Article
    • Term 02
      • An Urdu News Article
    • Term 03
      • An Urdu News Article on Science and Technology
    • Term 04
      • An Urdu News Article on Hazrat Jalal.ud.Din Romi R.A.
    • Remarks
      • Term 01 is very broad
      • Term 02 is broad
      • Term 03 is specific
      • Term 04 is very specific
  • Text Summarization is a Sequence to Sequence Problem
    • Text Summarization is a Sequence to Sequence Problem because
      • Input is Unstructured and of variable length
      • Output is Unstructured and of variable length
  • Text Summarization – Input and Output
    • Input
      • Unstructured (Text) – An Urdu News Article
    • Output
      • Unstructured (Text) – Summary
    • Note
      • Length of Input is much greater than the Length of Output
  • Research Focus – Text Summarization System
    • Research Focus
      • Develop a Text Summarization system to automatically generate summary of an Urdu news article
  • Steps – Treating Text Summarization as a Sequence to Sequence Problem
  • In sha Allah, I will follow the following steps to treat Text Summarization Problem as a Sequence to Sequence Problem
    • Step 1: Decide the Learning Settings
    • Step 2: Obtain Sample Data
    • Step 3: Understand and Pre-process Sample Data
    • Step 4: Represent Sample Data in Machine Understandable Format
    • Step 5: Select Suitable Machine Learning Algorithms
    • Step 6: Split Sample Data into Training Data and Testing Data
    • Step 7: Select Suitable Evaluation Measure(s)
    • Step 8: Execute First Two Phases of Machine Learning Cycle
      • Training Phase
      • Testing Phase
    • Step 9: Analyze Results

    • Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
      • Application Phase
      • Feedback Phase
    • Step 11: Based on Feedback
      • Go to Step 1 and Repeat all the Steps
  • Step 1: Decide the Learning Setting
  • In sha Allah, I aim to treat Text Summarization Problem as a
    • Supervised Machine Learning Problem
  • Since both Input and Output are Unstructured (Text) and of variable length , it will be treated as a
    • Sequence to Sequence Problem
  • Step 2: Obtain Sample Data
  • In sha Allah, I aim to treat Text Summarization Problem as a
    • Supervised Machine Learning Problem
  • Since both Input and Output are Unstructured (Text) and of variable length , it will be treated as a
    • Sequence to Sequence Problem
  • Since, I am treating Text Summarization Problem as a Supervised Learning Problem, I will need
    • Annotated Data
  • For more accurate learning, I need
    1. Large amount of Annotated Data
    2. High-quality Annotated Data
    3. Balanced Data
  • Note
    • For simplicity, In sha Allah I will use a toy Corpus / Dataset of 15 instances only
  • Two Main Choices to Obtain Sample Data
    1. Use an Existing Corpus
    2. Develop Your Own Corpus
  • Since, there is a benchmark Corpus / Dataset available for Urdu Text Summarization
    • I will use the existing Corpus / Dataset called
      • Urdu Text Summarization Corpus
  • We obtained a Sample Data of 15 instances
    • See summarization-sample-data.xlsx File in Supporting Material
  • Note
    • To save space I am putting below one instance from Sample Data

  • Next Slide contains set of 15 instances in the Sample Data
  • To save space , I am only putting the Summary and not presenting the Urdu News Article
  • Complete Sample Data is given in
    • summarization-sample-data.xlsx File in Supporting Material

  • Step 3: Understand and Pre-process data
  • Understanding Data
    • The Sample Data contains two Attributes
      • Urdu News Article
      • Summary
    • Separating Input from Output
      • Input comprises of one Attribute
        • An Urdu News Article
      • Output comprises of one Attribute
        • Summary
  • Pre-processing Data
    • Sample Data is already pre-processed
      • Therefore, no pre-processing is needed
  • Step 4: Represent Data in Machine Understandable Format
  • Deep Learning ML Algorithms (implemented in Keras or PyTorch) can understand data in
    • Numerical Representations
  • Problem
    • Our Sample Data is in Textual form
      • Therefore, Deep Learning ML Algorithms cannot understand it
  • A Possible Solution
    • Use Word Embedding Techniques to transform
      • Textual Data into Numerical Representation
    • Popular Word Embedding Techniques used in Deep Learning ML Algorithms are
      • Word2Vec
      • Glove
      • FastText
  • Step 5: Select Suitable Machine Learning Algorithms

 

  • Previous students have shown that Good Starting Points for Sequence to Sequence Problems (considering Textual Data ) are
    • Recurrent Neural Network (RNN)
    • Long Short-Term Memory (LSTM)
    • BI-LSTM
    • Gated Recurrent Units (GRU)
    • BI-GRU
  • Step 6: Split Sample Data into Training Data and Testing Data
  • Use Standard Practice for Data Split
    • e. Train-Test Split Ratio of
      • 67% – 33%
    • In our Sample Data
      • Total Instances = 15
    • Splitting Sample Data into Training Data and Testing Data
      • Training Data = 10
      • Testing Data = 5
  • Training Data

    • Complete Training Data is given in
      • summarization-training-data.xlsx File in Supporting Material

  • Testing Data
    • Complete Testing Data is given in
      • summarization-testing-data.xlsx File in Supporting Material

  • Step 7: Select Suitable Evaluation Measure(S)
  • Two Choices to Evaluate Text Summarization System
    1. Manual Approach
    2. Automatic Approach
  • Manual Approach
    • A human (Domain Expert) will manually judge the quality of summary automatically generated by Text Summarization System
    • Strengths
      • Evaluation will be very accurate and of high-quality
    • Weaknesses
      • It is practically not possible to manually evaluate thousands of summaries
  • Automatic Approach
    • A program will automatically judge the quality of summary automatically generated by Text Summarization System
    • Strengths
      • You can quickly and easily evaluate very large Test Data
    • Weaknesses
      • Evaluation will not be very accurate and of high-quality
  • Evaluating Urdu Text Summarization System

    • In sha Allah, I will use Automatic Approach to evaluate the performance of Urdu Text Summarization System
    • ROUGE is a de facto standard to automatically evaluate the performance of Text Summarization Systems
    • In sha Allah, I will use following three metrics of ROUGE to evaluate Urdu Text Summarization System
      • ROUGE-1
      • ROUGE-2
      • ROUGE-L
  • Note
    • To understand the working of ROUGE-L, ROUGE-1 and ROUGE-2 metrics
      • See Tutorial – Evaluating Sequence to Sequence Models using ROUGE
  • To summarize
    • Average F1 scores will be reported for ROUGE-1, ROUGE-2 and ROUGE-L metrics
  • Note
    • For details on F1 measure
      • See Lecture 13 – Evaluating Hypothesis (Model)

 

  • Step 8: Execute First Two Phases of Machine Learning Cycle

Recall the Equation

  • Training Phase
    • Use Training Data to build the Model
  • Note that our aim is to
    • Learn an Input-Output Function
  • General Settings – Learning Input-Output Function

  • Training Phase

  • Testing Phase
    • Testing Phase
      • Use Testing Data to compute Error in Model using
        • ROUGE-1, ROUGE-2 and ROUGE-L metrics
      • Report Average F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L metrics
  • Predictions Returned by Model on Test Data

  • Calculating Average F1 Scores for ROUGE-1, ROUGE-2 and ROUGE-L

  • Step 9: Analyze Results
  • Assumption for this Example
    • Here, I am assuming that Model
      • performed well on large Testing Data and we can deploy it in the real-world
  • Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
  • Application Phase
    • Model is deployed in Real-world to make predictions on Real-time Data
  • Steps – Make Predictions on Real-time Data
    • Step 1: Take Input from User
    • Step 2: Convert User Input into Feature Vector
      • Exactly same as Feature Vectors of Training and Testing Data
    • Step 3: Apply Model on the Feature Vector
    • Step 4: Return Prediction to the User
  • Example – Generating Text Summary of Real-time Data
  • Step 1: Enter an Urdu News Article

  • Step 2: Tokenize Text

  • Step3: Text Boundary

  • Step 4: Word to Index Mapping

  • Step 5: Word Embedding

  • Step 6: Apply Model on Feature Vector

  • Step 7: Predict Summary

  • Feedback Phase

    • A Two Step Process
    • Step 1: After sometime , take Feedback from
      • Domain Experts and Users on deployed Text Summarization System
    • Step 2: Make a List of Possible Improvements based on Feedback received
  • Step 11: Improve Text Summarization System based on Feedback
  • Go to Step 1 and improve the Text Summarization System based on
    • List of Possible Improvements made in Step 10

Machine Translation System – Treating a Real-world Problem as a Supervised Machine Learning Problem

  • Machine Translation Problem
  • Task
    • Develop a Machine Translation system for Urdu-English language pair to automatically translate (predict) Source Text (Urdu) into the Target Language (English)
  • Input
    • A Source Text (Urdu)
  • Output
    • Translation of Source Text in Target Language (English)
  • Treated as a
    • Supervised Machine Learning Problems
  • Goal
    • Learn an Input-Output Function
      • i.e. Learn from Input to predict Output
  • Machine Translation is a Sequence to Sequence Problem
  • Machine Translation is a Sequence to Sequence Problem because
    • Input is Unstructured and of variable length
    • Output is Unstructured and of variable length
  • Machine Translation – Input and Output
  • Input
    • Unstructured (Source Text in Urdu)
  • Output
    • Unstructured (Translated Text in English)
  • Note
    • Length of Input is almost same as the Length of Output
    • Recall
      • In Text Summarization
        • Difference in Lengths of Input and Output was quite high
  • Conclusion
    • Completely and correctly understand the Input and Output before treating a Real-world Problem as a Machine Learning Problem
  • Research Focus – Machine Translation System
  • Research Focus
    • Develop a Machine Translation system for Urdu-English language pair to automatically translate Source Text (Urdu) into the Target Language (English)
  • Steps – Treating Machine Translation as a Sequence to Sequence Problem
  • In sha Allah, I will follow the following steps to treat Machine Translation Problem as a Sequence to Sequence Problem
    • Step 1: Decide the Learning Settings
    • Step 2: Obtain Sample Data
    • Step 3: Understand and Pre-process Sample Data
    • Step 4: Represent Sample Data in Machine Understandable Format
    • Step 5: Select Suitable Machine Learning Algorithms
    • Step 6: Split Sample Data into Training Data and Testing Data
    • Step 7: Select Suitable Evaluation Measure(s)
    • Step 8: Execute First Two Phases of Machine Learning Cycle
      • Training Phase
      • Testing Phase
    • Step 9: Analyze Results

    • Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
      • Application Phase
      • Feedback Phase
    • Step 11: Based on Feedback
      • Go to Step 1 and Repeat all the Steps
  • Step 1: Decide the Learning Setting
  • In sha Allah, I aim to treat Machine Translation Problem as a
    • Supervised Machine Learning Problem
  • Since both Input and Output are Unstructured (Text) and of variable length , it will be treated as a
    • Sequence to Sequence Problem
  • Step 2: Obtain Sample Data
  • Since, I am treating Machine Translation Problem as a Supervised Learning Problem, I will need
    • Annotated Data
  • For more accurate learning, I need
    • Large amount of Annotated Data
    • High-quality Annotated Data
    • Balanced Data
  • Note
    • For simplicity , In sha Allah I will use a toy Corpus / Dataset of 15 instances only
  • Two Main Choices to Obtain Data
    1. Use an Existing Corpus
    2. Develop Your Own Corpus
  • Since, there is a benchmark Corpus / Dataset available for Urdu Machine Translation
    • I will use the existing Corpus / Dataset
      • MT-UE-20 Corpus
  • We obtained Sample Data of 15 instances
  • See mt-sample-data.xlsx File in Supporting Material
  • Sample Data

 

  • Step 3: Understand and Pre-process data
  • Understanding Data
    • Sample Data contains two Attributes
      • Source Text (Urdu)
      • Target Text (Translation of Source Text in English)
    • Separating Input from Output
      • Input comprises of one Attribute
        • Source Text (Urdu)
      • Output comprises of one Attribute
        • Target Text (English)
  • Pre-processing Data
    • Sample Data is already pre-processed
      • Therefore, no pre-processing is needed
  • Step 4: Represent Data in Machine Understandable Format
  • Statistical Machine Translation (SMT) Techniques and Neural Machine Translation (NMT) Techniques can understand data in
    • Numerical Representation
  • Problem
    • Our Sample Data is in Textual form
      • Therefore, SMT and NMT Techniques cannot understand it
  • Possible Solutions
    • Statistical Machine Translation (SMT) Technique
      • Use Probabilities of Words / Phrases to align Words / Phrases for Machine Translation
    • Neural Machine Translation Techniques
      • Use Word Embedding Techniques to transform
        • Textual Data into Numerical Representation
      • Popular Word Embedding Techniques are
        • Word2Vec
        • Glove
        • FastText
  • Step 5: Select Suitable Machine Learning Algorithms
  • Previous students have shown that Good Starting Points for Machine Translation are
    • Statistical Machine Translation Techniques
    • Neural Machine Translation Techniques
  • Step 6: Split Sample Data into Training Data and Testing Data
  • Use Standard Practice for Data Split
    • i.e. Train-Test Split Ratio of
      • 67% – 33%
    • In our dataset / corpus
      • Total Instances = 15
    • Splitting Sample Data into Training Data and Testing Data
      • Training Data = 10
      • Testing Data = 5
  • Training Data

    • See mt-training-data.xlsx File in Supporting Material 

  • Testing Data
    • See mt-testing-data.xlsx File in Supporting Material

  • Step 7: Select Suitable Evaluation Measure(S)
  • Two Choices to Evaluate Machine Translation System
    • Manual Approach
    • Automatic Approach
  • Manual Approach
    • A human (Domain Expert) will manually judge the quality of a translation automatically generated by Machine Translation System
    • Strengths
      • Evaluation will be very accurate and of high-quality
    • Weaknesses
      • It is practically not possible to manually evaluated thousands of translations
  • Automatic Approach
    • A program will automatically judge the quality of translation automatically generated by Machine Translation System
    • Strengths
      • You can quickly and easily evaluate very large Test Data
    • Weaknesses
      • Evaluation will not be very accurate and of high-quality
  • Evaluating Machine Translation System

    • In sha Allah, I will use Automatic Approach to evaluate the performance of Machine Translation system
    • BLEU is a de facto standard to automatically evaluate the performance of Machine Translation systems
    • In sha Allah, I will use following four metrics of BLEU to evaluate Urdu Machine Translation System
      • BLEU-1
      • BLEU-2
      • BLEU-3
      • BLEU-4
    • Note
      • To understand the working of BLEU-1, BLEU-2, BLEU-3 and BLEU-4 metrics
      • See Tutorial – Evaluating Sequence to Sequence Models using BLEU
  • Step 8: Execute First Two Phases of Machine Learning Cycle

Recall the Equation

  • Training Phase
    • Use Training Data to build the Model
  • Note that our aim is to
    • Learn an Input-Output Function
  • General Settings – Learning Input-Output Function

  • Training Phase
    • The Table below shows how Statistical Machine Translation Technique works to learn a Model

  • Testing Phase
    • Use Testing Data to compute Error in Model using BLEU-1, BLEU-2, BLEU-3 and BLEU-4 metrics
    • Predictions returned by the Model

  • Automatic Evaluation using BLEU

 

  • Automatic Evaluation using BLEU

 

  • Step 9: Analyze Results
  • Assumption for this Example
    • Here, I am assuming that Model
      • performed well on large Testing Data and we can deploy it in the real-world
  • Step 10: Execute 3rd and 4th Phases of Machine Learning Cycle
  • Application Phase
    • Model is deployed in Real-world to make predictions on Real-time Data
  • Steps – Make Predictions on Real-time Data
    • Step 1: Take Input from User
    • Step 2: Convert User Input into Numerical Representation
      • Exactly same as Numerical Representation of Training and Testing Data
    • Step 3: Apply Model on the Numerical Representation
    • Step 4: Return Prediction to the User
  • Example – Making Prediction on Real-time Data
    • Table below gives a Step by Step example to automatically translate an unseen instance using Statistical Machine Translation Technique

  • Feedback Phase
    • A Two Step Process
      • Step 1: After sometime , take Feedback from
        • Domain Experts and Users on deployed Machine Translation System
      • Step 2: Make a List of Possible Improvements based on Feedback received
  • Step 11: Improve Machine Translation System based on Feedback
  •  List of Possible Improvements made in Step 1

    • Go to Step 1 and improve the Machine Translation System based on

Chapter Summary

  • Chapter Summary

In this Chapter, I presented the following main concepts:

  • A Real-world Problem can be treatedas a Machine Learning Problem using the following Step by Step approach
    • Step 1: Decide the Learning Setting
    • Step 2: ObtainSample Data
    • Step 3: Understandand Pre-process Sample Data
    • Step 4: RepresentSample Data in Machine Understandable Format 
    • Step 5: Select SuitableMachine Learning Algorithms
    • Step 6: Split Sample Data into Training Data and Testing Data
    • Step 7: Select SuitableEvaluation Measure(s)
    • Step 8: Execute First Two Phases of Machine Learning Cycle
      • Training Phase
      • Testing Phase
    • Step 9: AnalyzeResults

  • Step 10: Execute 3rdand 4th Phases of Machine Learning Cycle
    • Application Phase
    • Feedback Phase
  • Step 11: Based on Feedback
    • Go to Step 1 and Repeal allthe Steps
  • Three mainLearning Settings are
    1. Supervised Learning
    2. Unsupervised Learning
    3. Semi-supervised Learning
  • What Type of Datashould be obtained depends upon
    • Leering Setting you selected in Step 1
  • Two Main Choices to Obtain Sample Data are
    1. Use ExistingCorpora / Datasets
    2. Develop your Own Corpora / Datasets
  • If (Corpora / Datasets Existfor your Research Problem)

      Then

                Use existing Corpora / Datasets

        Else

                You will need to develop your own Corpora / Datasets

  • Very often, Machine Learning Algorithms understand Data represented in the form of
    • Attribute-Value Pair
  • We should consider the following main points when choosing suitable Machine Learning Algorithmsfor your Machine Learning Problem
    1. Type of Machine Learning Problem
    2. Number of Parameters
    3. Size of Training and Testing Data
    4. Number of Features
    5. Training and Testing Time
    6. Accuracy
    7. Speed and Accuracy in Application Phase
  • Machine Learning Algorithms are designedto solve specific Machine Learning Problems
  • Two Important Points to Know
    1. Complete and correct understanding of the Type of Machine Learning Problem , you are trying to solve using Machine Learning Algorithms
    2. In previous studies, what Machine Learning Algorithms have provento be most effective for the Type of Machine Learning Problem you are solving?
  • Good Starting Points for Classification Problems
    • Feature-based ML Algorithms
      • For Textual Data
        • Random Forest
        • Support Vector Machine
        • Logistic Regression
        • Naïve Bayes
        • Gradient Boost
      • For Image / Video Data
        • Support Vector Machine
        • Regular Neural Networks
        • Logistic Regression
        • Naive Bayes
        • Extreme Learning Machines
        • Random Forest
        • Extreme Gradient Boost
        • Type II Approximate Reasoning
      • For Audio Data
        • Connectionist Temporal Classification
    • Deep Learning ML Algorithms
      • For Textual Data
        • Recurrent Neural Networks (RNN)
        • Long Short-Term Memory (LSTM)
        • BI-LSTM
        • Gated Recurrent Units (GRU)
        • BI-GRU
      • For Image / Video Data
        • Convolutional Neural Networks (most popular)
      • For Audio Data
        • Recurrent Neural Networks (RNN)
  • Good Starting Points for Regression Problems
    • Feature-based ML Algorithms
      • Linear Regression
      • Regression Trees
      • Lasso Regression
      • Multivariate Regression
  • Good Starting Points for Sequence to Sequence Problems
    • For Textual Data
      • Recurrent Neural Networks (RNN)
      • Long Short-Term Memory (LSTM)
      • BI-LSTM
      • Gated Recurrent Units (GRU)
      • BI-GRU
    • For Image / Video Data
      • Convolutional Neural Networks
    • For Audio Data
      • Recurrent Neural Networks (RNN)
  • Good Starting Points for Unsupervised Learning Problems
    • Feature based Mal Algorithms
      • For Textual Data
        • K-Means
        • Agglomerative Hierarchical Clustering
        • Mean-Shift Clustering Algorithm
        • DBSCAN – Density-Based Spatial Clustering of Applications with Noise
        • EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
      • For Image / Video Data
        • K-Means
        • Fuzzy C Means
    • Deep Learning ML Algorithms
      • For Image / Video Data
        • Generative Adversarial Networks
        • Auto-encoders
  • Good Starting Points for Semi-supervised Learning Problems
    • Feature based ML Algorithms
      • Label Spreading Algorithm
  • Question
    • Which Machine Learning Algorithm(s) is bestfor a specific Machine Learning Problem?
  • Answer
    • Apply all availableMachine Learning Algorithms and see which performs best 
  • Problem
    • It requires a lot of effort, time and resourcesto
      • Apply all availableMachine Learning Algorithms and find the best one 
  • A Possible Solution
    • Start with Good Starting Points
  • Machine Learning Experts say that following Machine Learning Algorithms are Good Starting Points
    • Feature based ML Algorithms
      • For Structured / Unstructured / Semi-structured Data
        • Support Vector Machine
        • Logistic Regression
    • Deep Learning ML Algorithms
      • For Textual Data
        • Recurrent Neural Network (RNN)
      • For Image / Video Data
        • Convolutional Neural Network (CNN)
      • For Audio Data
        • Recurrent Neural Network (RNN)
    • A ML Algorithm’s behavioris affected by
      • of Parameters
    • Size of Training Data
      • Size of Training Data plays a very importantrole in the Selection of Suitable ML Algorithms 
      • Feature-based ML Algorithms
        • Feature based ML Algorithms (a.k.a. Classical ML Algorithms) can be accurately trained , even if the Training Data is small
      • Deep Learning ML Algorithms
        • To accurately trainDeep Learning Algorithms huge amount of Training Data is required
      • Size of Testing Data
        • Size of Testing Data plays a very importantwhen evaluating a Machine Learning Algorithm
        • To deploy a Model in Real-world(Application Phase), it should fulfill the following two conditions
          • Model should perform well(Condition 01) on large Test Data (Condition 02)
        • Number of Features usedto Train a Model, have a significant impact on the performance of the Model
        • Selection of most discriminatingFeatures is important to get good results
        • In Text / Image / Video / Genetic Corpora / Datasets
          • Number of Features is very high compared to the of Instances in a Corpus / Dataset
        • Two popularand widely used approaches to reduce Number of Features in a Corpus / Dataset are
          • Feature Reduction
            • Feature Reduction (a.k.a. Dimensionality Reduction) is a process which transformsFeatures into a lower dimension
          • Feature Selection
            • Feature Selection is the process of selecting most discrimination(or important) subset of Features (excluding redundant or irrelevant Features) from the Original Set of Features (without changing them)
          • Popular Methods for Feature Reduction are
            • Principal Component Analysis
            • Generalized Discriminant Analysis
            • Auto-encoders
            • Non-negative Matrix Factorization
          • Popular Methods for Feature Selection are
            • Wrapper Methods
            • Filter Methods
          • Feature Extraction
            • Creates new Features
          • Feature Selection
            • Selects a subset of Features from the Original Set of Features 
          • Given a Corpus / Dataset
            • First carry out
              • Feature Extraction then
              • Feature Reduction / Feature Selection
            • Training and Testing Time mainly depends upon two main factors
              1. Size of Training and Testing Data
              2. Target Accuracy
            • Training Time of Deep Learning ML Algorithms is quite high compared to Feature based ML Algorithms

     

    • The Target Accuracymay differ from Machine Learning Problem to Machine Learning Problem
    • Speed and Accuracy requirementsin Application Phase, may vary from Machine Learning Problem to Machine Learning Problem
    • Standard Practice for Splitting Sample Data
      • Use a Train-Test Split Ratio of
        • 67% – 33%
      • Selection of SuitableEvaluation Measure(s) is important to
        • correctly evaluate the performance of a Model
      • Selection of Suitable Evaluation Measure(s) mainly dependson
        • Type of Machine Learning Problem
      • Some of the most popularand widely used Evaluation Measures for Classification Problems are
        • Baseline Accuracy (a.k.a. Most Common Categorization (MCC))
        • Accuracy
        • True Negative Rate
        • False Positive Rate
        • False Negative Rate
        • Recall or True Positive Rate or Sensitivity
        • Precision or Specificity
        • F1
        • Area Under the Curve (AUC)
      • Some of the most popularand widely used Evaluation Measures for Regression Problems are
        • Mean Absolute Error (MAE)
        • Mean Squared Error (MSE)
        • Root Mean Squared Error (RMSE)
        • R2or Coefficient of Determination
        • Adjusted R2
      • Some of the most popularand widely used Evaluation Measures for Sequence-to-Sequence Problems are
        • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
        • BLEU (Bi-Lingual Evaluation Understudy)BLEU
        • METEOR (Metric for Evaluation of Translation with Explicit Ordering)
      • Recall the Equation
      • Training Phase
        • Use Training Data to build the Model
      • Testing Phase
        • Use Testing Data to evaluate the performanceof the Model
          • i.e. calculate Error in the Model
  • When analyzing results, remember the Machine Learning Assumption

  • In Application Phase
    • Deploy the Model in the Real-world to make predictions on unseen data 
  • In Feedback Phase
    • Take Feedback from
      • Domain Experts
      • Users of the ML system
    • Based on Feedback from Domain Experts and Users
      • Improve your Model
  • In this Lecture, we treated (Step by Step) following four Real-world Problems as Machine Learning Problems
    1. GPA Prediction Problem
    2. Emotion Prediction Problem
    3. Text Summarization Problem
    4. Machine Translation Problem

In Next Chapter

  • In Next Chapter
  • In Sha Allah, in the next Chapter, I will present a detailed discussion on
  • Concept Learning and Hypothesis Representation
Chapter 5 - Data and Annotations - step by step Example
  • Previous
Chapter 7 - Concept Learning and Hypothesis Representation
  • Next
Share this article!
Facebook
Twitter
LinkedIn
About Us

Ilm O Irfan Technologies believes that honest dedicated hard work with the ethical approach of commitment true to every word, and trust-building is the road to success in any field of business or life.

Quick Links
  • About Us
  • Contact
Useful Links
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Support
  • FAQ
Subscribe Our Newsletter
Facebook Twitter Linkedin Instagram Youtube

© 2022 Ilmo Irfan. All Rights Reserved.