Chapter 3 - Basics of Machine Learning
Chapter Outline
- Chapter Outline
- Quick Recap
- What is Machine Learning?
- Learning Input-Output Functions – General Settings
- Steps to Build Efficient Machine Learning Models
- Treating Real-world Problems as Learning Input-Output Functions
- Types of Machine Learning
- Machine Learning Cycle
- Machine Learning – Training Regimes
- Chapter Summary
Quick Recap
- Quick Recap – Basics of Human Learning
- Develop such a Machine which behaves like Human
- It is essential to first understand
- What is the ultimate goal of Human Learning?
- How Human learns?
- What are the main sources of Human Learning?
- How efficiently and quickly a Human can learn?
- How Human Heart and other body parts co-ordinate to learn?
- What internal and external factors affect the Human Learning Process?
- It is essential to first understand
- The goal of Human Learning is to
- recognize the Creator (God) of heavens and earth
- To have happiness, peace, and prosperity in this life and hereafter,
- use your body, mind, soul and worldly things according to the instructions of the Creator (Allah)
- A human is said to learn if his today (character) is better than his yesterday (character)
- The sign of learning is, purity in thinking
- Advice of My Respected Teacher.
- Adeel! You are a teacher. Always remember.
- When you supervise a female student, you should have same feelings for her as you have for your daughter
- When you work in collaboration with a female colleague, you should have same feelings for her as you have for your sister
- Learning is a Searching Problem, and it continues till death
- To learn any Task, Human Learning Cycle comprises of four main Phases
- Training / Learning Phase
- Testing / Evaluation Phase
- Application Phase
- Feedback Phase
- One of the major problems in Human Learning is how to Quantify the Degree of Learning because
- in the Real-world, majority of things are Subjective
- Generally, to Quantify the Degree of Learning Standard Approaches / Practices are established for a Task
- To systematically learn a Task, use the following Step by Step approach
- Step 1: Define the Task
- Step 2: Define Main Components of Training / Learning and Testing / Evaluation Phases using Standard Approach / Practice
- Main Components of Training / Learning Phase
- Trainer / Instructor
- Standard Approach – Must be a Domain Expert
- Standard Training / Learning Material
- Standard Training / Learning Environment
- Standard Training / Learning Methodology
- Trainer / Instructor
- Main Components of Testing / Evaluation Phase
- Examiner / Invigilator
- Standard Approach – Must be a Domain Expert
- Standard Testing / Evaluation Material
- Standard Testing / Evaluation Environment
- Standard Testing / Evaluation Methodology
- Standard Evaluation Measure
- Examiner / Invigilator
- Main Components of Training / Learning Phase
- Step 3: Trainer will Train the Trainee on the Task during the Training Phase
- Step 4: After the completion of Training Phase
- Examiner will evaluate the performance of the Trainee on the Task that (s)he learned in Step 3 (i.e. Training Phase)
- Step 5: If (Performance in Testing Phase = Good)
- Then
- Allow the Trainee to perform the Task in real worlde. Application Phase
- Else
- Ask the Trainee to Go to Step 3 and take more Training and re-appear for Evaluation
- Then
- Step 6: After deployment in Real-world i.e. Application Phase
- Take Feedback from both Domain Experts and Users / Audience / Participants (Feedback Phase)
- Step 7: Based on Feedback
- Go to Step 2, and repeat all phases of Human Learning Cycle to further improve learning and keep doing this till deathe. Be a Learner till Death 😊
- From Machine Learning perspective, Human Learning can be broadly categorized into
- Deductive Learning
- Inductive Learning
- In Deductive Learning Approach, a Concept / Task is learned by using proven knowledge (or success methods)
- To systematically learn a Task through Deductive Learning Approach, a Step by Step approach is as follows,
- Step 1: Define the Learning Task
- Step 2: Search for the proven knowledge (or success methods) used by the most successful person(s) who were an authority in the whole world in the Task you want to learn
- Step 3: Simply follow proven knowledge (or success methods)used by the successful person(s) in the world and you will be successful in this life and hereafter
- In Inductive Learning Approach, a Human learns from his own experiences
- To systematically learn a Task using Inductive Learning Approach, a Step-by-Step approach is as follows
- Step 1: Define the learning Task
- Step 2: Take examples of the Task to be learned
- Step 3: Learn from Examples
- Step 4: Generalize the task learned from specific examples
What is Machine Learning?
- Machine Learning - Ultimate Goal
- To develop such a Machine which behaves like Human
- Note
- This goal cannot be achieved because humans (creature)can never be perfect like God (Creator)
- Also, Machine Learning is mainly based on Inductive Learning Approach, which has the Scope of Error
- Therefore, all Machine Learning Models will have Scope of Error and Machine cannot be intelligent like human
- Machine Learning
- Definition
- A Machine is said to learn if his today (character) is better than his yesterday (character)
- Purpose
- To develop intelligent programs which can assist (or if possible, replace) humans in various tasks
- Importance
- Information Overload Problem
- In recent years, one of the biggest problem/challenges is information overload
- Practically, it is not possible to manually extract useful information from a massive amount of Data
- To address this problem, we need intelligent programs, which can assist humans to improve the quality of their tasks
- Example
- Task
- Check Plagiarism in an assignment submitted by a university student
- Solution – Two Main Approaches
- Manual Approach
- It is practically not possible for a human to manually identify the source(s) of plagiarism from billions of digital documents (Data)
- Manual Approach
- Task
- Information Overload Problem
- Automatic Approach
- A Two-Step Process
- Step 1: An intelligent program automatically searchers for a very small subset of the potential source(s) of plagiarism from billions of digital documents and presents that subset to a human
- Step 2: Humans can easily inspect a subset of documents to check whether the assignment is plagiarized or not?
- Note
- Automatic approaches mainly assist humans
- A Two-Step Process
- Without Machine Learning, we cannot develop intelligent programs which can assist humans in a range of tasks
- Applications
- Education
- Health Care
- Business
- Agriculture
- Entertainment
- Software Applications
- Natural Language Processing
- Data Science
- Defense
- Disciplines Contributing
- Statistics
- Artificial Intelligence
- Biology
- Cognitive Science
- Information Theory
- Philosophy
- Control Theory
- Computational Complexity
- When to Use?
- When we have Data with three main characteristics
- Large amount of Data
- High-quality Data
- Balanced Data
- When we have Data with three main characteristics
- When patterns exist in our Data
- Even if we don’t know what they are?
- Why it is Hard?
- Example 1
- Example 2
- What is a 2?
- Example 2
- How Machine Learning Works
- Major Challenges in Machine Learning
As discussed earlier, our main goal is
- To achieve this goal
- Need to completely and correctly understand
- How does human learn?
- Need to completely and correctly understand
- First Major Problem
- Unfortunately, no one perfectly knows
- What is the structure of the human brain?
- How does the human brain work?
- How different parts of the human body are interacting to learn?
- Unfortunately, no one perfectly knows
- Second Major Problem
- Human Learning is the most complex task in this world, which makes it practically impossible to identify perfect human learning patterns
- Third Major Problem
- Rate of Human Learning varies from person to person and cannot be judged/predicted accurately
- Fourth Major Problem
- A number of external factors also influence Human Learning
- Three main factors are
- Devil
- Self
- Environment
- Three main factors are
- A number of external factors also influence Human Learning
- Conclusion
- To conclude, it is practically not possible to achieve
- However, we can build intelligent programs, which can do several useful tasks like
- Spam Email Detection
- Gender Identification
- Age Group Identification
- Sentiment Analysis
- Face Recognition
- Machine Translation
- Next Word Prediction
- Text Summarization
- Speech to Text
- Text to Speech
- Program
- Definition
- A program Processes the Data according to a Set of Instructions to produce Output
- Solving a Real-world Problem through Programming
- To solve a Real-world Problem through Programming, you need to know four things,
- Purpose
- Input
- Data
- Instructions
- Processing
- Output
- Example – Solving a Problem through Programming (Cont.)
- Problem
- Write a program, which calculates the sum of two integer numbers
- Purpose
- Find the sum of two Integer numbers
- Input
- Data
- Two Integer numbers
- 5, 7
- Two Integer numbers
- Instruction(s)
- Add two Integer numbers
- Data
- Processing
- Calculate the sum of two Integer numbers
- 5 + 7
- Calculate the sum of two Integer numbers
- Output
- Sum of two Integer numbers
- 12
- Sum of two Integer numbers
- Solving a Problem through Programming, (Cont.)
- Considering example on the previous slide
- The main job of a Program is to
- Process the Data based on Instructions to generate the Output
- Program - Input and Output
- Input (Program)
- Data
- 5, 7
- Instruction(s)
- +
- Output (Program)
- Result(s) obtained after processing the Input (Data + Instructions)
- 12
- Result(s) obtained after processing the Input (Data + Instructions)
- Data
- Summary – Traditional Programming
- Traditional Programming vs. Machine Learning
- The main job of a Machine Learning Algorithm is to
- Learn from Input to predict Output
- Machine Learning Algorithm - Input and Output
- Input (Machine Learning Algorithm)
- Data
- Input, Output
- Output (Machine Learning Algorithm)
- An intelligent Program (a.k.a. Model)
- Data
- Example - Machine Learning Algorithm (Input and Output)
- Goal of a Machine Learning Algorithm
- Learn from Input to predict Output
- Input to a Machine Learning Algorithm
- Data
- Output of a Machine Learning Algorithm
- An intelligent Program (a.k.a. Model)
- x2
- where X is an Integer number
- x2
- An intelligent Program (a.k.a. Model)
- Note
- Machine Learning Algorithms used Data (Input and Output) to learn an Intelligent Program / Model: X2
- After learning, if you give an Input (say 10) to Intelligent Program / Model, it will predict the Output (100)
- Summary - Machine Learning
- Inductive Learning Approach and Machines Learning
- Machine Learning is based on Inductive Learning Approach
- i.e. Learn from Examples (Data)
- Inductive Learning Approach and Machines Learning
- Concept Learning is a major sub-class of Inductive Learning
- For details on Concept Learning See
- Chapter 7 – Concept Learning and Hypothesis Representation
- For details on Concept Learning See
- Much learning is acquiring general concepts from specific examples
- Steps – How Machines Learn using Inductive Learning Approach
- Step 1: Define the Concept to be learned
- Step 2: Take examples of the Concept to be learned
- Step 3: Learn from Examples
- Step 4: Generalize the Concept learned from specific examples
- Example - Steps (How Machines Learn using Inductive Learning Approach)
- Step 1: Define the Concept to be learned
- What happens when I throw a ball in the air?
- Step 2: Take examples of the Concept to be learned
- 1 example – one time I throw the ball in the air
- 50 examples – 50 times I throw the ball in the air
- 100 examples – 100 times I throw the ball in the air
- Step 3: Learn from Examples
- I went to the nearest park in my colony (specific place) and I threw a ball 100 times in the air and learned that every time (100 times) I threw the ball in the air, it falls downward
- Step 4: Generalize the Concept learned from specific examples
- I conclude, at any place in this world (generalized), if I throw a ball in the air it will fall downwards
- Significance of Inductive Learning Approach in Machine Learning
- Although Inductive Learning has Scope of Error, however
- Several successful Machine Learning based systems have been developed using Inductive Learning Approach
- Summary – How Machines learn using Inductive Learning Approach
- Learn from Data (Examples)
- In the form of Equation 😊
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 1
- Task 1
- Irfan has three numbers (N1 = 5, N2 = 10, N3 = 7). He wants to write a Program to add the first two numbers and subtract the third number from the sum of the first two numbers
- Write down
- Purpose
- Impute
- Data
- Instructions
- Processing
- Output
- Task 2
- Abdul Jabbar scored 80 marks in Machine Learning Course. He wants to write a Program that tells him whether he passed in the course or not? Note you should use If-Else Selection Structure in your Program.
- Write down
- Purpose
- Impute
- Data
- Instructions
- Processing
- Output
- Task 3
- Consider the following Data (Input and Output)
- What Intelligent Program / Model a Machine Learning Algorithm will learn from Data given in the above Table
- What Output will be predicted by the Model learned from Data for the following unseen instances
- Are the predictions generated by the Model correct for all unseen instances?
- Task 4
- Consider the following Data (Input and Output). Note that both Input and Output are case sensitive
- What Intelligent Program / Model a Machine Learning Algorithm will Learn from Data given in the above Table
- What Output will be predicted by the Model Learned from Data for the following unseen instances
- Are the predictions generated by the Model correct for all unseen instances?
Your Turn Tasks
Your Turn Task 1
- Task 1
- Similar to four Tasks given in TODO
- Task 1
- Find two Real-world Problems and solve them using Programming
- Task 2
- Find two Real-world Problems and solve them using Machine Learning
- Task 1
- Similar to four Tasks given in TODO
- Note you must use the same Steps and answer the same questions given in TODO
Learning Input-Output Functions – General Settings
- What is to be Learned?
- Function
- Program
- Finite state machine
- Grammar
- Problem-solving system
- Concept Learning / Function Learning
- As discussed earlier, Concept Learning is a major subclass of Inductive Learning Approach
- Most of Machine Learning revolves around
- Learning Input-Output Functions
- k.a. Function Learning or Concept Learning
- Learning Input-Output Functions
- Goal – Concept Learning
- In Concept Learning
- Goal of the Learner is to learn a Target Function / Concept from Data (Set of Training Examples)
- Note
- In Machine Learning, Learner refers to a Machine Learning Algorithm
- Learner - Input and Output
- Input to a Learner
- Set of Training Examples (D)
- Set of Functions / Hypothesis (H) (a.k.a. Hypothesis Space)
- Output of a Learner
- A h from H, which is an approximation of the Target Function f
- Note
- h is assumed a priori to be drawn from a Set of Functions / Hypothesis (H)
- Target Function f
- may / may not be in H and
- this may / may not be known
- Learning Input-Output Functions – General Settings
- Why We Cannot Completely Learn a Target Function (f)?
- Question
- Why a Concept cannot be completely Learned?
- Answer
- Inductive Learning Approach has Scope of Error
- Since Concept Learning is a major sub-class of Inductive Learning Approach
- Therefore, Target Function (f) cannot be completely Learned, however, it can be approximated
- Hypothesis (h) is an approximation of the Target Function f
- Learning is a Searching Problem
- Given
- Set of Training Examples (D)
- Set of Hypothesis / Hypothesis Space (H)
- Job of Learner
- Search the Hypothesis Space (H) using the Set of Training Examples (D) and Output a Hypothesis (h) from H which best fits the Set of Training Examples (D)
- Major Problem – Machine is Dump
- A Learner needs
- Set of Training Examples (D)
- Set of Hypothesis (H)
- Problem
- Machine is dump and cannot understand Set of Training Examples (D) and Set of Hypothesis (H)
- Solution
- Change representation of Set of Training Examples (D) and Set of Hypothesis (H) in a format which Learner (ML Algorithm) can understand
- Representation of Example (x) and Hypothesis (h)
- Representation of Hypothesis (h)
- Will be discussed in next Lecture Insha Allah
- Representation of Example (x)
- Will be discussed in next Slides Insha Allah
- Note
- An Example (x) can be
- Training Example or
- Testing Example
- In this Lecture
- x means Example
- d means Training Example
- An Example (x) can be
- Example - Representation
- Example is a.k.a. instance, data point or observation
- Very often, Example is represented as
- Attribute-Value Pair
- In Machine Learning
- Attributes are a.k.a. Features
- Representation of Input
- Attribute-Value pair
- To represent Input, need to decide
- Set of Input Attributes
- Data Type of each Input Attribute
- Possible Values of each Input Attributes
- Input can be
- Single valued
- comprises of one Input Attribute
- Vector valued
- comprises of multiple Input Attributes
- Single valued
- Note
- Input is mostly vector valued
- Values of Input Attributes can be
- Categorical / Ordinal – e.g. Male, Female, Yes, No
- Numeric
- Discrete – e.g. 10, 25, 10000
- Continuous – e.g. 3.5, 5.9
- Representation of Output
- Attribute-Value pair
- To represent Output, need to decide
- Set of Output Attributes
- Data Types of each Output Attribute
- Possible Values of each Output Attributes
- Output can be
- Single valued
- comprises of one Output Attribute
- Vector valued
- comprises of multiple Output Attributes
- Note
- Output is mostly single valued
- Single valued
- Values of Output Attributes can be
- Categorical / Ordinal – e.g. Male, Female, Yes, No
- Numeric
- Discrete – e.g. 10, 25, 10000
- Continuous – e.g. 3.5, 5.9
- Example – Representation of Instance / Example (x)
- Concept to be Learned (Machine Learning Problem)
- Gender Identification
- Input
- Human
- Output
- Gender of a Human
- Instance = Input + Output
- Example – Representation of Instance / Example (x) (Cont.)
- Representation of Input
- Attribute-Value pair
- Input is represented as a Set of 3 Input Attributes
- Input Attributes
- Height
- Weight
- Beard
- Data Type for each Input Attribute
- Height – Categorical
- Weight – Categorical
- Beard – Categorical
- Possible Values for each Input Attribute
- Height – Short, Medium, Tall
- Weight – Small, Medium, Heavy
- Beard – Yes, No
- Input Attributes
- HINT: Try to identify the most discriminating Input Attributes for a Machine Learning Problem
- Representation of Input
- Example – Representation of Instance / Example (x) (Cont.)
- Representation of Output
- Attribute-Value pair
- Output is represented as a single Output Attributes
- Output Attribute
- Gender
- Data Type of Output Attribute
- Gender – Categorical
- Possible Values for Output Attribute
- Gender – Male, Female
- Output Attribute
- Example – Representation of Instance / Example (x), Cont...
- Representation of Output
- Example – Representation of Instance / Example (x), Cont...
- Below are three possible instances for the Gender Identification learning problem
- Instance / Example (x) Representation - Summary
- To summarize
- Instance is a vector of Attribute values
- Example / Instance (x) – Formal Representation
- X refers to Set of Examples
- x refers to a single Example
- Formal Representation of Example (x)
- xi, f(xi)
- where xi, represents the Input and f(xi) represents the Output
- Example – Formal Representation of Example / Instance (x)
- Consider the following Set of Examples (X)
- X = { (x1, f(x1)), (x2, f(x2)), (x3, f(x3)) }
- Here
- x1 = <Short, Medium, No> and f(x1) = Female
- x2 = <Tall, Heavy, Yes> and f(x2) = Male
- x3 = <Medium, Medium, No> and f(x3) = Female
- Learning Input-Output Functions – General Settings
- Steps – Learning Input-Output Functions
- Step 1: Define the Concept to be Learned
- Step 2: Take the examples of the Concept to be Learned
- Step 3: Learn from Examples
- Step 4: Generalize the Concept Learned from specific examples
- Example 1 - Learning Input-Output Functions
- Input – Single
- Output – Single
- Example 2 - Learning Input-Output Functions
- Input – Vector valued
- vector of Input Attribute Values
- Output – Single
- Example 3 - Learning Input-Output Functions
- Input – Vector valued
- vector of input attribute values
- Output – Single
- Comparing Example 1, Example 2 and Example 3
- Example 1 is very simple
- Example 2 is more complex then Example 1
- Example 3 is more complex then Example 2
- Learning Input-Output Functions – Summary
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 2
- Task 1
- Machine Learning Problem
- Gender Identification
- Consider the following Set of Training Examples (D) and answer the questions below,
- Machine Learning Problem
- Identify Input Attributes, their Data Types and Possible Values (or Range of Values)
- Identify Output Attribute, their Data Types and Possible Values (or Range of Values)
- How Input and Output are represented for Gender Identification Problem
- Is Input single-valued or Vector-valued? Explain.
- Is Output single-valued or Vector-valued? Explain.
- Can we treat Gender Identification Problem as a Learning Input-Output Function Problem? Explain.
- Task 2
- Consider the Iris Dataset given at UCI Machine Learning Repository
- Answer the following questions
- Copy complete Iris Dataset in your MS Word file
- Identify Input Attributes, their Data Types and Possible Values (or Range of Values)
- Identify Output Attribute, their Data Types and Possible Values (or Range of Values)
- How Input and Output are represented in Iris Dataset
- Is Input single-valued or Vector-valued? Explain.
- Is Output single-valued or Vector-valued? Explain.
- Can we treat Iris Problem as a Learning Input-Output Function Problem? Explain.
Your Turn Tasks
Your Turn Task 2
- Task 1
- Go to UCI Machine Learning Repository and select two Datasets
- For each Dataset, answer the questions given below
- Copy complete Dataset in your MS Word file
- Note if Dataset is very large, then copy a subset of instances from Dataset
- Copy complete Dataset in your MS Word file
- Identify Input Attributes, their Data Types and Possible Values (or Range of Values)
- Identify Output Attribute, their Data Types and Possible Values (or Range of Values)
- How Input and Output are represented in the selected Dataset
- Is Input single-valued or Vector-valued? Explain.
- Is Output single-valued or Vector-valued? Explain.
- Can we treat the Problem presented in selected Dataset as a Learning Input-Output Function Problem? Explain.
Steps to Build Efficient Machine Learning Models
- Machine Learning
- Steps to Build Efficient Machine Learning Models
- Step 1: Build a strong and accurate understanding of Data
- Step 2: Properly pre-process Data, so that it becomes high-quality data
- Step 3: Represent / Transform Data into a format which Machine Learning Algorithms can understand
- Step 4: Identify what Machine Learning Algorithms will be most suitable for your Data (prepared in Step 3)
- Step 5: Train / Test selected (in Step 4) Machine Learning Algorithms on Data (prepared in Step 3)
- Step 1 – Data Understanding
- Data Understating
- Two Main Approaches
- Manual Inspection
- Automatic Analysis
- Manual Inspection
- You open the file containing your Data and manually analyze and record your observations about Data
- To carry out good manual inspection, you must have
- strong basic understating of the domain from which Data is collected
- Manual Inspection is more suitable when we have a small amount of Data
- Automatic Analysis
- Two Main Approaches for Automatic Analysis
- Statistical Analysis Tools
- For example, Five Number Summary
- Data Visualization Tools
- For example, Google Charts, Tableau
- Statistical Analysis Tools
- Two Main Approaches for Automatic Analysis
- In Automatic Analysis
- You apply automatic data analysis tools on your Data and record your observations based on analysis of the Data
- To carry out a good automatic analysis, you must have
- strong basic understating of the domain from which Data is collected
- strong basic understanding of the automatic data analysis tools that you are using to analyze your Data
- Automatic Analysis is more suitable when we have a huge amount of Data
- Two Main Approaches
- Step 1 – Data Understanding, (Cont.)
- Question
- What approach (manual or automatic) should be used to build a strong and more accurate understanding of Data?
- Answer
- Perform both manual inspection and automatic analysis
- First, perform manual inspection and then carry out an automatic analysis
- Perform both manual inspection and automatic analysis
- Step 2 – Data Preprocessing
- Once you have correctly understood your Data, then you can decide
- What type of pre-processing will be suitable to convert your Data into high-quality data
- Data Pre-processing – Definition
- Data Pre-processing refers to the technique which transforms raw data into an understandable format
- Main Steps of Data Pre-processing
- Some of the main steps of Data Pre-processing are as follows:
- Data Cleaning
- Data Cleaning (a.k.a. Data Cleansing) is the process of identifying (incomplete, incorrect, inaccurate or irrelevant) parts of the Data and correcting (replacing, modifying, or deleting) them
- Data Cleaning process may include
- Fill in missing values
- Smooth noisy data
- Identify or remove outliers
- Resolve inconsistencies
- Data Integration
- The process of combining Data from different sources (multiple databases, different cubes or files) into a single, unified view
- Data Transformation
- The process of converting Data from one format to another, typically from the format of a source system into the required format of a destination system
- Data Transformation may include
- Data normalization
- Data aggregation
- Data Reduction
- The process of reducing the huge volume of Data but producing the same or similar analytical results
- We perform Data Reduction when we have a very huge amount of Data
- Note
- I have only given a basic overview of Data Pre-processing. If you interested to learn more about it then can read tutorial or books on Data Pre-processing
- Step 3 - Represent Data into Machine Understandable Format
- Machine Learning Algorithms can be broadly categorized as
- Feature-based Machine Learning Algorithms (a.k.a. Classical Machine Learning Algorithms)
- Deep Neural Network Architectures (a.k.a. Deep Learning Algorithms)
- Feature-based Machine Learning Algorithms
- Definition
- Feature-based ML Algorithms are based on Manual Feature Engineering
- Strengths
- Feature-based ML Algorithms can even learn from small Training Data
- Weaknesses
- The process of Manual Feature Engineering requires a lot of time, cost and effort because the set of most discriminating features is learned manually
- Deep Neural Network Architectures
- Definition
- Deep Neural Network Architectures based on Automatic Feature Engineering
- Strengths
- In Automatic Feature Engineering, the set of most discriminating features is learned automatically
- Weaknesses
- Deep Neural Network Architectures require a very large amount of Training Data
- Step 4 - Selection of Suitable Machine Learning Algorithms
- Very Important Decision
- To make a good decision
- You must have a high level of expertise in Machine Learning or
- Consult a Machine Learning Expert
- To make a good decision
- A Two-Step Process
- Step 1: Decide whether you will use
- Feature-based ML Algorithms or
- Deep Neural Network Architectures or
- Both
- Step 2: Which Machine Learning Algorithms from Feature-based and/or Deep Neural Network are more suitable for your Machine Learning Problem
- Important Note
- No one has a definite answer about which Machine Learning Algorithms are most suitable for a Machine Learning Problem but we can start with those Machine Learning Algorithms which have proven effective in solving Machine Learning Problem(s) similar to our Machine Learning Problem
- Recall the Deductive Learning Approach – Learn using proven success methods 😊
- No one has a definite answer about which Machine Learning Algorithms are most suitable for a Machine Learning Problem but we can start with those Machine Learning Algorithms which have proven effective in solving Machine Learning Problem(s) similar to our Machine Learning Problem
- Example 1 - Selection of Suitable Machine Learning Algorithms
- Problem
- Sentiment Analysis of Users Reviews / Comments on Products
- Input
- Text (Review / Comments)
- Output
- Sentiment (Positive / Negative / Neutral)
- Step 1: I decide to apply both Feature-based and Deep Neural Network Machine Learning Algorithms on my Sentiment Analysis dataset
- Step 2: Previous research/studies have shown that for textual Data some of the
- Suitable Feature-based ML Algorithms are
- Support Vector Machine
- Logistic Regression
- Random Forest
- Naïve Bayes
- Suitable Neural Network ML Algorithms are
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- BI-LSTM
- Suitable Feature-based ML Algorithms are
- Example 2 - Selection of Suitable Machine Learning Algorithms
- Problem
- Emotion Analysis from Image
- Input
- Image
- Output
- Emotion (Happy / Sad / Angry)
- Step 1: I decide to apply Deep Neural Network Machine Learning Algorithms on my Emotion Analysis dataset
- Step 2: Previous research/studies have shown that for image Data some of the
- Suitable Deep Neural Network ML Algorithms are
- Convolutional Neural Networks (CNN)
- Suitable Deep Neural Network ML Algorithms are
- Important Note
- You can clearly see from these two examples that
- for textual data suggested ML algorithms are entirely different from the one suggested for image Data
- You can clearly see from these two examples that
- Conclusion
- To conclude, if you don’t have a strong and accurate understanding of your Data, you will not be able to select suitable Machine Learning Algorithms for your Machine Learning Problem
- Step 5 - Train / Test Selected ML Algorithms
- Train / Test selected Machine Learning Algorithms on your Data
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 3
- Task 1
- Irfan wants to develop a Face Recognition system
- Questions
- Write down Input and Output for the Face Recognition system?
- What type of preprocessing is suitable for Image data?
- Write down the Steps to Build Efficient Machine Learning Models for Face Recognition Problem?
- Task 2
- Abdul Jabbar wants to develop a Speech to Text system
- Questions
- Write down Input and Output for the Speech to Text system?
- What type of preprocessing is suitable for Speech (audio) data?
- Write down the Steps to Build Efficient Machine Learning Models for Speech to Text Problem?
- Questions
- Abdul Jabbar wants to develop a Speech to Text system
Your Turn Tasks
Your Turn Task 3
- Task 1
- Select any Machine Learning Problem and answer the following questions
- Questions
- Write down Input and Output for your Machine Learning Problem?
- What type of preprocessing is suitable for your data?
- Write down the Steps to Build Efficient Machine Learning Models for your Machine Learning Problem?
Treating Real World Problems as Learning Input-Output Functions
- Learning Input-Output Functions
- To Learn Input-Output Functions
- First Understand Data i.e.
- Input and
- Output
- Then
- Learn Input-Output Function
- Input
- We can categories Input in three different ways
- Form of Data – First Categorization of Input
- Text
- Image
- Video
- Audio
- Type of Data – Second Categorization of Input
- Structured
- Unstructured
- Semi-structured
- Length of Data – Third Categorization of Input
- Fixed
- Variable Length
- Output
- We can categories Output in three different ways
- Form of Data – First Categorization of Output
- Text
- Image
- Video
- Audio
- Type of Data – Second Categorization of Output
- Structured
- Unstructured
- Semi-structured
- Length of Data – Third Categorization of Output
- Fixed
- Variable Length
- Possible Combinations of Input and Output
- Considering different forms of Data
- Possible Combinations of Input and Output
- Considering different types of Data
- Possible Combinations of Input and Output
Considering different lengths of Data
- Important Note
- For the same Machine Learning Problem, we may get Data in different formats
- In the next slides, In Sha Allah I will try to explain this with examples
- Real-World Problem – Gender Identification
- Real-world Problem
- Automatically Predict the Gender of a Human
- One Possible Solution
- Treat the Gender Identification Problem as Learning Input-Output Function
- In Learning Input-output Functions, the first step is to
- Identify input and Output i.e. Understand Data
- Example 1 – Treating Gender Identification Problem as Learning Input-Output Function
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Text
- Output
- Text
- Input
- Considering Type of Data
- Input
- Structured
- Output
- Structured
- Input
- Considering Length of Data
- Input
- Fixed (Set of 3 Input Attributes)
- Output
- Fixed (Set of 1 Output Attributes)
- Input
- Considering Form of Data
- Example 2 – Treating Gender Identification Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Text
- Output
- Text
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Structured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Fixed (Set of 1 Output Attributes)
- Input
- Considering Form of Data
- Example 3 – Treating Gender Identification Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Image
- Output
- Text
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Structured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Fixed (Set of 1 Output Attributes)
- Input
- Considering Form of Data
- Example 4 – Treating Gender Identification Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Audio
- Output
- Text
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Structured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Fixed (Set of 1 Output Attributes)
- Input
- Considering Form of Data
- Example 5 – Treating Sentiment Analysis Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Text
- Output
- Text
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Structured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Fixed (Set of 1 Output Attributes)
- Input
- Considering Form of Data
- Example 6 – Treating Machine Translation Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Text
- Output
- Text
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Unstructured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Variable Length
- Input
- Considering Form of Data
- Example 7 – Treating Object Detection Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Image
- Output
- Image
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Unstructured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Variable Length
- Input
- Considering Form of Data
- Example 8 – Treating Natural Language Description Generation from Image Problem as Learning Input-Output Functions
- Available Data
- Analyze and Understand Data
- Considering Form of Data
- Input
- Image
- Output
- Text
- Input
- Considering Type of Data
- Input
- Unstructured
- Output
- Unstructured
- Input
- Considering Length of Data
- Input
- Variable Length
- Output
- Variable Length
- Input
- Considering Form of Data
- Comparison of All 8 Examples
As we move from Example 01 to Example 08, the complexity of the Machine Learning Problem increases 😊
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 4
- Task 1
- Go to Google Translate
- Questions
- Identify how many Machine Learning Systems are being used in Google Translate?
- For each Machine Learning System
- Write down its Input and Output considering
- Form of Data
- Type of Data
- Length of Data
- Write down its Input and Output considering
Your Turn Tasks
Your Turn Task 4
- Task 1
- Similar to Google Translate, identify a Real-world Application which is mainly based on Machine Learning Systems
- Questions
- Identify how many Machine Learning Systems are being used in your selected Real-world Application?
- For each Machine Learning System
- Write down it’s Input and Output considering
- Form of Data
- Type of Data
- Length of Data
- Write down it’s Input and Output considering
Types of Machine Learning
- Data
- Data
- Raw Facts and Figures
- Varieties of Data
- Structured Data
- Unstructured Data
- Semi-structured Data
- Structured Data
- Definition
- Structured data refers to that Data that has been organized into a formatted repository (typically a Database)
- A Data structure is a kind of repository that organizes information for that purpose
- Structured data refers to that Data that has been organized into a formatted repository (typically a Database)
- Purpose
- Data is stored in a structured format so that it can be
- Easily understood
- Quickly stored and accessed
- Effectively analyzed and processed
- Data is stored in a structured format so that it can be
- Examples
- Databases
- Names
- Dates
- Addresses
- Credit Card Numbers
- Stock information
- Geo-Location etc.
- Definition
- Unstructured Data
- Definition
- Unstructured Data is information that either does not have a pre-defined Data model or is not organized in a pre-defined manner
- Unstructured information is typically text-heavy but may contain Data such as dates, numbers, and facts as well
- Unstructured Data is information that either does not have a pre-defined Data model or is not organized in a pre-defined manner
- Purpose
- Unstructured Data is “mostly” used in daily life to communicate with one another
- Examples
- Videos
- Photos
- Audio Files
- E-mail Messages
- Word Processing Documents
- Presentations
- Webpages etc.
- Definition
- Semi-structured Data
- Definition
- Semi-structured Data is a form of structured Data that does not obey the formal structure of Data models associated with Relational Databases or other forms of Data Tables, but nonetheless contains Tags or other Markers to separate semantic elements and enforce hierarchies of records and fields within the Data
- Purpose
- Tags (or Metadata) are added to an unstructured Data to make it easier to understand and search
- Examples
- XML Documents
- HTML Documents
- JSON Documents
- NoSQL Databases etc.
- Definition
- Four main forms of Data are
- Text
- Image
- Video
- Audio
- Information
- Definition
- Processed form of Data
- Purpose
- Information helps us to Learn
- Example
- Raw Data
- Sidra, 70, Mehwish, 80, Adeel, 90, Ayesha, 80, Imran, 70
- Information
- Organize Data in some structured format to extract meaningful information, e.g. Table
- Raw Data
- Definition
- Insights
- Highest marks in Machine Learning course are: 90
- Lowest marks in Machine Learning course are: 70
- Average marks in Machine Learning course are: 80
- Topper in Machine Learning course is: Adeel
- Data Annotation
- Data annotation
- Definition
- Data Annotation (a.k.a. Data labeling / Data tagging) is the process of labeling Data
- Purpose
- Data is annotated so that Machine Learning Algorithms can more accurately learn from annotated Data
- Who Do Data Annotations?
- Data Annotation is performed by Domain Experts (humans – a.k.a. annotators/taggers/raters)
- Strengths
- When Machine Learning Algorithms uses annotated Data to learn, their learning is more accurate
- Weaknesses
- Data Annotation requires a lot of effort, time and cost
- Examples
- See next Slides 😊
- Definition
- Example 1 – Data Annotation
- Task
- Annotate textual Data (Users Comments / Reviews on Product (iPhone7)) for Sentiment Analysis and Gender Identification tasks
- Raw Data
- Data Annotation – Sentiment Analysis
- Data Annotation – Gender Identification
- Note
- Same textual Data is annotated for two entirely different tasks i.e. Sentiment Analysis and Gender Identification
- Example 2 – Data Annotation
- Task
- Annotate image Data for Gender Identification, Emotion Analysis, and Age Group Identification tasks
- Raw Data
- Data Annotation – Gender Identification
- Data Annotation – Emotion Analysis
- Data Annotation – Age Group Identification
- Note
- Same image Data is annotated for three different taskse. Gender Identification, Emotion Analysis, and Age Group Identification tasks
- Data and Machine Learning
- For Machine Learning Algorithms, Data is mainly available as
- Un-annotated Data
- Annotated Data
- Semi-annotated Data
- Un-annotated Data
- Output is not associated with the Input
- Example
- Annotated Data
- Output is associated with all the Inputs
- Example
- Semi-annotated Data
- Output is associated with some of the Inputs
- Example
- Characteristics of Data used for Machine Learning
- Recall the Equation
- For more accurate learning, it is important to have
- Large amount of Data
- High-quality Data
- Balanced Data
- Large Amount of Data Needed for Machine Learning
- Question
- How much Data is good enough for accurate learning?
- A Possible Answer
- Varies from Task to Task
- Note
- Insha Allah, in the next slides we will discuss different Machine Learning Problems and see how much Data will be good enough for them
- Example 1 - Large Amount of Data Needed for Machine Learning
- Task
- Sentiment Analysis from Customer Reviews on Products
- Amount of Data Needed
- 10, 000 instances (seems to be a good start)
- Example 2 - Large Amount of Data Needed for Machine Learning
- Task
- Machine Translation
- Amount of Data Needed
- 1 Million instances (seems to be a good start)
- Comparing Example 1 and Example 2
- Finding
- It can be noted that amount of Data required for Machine Translation is very high compared to Sentiment Analysis on Customer Reviews
- Reason
- Machine Translation is a very complex task (Sequence to Sequence Problem) compared to Sentiment Analysis task (Classification Problem)
- Consequently, we need more Data to accurately learn a complex task
- Conclusion
- Complex and big tasks require more effort
- Example
- Duration of PhD = 21 years
- Duration of Matrix = 10 years
- Amount of effort required to get a Ph.D. degree is much higher Matric degree
- When you SET BIG GOALS in life, then you will need two things to maintain
- Patience and
- Consistency 😊
- Majority of people get demotivated because they want things to happen quickly 😉
- High-quality Data Needed for Machine Learning
- Question
- What do you mean by High-quality Data in Machine Learning?
- A Possible Answer
- A Dataset is said to be of high-quality if it contains instances which are
- Noise Free
- Complete and Correct
- Diversified
- A Dataset is said to be of high-quality if it contains instances which are
- Example - High-quality Data Needed for Machine Learning
- Machine Learning Problem
- Gender Identification
- A Dataset is of high-quality if it is
- Noise Free
- Only contains instances related to Gender Identification Problem
- Should not contain any instances which Machine Learning Algorithms cannot understand
- Complete and Correct
- All instances in the Dataset must be complete
- e. there should not be any missing values, inconsistencies, outliers, etc.
- All instances in the Dataset must be correct
- e. there should not be any errors in the instances
- All instances in the Dataset must be complete
- Diversified
- Dataset should contain instances (humans) from all 7 continents of the world because the characteristics and behavior of humans in different parts of the world is different
- Noise Free
- Note
- Data Pre-processing tools are used to improve the quality of Data
- Balanced Data vs Unbalanced Data
- For more accurate learning, it is important to have a Balanced Data
- Balanced Data
- For each class, the Dataset must contain the same number of instances
- Example 1 - Balanced Data vs Unbalanced Data
- Machine Learning Problem
- Gender Identification
- No. of Classes
- Class 1 = Male
- Class 2 = Female
- Dataset Size
- 300 Instances
- Examples – Unbalanced Datasets
- Unbalanced Dataset 1
- Male = 50, Female = 250
- Note that this Dataset if highly unbalanced
- Unbalanced Dataset 2
- Male = 200, Female = 100
- Note that this Dataset if moderately unbalanced
- Reason for Unbalanced Datasets
- For Male and Female classes, the number of instances in not the same
- Unbalanced Dataset 1
- Example – Balanced Dataset
- Male = 150, Female = 150
- Reason for Balanced Dataset
- For Male and Female classes, the number of instances is the same
- i.e. 50% instances are Male and remaining 50% instances are Female
- For Male and Female classes, the number of instances is the same
- Note
- Gender Identification is a Binary Classification Problem because there are two Classes i.e. Male and Female
- Example 2 - Balanced Data vs. Unbalanced Data
- Machine Learning Problem
- Sentiment Analysis
- of Classes
- Class 01 = Positive
- Class 02 = Negative
- Class 03 = Neutral
- Dataset Size
- 300 Instances
- Examples – Unbalanced Datasets
- Unbalanced Dataset 01
- Positive = 100, Negative = 200, Neutral = 0
- Note that this Dataset if highly unbalanced
- Unbalanced Dataset 02
- Positive = 50, Negative = 100, Neutral = 150
- Note that this Dataset if moderately unbalanced
- Reason for Unbalanced Dataset
- For Positive, Negative and Neutral classes, the number of instances in not same
- Unbalanced Dataset 01
- Example – Balanced Dataset
- Positive = 100, Negative = 100, Neutral = 100
- Reason for Balanced Dataset
- For Positive, Negative and Neutral classes, the number of instances in the same
- e. 33.33% are Positive, 33.33% are Negative and 33.33% are Neutral
- For Positive, Negative and Neutral classes, the number of instances in the same
- Reason for Balanced Dataset
- Note
- Sentiment Analysis is a Multi-class Classification Problem because there are more than two Classes i.e. Positive, Negative and Neutral
- Remarks - Balanced Data vs Unbalanced Data
- Ideal Situation
- Have a large balanced Dataset with high-quality
- Problem
- It is a difficult and challenging task to create large balanced Datasets with high-quality
- A Possible Solution
- Use large and high-quality Datasets, which are moderately balanced
- Example – Moderately Balanced Datasets
- For a Binary Classification Problem, some of the possible moderately balanced are as follows
- 55% – 45%
- 60% – 40%
- For a Binary Classification Problem, some of the possible moderately balanced are as follows
- Types of Learning
- Three main types of learning are
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Supervised Learning
- Definition
- In Supervised Learning, a Machine Learning Algorithm learns from Annotated Data
- Annotated Data means that for all Training Examples, Output is associated with Inputs
- In Supervised Learning, a Machine Learning Algorithm learns from Annotated Data
- Strengths
- Learning is more accurate because the quality of Annotated Data is high (annotated by Domain Expers)
- Weaknesses
- Acquiring Annotated Data requires a lot of time, effort and cost
- Types of Supervised Learning
- Two main types of Supervised Learning are
- Classification
- Regression
- Classification
- Definition
- In Classification, the Output is Categorical (or Discrete)
- Example
- Task
- Gender Identification
- Annotated Data
- Task
- Definition
- Regression
- Definition
- In Regression, the Output is Numeric (or Continuous)
- Example
- Task
- House Price Prediction
- Annotated Data
- Task
- Definition
- UnSupervised Learning
- Definition
- In Unsupervised Learning, a Machine Learning Algorithm learns from Unannotated Data
- Unannotated Data means that for all Training Examples, Output is not associated with Inputs
- In Unsupervised Learning, a Machine Learning Algorithm learns from Unannotated Data
- Strengths
- You can easily and quickly collect a large amount of Unannotated Data
- Weaknesses
- Learning may not be accurate since the quality of Data is low because it is unannotated
- Semi-supervised Learning
Definition
- In Semi-supervised Learning, a Machine Learning Algorithm learns from Semi-annotated Data
- Semi-annotated Data means that only for some Training Examples, Output is associated with Inputs
- In Semi-supervised Learning, a Machine Learning Algorithm learns from Semi-annotated Data
- Strengths
- You can quickly collect a large amount of Semi-annotated Data
- Weaknesses
- Learning may not be accurate since the quality of Data is low because all Training Examples are not annotated
- Machine Learning Algorithms for Different Types of Learning
For each type of learning, different Machine Learning Algorithms are developed
- Next slides present some of the popular and widely used Machine Learning Algorithms
- Note – Here I am considering Scikit-learn Machine Learning Toolkit implementations
- Popular and Widely Used Machine Learning Algorithms for Different Types of Learning
- ML Algorithms for Supervised Learning
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- ML Algorithms for Classification
- Naïve Bayes
- Random Forest Classifier
- Support Vector Machine Classifier
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- ML Algorithms for Regression
- Random Forest Regressor
- Support Vector Machine Regressor
- Logistic Regression
- Linear Regression
- ML Algorithms for Unsupervised Learning
- K-means Clustering Algorithm
- Agglomerative Hierarchical Clustering Algorithm
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
- ML Algorithms for Semi-supervised Learning
- Label Propagation
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 5
- Task 1
- Consider the following Raw Data comprising of 5 Images
- Questions
- For how many Machine Learning Problems the above Image Data can be annotated?
- For each Machine Learning Problem
- Write it’s Input and Output
- For each Machine Learning Problem
- Write what Type of Machine Learning Algorithms will be more suitable?
- Task 2
- Consider the following Raw Data comprising of 5 Text Document collected from Wikipedia
- Questions
- For how many Machine Learning Problems the above Textual Data can be annotated?
- For each Machine Learning Problem
- Write it’s Input and Output
- For each Machine Learning Problem
- Write what Type of Machine Learning Algorithms will be more suitable?
Your Turn Tasks
Your Turn Task 5
- Task 1
- Collect 5 videos (Raw Data) and answer the question given below
- Questions
- For how many Machine Learning Problems your collected Video Data can be annotated?
- For each Machine Learning Problem
- Write it’s Input and Output
- For each Machine Learning Problem
- Write what Type of Machine Learning Algorithms will be more suitable?
- Task 2
- Collect 5 audios (Raw Data) and answer the question given below
- Questions
- For how many Machine Learning Problems your collected Speech (Audio) Data can be annotated?
- For each Machine Learning Problem
- Write it’s Input and Output
- For each Machine Learning Problem
- Write what Type of Machine Learning Algorithms will be more suitable?
Machine Learning Cycle
- Machine Learning Cycle
- Four main phases of Machine Learning Cycle are
- Training / Learning Phase
- Testing / Evaluation Phase
- Application Phase
- Feedback Phase
- Data Split
- Problem
- For both Training Phase and Testing Phase, we need
- Data
- A Possible Solution
- Split available Data into
- Train Data (or Train set)
- Test Data (or Test set)
- Split available Data into
- Use Train Data in the Training Phase and Test Data in the Testing Phase
- For both Training Phase and Testing Phase, we need
- For all types of learning (supervised, unsupervised and semi-supervised), we need both Training Data and Text Data
- Supervised Learning
- Training Data – must be annotated
- Test Data – must be annotated
- Unsupervised Learning
- Training Data – must be unannotated
- Test Data – must be annotated
- Semi-supervised Learning
- Training Data – must be semi-annotated
- Test Data – must be annotated
- Important Note
- Training Data varies for three types of learning
- Test Data must be annotated for all three types of learning
- Standard Approach for Data Split
- Train set – Use 2 / 3 (67%) of Data
- Test set – Use 1 / 3 (33%) of Data
- Importance Note
- Train set and Test set must be disjoint
- examples accruing in the Train set should not occur in the Test set and vice versa
- Train set and Test set must be disjoint
- Two main approaches to Data split
- Random Data Split
- Class Balanced Data Split
- Random Data Split
- In this approach, the Data Distribution (for each class) in the original Dataset is not followed while splitting Data
- Class Balanced Data Split
- In this approach, the Data Distribution (for each class) in the original Dataset is strictly followed while splitting Data
- Example 1 – Data Split
- Machine Learning Problem
- Gender Identification
- No of Classes
- Class 1 = Male
- Class 2 = Female
- Original Dataset Size
- 600 instances
- Data Distribution
- Male = 300 (50%)
- Female = 300 (50%)
- Train-Test Split Ratio
- 67%-33%
- Random Data Split Approach
- Train set = 400 instances (67% of Original Dataset)
- Male = 250, Female = 150
- Test set = 200 instances (33% of Original Dataset)
- Male = 150, Female = 50
- Train set = 400 instances (67% of Original Dataset)
- Reason – Why it is a Random Data Split?
- In Original Dataset, the Data Distribution is
- 50% Male instances and 50% Female instances
- In Train set
- Total Instances = 400
- Male Instances = 250
- Female Instances = 150
- In Test set
- Total Instances = 200
- Male Instances = 150
- Female Instances = 50
- In Original Dataset, the Data Distribution is
- Note that both in the Train set and Test set, the Data Distribution does not match with the Data Distribution in the Original Dataset
- Class Balanced Data Split Approach
- Train set = 400 instances (67% of Original Dataset)
- Male = 200, Female = 200
- Test set = 200 instances (33% of Original Dataset)
- Male = 100, Female = 100
- Reason – Why it is a Class Balanced Data Split?
- In Original Dataset, the Data Distribution is
- 50% Male instances and 50% Female instances
- In Train set
- Total Instances = 400
- Male Instances = 200
- Female Instances = 200
- In Test set
- Total Instances = 200
- Male Instances = 100
- Female Instances = 100
- In Original Dataset, the Data Distribution is
- Train set = 400 instances (67% of Original Dataset)
- Note that both in the Train set and Test set, the Data Distribution matches with the Data Distribution in the Original Dataset
- Example 2 – Data Split
- Machine Learning Problem
- Gender Identification
- No of Classes
- Class 1 = Male
- Class 2 = Female
- Original Dataset Size
- 900 instances
- Data Distribution
- Male = 600 (67%)
- Female = 300 (33%)
- Train-Test Split Ratio
- 67%-33%
- Random Data Split Approach
- Train set = 600 instances (67% of Original Dataset)
- Male = 500, Female = 100
- Test set = 300 instances (33% of Original Dataset)
- Male = 100, Female = 200
- Reason – Why it is a Random Data Split?
- In Original Dataset, the Data Distribution is
- 67% Male instances and 33% Female instances
- In Train set
- Total Instances = 400
- Male Instances = 500
- Female Instances = 100
- In Test set
- Total Instances = 300
- Male Instances = 100
- Female Instances = 200
- In Original Dataset, the Data Distribution is
- Train set = 600 instances (67% of Original Dataset)
- Note that both in the Train set and Test set, the Data Distribution does not match with the Data Distribution in the Original Dataset
- Class Balanced Data Split Approach
- Train set = 600 instances (67% of Original Dataset)
- Male = 400, Female = 200
- Test set = 300 instances (33% of Original Dataset)
- Male = 200, Female = 100
- Reason – Why it is a Class Balanced Data Split?
- In Original Dataset, the Data Distribution is
- 67% Male instances and 33% Female instances
- In Train set
- Total Instances = 600
- Male Instances = 400
- Female Instances = 200
- In Test set
- Total Instances = 300
- Male Instances = 200
- Female Instances = 100
- In Original Dataset, the Data Distribution is
- Train set = 600 instances (67% of Original Dataset)
- Note that both in the Train set and Test set, the Data Distribution matches with the Data Distribution in the Original Dataset
- Summary – Data Split
- It is good to split Data in a Train-Test Split Ratio of 67%-33% using the Class Balanced Data Split Approach
- Machine Learning Cycle
- Recall – Four Phases of Machine Learning Cycle
- Training / Learning Phase
- Testing / Evaluation Phase
- Application Phase
- Feedback Phase
- Training / Learning Phase
- Definition
- Use Training Data to build a Model
- Purpose
- Build a Model (or Intelligent Program) from Training Data, to make predictions on unseen Data
- An Important Question
- How good your Model has learned?
- A Possible Answer
- Evaluate the performance of Model on unseen Data (Test Data)
- Testing / Evaluation Phase
- Definition
- Use Test Data to evaluate the performance of Model (build in the Training Phase)
- Purpose
- Judge how good Model has Learned from the Training Data
- An Important Question
- How to quantify the performance of the Model?
- A Possible Answer
- Use an Evaluation Measure
- Standard Evaluation Measures
- Classification – Standard Evaluation Measures
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F-measure
- Area Under the Curve (AUC)
- Regression – Standard Evaluation Measures
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R² or Coefficient of Determination
- Adjusted R²
- Note – I have only mentioned some of the popular and widely used Evaluation Measures, there are also many others 😊
- Tip
- To learn any task
- Start with the most popular and widely used approaches
- Also, in learning
- Never Compromise on Quality 😊
- Summary - Training and Testing Phases
- Recall the Equation
- In Training Phase, Model is build using the Training Data
- In Testing Phase, Test Data is used to check the Error in the Model
- Application Phase
- Definition
- Deploy the Model in the real world
- Purpose
- Use the Model to make predictions on future unseen Data for a range of Real-world Applications
- Question
- How can we say that our Model is good enough to perform well in the real world?
- A Possible Answer (ML Assumption)
- Note
- Feedback Phase
- Definition
- Take Feedback from Users and Domain Experts on your deployed Model
- Purpose
- To further improve the deployed Model
- After taking Feedback
- Go to Training Phase to further improve your Model based on the Feedback received from Users and Domain Experts
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 1
- Task 1
- Irfan has a Dataset for the Plagiarism Detection task, which comprises of 1000 documents (800 are Plagiarized and 200 are Non-Plagiarized). He wants to Split Data using a Train-Test Split Ratio of 80%-20%.
- Question
- In which of the following category the Dataset falls? Explain your answer.
- Highly Unbalanced Dataset
- Moderately Unbalanced Dataset
- Balanced Dataset
- Show Data Split using Random Data Split Approach
- Show Data Split using Class Balanced Data Split Approach
- What tasks Irfan will have to perform in the following four Phases of Machine Learning Cycle?
- Training Phase
- Testing Phase
- Application Phase
- Feedback Phase
- What conditions should be checked by Irfan before deploying his Model in Real-world?
- Can Irfan deploy his Model in Real-world? Explain your answer.
- In which of the following category the Dataset falls? Explain your answer.
- Task 2
- Abdul Jabbar has a Dataset for the Age Group Identification task, which comprises of 12000 documents (4000 are [18 – 25], 5000 are [26 – 40] 3000 are [41 – 100]). He wants to Split Data using a Train-Test Split Ratio of 70%-30%.
- Question
- In which of the following category the Dataset falls? Explain your answer.
- Highly Unbalanced Dataset
- Moderately Unbalanced Dataset
- Balanced Dataset
- Show Data Split using Random Data Split Approach
- Show Data Split using Class Balanced Data Split Approach
- What tasks Abdul Jabbar will have to perform in the following four Phases of Machine Learning Cycle?
- Training Phase
- Testing Phase
- Application Phase
- Feedback Phase
- What conditions should be checked by Abdul Jabbar before deploying his Model in Real-world?
- Can Abdul Jabbar deploy his Model in Real-world? Explain your answer.
- In which of the following category the Dataset falls? Explain your answer.
Your Turn Tasks
Your Turn Task 6
- Task 1
- Go to Kaggle Website and select 3 Datasets
- 1 Datasets should be Highly Unbalanced
- 1 Datasets should be Moderately Unbalanced
- 1 Datasets should be Balanced
- For each Dataset, Kaggle has released two files
- Training Data File and
- Testing Data File
- URL: https://www.kaggle.com/Datasets
- Question
- For each Dataset, write down
- Input and Output of the Machine Learning Problem
- Type of Data
- Total Number of Instances
- Total Number of Instances in Training Data
- Total Number of Instances in Testing Data
- Number of Classes
- Number of Instances per Class (i.e. Class Distribution)
- Any other information you find useful
- For each Dataset, based on the statistics collected in Question 1, identify whether your selected Dataset is Split using
- Random Data Split Approach or
- Class Balanced Data Split Approach
- For each Dataset, based on statistics collected in Question 01, write down
- Train-Test Split Ratio
- For each Dataset, write down
- Go to Kaggle Website and select 3 Datasets
- For each Dataset, describe how many Phases of Machine Learning Cycle are applied?
Machine Learning – Training Regimes
- Training Regime
- Definition
- A systematic way in which Training Data is used by a Machine Learning Algorithm to learn from it
- Purpose
- When we learn (or train) systematically, the quality of Training increases
- Types of Training Regimes
- Some of the main types of Training Regimes are
- Batch Method
- Incremental Method
- On-line Method
- Types of Training Regimes
- Batch Method
- In this method, all Training Examples are available and used all at once to build the Model (or Hypothesis h)
- Incremental Method
- In the method, one member (Training Example) of the Training Data is selected at a time and used to modify the current Hypothesis (h)
- On-line Method
- If Training Examples become available one at a time and are used as they become available, the Training Regime is called On-line Method
- Example
- A robot which is learning a Hypothesis (h) from sensory inputs which control its actions (and hence determines its future sensory inputs)
- Note
- Whatever Training Regime you use, the Learner (Machine Learning Algorithm) will return you a Hypothesis (h), which is an approximation of the Target Function (f)
- Direct Training vs. Indirect Training
- Direct Training
- Each Training Example has associated Output value
- Example
- Indirect Training
- Only sequences of Training Examples have associated Output value
- Example – Chess Game
- Goal
- is to learn a function from the board position to next move
- Input
- Sequence of Moves
- Output
- Class 01 = Win
- Class 02 = Lose
- Problem
- May not know whether individual moves are correct and we will Win
- Only sequence of moves leads to Win or Loss
- Solution
- Learner must decide what sequences of moves will take him to Win / Lose
- Goal
- Training Data– Chess Game
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 7
- Task 1
- Consider the following Real-world Problems
- Playing Cricket Match
- Path Detection by a Robot
- Getting a Ph.D. Degree
- Questions
- What Training Regime is more suitable for the above Real-world Problems? Explain.
- Incremental Method
- Batch Method
- Online Method
- What Training Regime is more suitable for the above Real-world Problems? Explain.
- Direct Training
- Indirect Training
- What Training Regime is more suitable for the above Real-world Problems? Explain.
- Consider the following Real-world Problems
Your Turn Tasks
Your Turn Task 7
- Task1
- Identify 6 Real-world Problems
- 2 Real-world Problems should use Incremental Training Regime
- 2 Real-world Problems should use the Batch Training Regime
- 2 Real-world Problems should use Online Training Regime
- Identify 6 Real-world Problems
- Task 2
- Identify 6 Real-world Problems
- 3 Real-world Problems should use Direct Training Regime
- 2 Real-world Problems should use Indirect Training Regime
- Identify 6 Real-world Problems
Chapter Summary
- Chapter Summary
- Before learning / doing a Task, it is essential to know its ultimate goal
- The ultimate goal of Machine Learning is
- Develop such a Machine which behaves like Human
- To stay motivated and successful in this life and hereafter
- Be Realistic
- To be honest, it is not possible to develop such a Machine which behaves like Human because
- Human (creature) can never be perfect like Allah (Creator)
- Machine Learning is mainly based on Inductive Learning Approach, which has Scope of Error
- Therefore, all Machine Learning Models will have Scope of Error and Machine cannot be intelligent like Human
- A Machine is said to learn if its today (character) is better than his yesterday (character)
- Generally , we use Machine Learning
- When we have Data with three main characteristics
- Large amount of Data
- High quality Data
- Balanced Data
- When patterns exist in our Data
- Even if we don’t know what they are?
- In Traditional Programming, a Program Processes the Data according to a Set of Instructions to produce Output
- When we have Data with three main characteristics
- To solve a Real-world Problem through Programming, you need to know four things
- Purpose
- Input
- Data
- Instructions
- Processing
- Output
- The diagram below summarizes how Traditional Programming works
- The main job of a Machine Learning Algorithm is to
- Learn from Input to predict Output
- Input (Machine Learning Algorithm)
- Data
- Input, Output
- Output (Machine Learning Algorithm)
- An intelligent Program (a.k.a. Model)
- The diagram below summarizes how Machine Learning works
- Data
- Machine Learning, can be summarized in the following Equation
- Data = Model + Error
- Most of Machine Learning revolves around
- Learning Input-Output Functions
- a.k.a. Function Learning or Concept Learning
- Learner (Machine Learning Algorithm) – Input and Output
- Input to a Learner
- Learning Input-Output Functions
- Set of Training Examples (D)
- Set of Functions / Hypothesis (H) (a.k.a. Hypothesis Space)
- Output of a Learner
- A h from H , which is an approximation of the Target Function f
- Output of a Learner
- The diagram below summarizes the general setting of Learning Input-Output Functions
- In Machine Learning, a Target Function f cannot be completely learned because
- Machine Learning uses Inductive Learning Approach, which has Scope of Error
- Therefore, Target Function (f) cannot be completely learned, however it can be approximated
- Hypothesis (h) is an approximation of the Target Function f
- Machine is dump and cannot understand Real-world Objects directly
- Therefore, we need to represent Real-world Objects in a Format which Machine Learning Algorithms can understand
- Very often, Example (a.k.a. instance, data point or observation) is represented as
- Attribute-Value Pair
- Example = Input + Output
- Input is mostly Vector-valued
- Output is mostly Single-valued
- Values of Attributes can be
- Categorical / Ordinal – e.g. Male, Female, Yes, No
- Numeric
- Discrete – e.g. 10, 25, 10000
- Continuous – e.g. 3.5, 5.9
- Representation of Hypothesis (h) varies from ML Algorithm to ML Algorithm
- The main steps to Learn an Input-Output Function are as follows
- Step 1: Define the Concept to be learned
- Step 2: Take the examples of the Concept to be learned
- Step 3: Learn from Examples
- Step 4: Generalize the Concept learned from specific examples
- Learning Input-Output Functions can be summarized as
- Learn from Input to predict Output
- To build efficient Machine Learning Models, remember the following Equation
- Machine Learning = Data Understanding and Preprocessing (40 %-50 %)+Predictive Analysis (50 %-60 %)
- To build efficient Machine Learning Models follow the following steps
- Step 1: Build strong and accurate understanding of Data
- Step 2: Properly pre-process Data, so that it becomes high-quality data
- Step 3: Represent / Transform Data into a format which Machine Learning Algorithms can understand
- Step 4: Identify what Machine Learning Algorithms will be most suitable for your Data (prepared in Step 3)
- Step 5: Train / Test selected (in Step 4) Machine Learning Algorithms on Data (prepared in Step 3)
- It is very difficult to build an efficient Model unless and until you have good understanding of your Data
- To have strong and more accurate understanding of your Data
- First perform manual inspection and then carry out automatic analysis
- Feature-based ML Algorithms are based on Manual Feature Engineering and Deep Learning ML Algorithms are based on Automatic Feature Engineering
- To select most suitable ML Algorithms for a Machine Learning Problem
- You must have high level of expertise in Machine Learning or consult a Machine Learning Expert
- No one has a definite answer about which Machine Learning Algorithms are most suitable for a Machine Learning Problem but we can start with those Machine Learning Algorithms which have proven effective in solving Machine Learning Problem(s) similar to our Machine Learning Problem
- Data is defined as Raw Facts and Figures
- Information is defined as the processed form of Data
- Data can be mainly categorized in three ways
- Form of Data – First Categorization of Data
- Text
- Image
- Video
- Audio
- Type of Data – Second Categorization of Data
- Structured
- Unstructured
- Semi-structured
- Length of Data – Third Categorization of Data
- Fixed
- Variable Length
- Data Annotation (a.k.a. data labeling / data tagging) is the process of labeling Data
- Data is annotated so that Machine Learning Algorithms can more accurately learn from annotated data
- Same Data can be annotated for different Tasks
- For Machine Learning Algorithms, Data is mainly available as
- Un-annotated Data
- Output is not associated with Input
- Annotated Data
- Output is associated with Input for all instances
- Semi-annotated Data
- Output is associated with Input for some instances
- Un-annotated Data
- For more accurate learning , it is important to have
- Large amount of Data
- High-quality Data
- Balanced Data
- The amount of Data required to accurately learn a Task
- Varies from Task to Task
- Data is said to be of High-quality if it contains instances which are
- Noise Free
- Complete and Correct
- Diversified
- Normally, Data Pre-processing tools are used to improve the quality of Data
- Balanced Data means that for each Class , the dataset must contain the same number of instances
- It is a difficult and challenging task to create large balanced datasets with high-quality
- Therefore, use large and high-quality datasets, which are moderately balanced
- A Machine Learning Problem can be mainly categorized as
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- In Supervised Learning, a Machine Learning Algorithm learns from Annotated Data
- In Unsupervised Learning, a Machine Learning Algorithm learns from Unannotated Data
- In Semi-supervised Learning, a Machine Learning Algorithm learns from Semi-annotated Data
- Supervised Learning is broadly categorized into
- Classification
- Output is Categorical
- Regression
- Output is Numeric
- Classification
- For each type of learning , different Machine Learning Algorithms are developed
- Good Starting Points – ML Algorithms for Supervised Learning
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- Good Starting Points – ML Algorithms for Classification
- Naïve Bayes
- Random Forest Classifier
- Support Vector Machine Classifier
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- Good Starting Points – ML Algorithms for Regression
- Random Forest Regressor
- Support Vector Machine Regressor
- Logistic Regression
- Linear Regression
- Good Starting Points – ML Algorithms for Unsupervised Learning
- K-means Clustering Algorithm
- Agglomerative Hierarchical Clustering Algorithm
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
- Good Starting Points – ML Algorithms for Semi-supervised Learning
- Label Propagation
- Four main Phases of Machine Learning Cycle are
- Training / Learning Phase
- Testing / Evaluation Phase
- Application Phase
- Feedback Phase
- In Machine Learning, we split Data into
- Train Data (or Train set)
- Test Data (or Test set)
- Generally, we split Data in a Train-Test Split Ratio of 67%-33%
- Training Data is used in the Training Phase and Test Data is used in the Testing Phase
- Annotation of Training Data varies for three types of learning
- Test Data must be annotated for all three types of learning
- In Training Phase, we
- Use Training Data to build a Model
- In Testing Phase, we
- Evaluate the performance of Model on unseen Data (Test Data) using standard Evaluation Measure(s)
- Classification – Standard Evaluation Measures
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F-measure
- Area Under the Curve (AUC)
- Regression – Standard Evaluation Measures
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R² or Coefficient of Determination
- Adjusted R²
- Both Training and Testing Phases can be summarized as follows
- Recall the Equation
- Data = Model + Error
- In Training Phase, Model is build using the Training Data
- In Testing Phase, Test Data is used to check the Error in the Model
- If a Model performs well on large Test Data then it is
- Deployed in Real world to make predictions on future unseen instances (Application Phase)
- In Feedback Phase, we take Feedback from Users and Domain Experts on your deployed Model and try to further improve it based on Feedback received
- Training Regime is a systematic way in which Training Data is used by a Machine Learning Algorithm to learn from it
- Three main types of Training Regimes are
- Batch Method
- In this method, all training examples are available and used all at once to build the Model (or hypothesis h)
- Incremental Method
- In the method, one member (training example) of the Training Data is selected at a time and use to modify the current hypothesis (h)
- On-line Method
- If training instances become available one at a time and are used as they become available, the method is called an on-line method
- Whatever Training Regime you use, the Learner (Machine Learning Algorithm) will return you a hypothesis (h), which is an approximation of the target function
In Next Chapter
- In Next Chapter
- In Sha Allah, in the next Chapter, I will present a detailed discussion on
- Data and Annotations