Machine Learning

October 18, 2022

Chapter 3 - Basics of Machine Learning

Chapter Outline

Quick Recap
What is Machine Learning?
Learning Input-Output Functions – General Settings
Steps to Build Efficient Machine Learning Models
Treating Real-world Problems as Learning Input-Output Functions
Types of Machine Learning
Machine Learning Cycle
Machine Learning – Training Regimes
Chapter Summary

Quick Recap

Develop such a Machine which behaves like Human
- It is essential to first understand
  - What is the ultimate goal of Human Learning?
  - How Human learns?
  - What are the main sources of Human Learning?
  - How efficiently and quickly a Human can learn?
  - How Human Heart and other body parts co-ordinate to learn?
  - What internal and external factors affect the Human Learning Process?
The goal of Human Learning is to
- recognize the Creator (God) of heavens and earth
To have happiness, peace, and prosperity in this life and hereafter,
- use your body, mind, soul and worldly things according to the instructions of the Creator (Allah)
A human is said to learn if his today (character) is better than his yesterday (character)
The sign of learning is, purity in thinking
Advice of My Respected Teacher.
- Adeel! You are a teacher. Always remember.
- When you supervise a female student, you should have same feelings for her as you have for your daughter
- When you work in collaboration with a female colleague, you should have same feelings for her as you have for your sister
Learning is a Searching Problem, and it continues till death
To learn any Task, Human Learning Cycle comprises of four main Phases

1. Training / Learning Phase
2. Testing / Evaluation Phase
3. Application Phase
4. Feedback Phase

One of the major problems in Human Learning is how to Quantify the Degree of Learning because
- in the Real-world, majority of things are Subjective
Generally, to Quantify the Degree of Learning Standard Approaches / Practices are established for a Task
To systematically learn a Task, use the following Step by Step approach
Step 1: Define the Task
Step 2: Define Main Components of Training / Learning and Testing / Evaluation Phases using Standard Approach / Practice
- Main Components of Training / Learning Phase
  - Trainer / Instructor
    - Standard Approach – Must be a Domain Expert
  - Standard Training / Learning Material
  - Standard Training / Learning Environment
  - Standard Training / Learning Methodology
- Main Components of Testing / Evaluation Phase
  - Examiner / Invigilator
    - Standard Approach – Must be a Domain Expert
  - Standard Testing / Evaluation Material
  - Standard Testing / Evaluation Environment
  - Standard Testing / Evaluation Methodology
  - Standard Evaluation Measure
Step 3: Trainer will Train the Trainee on the Task during the Training Phase
Step 4: After the completion of Training Phase
- Examiner will evaluate the performance of the Trainee on the Task that (s)he learned in Step 3 (i.e. Training Phase)
Step 5: If (Performance in Testing Phase = Good)
- Then
  - Allow the Trainee to perform the Task in real worlde. Application Phase
- Else
  - Ask the Trainee to Go to Step 3 and take more Training and re-appear for Evaluation
Step 6: After deployment in Real-world i.e. Application Phase
- Take Feedback from both Domain Experts and Users / Audience / Participants (Feedback Phase)
Step 7: Based on Feedback
- Go to Step 2, and repeat all phases of Human Learning Cycle to further improve learning and keep doing this till deathe. Be a Learner till Death 😊
From Machine Learning perspective, Human Learning can be broadly categorized into

1. Deductive Learning
2. Inductive Learning

In Deductive Learning Approach, a Concept / Task is learned by using proven knowledge (or success methods)
To systematically learn a Task through Deductive Learning Approach, a Step by Step approach is as follows,
Step 1: Define the Learning Task
Step 2: Search for the proven knowledge (or success methods) used by the most successful person(s) who were an authority in the whole world in the Task you want to learn
Step 3: Simply follow proven knowledge (or success methods)used by the successful person(s) in the world and you will be successful in this life and hereafter
In Inductive Learning Approach, a Human learns from his own experiences
To systematically learn a Task using Inductive Learning Approach, a Step-by-Step approach is as follows
- Step 1: Define the learning Task
- Step 2: Take examples of the Task to be learned
- Step 3: Learn from Examples
- Step 4: Generalize the task learned from specific examples

What is Machine Learning?

To develop such a Machine which behaves like Human
Note
- This goal cannot be achieved because humans (creature)can never be perfect like God (Creator)
- Also, Machine Learning is mainly based on Inductive Learning Approach, which has the Scope of Error
  - Therefore, all Machine Learning Models will have Scope of Error and Machine cannot be intelligent like human

Definition
- A Machine is said to learn if his today (character) is better than his yesterday (character)
Purpose
- To develop intelligent programs which can assist (or if possible, replace) humans in various tasks
Importance
- Information Overload Problem
  - In recent years, one of the biggest problem/challenges is information overload
  - Practically, it is not possible to manually extract useful information from a massive amount of Data
  - To address this problem, we need intelligent programs, which can assist humans to improve the quality of their tasks
- Example
  - Task
    - Check Plagiarism in an assignment submitted by a university student
  - Solution – Two Main Approaches
    1. Manual Approach
      - It is practically not possible for a human to manually identify the source(s) of plagiarism from billions of digital documents (Data)

1. 1. 1. Automatic Approach

- - - - A Two-Step Process
        Step 1: An intelligent program automatically searchers for a very small subset of the potential source(s) of plagiarism from billions of digital documents and presents that subset to a human
        Step 2: Humans can easily inspect a subset of documents to check whether the assignment is plagiarized or not?
        Note
        Automatic approaches mainly assist humans
- Without Machine Learning, we cannot develop intelligent programs which can assist humans in a range of tasks
Applications
- Education
- Health Care
- Business
- Agriculture
- Entertainment
- Software Applications
- Natural Language Processing
- Data Science
- Defense
Disciplines Contributing
- Statistics
- Artificial Intelligence
- Biology
- Cognitive Science
- Information Theory
- Philosophy
- Control Theory
- Computational Complexity
When to Use?
- When we have Data with three main characteristics
  1. Large amount of Data
  2. High-quality Data
  3. Balanced Data

When patterns exist in our Data
- Even if we don’t know what they are?

Why it is Hard?
- Example 1

- Example 2
  - What is a 2?

How Machine Learning Works

As discussed earlier, our main goal is

To achieve this goal
- Need to completely and correctly understand
  - How does human learn?
First Major Problem
- Unfortunately, no one perfectly knows
  - What is the structure of the human brain?
  - How does the human brain work?
  - How different parts of the human body are interacting to learn?
Second Major Problem
- Human Learning is the most complex task in this world, which makes it practically impossible to identify perfect human learning patterns
Third Major Problem
- Rate of Human Learning varies from person to person and cannot be judged/predicted accurately
Fourth Major Problem
- A number of external factors also influence Human Learning
  - Three main factors are
    - Devil
    - Self
    - Environment
Conclusion
- To conclude, it is practically not possible to achieve

However, we can build intelligent programs, which can do several useful tasks like
- Spam Email Detection
- Gender Identification
- Age Group Identification
- Sentiment Analysis
- Face Recognition
- Machine Translation
- Next Word Prediction
- Text Summarization
- Speech to Text
- Text to Speech

Definition
- A program Processes the Data according to a Set of Instructions to produce Output

To solve a Real-world Problem through Programming, you need to know four things,

Purpose
Input
- Data
- Instructions
Processing
Output

Problem
- Write a program, which calculates the sum of two integer numbers
Purpose
- Find the sum of two Integer numbers
Input
- Data
  - Two Integer numbers
    - 5, 7
- Instruction(s)
  - - Add two Integer numbers
Processing
- Calculate the sum of two Integer numbers
  - 5 + 7
Output
- Sum of two Integer numbers
  - 12

Considering example on the previous slide
The main job of a Program is to
- Process the Data based on Instructions to generate the Output

Input (Program)
- Data
  - 5, 7
- Instruction(s)
  - +
- Output (Program)
  - Result(s) obtained after processing the Input (Data + Instructions)
    - 12

The main job of a Machine Learning Algorithm is to
- Learn from Input to predict Output

Input (Machine Learning Algorithm)
- Data
  - Input, Output
- Output (Machine Learning Algorithm)
  - An intelligent Program (a.k.a. Model)

Goal of a Machine Learning Algorithm
- Learn from Input to predict Output
Input to a Machine Learning Algorithm

- Data

Output of a Machine Learning Algorithm
- An intelligent Program (a.k.a. Model)
  - x²
    - where X is an Integer number
Note
- Machine Learning Algorithms used Data (Input and Output) to learn an Intelligent Program / Model: X²
- After learning, if you give an Input (say 10) to Intelligent Program / Model, it will predict the Output (100)

Machine Learning is based on Inductive Learning Approach
- i.e. Learn from Examples (Data)

Concept Learning is a major sub-class of Inductive Learning
- For details on Concept Learning See
  - Chapter 7 – Concept Learning and Hypothesis Representation
Much learning is acquiring general concepts from specific examples

Step 1: Define the Concept to be learned
Step 2: Take examples of the Concept to be learned
Step 3: Learn from Examples
Step 4: Generalize the Concept learned from specific examples

Step 1: Define the Concept to be learned
- What happens when I throw a ball in the air?
Step 2: Take examples of the Concept to be learned
- 1 example – one time I throw the ball in the air
- 50 examples – 50 times I throw the ball in the air
- 100 examples – 100 times I throw the ball in the air
Step 3: Learn from Examples
- I went to the nearest park in my colony (specific place) and I threw a ball 100 times in the air and learned that every time (100 times) I threw the ball in the air, it falls downward
Step 4: Generalize the Concept learned from specific examples
- I conclude, at any place in this world (generalized), if I throw a ball in the air it will fall downwards

Although Inductive Learning has Scope of Error, however
- Several successful Machine Learning based systems have been developed using Inductive Learning Approach

Learn from Data (Examples)
In the form of Equation 😊

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Learning Input-Output Functions – General Settings

Function
Program
Finite state machine
Grammar
Problem-solving system

As discussed earlier, Concept Learning is a major subclass of Inductive Learning Approach
Most of Machine Learning revolves around
- Learning Input-Output Functions
  - k.a. Function Learning or Concept Learning

In Concept Learning
- Goal of the Learner is to learn a Target Function / Concept from Data (Set of Training Examples)
Note
- In Machine Learning, Learner refers to a Machine Learning Algorithm

Input to a Learner
1. Set of Training Examples (D)
2. Set of Functions / Hypothesis (H) (a.k.a. Hypothesis Space)
Output of a Learner
- A h from H, which is an approximation of the Target Function f
Note
- h is assumed a priori to be drawn from a Set of Functions / Hypothesis (H)
- Target Function f
  - may / may not be in H and
  - this may / may not be known

Question
- Why a Concept cannot be completely Learned?
Answer
- Inductive Learning Approach has Scope of Error
- Since Concept Learning is a major sub-class of Inductive Learning Approach
- Therefore, Target Function (f) cannot be completely Learned, however, it can be approximated
  - Hypothesis (h) is an approximation of the Target Function f

Given
1. Set of Training Examples (D)
2. Set of Hypothesis / Hypothesis Space (H)
Job of Learner
- Search the Hypothesis Space (H) using the Set of Training Examples (D) and Output a Hypothesis (h) from H which best fits the Set of Training Examples (D)

A Learner needs
1. Set of Training Examples (D)
2. Set of Hypothesis (H)
Problem
- Machine is dump and cannot understand Set of Training Examples (D) and Set of Hypothesis (H)
Solution
- Change representation of Set of Training Examples (D) and Set of Hypothesis (H) in a format which Learner (ML Algorithm) can understand

Representation of Hypothesis (h)
- Will be discussed in next Lecture Insha Allah
Representation of Example (x)
- Will be discussed in next Slides Insha Allah
Note
- An Example (x) can be
  1. Training Example or
  2. Testing Example
- In this Lecture
  1. x means Example
  2. d means Training Example

Example is a.k.a. instance, data point or observation
Very often, Example is represented as
- Attribute-Value Pair
In Machine Learning
- Attributes are a.k.a. Features

Representation of Input
- Attribute-Value pair
- To represent Input, need to decide
  - Set of Input Attributes
  - Data Type of each Input Attribute
  - Possible Values of each Input Attributes
- Input can be
  1. Single valued
    - comprises of one Input Attribute
  2. Vector valued
    - comprises of multiple Input Attributes
Note
- Input is mostly vector valued
Values of Input Attributes can be
- Categorical / Ordinal – e.g. Male, Female, Yes, No
- Numeric
  - Discrete – e.g. 10, 25, 10000
  - Continuous – e.g. 3.5, 5.9
Representation of Output
- Attribute-Value pair
- To represent Output, need to decide
  - Set of Output Attributes
  - Data Types of each Output Attribute
  - Possible Values of each Output Attributes
- Output can be
  - Single valued
    - comprises of one Output Attribute
  - Vector valued
    - comprises of multiple Output Attributes
  - Note
    - Output is mostly single valued
- Values of Output Attributes can be
  - Categorical / Ordinal – e.g. Male, Female, Yes, No
  - Numeric
    - Discrete – e.g. 10, 25, 10000
    - Continuous – e.g. 3.5, 5.9

Concept to be Learned (Machine Learning Problem)
- Gender Identification
Input
- Human
Output
- Gender of a Human
Instance = Input + Output

Representation of Input
- Attribute-Value pair
- Input is represented as a Set of 3 Input Attributes
  - Input Attributes
    - Height
    - Weight
    - Beard
  - Data Type for each Input Attribute
    - Height – Categorical
    - Weight – Categorical
    - Beard – Categorical
  - Possible Values for each Input Attribute
    - Height – Short, Medium, Tall
    - Weight – Small, Medium, Heavy
    - Beard – Yes, No
- HINT: Try to identify the most discriminating Input Attributes for a Machine Learning Problem
Representation of Input

Representation of Output
- Attribute-Value pair
- Output is represented as a single Output Attributes
  - Output Attribute
    - Gender
  - Data Type of Output Attribute
    - Gender – Categorical
  - Possible Values for Output Attribute
    - Gender – Male, Female

Representation of Output

Below are three possible instances for the Gender Identification learning problem

To summarize
- Instance is a vector of Attribute values

X refers to Set of Examples
x refers to a single Example
Formal Representation of Example (x)
- x_i, f(x_i)
- where x_i, represents the Input and f(x_i) represents the Output

Consider the following Set of Examples (X)

X = { (x₁, f(x₁)), (x₂, f(x₂)), (x₃, f(x₃)) }
Here
- x₁ = <Short, Medium, No> and f(x₁) = Female
- x₂ = <Tall, Heavy, Yes> and f(x₂) = Male
- x₃ = <Medium, Medium, No> and f(x₃) = Female

Step 1: Define the Concept to be Learned
Step 2: Take the examples of the Concept to be Learned
Step 3: Learn from Examples
Step 4: Generalize the Concept Learned from specific examples

Input – Single
Output – Single

Input – Vector valued
- vector of Input Attribute Values
Output – Single

Input – Vector valued
- vector of input attribute values
Output – Single

Example 1 is very simple
Example 2 is more complex then Example 1
Example 3 is more complex then Example 2

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Steps to Build Efficient Machine Learning Models

Step 1: Build a strong and accurate understanding of Data
Step 2: Properly pre-process Data, so that it becomes high-quality data
Step 3: Represent / Transform Data into a format which Machine Learning Algorithms can understand
Step 4: Identify what Machine Learning Algorithms will be most suitable for your Data (prepared in Step 3)
Step 5: Train / Test selected (in Step 4) Machine Learning Algorithms on Data (prepared in Step 3)

Data Understating
- Two Main Approaches
  1. Manual Inspection
  2. Automatic Analysis
- Manual Inspection
  - You open the file containing your Data and manually analyze and record your observations about Data
  - To carry out good manual inspection, you must have
    - strong basic understating of the domain from which Data is collected
  - Manual Inspection is more suitable when we have a small amount of Data
- Automatic Analysis
  - Two Main Approaches for Automatic Analysis
    - Statistical Analysis Tools
      - For example, Five Number Summary
    - Data Visualization Tools
      - For example, Google Charts, Tableau
- In Automatic Analysis
  - - You apply automatic data analysis tools on your Data and record your observations based on analysis of the Data
    - To carry out a good automatic analysis, you must have
      - strong basic understating of the domain from which Data is collected
        strong basic understanding of the automatic data analysis tools that you are using to analyze your Data
    - Automatic Analysis is more suitable when we have a huge amount of Data

Question
- What approach (manual or automatic) should be used to build a strong and more accurate understanding of Data?
Answer
- Perform both manual inspection and automatic analysis
  - First, perform manual inspection and then carry out an automatic analysis

Once you have correctly understood your Data, then you can decide
- What type of pre-processing will be suitable to convert your Data into high-quality data
Data Pre-processing – Definition
- Data Pre-processing refers to the technique which transforms raw data into an understandable format
Main Steps of Data Pre-processing
- Some of the main steps of Data Pre-processing are as follows:
- Data Cleaning
  - Data Cleaning (a.k.a. Data Cleansing) is the process of identifying (incomplete, incorrect, inaccurate or irrelevant) parts of the Data and correcting (replacing, modifying, or deleting) them
  - Data Cleaning process may include
    - Fill in missing values
    - Smooth noisy data
    - Identify or remove outliers
    - Resolve inconsistencies
- Data Integration
  - The process of combining Data from different sources (multiple databases, different cubes or files) into a single, unified view
- Data Transformation
  - The process of converting Data from one format to another, typically from the format of a source system into the required format of a destination system
  - Data Transformation may include
    - Data normalization
    - Data aggregation
- Data Reduction
  - The process of reducing the huge volume of Data but producing the same or similar analytical results
  - We perform Data Reduction when we have a very huge amount of Data
- Note
  - I have only given a basic overview of Data Pre-processing. If you interested to learn more about it then can read tutorial or books on Data Pre-processing

Machine Learning Algorithms can be broadly categorized as
1. Feature-based Machine Learning Algorithms (a.k.a. Classical Machine Learning Algorithms)
2. Deep Neural Network Architectures (a.k.a. Deep Learning Algorithms)

Definition
- Feature-based ML Algorithms are based on Manual Feature Engineering
Strengths
- Feature-based ML Algorithms can even learn from small Training Data
Weaknesses
- The process of Manual Feature Engineering requires a lot of time, cost and effort because the set of most discriminating features is learned manually

Definition
- Deep Neural Network Architectures based on Automatic Feature Engineering
Strengths
- In Automatic Feature Engineering, the set of most discriminating features is learned automatically
Weaknesses
- Deep Neural Network Architectures require a very large amount of Training Data

Very Important Decision
- To make a good decision
  - You must have a high level of expertise in Machine Learning or
  - Consult a Machine Learning Expert
A Two-Step Process
Step 1: Decide whether you will use
- Feature-based ML Algorithms or
- Deep Neural Network Architectures or
- Both
Step 2: Which Machine Learning Algorithms from Feature-based and/or Deep Neural Network are more suitable for your Machine Learning Problem

Important Note
- No one has a definite answer about which Machine Learning Algorithms are most suitable for a Machine Learning Problem but we can start with those Machine Learning Algorithms which have proven effective in solving Machine Learning Problem(s) similar to our Machine Learning Problem
  - Recall the Deductive Learning Approach – Learn using proven success methods 😊

Problem
- Sentiment Analysis of Users Reviews / Comments on Products
Input
- Text (Review / Comments)
Output
- Sentiment (Positive / Negative / Neutral)
Step 1: I decide to apply both Feature-based and Deep Neural Network Machine Learning Algorithms on my Sentiment Analysis dataset
Step 2: Previous research/studies have shown that for textual Data some of the
- Suitable Feature-based ML Algorithms are
  - Support Vector Machine
  - Logistic Regression
  - Random Forest
  - Naïve Bayes
- Suitable Neural Network ML Algorithms are
  - Recurrent Neural Networks (RNN)
  - Long Short-Term Memory (LSTM)
  - BI-LSTM

Problem
- Emotion Analysis from Image
Input
- Image
Output
- Emotion (Happy / Sad / Angry)
Step 1: I decide to apply Deep Neural Network Machine Learning Algorithms on my Emotion Analysis dataset
Step 2: Previous research/studies have shown that for image Data some of the
- Suitable Deep Neural Network ML Algorithms are
  - Convolutional Neural Networks (CNN)
Important Note
- You can clearly see from these two examples that
  - - for textual data suggested ML algorithms are entirely different from the one suggested for image Data
Conclusion
- To conclude, if you don’t have a strong and accurate understanding of your Data, you will not be able to select suitable Machine Learning Algorithms for your Machine Learning Problem

Train / Test selected Machine Learning Algorithms on your Data

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Treating Real World Problems as Learning Input-Output Functions

To Learn Input-Output Functions
First Understand Data i.e.
- Input and
- Output
Then
- Learn Input-Output Function

We can categories Input in three different ways
Form of Data – First Categorization of Input
- Text
- Image
- Video
- Audio
Type of Data – Second Categorization of Input
- Structured
- Unstructured
- Semi-structured
Length of Data – Third Categorization of Input
- Fixed
- Variable Length

We can categories Output in three different ways
Form of Data – First Categorization of Output
- Text
- Image
- Video
- Audio
Type of Data – Second Categorization of Output
- Structured
- Unstructured
- Semi-structured
Length of Data – Third Categorization of Output
- Fixed
- Variable Length

Considering different forms of Data

Considering different types of Data

Considering different lengths of Data

For the same Machine Learning Problem, we may get Data in different formats
In the next slides, In Sha Allah I will try to explain this with examples

Real-world Problem
- Automatically Predict the Gender of a Human
One Possible Solution
- Treat the Gender Identification Problem as Learning Input-Output Function
In Learning Input-output Functions, the first step is to
- Identify input and Output i.e. Understand Data

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Text
  - Output
    - Text
- Considering Type of Data
  - Input
    - Structured
  - Output
    - Structured
- Considering Length of Data
  - Input
    - Fixed (Set of 3 Input Attributes)
  - Output
    - Fixed (Set of 1 Output Attributes)

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Text
  - Output
    - Text
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Structured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Fixed (Set of 1 Output Attributes)

Available Data
Analyze and Understand Data
- Considering Form of Data
  - Input
    - Image
  - Output
    - Text
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Structured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Fixed (Set of 1 Output Attributes)

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Audio
  - Output
    - Text
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Structured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Fixed (Set of 1 Output Attributes)

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Text
  - Output
    - Text
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Structured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Fixed (Set of 1 Output Attributes)

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Text
  - Output
    - Text
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Unstructured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Variable Length

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Image
  - Output
    - Image
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Unstructured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Variable Length

Available Data

Analyze and Understand Data
- Considering Form of Data
  - Input
    - Image
  - Output
    - Text
- Considering Type of Data
  - Input
    - Unstructured
  - Output
    - Unstructured
- Considering Length of Data
  - Input
    - Variable Length
  - Output
    - Variable Length

As we move from Example 01 to Example 08, the complexity of the Machine Learning Problem increases 😊

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Types of Machine Learning

Data
- Raw Facts and Figures
Varieties of Data
- Structured Data
- Unstructured Data
- Semi-structured Data
Structured Data
- Definition
  - Structured data refers to that Data that has been organized into a formatted repository (typically a Database)
    - A Data structure is a kind of repository that organizes information for that purpose
- Purpose
  - Data is stored in a structured format so that it can be
    - Easily understood
    - Quickly stored and accessed
    - Effectively analyzed and processed
- Examples
  - Databases
  - Names
  - Dates
  - Addresses
  - Credit Card Numbers
  - Stock information
  - Geo-Location etc.
Unstructured Data
- Definition
  - Unstructured Data is information that either does not have a pre-defined Data model or is not organized in a pre-defined manner
    - Unstructured information is typically text-heavy but may contain Data such as dates, numbers, and facts as well
- Purpose
  - Unstructured Data is “mostly” used in daily life to communicate with one another
- Examples
  - Videos
  - Photos
  - Audio Files
  - E-mail Messages
  - Word Processing Documents
  - Presentations
  - Webpages etc.
Semi-structured Data
- Definition
  - Semi-structured Data is a form of structured Data that does not obey the formal structure of Data models associated with Relational Databases or other forms of Data Tables, but nonetheless contains Tags or other Markers to separate semantic elements and enforce hierarchies of records and fields within the Data
- Purpose
  - Tags (or Metadata) are added to an unstructured Data to make it easier to understand and search
- Examples
  - XML Documents
  - HTML Documents
  - JSON Documents
  - NoSQL Databases etc.

Four main forms of Data are
1. Text
2. Image
3. Video
4. Audio

Information
- Definition
  - Processed form of Data
- Purpose
  - Information helps us to Learn
- Example
  - Raw Data
    - Sidra, 70, Mehwish, 80, Adeel, 90, Ayesha, 80, Imran, 70
  - Information
    - Organize Data in some structured format to extract meaningful information, e.g. Table

Insights
- Highest marks in Machine Learning course are: 90
- Lowest marks in Machine Learning course are: 70
- Average marks in Machine Learning course are: 80
- Topper in Machine Learning course is: Adeel

Data annotation
- Definition
  - Data Annotation (a.k.a. Data labeling / Data tagging) is the process of labeling Data
- Purpose
  - Data is annotated so that Machine Learning Algorithms can more accurately learn from annotated Data
- Who Do Data Annotations?
  - Data Annotation is performed by Domain Experts (humans – a.k.a. annotators/taggers/raters)
- Strengths
  - When Machine Learning Algorithms uses annotated Data to learn, their learning is more accurate
- Weaknesses
  - Data Annotation requires a lot of effort, time and cost
- Examples
  - See next Slides 😊

Task
- Annotate textual Data (Users Comments / Reviews on Product (iPhone7)) for Sentiment Analysis and Gender Identification tasks
Raw Data

Data Annotation – Sentiment Analysis

Data Annotation – Gender Identification

Note
- Same textual Data is annotated for two entirely different tasks i.e. Sentiment Analysis and Gender Identification

Task
- Annotate image Data for Gender Identification, Emotion Analysis, and Age Group Identification tasks

Raw Data

Data Annotation – Gender Identification

Data Annotation – Emotion Analysis

Data Annotation – Age Group Identification

Note
- Same image Data is annotated for three different taskse. Gender Identification, Emotion Analysis, and Age Group Identification tasks

For Machine Learning Algorithms, Data is mainly available as
1. Un-annotated Data
2. Annotated Data
3. Semi-annotated Data

Un-annotated Data
- Output is not associated with the Input
- Example

Annotated Data
- Output is associated with all the Inputs
- Example

Semi-annotated Data
- Output is associated with some of the Inputs
- Example

Recall the Equation

For more accurate learning, it is important to have
1. Large amount of Data
2. High-quality Data
3. Balanced Data

Question
- How much Data is good enough for accurate learning?
A Possible Answer
- Varies from Task to Task
Note
- Insha Allah, in the next slides we will discuss different Machine Learning Problems and see how much Data will be good enough for them

Task
- Sentiment Analysis from Customer Reviews on Products
Amount of Data Needed
- 10, 000 instances (seems to be a good start)

Task
- Machine Translation
Amount of Data Needed
- 1 Million instances (seems to be a good start)

Finding
- It can be noted that amount of Data required for Machine Translation is very high compared to Sentiment Analysis on Customer Reviews
Reason
- Machine Translation is a very complex task (Sequence to Sequence Problem) compared to Sentiment Analysis task (Classification Problem)
- Consequently, we need more Data to accurately learn a complex task
Conclusion
- Complex and big tasks require more effort
- Example
  - Duration of PhD = 21 years
  - Duration of Matrix = 10 years
  - Amount of effort required to get a Ph.D. degree is much higher Matric degree
When you SET BIG GOALS in life, then you will need two things to maintain
- Patience and
- Consistency 😊
Majority of people get demotivated because they want things to happen quickly 😉

Question
- What do you mean by High-quality Data in Machine Learning?
A Possible Answer
- A Dataset is said to be of high-quality if it contains instances which are
  - Noise Free
  - Complete and Correct
  - Diversified

Machine Learning Problem
- Gender Identification
A Dataset is of high-quality if it is
- Noise Free
  - Only contains instances related to Gender Identification Problem
  - Should not contain any instances which Machine Learning Algorithms cannot understand
- Complete and Correct
  - All instances in the Dataset must be complete
    - e. there should not be any missing values, inconsistencies, outliers, etc.
  - All instances in the Dataset must be correct
    - e. there should not be any errors in the instances
- Diversified
  - Dataset should contain instances (humans) from all 7 continents of the world because the characteristics and behavior of humans in different parts of the world is different
Note
- Data Pre-processing tools are used to improve the quality of Data

For more accurate learning, it is important to have a Balanced Data
Balanced Data
- For each class, the Dataset must contain the same number of instances

Machine Learning Problem
- Gender Identification
No. of Classes
- Class 1 = Male
- Class 2 = Female
Dataset Size
- 300 Instances
Examples – Unbalanced Datasets
- Unbalanced Dataset 1
  - Male = 50, Female = 250
  - Note that this Dataset if highly unbalanced
- Unbalanced Dataset 2
  - Male = 200, Female = 100
  - Note that this Dataset if moderately unbalanced
- Reason for Unbalanced Datasets
  - For Male and Female classes, the number of instances in not the same
Example – Balanced Dataset
- Male = 150, Female = 150
- Reason for Balanced Dataset
  - For Male and Female classes, the number of instances is the same
    - i.e. 50% instances are Male and remaining 50% instances are Female
Note
- Gender Identification is a Binary Classification Problem because there are two Classes i.e. Male and Female

Machine Learning Problem
- Sentiment Analysis
of Classes
- Class 01 = Positive
- Class 02 = Negative
- Class 03 = Neutral
Dataset Size
- 300 Instances
Examples – Unbalanced Datasets
- Unbalanced Dataset 01
  - Positive = 100, Negative = 200, Neutral = 0
  - Note that this Dataset if highly unbalanced
- Unbalanced Dataset 02
  - Positive = 50, Negative = 100, Neutral = 150
  - Note that this Dataset if moderately unbalanced
- Reason for Unbalanced Dataset
  - For Positive, Negative and Neutral classes, the number of instances in not same
Example – Balanced Dataset
Positive = 100, Negative = 100, Neutral = 100
- Reason for Balanced Dataset
  - For Positive, Negative and Neutral classes, the number of instances in the same
    - e. 33.33% are Positive, 33.33% are Negative and 33.33% are Neutral
Note
- Sentiment Analysis is a Multi-class Classification Problem because there are more than two Classes i.e. Positive, Negative and Neutral

Ideal Situation
- Have a large balanced Dataset with high-quality
Problem
- It is a difficult and challenging task to create large balanced Datasets with high-quality
A Possible Solution
- Use large and high-quality Datasets, which are moderately balanced
Example – Moderately Balanced Datasets
- For a Binary Classification Problem, some of the possible moderately balanced are as follows
  - 55% – 45%
  - 60% – 40%

Three main types of learning are

1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning

Definition
- In Supervised Learning, a Machine Learning Algorithm learns from Annotated Data
  - Annotated Data means that for all Training Examples, Output is associated with Inputs
Strengths
- Learning is more accurate because the quality of Annotated Data is high (annotated by Domain Expers)
Weaknesses
- Acquiring Annotated Data requires a lot of time, effort and cost

Two main types of Supervised Learning are
- Classification
- Regression
Classification
- Definition
  - In Classification, the Output is Categorical (or Discrete)
- Example
  - Task
    - Gender Identification
  - Annotated Data

Regression
- Definition
  - In Regression, the Output is Numeric (or Continuous)
- Example
  - Task
    - House Price Prediction
  - Annotated Data

Definition
- In Unsupervised Learning, a Machine Learning Algorithm learns from Unannotated Data
  - Unannotated Data means that for all Training Examples, Output is not associated with Inputs
Strengths
- You can easily and quickly collect a large amount of Unannotated Data
Weaknesses
- Learning may not be accurate since the quality of Data is low because it is unannotated

Definition
- In Semi-supervised Learning, a Machine Learning Algorithm learns from Semi-annotated Data
  - Semi-annotated Data means that only for some Training Examples, Output is associated with Inputs
Strengths
- You can quickly collect a large amount of Semi-annotated Data
Weaknesses
- Learning may not be accurate since the quality of Data is low because all Training Examples are not annotated

For each type of learning, different Machine Learning Algorithms are developed
Next slides present some of the popular and widely used Machine Learning Algorithms
- Note – Here I am considering Scikit-learn Machine Learning Toolkit implementations

ML Algorithms for Supervised Learning
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
ML Algorithms for Classification
- Naïve Bayes
- Random Forest Classifier
- Support Vector Machine Classifier
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
ML Algorithms for Regression
- Random Forest Regressor
- Support Vector Machine Regressor
- Logistic Regression
- Linear Regression
ML Algorithms for Unsupervised Learning
- K-means Clustering Algorithm
- Agglomerative Hierarchical Clustering Algorithm
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
ML Algorithms for Semi-supervised Learning
- Label Propagation

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Machine Learning Cycle

Four main phases of Machine Learning Cycle are

1. Training / Learning Phase
2. Testing / Evaluation Phase
3. Application Phase
4. Feedback Phase

Problem
- For both Training Phase and Testing Phase, we need
  - Data
- A Possible Solution
  - Split available Data into
    - Train Data (or Train set)
    - Test Data (or Test set)
- Use Train Data in the Training Phase and Test Data in the Testing Phase
For all types of learning (supervised, unsupervised and semi-supervised), we need both Training Data and Text Data
Supervised Learning
- Training Data – must be annotated
- Test Data – must be annotated
Unsupervised Learning
- Training Data – must be unannotated
- Test Data – must be annotated
Semi-supervised Learning
- Training Data – must be semi-annotated
- Test Data – must be annotated
Important Note
- Training Data varies for three types of learning
- Test Data must be annotated for all three types of learning
Standard Approach for Data Split
- Train set – Use 2 / 3 (67%) of Data
- Test set – Use 1 / 3 (33%) of Data
Importance Note
- Train set and Test set must be disjoint
  - examples accruing in the Train set should not occur in the Test set and vice versa
Two main approaches to Data split
- Random Data Split
- Class Balanced Data Split
Random Data Split
- In this approach, the Data Distribution (for each class) in the original Dataset is not followed while splitting Data
Class Balanced Data Split
- In this approach, the Data Distribution (for each class) in the original Dataset is strictly followed while splitting Data

Machine Learning Problem
- Gender Identification
No of Classes
- Class 1 = Male
- Class 2 = Female
Original Dataset Size
- 600 instances
Data Distribution
- Male = 300 (50%)
- Female = 300 (50%)
Train-Test Split Ratio
- 67%-33%

Random Data Split Approach
- Train set = 400 instances (67% of Original Dataset)
  - Male = 250, Female = 150
- Test set = 200 instances (33% of Original Dataset)
  - Male = 150, Female = 50
Reason – Why it is a Random Data Split?
- In Original Dataset, the Data Distribution is
  - 50% Male instances and 50% Female instances
- In Train set
  - Total Instances = 400
  - Male Instances = 250
  - Female Instances = 150
- In Test set
  - Total Instances = 200
  - Male Instances = 150
  - Female Instances = 50
Note that both in the Train set and Test set, the Data Distribution does not match with the Data Distribution in the Original Dataset
Class Balanced Data Split Approach
- Train set = 400 instances (67% of Original Dataset)
  - Male = 200, Female = 200
- Test set = 200 instances (33% of Original Dataset)
  - Male = 100, Female = 100
- Reason – Why it is a Class Balanced Data Split?
  - In Original Dataset, the Data Distribution is
    - 50% Male instances and 50% Female instances
  - In Train set
    - Total Instances = 400
    - Male Instances = 200
    - Female Instances = 200
  - In Test set
    - Total Instances = 200
    - Male Instances = 100
    - Female Instances = 100
Note that both in the Train set and Test set, the Data Distribution matches with the Data Distribution in the Original Dataset

Machine Learning Problem
- Gender Identification
No of Classes
- Class 1 = Male
- Class 2 = Female
Original Dataset Size
- 900 instances
Data Distribution
- Male = 600 (67%)
- Female = 300 (33%)
Train-Test Split Ratio
- 67%-33%

Random Data Split Approach
- Train set = 600 instances (67% of Original Dataset)
  - Male = 500, Female = 100
- Test set = 300 instances (33% of Original Dataset)
  - Male = 100, Female = 200
- Reason – Why it is a Random Data Split?
  - In Original Dataset, the Data Distribution is
    - 67% Male instances and 33% Female instances
  - In Train set
    - Total Instances = 400
    - Male Instances = 500
    - Female Instances = 100
  - In Test set
    - Total Instances = 300
    - Male Instances = 100
    - Female Instances = 200
Note that both in the Train set and Test set, the Data Distribution does not match with the Data Distribution in the Original Dataset
Class Balanced Data Split Approach
- Train set = 600 instances (67% of Original Dataset)
  - Male = 400, Female = 200
- Test set = 300 instances (33% of Original Dataset)
  - Male = 200, Female = 100
- Reason – Why it is a Class Balanced Data Split?
  - In Original Dataset, the Data Distribution is
    - 67% Male instances and 33% Female instances
  - In Train set
    - Total Instances = 600
    - Male Instances = 400
    - Female Instances = 200
  - In Test set
    - Total Instances = 300
    - Male Instances = 200
    - Female Instances = 100
Note that both in the Train set and Test set, the Data Distribution matches with the Data Distribution in the Original Dataset

It is good to split Data in a Train-Test Split Ratio of 67%-33% using the Class Balanced Data Split Approach

Recall – Four Phases of Machine Learning Cycle

Training / Learning Phase
Testing / Evaluation Phase
Application Phase
Feedback Phase

Definition
- Use Training Data to build a Model
Purpose
- Build a Model (or Intelligent Program) from Training Data, to make predictions on unseen Data
An Important Question
- How good your Model has learned?
A Possible Answer
- Evaluate the performance of Model on unseen Data (Test Data)

Definition
- Use Test Data to evaluate the performance of Model (build in the Training Phase)
Purpose
- Judge how good Model has Learned from the Training Data
An Important Question
- How to quantify the performance of the Model?
A Possible Answer
- Use an Evaluation Measure

Classification – Standard Evaluation Measures
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F-measure
- Area Under the Curve (AUC)
Regression – Standard Evaluation Measures
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R² or Coefficient of Determination
- Adjusted R²
Note – I have only mentioned some of the popular and widely used Evaluation Measures, there are also many others 😊

To learn any task
- Start with the most popular and widely used approaches
Also, in learning
- Never Compromise on Quality 😊

Recall the Equation

In Training Phase, Model is build using the Training Data
In Testing Phase, Test Data is used to check the Error in the Model

Definition
- Deploy the Model in the real world
Purpose
- Use the Model to make predictions on future unseen Data for a range of Real-world Applications
Question
- How can we say that our Model is good enough to perform well in the real world?
A Possible Answer (ML Assumption)

Note

Definition
- Take Feedback from Users and Domain Experts on your deployed Model
Purpose
- To further improve the deployed Model
After taking Feedback
- Go to Training Phase to further improve your Model based on the Feedback received from Users and Domain Experts

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Machine Learning – Training Regimes

Definition
- A systematic way in which Training Data is used by a Machine Learning Algorithm to learn from it
Purpose
- When we learn (or train) systematically, the quality of Training increases

Some of the main types of Training Regimes are
1. Batch Method
2. Incremental Method
3. On-line Method

Batch Method
- In this method, all Training Examples are available and used all at once to build the Model (or Hypothesis h)
Incremental Method
- In the method, one member (Training Example) of the Training Data is selected at a time and used to modify the current Hypothesis (h)
On-line Method
- If Training Examples become available one at a time and are used as they become available, the Training Regime is called On-line Method
- Example
  - A robot which is learning a Hypothesis (h) from sensory inputs which control its actions (and hence determines its future sensory inputs)

Note
- Whatever Training Regime you use, the Learner (Machine Learning Algorithm) will return you a Hypothesis (h), which is an approximation of the Target Function (f)

Direct Training
- Each Training Example has associated Output value
- Example

Indirect Training
- Only sequences of Training Examples have associated Output value
- Example – Chess Game
  - Goal
    - is to learn a function from the board position to next move
  - Input
    - Sequence of Moves
  - Output
    - Class 01 = Win
    - Class 02 = Lose
  - Problem
    - May not know whether individual moves are correct and we will Win
    - Only sequence of moves leads to Win or Loss
  - Solution
    - Learner must decide what sequences of moves will take him to Win / Lose
- Training Data– Chess Game

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Chapter Summary

Before learning / doing a Task, it is essential to know its ultimate goal
The ultimate goal of Machine Learning is
- Develop such a Machine which behaves like Human
To stay motivated and successful in this life and hereafter
- Be Realistic
To be honest, it is not possible to develop such a Machine which behaves like Human because
1. Human (creature) can never be perfect like Allah (Creator)
2. Machine Learning is mainly based on Inductive Learning Approach, which has Scope of Error
  - Therefore, all Machine Learning Models will have Scope of Error and Machine cannot be intelligent like Human

A Machine is said to learn if its today (character) is better than his yesterday (character)
Generally , we use Machine Learning
- When we have Data with three main characteristics
  - Large amount of Data
  - High quality Data
  - Balanced Data
- When patterns exist in our Data
  - Even if we don’t know what they are?
- In Traditional Programming, a Program Processes the Data according to a Set of Instructions to produce Output

To solve a Real-world Problem through Programming, you need to know four things

Purpose
Input
1. Data
2. Instructions
Processing
Output

The diagram below summarizes how Traditional Programming works

The main job of a Machine Learning Algorithm is to
- Learn from Input to predict Output
Input (Machine Learning Algorithm)
- Data
  - Input, Output
- Output (Machine Learning Algorithm)
  - An intelligent Program (a.k.a. Model)
- The diagram below summarizes how Machine Learning works

Machine Learning, can be summarized in the following Equation
- Data = Model + Error
Most of Machine Learning revolves around
- Learning Input-Output Functions
  - a.k.a. Function Learning or Concept Learning
- Learner (Machine Learning Algorithm) – Input and Output
- Input to a Learner

1. 1. Set of Training Examples (D)
  2. Set of Functions / Hypothesis (H) (a.k.a. Hypothesis Space)

- Output of a Learner
  - A h from H , which is an approximation of the Target Function f
The diagram below summarizes the general setting of Learning Input-Output Functions

In Machine Learning, a Target Function f cannot be completely learned because
- Machine Learning uses Inductive Learning Approach, which has Scope of Error
- Therefore, Target Function (f) cannot be completely learned, however it can be approximated
  - Hypothesis (h) is an approximation of the Target Function f
Machine is dump and cannot understand Real-world Objects directly
- Therefore, we need to represent Real-world Objects in a Format which Machine Learning Algorithms can understand
Very often, Example (a.k.a. instance, data point or observation) is represented as
- - Attribute-Value Pair
Example = Input + Output
- Input is mostly Vector-valued
- Output is mostly Single-valued
Values of Attributes can be
- Categorical / Ordinal – e.g. Male, Female, Yes, No
- Numeric
  - Discrete – e.g. 10, 25, 10000
  - Continuous – e.g. 3.5, 5.9
Representation of Hypothesis (h) varies from ML Algorithm to ML Algorithm
The main steps to Learn an Input-Output Function are as follows
- Step 1: Define the Concept to be learned
- Step 2: Take the examples of the Concept to be learned
- Step 3: Learn from Examples
- Step 4: Generalize the Concept learned from specific examples
Learning Input-Output Functions can be summarized as
- Learn from Input to predict Output
To build efficient Machine Learning Models, remember the following Equation
Machine Learning = Data Understanding and Preprocessing (40 %-50 %)+Predictive Analysis (50 %-60 %)
To build efficient Machine Learning Models follow the following steps
- Step 1: Build strong and accurate understanding of Data
- Step 2: Properly pre-process Data, so that it becomes high-quality data
- Step 3: Represent / Transform Data into a format which Machine Learning Algorithms can understand
- Step 4: Identify what Machine Learning Algorithms will be most suitable for your Data (prepared in Step 3)
- Step 5: Train / Test selected (in Step 4) Machine Learning Algorithms on Data (prepared in Step 3)
It is very difficult to build an efficient Model unless and until you have good understanding of your Data
To have strong and more accurate understanding of your Data
- First perform manual inspection and then carry out automatic analysis
Feature-based ML Algorithms are based on Manual Feature Engineering and Deep Learning ML Algorithms are based on Automatic Feature Engineering
To select most suitable ML Algorithms for a Machine Learning Problem
- You must have high level of expertise in Machine Learning or consult a Machine Learning Expert
No one has a definite answer about which Machine Learning Algorithms are most suitable for a Machine Learning Problem but we can start with those Machine Learning Algorithms which have proven effective in solving Machine Learning Problem(s) similar to our Machine Learning Problem
Data is defined as Raw Facts and Figures
Information is defined as the processed form of Data
Data can be mainly categorized in three ways
Form of Data – First Categorization of Data
- Text
- Image
- Video
- Audio
Type of Data – Second Categorization of Data
- Structured
- Unstructured
- Semi-structured
Length of Data – Third Categorization of Data
- Fixed
- Variable Length
Data Annotation (a.k.a. data labeling / data tagging) is the process of labeling Data
Data is annotated so that Machine Learning Algorithms can more accurately learn from annotated data
Same Data can be annotated for different Tasks
For Machine Learning Algorithms, Data is mainly available as
- Un-annotated Data
  - Output is not associated with Input
- Annotated Data
  - Output is associated with Input for all instances
- Semi-annotated Data
  - Output is associated with Input for some instances
For more accurate learning , it is important to have
- Large amount of Data
- High-quality Data
- Balanced Data
The amount of Data required to accurately learn a Task
- Varies from Task to Task
- Data is said to be of High-quality if it contains instances which are
  - Noise Free
  - Complete and Correct
  - Diversified
Normally, Data Pre-processing tools are used to improve the quality of Data
- Balanced Data means that for each Class , the dataset must contain the same number of instances
It is a difficult and challenging task to create large balanced datasets with high-quality
- Therefore, use large and high-quality datasets, which are moderately balanced
A Machine Learning Problem can be mainly categorized as

1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning

In Supervised Learning, a Machine Learning Algorithm learns from Annotated Data
In Unsupervised Learning, a Machine Learning Algorithm learns from Unannotated Data
In Semi-supervised Learning, a Machine Learning Algorithm learns from Semi-annotated Data
Supervised Learning is broadly categorized into
- Classification
  - Output is Categorical
- Regression
  - Output is Numeric
For each type of learning , different Machine Learning Algorithms are developed
Good Starting Points – ML Algorithms for Supervised Learning
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
Good Starting Points – ML Algorithms for Classification
- Naïve Bayes
- Random Forest Classifier
- Support Vector Machine Classifier
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- Good Starting Points – ML Algorithms for Regression
- Random Forest Regressor
- Support Vector Machine Regressor
- Logistic Regression
- Linear Regression
Good Starting Points – ML Algorithms for Unsupervised Learning
- K-means Clustering Algorithm
- Agglomerative Hierarchical Clustering Algorithm
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
Good Starting Points – ML Algorithms for Semi-supervised Learning
- Label Propagation
Four main Phases of Machine Learning Cycle are
- Training / Learning Phase
- Testing / Evaluation Phase
- Application Phase
- Feedback Phase
In Machine Learning, we split Data into
- Train Data (or Train set)
- Test Data (or Test set)
Generally, we split Data in a Train-Test Split Ratio of 67%-33%
Training Data is used in the Training Phase and Test Data is used in the Testing Phase
Annotation of Training Data varies for three types of learning
Test Data must be annotated for all three types of learning
In Training Phase, we
- Use Training Data to build a Model
In Testing Phase, we
- Evaluate the performance of Model on unseen Data (Test Data) using standard Evaluation Measure(s)
Classification – Standard Evaluation Measures
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F-measure
- Area Under the Curve (AUC)
Regression – Standard Evaluation Measures
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R² or Coefficient of Determination
- Adjusted R²
Both Training and Testing Phases can be summarized as follows
Recall the Equation
- Data = Model + Error
In Training Phase, Model is build using the Training Data
In Testing Phase, Test Data is used to check the Error in the Model
If a Model performs well on large Test Data then it is
Deployed in Real world to make predictions on future unseen instances (Application Phase)
In Feedback Phase, we take Feedback from Users and Domain Experts on your deployed Model and try to further improve it based on Feedback received
Training Regime is a systematic way in which Training Data is used by a Machine Learning Algorithm to learn from it
Three main types of Training Regimes are
Batch Method
- In this method, all training examples are available and used all at once to build the Model (or hypothesis h)
Incremental Method
- In the method, one member (training example) of the Training Data is selected at a time and use to modify the current hypothesis (h)
On-line Method
- If training instances become available one at a time and are used as they become available, the method is called an on-line method
Whatever Training Regime you use, the Learner (Machine Learning Algorithm) will return you a hypothesis (h), which is an approximation of the target function

Ilm o Irfan

Machine Learning

Table of Contents

Chapter 3 - Basics of Machine Learning

Chapter Outline

Quick Recap

What is Machine Learning?

TODO and Your Turn

Learning Input-Output Functions – General Settings

TODO and Your Turn

Steps to Build Efficient Machine Learning Models

TODO and Your Turn

Treating Real World Problems as Learning Input-Output Functions

TODO and Your Turn

Types of Machine Learning

Machine Learning Cycle

Machine Learning – Training Regimes

Chapter Summary

In Next Chapter

Chapter 2 - Basics of Huamn Learning

Chapter 4 - Data and Annotations

Share this article!

About Us

Quick Links

Useful Links

Subscribe Our Newsletter

Ilm o Irfan

Machine Learning

Table of Contents

Chapter 3 - Basics of Machine Learning

Chapter Outline

Quick Recap

What is Machine Learning?

TODO and Your Turn​

Learning Input-Output Functions – General Settings

TODO and Your Turn​

Steps to Build Efficient Machine Learning Models

TODO and Your Turn​

Treating Real World Problems as Learning Input-Output Functions

TODO and Your Turn

Types of Machine Learning

Machine Learning Cycle

Machine Learning – Training Regimes

Chapter Summary

In Next Chapter

Chapter 2 - Basics of Huamn Learning

Chapter 4 - Data and Annotations

Share this article!

About Us

Quick Links

Useful Links

Subscribe Our Newsletter

TODO and Your Turn

TODO and Your Turn

TODO and Your Turn