Machine Learning

January 17, 2023

Chapter 11- Artificial Neural Networks

Chapter Outline

Quick Recap

Artificial Neural Networks

General Architecture – Artificial Neural Networks

Perceptron – Two Layer Artificial Neural Networks

Machine Learning Cycle – Perceptron

Multi-layer Artificial Neural Networks

Overfitting – Artificial Neural Networks

Chapter Summary

Quick Recap

Main Problems – Candidate Elimination Algorithm

- Cannot handle noisy Data

- Cannot handle Numeric Data

- Cannot work for complex representations of Data

- It may fail to find the Target Function in the Hypothesis Space

- - i.e. converge to an empty Version Space

Proposed Solution

- Decision Tree Learning Algorithms

Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree

Learned Decision Trees (or Models) can also be re-represented as

- Sets of If-Then Rules (to improve human readability)

Decision Tree Learning is most popular in Inductive Learning Algorithms

ID3 Algorithm – Summary

- Representation of Training Examples (D)

- - Attribute-Value Pair

- Representation of Hypothesis (h)

- - Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)

- Searching Strategy

- - Greedy Search

- - - Simple to Complex (Hill Climbing)

- Training Regime

- - Batch Method

- Inductive Bias

- - Shorter Decision Trees are preferred over Longer Decision Trees

- - Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not

- Strengths

- - Can handle noisy Data

- - Always finds the Target Function because

- - - ID3 Algorithm searches a complete Hypothesis Space

- Weaknesses

- - Cannot handle very complex Numeric Representation

- Note

- - State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations

- - - However, they fail to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)

To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps

- Step 1: Write down all the paths in the Decision Tree

- - A path is a Conjunction (AND) of Attributes

- Step 2: Join the paths using Disjunction (OR)

- Decision Tree = Disjunction of Paths

In general, Decision Tree Learning Algorithms are best for Machine Learning Problems where

- Instances / Examples are Represented as Attribute–Value Pair

- Target Function (f) is Discrete Valued

- Disjunctive Hypothesis may be Required

- Possibly Noisy / Incomplete Training Data

ID3 Algorithm, learns Decision Tree by constructing them top- down, beginning with the following question

- Which Attribute should be tested at the root of the tree?

Generally, a statistical test is used to evaluate each Input Attribute to determine

- How well it alone classifies the Training Examples?

Best Attribute is the one which alone best classifies the Training Examples

Best Attribute is selected and used as the test at the root node of the tree

A descendant of the root node is then created for each possible value of this Attribute, and

- Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)

The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree

This process continues for each new leaf node until either of two conditions is met

- Every Attribute has already been included along this path through the tree, or

- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)
ID3 Algorithm forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices
Many measures are available for picking the best classifier Attribute

- for e.g. Information Gain, Gain Ratio etc.

Information Gain is a useful measure of for picking the best classiﬁer Attribute

Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute

Information Gain measures

- how well a given Attribute separates Training Examples with respect to their Target Classiﬁcation

Information Gain is deﬁned in terms of

- Entropy

Entropy gives a measure of purity / impurity of a Set of Training Examples

- Value of Entropy lies between [0 – 1]

- - 0 means the Sample is Pure

- - 1 means that Sample has maximum impurity

Minimum Entropy

- Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category

Maximum Entropy

- Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative

A limitation of Information Gain measure is that it favors attributes with many values over those with few values

Hypothesis Space of ID3 Algorithm is complete space of ﬁnite, discrete-valued functions w.r.t available Attributes

Hypothesis Space of Candidate Elimination Algorithm is incomplete space because

- It only contains Hypotheses with Conjunction relationship between Attributes

ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far

- This means that we cannot

- - determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples

ID3 Algorithm incompletely searches a complete Hypothesis Space (H)

Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H)

ID3 Algorithm performs no backtracking

- Once an Attribute is selected for testing at a given node, this choice is never reconsidered

- Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions

Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method

Preference Bias only effects order in which Hypotheses are searched

- In Preference Bias, Hypothesis Space (H) will contain the Target Function

Restriction Bias effects which Hypotheses are searched

- In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function

Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias

Some Machine Learning Algorithms may combine Preference and Restriction Biases

- for e.g. Checker’s Learning Program

Given a hypothesis space H, a hypothesis h ∈ H overﬁts the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances

What Causes Overfitting?

- - Noise in Training Examples

- - Number of Training Examples is too small to produce a representative sample of Target Function

Why Overfitting is a Serious Problem?

- Overfitting is a serious problem for many Machine Learning Algorithms

- - For example

- - - Decision Tree Learning

- - - Regular Neural Networks

- - - Deep Learning Algorithm etc.

Overﬁtting is a real problem for Decision Tree Learning

For Decision Tree Learning, one empirical study showed that for a range of tasks there was a

- Decrease of 10% – 25% in Accuracy due to Overfitting

The simple ID3 Algorithm can produce

- Decision Trees that overfit the Training Examples

Two general approaches to avoid Overfitting in Decision Tree Learning are

- Stop growing Decision Tree before perfectly ﬁtting Training Examples

- - e.g. when Data Split is not statistically signiﬁcant

- Grow full Decision Tree, then prune afterwards

In practice

- Second approach has been more successful

Most Common Approach to avoid Overfitting is

- Training and Validation Set Approach

Using Training and Validation Set Approach, Sample Data is split into three sets

- Training Set

- - is used to build the Model

- Validation Set

- - is used to check whether the Model is Overfitting or not during Training

- Testing Set

- - is used to evaluate the performance of the Model

Holding Data back for a Validation Set reduces Data available for Training

During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting

During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting

Two approaches for Diction Tree Pruning to avoid Overfitting are

- Reduced Error Pruning Approach

- Rule Post-Pruning Approach

- - Rule Post-Pruning Approach is better than Reduced Error Pruning Approach

Artificial Neural Network

Definition

- Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)

Applications of ANNs in Natural Language Processing

- Text Classification

- Information Extraction

- Semantic Parsing

- Question Answering

- Paraphrase Detection

- Natural Language Generation

- Text Summarization

- Machine Translation

- Speech Recognition

- Character Recognition and many more 😉

Applications of ANNs in Image Processing

- Face Detection

- Face Recognition

- Fake Image / Video Detection

- Object Detection in Image / Video

- Activity Recognition in Image / Video

- Natural language Description Generation from Image

- Captioning of Image / Video and many more 😉

Biological Motivation

- Human Brain can classify / categorize Real-world Objects easily

- Human Brain is made up of Networks of Neurons

- - Naturally occurring Neural Networks
- Each Neuron is connected to many others

- - Input to one Neuron is the Output from many others

Like Human Brain

- ANNs are Neural Networks which can categorize / classify Real-world Objects

Don’t take the analogy too far 😊

- Human Brain has approximately 100,000,000,000 Neurons

- ANNs usually have < 1000 Neurons

To conclude

- ANNs are a gross simplification of real Neural Networks

Machine Learning Problem

- Squaring Integers

Input

- An Integer Number (Numeric)

Output

- An Integer Number (Numeric)

Set of Training Examples (D)

- Consider the following Set of Training Examples (D)

In Training Examples (D), we have

- Single Input

- Single Output

Job of Learner (Artificial Neural Network)

- Learn from Input (Numeric) to predict Output (Numeric)

Output of Learner (Artificial Neural Network)

- Model / h = x2

- where x is an Integer Number

Note

- You can see in this example that

- - ANN has learned from Numeric values (Inputs) to predict Numeric values (Outputs)

Machine Learning Problem

- Learn Relationship between Three Integer

Input

- Three Integer Numbers (Numeric)

Output

- An Integer Number (Numeric)

Set of Training Examples (D)

- Consider the following Set of Training Examples (D)

In Training Examples (D), we have

- Multiple Inputs

- Single Output

Job of Learner (Artificial Neural Network)

- Learn from Input (Numeric) to predict Output (Numeric)

Output of Learner (Artificial Neural Network)

- Model / h = [A, B, C] -> A*C – B

- where A, B and C are Integer Numbers

Note

- You can see in this example that

- - ANN has learned from Numeric values (Inputs) to predict Numeric values (Outputs)

- Also, calculations in this Machine Learning Problem are more complex then the previous one ( Squaring Integers Machine Learning Problem)

Machine Learning Problem

- Categorizing Vehicles

Input

- An Image

Output

- Category of Vehicle

Possible Output Values

- Car

- Bus

- Tank

Set of Training Examples (D)

- Consider the following Set of Training Examples (D)

In Training Examples (D)

- Input is Image (Non-numeric)

- Output is Categorical (Non-numeric)

Problem

- ANNs can only understand Non-Symbolic Representations

Solution

- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)

Converting Output into Non-Symbolic Representation

Question

- How to transform Categories into Numeric Representations?

A Possible Answer

- Map each Category to

- - A Number or

- - A Range of Real Valued Numbers (e.g., 0.5 – 0.9)

Considering Vehicle Categorization Problem

- Map each Category to a Number

After Mapping, Output in Training Examples will be as follows

Note

- الحمداللہ, Output is transformed into Numeric Representation, In Sha Allah, in next Slides I will try to explain
  - How to transform an Image into Numeric representations?

Converting Input (Image) into Non-Symbolic Representation

Question

- How to transform an Image into Numeric representations?

A Possible Answer

- Use a Feature Extraction Method

- - e.g. Extract four Pixel Values from each Image

Note

- For simplicity , I have only taken four Pixels (Attributes / Features)

Considering Vehicle Categorization Problem

- Extract four Pixel Values from each Image

- - Range of Pixel Values is [0 – 255]

After Converting Image into Pixel Values, Training Examples will be as follows

Set of Training Examples (D)

Job of Learner (Artificial Neural Network)

- Learn from Input (Numeric) to predict Output (Numeric)

Output of Learner (Artificial Neural Network)

- Model / h

Note

- You can see in this example that

- - ANN has learned from Numeric values (Inputs) to predict

- - Numeric values (Outputs)

- Also, calculations in this Machine Learning Problem are more complex then previous two Machine Learning Problems: (1) Suqring Integers and (2) Learning Relationship between Three Integers

Conclusion

- ANNs are very efficient , since they have the capability to learn complex concepts like

- - Vehicle Categorization

Training Examples (both Input and Output) can be represented as

- real values (Numeric Representation)

Hypothesis (h) can be represented as

- real values (Numeric Representation)

Slow Training Times are OK

- ANNs can take hours and days to train Neural Networks

Predictive Accuracy is more important then understanding

- What the ANN has learned?

- - Since ANNs have a Black Box Representation, it is very difficult to understand what they have learned?

Execution of Learned Function (Model / h) must be quick

- In Application Phase , Learned Neural Networks (Model / h) can categorise unseen example very quickly

- - Very useful in time critical situations (e.g. Is that a Car or Tank?)
  - ANNs are fairly robust to noise in Training Data

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

General Architecture - Artificial Neural Networks

Three Main Layers

- Input Layer

- Hidden Layer

- Output Layer

Each Layer contains one or more Units

Input Units

- Units in the Input Layer called Input Units

Hidden Units

- Units in the Hidden Layer called Hidden Units

Output Units

- Units in the Output Layer called Output Units

Question 1

- How many Units should be there at Input Layer?

Answer

- Number of Input Units = Number of Attribute / Feature Values

- Example – Categorizing Vehicles

- - We extracted four Pixels (Attributes / Features)

- - - Number of Input Units = 4

Question 2

- How many Units should be there at Output Layer?

Answer

- Number of Output Units = Number of Classes

- Example – Categorizing Vehicles

- - There are 3 Classes (Car, Bus and Tank)

- - - Number of Output Units = 3

Question 3

- How many Units should be there at Hidden Layer?

Answer

- There is no definite answer to this question

- Normally, people randomly select

- - Number of Hidden Layers and

- - Number of Hidden Units

Considering Vehicle Categorization Problem

- Output Units contain the Output generated by ANN

Problem

- How can we interpret the vectors at Output Units to categorize Image as Car / Bus / Tank?

A Possible Solution

- Use a Softmax Function

Softmax Function

- Softmax Function takes vectors as input and converts them into probabilities such that sum of probabilities of all Output Units is 1

- The Output Unit with the highest probability will be the predicted Class / Category of an instance / example

Softmax Layer is the last Layer of ANN

ANNs embed giant Mathematical Functions (e.g. relu, sigmoid etc.)

All the Hidden Units and Output Units in an ANN, have the

- same Mathematical Function

Input to a Mathematical Function

- Weighted Sum of Inputs (S)

Fully Connected ANN

- A Fully Connected Neural Network consists of a series of Fully Connected Layers

Fully Connected Layer

- In a Fully Connected Layer, each Unit receives input from every Unit of the previous layer

Partially Connected ANN

- A Partially Connected Neural Network consists of a series of Partially Connected Layers

Partially Connected Layer

- In a Partially Connected Layer, each Unit does not receive input from every Unit of the previous layer

Question

- Why ANN is called a Feed Forward Network?

Answer

- A Simple ANN is also called Feed Forward Network because numbers in the ANN, move in a forward direction

- - i.e. Input Layer == > Hidden Layer(s) == > Output Layer

- Note that calculations are performed at each Hidden and Output Unit

The edges between Input-Hidden Units, Hidden-Hidden Unit and Hidden-Output Unit contain the

- Weights

Recall

- Hypothesis (h) Representation in ANNs

- - Combination of Weights between Units

To Train ANN

- Initially, Weights are randomly assigned

- Normally, small Weights are assigned in a range of [-0.5 to +0.5]

Hypothesis Space (H) in ANN

- Set of All Possible Combinations of Weights between Units

Recall – Learning is a Searching Problem

- The main goal of the Learner (ANN) is to search the Hypoes Space (H) to find a Hypothesis (h / Model), which best fits the Set of Training Examples

Hypothesis Space (H) in ANN

- Set of All Possible Combinations of Weights between Units

Note that Hypothesis Space (H) in ANN is very complex

Question

- What will happen if I increase the Number of Hidden Layers in ANN?

Answer

- Both complexity of Model (h) and computational cost will increase

Model / h returned by Learner

- Best Combination of Weights between Units

ANNs are said to have Black Box Representation because

- Useful knowledge about learned concept (or Model / h) is difficult to extract

The Architecture of ANN plays an important role on the

- performance and computational cost of ANN

Important Parameters to consider in designing ANN Architecture

Main Parameters

- No. of Input Units

- No. of Hidden Layers

- - No. of Hidden Units at each Hidden Layer

- No. of Output Units

- Mathematical Function at each Hidden and Output Unit

- Weights between Units

- ANN will be Fully Connected or Not?

ANNs can be broadly categorized based on

- Number of Hidden Layers in an ANN Architecture

Two Layer Neural Networks (a.k.a. Perceptron)

- Number of Hidden Layers = 0

Multi-layer Neural Networks

- Regular Neural Network

- - Number of Hidden Layers = 1

- Deep Neural Network

- - Number of Hidden Layers > 1

Chapter Focus

- Perceptron (Two Layer ANNs)

- Regular Neural Network (Multi-layer ANNs)

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Perceptron - Two Layer Artificial Neural Networks

Definition

- A Perceptron is a simple Two Layer ANN, with

- - One Input Layer

- - - Multiple Input Units

- - Output Layer

- - - Single Output Unit

- A Perceptron can be used for Binary Classification Problems

A Sample Perceptron

Strengths

- Perceptron are useful to study because

- - We can use Perceptrons to build larger Neural Networks

Weaknesses

- Perceptron has limited learning abilities

- - i.e. Fails to learn simple Boolean-valued Functions (for e.g. XOR)

A Perceptron works as follows

- Step 1: Random Weighs are assigned to edges between Input-Output Units

- Step 2: Input is feed into Input Units

- Step 3: Weighted Sum of Inputs (S) is calculated

- Step 4: Weighted Sum of Inputs (S) is given as Input to the Mathematical Function at Output Unit

- Step 5: Mathematical Function calculates Output for Perceptron

Some of the Mathematical Functions that can be used in Perceptron are as follows

Linear Function

- Simply output the Weighted Sum of Inputs (S)

Step Function

- where S represents Weighted Sum of Inputs and

- T represents Threshold

Sigma Function

- Similar to Step Function but differentiable

Machine Lerning Problem

- Gender Identification from Image

Input

- Photo / Image (Balck and White) of a Human

Output

- Gender of the Human

Task

- Given 2×2 Pixel Balck and White Image of a Human (Input), predict the Gender of the Human (Output)

Treated as

- Learning Input-Output Function

- - i.e. Learn from Input to predict Output

Input

- 2×2 Black and White Image

Output

- Class / Category 01 = Male

- Class / Category 01: = Female

Categorization Rule

- If Image contains 2, 3 or 4 White Pixels then

- - It is Female

- If Image contains 0 or 1 White Pixels then

- - It is Male

Perceptron Architecture

- Input Layer

- - Four Input Units (one for each Pixel)

- Output Layer

- - One Output Unit

- - - +1 for Female and

- - - -1 for Male

- Mathematical Function

- - Step Function

Perceptron Architecture

Need to Learn two things

- Weights between Input and Output Units

- Value for the Threshold (T)

Make calculations easier by

- Thinking of Threshold (T) as a Weight from a special Input Unit, whose

- - Output from the Input Unit is always 1

Exactly the same result, but we only have to learn

- Weights between Input and Output Units

Updated Perceptron Architecture

- Input Layer

- - Five Input Units

- - - Four Input Units (one for each Pixel)

- - - One Input Unit (for Threshold)

- Output Layer

- - One Output Unit

- - - +1 for Female and

- - - -1 for Male

- Mathematical Function

- - Step Function

Updated Perceptron Architecture

Machine Learning Cycle – Perceptron

Four phases of a Machine Learning Cycle are

- Training Phase

- - Build the Model using Training Data

- Testing Phase

- - Evaluate the performance of Model using Testing Data

- Application Phase

- - Deploy the Model in Real-world , to make prediction on Real-time unseen Data

- Feedback Phase

- - Take Feedback form the Users and Domain Experts to improve the Model

Consider the Sample Data of five Black and White Images

In Sample Data

- Input is Image (Non-numeric)

- Output is Categorical (Non-numeric)

Problem

- ANNs can only understand Non-Symbolic Representations

Solution

- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)

In Sample Data

- Input is Image (Non-numeric)

- Output is Categorical (Non-numeric)

Problem

- ANNs can only understand Non-Symbolic Representations

Solution

- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)

Converting Output into Numeric Representation

- Female = +1

- Male = -1

Converting Input into Numeric Representation

Consider the Sample Data with four Pixels for each Image

Value of Black Pixel

- -1

Value of White Pixel

- +1

Note that Pixel values are extracted from

- Left to Right, Top to Bottom

Consider the Sample Data with four Pixels for each Image

E1 = <-1, +1, +1, -1> +

E2 = <-1, +1, -1, -1> –

E3 = <-1, +1, +1, +1> +

E4 = <+1, +1, +1, +1> +

E5 = < -1, -1, -1, -1> –

We split the Sample Data using Random Split Approach into

- Training Data – 2 / 3 of Sample Data

- Testing Data – 1 / 3 of Sample Data

E1 = <-1, +1, +1, -1> +

E2 = <-1, +1, -1, -1> –

E3 = <-1, +1, +1, +1> +

E4 = <+1, +1, +1, +1> +

E5 = < -1, -1, -1, -1> –

Perceptron – Learning Algorithm

Perceptron Training Rule is used to tweak Weights, when

- Actual Value is different from Predicted Value

How Perceptron Training Rule Works?

- When Target Output t(E) is different from Observed Output o(E)

- - Add on Δi to weight wi

- - - where i = η ( t(E) – o(E) ) xi

- - Do this for every Weight in Perceptron (or ANN)

Interpretation

- Considering the Gender Identification Problem

- - (t(E) – o(E)) will either be +2 or –2 [cannot be the same sign]

- - So we can think of the addition of Δi as the movement of Weights in a direction

- - Which will improve the Perceptron (or ANN) performance with respect to E

Multiplication by xi

- Moves it more if the Input is bigger

η is called the Learning Rate

- Usually set to something small (e.g., 0.1)

To control the movement of the Weights

- Not to move too far for one Training Example

- - which may over-compensate for another Training Example

If a large movement is actually necessary for the Weights to correctly categorise E

This will occur over time with multiple epochs

Training Phase – Perceptron

First Training Example

- x1 = <-1, +1, +1, -1> +

- t(x1) = +1

Epoch 01

- Compute Weighed Sum of Inputs

- - S = W0* 1 + W1*X1 + W2*X2 + W3*X3 + W4*X4

- - S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1)

- - S = -2.2

- Apply Step Function to S to get prediction from Perceptron i.e. o(x1)

- Output of Perceptron (or ANN)

- - o(x1) = -1

- Actual Value t(x1) is different from Predicted Value o(x1)

- - Tweak Weights using Perceptron Training Rule

Δ0 = η(t(E)-o(E)) x0

- = 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2

Δ1 = η(t(E)-o(E)) x1

- = 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2

Δ2 = η(t(E)-o(E)) x2

- = 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2

Δ3 = η(t(E)-o(E)) x3

- = 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2

Δ4 = η(t(E)-o(E)) x4

- = 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2

w’0 = -0.5 + Δ0 = -0.5 + 0.2 = -0.3
w’1 = 0.7 + Δ1 = 0.7 + -0.2 = 0.5
w’2 = -0.2 + Δ2 = -0.2 + 0.2 = 0
w’3= 0.1 + Δ3 = 0.1 + 0.2 = 0.3
w’4 = 0.9 + Δ4 = 0.9 – 0.2 = 0.7

First Training Example

- x1 = <-1, +1, +1, -1> +

- t(x1) = +1

Epoch 02

- Compute Weighed Sum of Inputs

- - S = -0.3 (1) + 0.5(-1) + 0*(1) + 0.3*(1) + 0.7(-1)
  - S = -1.2

- Apply Step Function to S to get prediction from Perceptron i.e. o(x1)

- - If (S > 0) Then

o(x1) = +1

else

o(x1) = -1

- Output of Perceptron (or ANN)

- - o(x1) = -1

- Actual Value t(x1) is different from Predicted Value o(x1)

- - Tweak Weights using Perceptron Training Rule

First Training Example

- x1 = <-1, +1, +1, -1> +

- t(x1) = +1

Still gets the wrong Classification / Categorisation

- But the value is closer to ZERO (from -2.2 to -1.2)

- In a few epochs time, Training Example x1 will be correctly Classified / Categorised

I assume that we have trained Perceptron on all 3 Training Examples and Model (or hypothesis) learned is given below

Recall

Training Data

Model

Note that Model is Black Box Representation and it is very difficult to

- understand what the Model has learned?

In sha Allah, in the next Phase i.e. Testing Phase, we will

Evaluate the performance of the Model

Testing Phase – Perceptron

Question

- How good Model has learned?

Answer

- Evaluate the performance of the Model on unseen data (or Testing Data)

Evaluation will be carried out using

- Error measure

Definition

- Error is defined as the proportion of incorrectly classified Test instances

Formula

Note

- Accuracy = 1 – Error

First Test Example

- x4 = <+1, +1, +1, +1> +

- t(x4) = +1

Evaluating Text Example x4

- Compute Weighed Sum of Inputs

- - S = -0.8 (1) + 0.4(1) + 0.2*(1) + 0.8*(1) + 0.3(1)
  - S = 0.9

- Apply Step Function to S to get prediction from Perceptron i.e. o(x1)

- Output of Perceptron (or ANN)

- - o(x4) = +1

- Actual Value t(x4) is same as Predicted Value o(x4)

- - i.e. Text instace is correctly Classificed

Second Test Example

- X5 = <-1, -1, -1, -1> –

- t(x5) = -1

Evaluating Text Example x5

- Compute Weighed Sum of Inputs

- - S = -0.8 (1) + 0.4(-1) + 0.2*(-1) + 0.8*(-1) + 0.3(-1)
  - S = -2.5

- Apply Step Function to S to get prediction from Perceptron i.e. o(x5)

- - If (S > 0) Then
    - o(x5) = +1
  - else
    - o(x5) = -1

- Output of Perceptron (or ANN)

- - o(x5) = -1

- Actual Value t(x5) is same as Predicted Value o(x5)

- - i.e. Text instace is correctly Classificed

Application Phase – Perceptron

We assume that our Model

- performed well on large Test Data and can be deployed in Real-world

Model is deployed in the Real-world and now we can make

- predictions on Real-time Data

Step 1: Take Input from User

Step 2: Convert User Input into Feature Vector

- Exactly same as Feature Vectors of Training and Testing Data

Step 3: Apply Model on the Feature Vector

Step 4: Return Prediction to the User

Step 1: Take input from User

Step 2: Convert User Input into Feature Vector
- Note that order of Attributes / Feature must be exactly same as that of Training and Testing Examples

Step 3: Apply Model on Feature Vector

Step 4: Return Predition to the User

- Male

Note

You can take Input from user, apply Model and return predictions as many times as you like 😊

Feedback Phase – Perceptron

Only Allah is Perfect 😊

Take Feedback on your deployed Model from

- Domain Experts and

- Users
Improve your Model based on Feedback 😊

Strengths

- Perceptron’s can learn Binary Classification Problems

- - For example, we learned Gender Identification Problem using a simple Perceptron

- Perceptron’s can be combined to make larger ANNs

- Perceptron’s can learn linearly separable functions

Weaknesses
- Perceptron’s fail to learn simple Boolean-valued Functions which are not linearly separable functions

Perceptron’s can learn simple Boolean valued functions which are linearly separable functions

- For example, AND Function, OR Function etc.

Truth Table of AND Function Truth Table of OR Function
- Truth Table of AND Function

- Truth Table of OR Function

Boolean-valued AND Function and OR Function

Perceptron’s cannot learn simple Boolean valued functions which are not linearly separable functions

- For example, XOR Function

Truth table of XOR Function

Boolean-valued XOR Function

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Multi-layer Artificial Neural Networks

Definition

- A Multilayer Feed Forward Neural Network consists of an Input Layer, one or more Hidden Layers, and an Output Layer

- - One Input Layer

- - - Multiple Input Units

- - One or More Hidden Layers

- - - Multiple Hidden Units

- - Output Layer

- - - Multiple Output Unit

- A Multi-layer ANN can be used for both

- - Binary Classification Problems and

- - Multi-class Classification Problems

The focus of this Chapter is on Multi-layer Neural Networks with one Hidden Layer i.e. Regular Neural Network

A Sample Regular Neural Network

Strengths

- Can learn both linearly separable and non-linearly separable Target Functions

- Can handle noisy Data

- Can learn Machine Learning Problems with very complex Numerical Representations

Weaknesses

- Computational Cost and Training Time are high

A Multilayer ANN works as follows

- Step 1: Random Weighs are assigned to edges between Input, Hidden, and Output Units

- Step 2: Inputs are fed simultaneously into the Input Units (making up the Input Layer)

- Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)

- Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)

- Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)

- Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)

- Step 7: Softmax Layer converts the vectors at Output Units into probabilities and Class with highest probability is the Prediction of the Regular Neural Network

Some of the popular and widely used Mathematic Functions in Regular Neural Network are

- Sigmoid

Formula of Sigmoid Function

- σ(S) = 1(1 + e-S)

- where ‘S’ is Weighted sum of Inputs

Machine Lerning Problem

- Gender Identification from Image

Input

- Photo / Image (Color) of a Human

Output

- Gender of the Human

Task

- Given RGB Color Image of a Human (Input), predict the Gender of the Human (Output)

Treated as

- Learning Input-Output Function

- - i.e. Learn from Input to predict Output

Input

- 2×2 RGB Image

Output

- Class / Category 01 = Male

- Class / Category 02 = Female

Categorization Rule

- If Image contains 2, 3 or 4 Red Pixels then

- - It is Female

- If Image contains 0 or 1 Red Pixels, then

- - It is Male

Multilayer Architecture

- Input Layer

- - Four Input Units (one for each Pixel)

- Hidden Layer

- - Two Hidden Units that’s receives input (Weighted Sum of Inputs) from the Input Layer and sends its Output (Weighted Sum of Inputs) to Output Layer

- Output Layer

- - Two Output Units

- - - O1 for Female

- - - O1 for Male

- Mathematical Function

- - Sigmoid Function

Multilayer ANN Architecture

Need to Learn

Combination of Weights between Unites which best fits the Training Data

Machine Learning Cycle – Regular Neural Network

Four phases of a Machine Learning Cycle are

- Training Phase

- - Build the Model using Training Data

- Testing Phase

- - Evaluate the performance of Model using Testing Data

- Application Phase

- - Deploy the Model in Real-world, to make prediction on Real-time unseen Data

- Feedback Phase

- - Take Feedback form the Users and Domain Experts to improve the Model

Consider the Sample Data of five Color Images

In Sample Data

- Input is Image (Non-numeric)

- Output is Categorical (Non-numeric)

Problem

- ANNs can only understand Non-Symbolic Representations

Solution

- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)

Converting Output into Numeric Representation

- Female = +1

- Male = -1

Converting Input into Numeric Representation

Feature Extraction from Image Data

- Extract four color Pixels for each Image

Value of Red Pixel

- 0 – 255

Value of Green Pixel

- 0 – 255

Value of Blue Pixel

- 0 – 255

Note that Pixel values are extracted from

- Left to Right, Top to Bottom

Consider the Sample Data with Numeric Representation

E1 = < 200, 30, 70, 175 > +

E2 = < 140, 15, 84, 211 > –

E3 = < 25, 78, 158, 125 > +

E4 = < 36, 146, 243, 64 > +

E5 = < 198, 31, 214, 34 > –

We split the Sample Data using Random Split Approach into

- Training Data – 2 / 3 of Sample Data

- Testing Data – 1 / 3 of Sample Data

E1 = < 200, 30, 70, 175 > +

E2 = < 140, 15, 84, 211 > –

E3 = < 25, 78, 158, 125 > +

E4 = < 36, 146, 243, 64 > +

E5 = < 198, 31, 214, 34 > –

Multilayer ANN – Learning Algorithm

Input

- Set of Training Example (D)

- Learning Rate (l)

- A Regular Neural Network (N)

Output

- A Trained Neural Network (Model / h)

Algorithm

- Regular Neural Network learning the Gender Identification Problem using Backpropagation Algorithm (Backpropagation)

Note

- The pseudo code below is taken form the following Book

- - Data Mining, Third Edition (Page: 398)

In the given pseudo code

b refers to Bias

Biases

- Biases are values associated with each Unit in the Input Layer and Hidden Layer of a Regular Neural Netwrok, but in practice are treated in exactly the same manner as other Weights

Regular Neural Network Training Rule is used to tweaking Weights, when

- Actual Value is different from Predicted Value

How Regular Neural Network Training Rule Works?

- Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the Class of given Training Example) as 1 and

- - Set the Target Output Values of remaining Output Units as 0

- Example – Step 1

- - Consider our Regular Neural Network for Gender Identification Problem

- - If Training Example E is Positive (Female), then

- - - Set Target Output Value of O1 to 1 and

- - - Set Target Output Value of O2 to 0

- - If Training Example E is Negative (Male), then

- - - Set Target Output Value of O1 to 0 and

- - - Set Target Output Value of O2 to 1

- Step 2: Training the Regular Neural Network using Training Example E and get Predictions (Observed Output o(E)) from Regular Neural Network
- Step 3: If Target Output t(E) is different from Observed Output o(E)

- - Calculate the Network Error

- - - Back propagate the Error by updating weights between Output-Hidden Units and Hidden-Input Units

- Step 4: Keep training Regular Neural Network, until Network Error becomes very small

η is called the Learning Rate

- Usually set to something small (e.g., 0.1)

To control the movement of the Weights

- Not to move too far for one Training Example

- - which may over-compensate for another Training Example

If a large movement is actually necessary for the Weights to correctly categorize E

- This will occur over time with multiple epochs

Training Phase – Regular Neural Network

First Training Example
- x₁= < 200, 30, 70, 175 > +
- t(x₁) = +1
Epoch 01
- Compute Weighed Sum of Input (S) as Input to Hidden Layer
  - Using where
  - I_H1= + W₁₁*X₁ + W₂₁*X₂ + W₃₁*X₃ + W₄₁*X₄
  - I_H1 = (-0.2) + (0.2)*(200) + (-0.1)*( 30) + (0.4)*(70) + (0.3)*(175)
  - I_H1 = 117.3

I_H2= + W₁₂*X₁ + W₂₂*X₂ + W₃₂*X₃ + W₄₂*X₄
I_H2 = (0.9) + (0.7)*(200) + (-0.4)*( 30) + (0.8)*(70) + (0.1)*(175)
I_H2 = 202.4

Apply Sigmoid Function to calculate the output from the Hidden layer
- Using:

- O_H1 = 0.4670
- O_H2

= 0.4433

- So, H₁has fired, H₂ has not
Compute Weighed Sum of Inputs (S) as Input into the Output Layer
- I_O1= + W₁₁* O_H1 + W₂₁* O_H2
- I_O1= (-0.4) + (- 0.3)*(0.4670) + (0.9)* (0.4433)
- I_O1= -0.1411

- I_O2= + W₁₂* O_H1 + W₂₂* O_H2
- I_O2= (0.7) + (0.6)*(0.4670) + (-0.5)*(0.4433)
- I_O2= 0.7585

Apply Sigmoid Function to calculate the output from the ANN
- Using: =
- O_O1= = = 0.4647
- O_O2= = = 0.6810
- So, the Multilayer ANN predicts category associated with O₂
Output of Multilayer ANN
- o(x₁) = -1 (Male)
Actual Value t(x₁) is different from Predicted Value o(x₄)
- Tweak Weights using Backpropagation Algorithm / Method

Compute the Error of each Unit in Output Layer
- Error of Output Unit O₁
  - Using :
  - 4647 (1 – 0.4647) (1 – 0.4647)
  - 1331
- Error of Output Unit O₂
  - Using :
  - 6810 (1 – 0.6810) (-1 – 0.6810)
  - – 0.3651
- Compute the Error of each Unit in Hidden Layer
  - Error of Hidden Unit H₁

- - Error of Hidden Unit H₂

Updated Weights between Hidden and Outputs Units

- Using wij=wij+∆wij ; where, ∆wij=lErrjOj

- So, wij=wij+lErrjOj

- w11=w11+0.9ErrO1Oh1

- w’11=-0.3+0.90.1330.4670= -0.24

- w12=w12+0.9ErrO2Oh1

- w’12=0.6+0.9-0.3650.4670=0.44

- w21=w21+0.9ErrO1Oh2

- w’21=0.6+0.90.133- 0.3651=0.55

- w22=w22+0.9ErrO2Oh2

- w’22=-0.5+0.9-0.365- 0.3651=-0.38

Updated Biases of Outputs Units

- Using bj=bj+∆bj ; where, ∆bj=lErrj

- So, bj=bj+lErrj

- bO1=bO1+lErrO1

- b’O1=-0.4+0.90.1331=-0.28

- bO2=bO2+lErrO2

- b’O2=0.7+0.9-0.365=0.37

Updated Weights between Input and Hidden Units

- Using wij=wij+∆wij ; where, ∆wij=lErrjOj

- So, wij=wij+lErrjOj

- w11=w11+0.9Errh1Ox1

- w’11=0.2+0.9-0.063200= -11.14

- Calculate the Updated weights of all the remaining Units

Updated Biases of Hidden Units

- Using bj=bj+∆bj ; where, ∆bj=lErrj

- So, bj=bj+lErrj

- bh1=bh1+lErrh1

- b’h1=-0.2+0.9-0.063=-0.25

- bh2=bh2+lErrh2

- b’h2=0.9+0.90.074=0.96

I assume that we have trained Regualar Neural Netwrok on all 3 Training Examples and Model (or hypothesis) learned is given below

Recall

- Data = Model + Error

Training Data

Model

Note that Model is Black Box Representation and it is very difficult to

- Understand what the Model has learned ?

In sha Allah, In the next Phase i.e. Testing Phase, we will

Evaluate the performance of the Model

Testing Phase – Multilayer ANN

Question

- How good Model has learned?

Answer

- Evaluate the performance of the Model on unseen data (or Testing Data)

Evaluation will be carried out using

- Error measure

Definition

- Error is defined as the proportion of incorrectly classified Test instances

Formula

Note

- Accuracy = 1 – Error

First Test Example
- X₄= < 36, 146, 243, 64 > +
- t(x₄) = +1 (Female)
Evaluating Text Example x₄
- Compute Weighed Sum of Inputs (S) to Hidden layer
  - Using : where
  - I_H1= + W₁₁*X₁ + W₂₁*X₂ + W₃₁*X₃ + W₄₁*X₄

I_H1 = (-1.5) + 0.3(36) + (-0.2)*(146) + 0.5*(243) + 1.2(64)

I_H1 = 178.4

- - I_H2= + W₁₂*X₁ + W₂₂*X₂ + W₃₂*X₃ + W₄₂*X₄

I_H2 = (0.4) + 0.6(36) + (-1.3)*( 146) + 1.3*(243) + 3.2(64)

I_H2 = 352.9

- Apply Sigmoid Function to calculate the output from the Hidden layer
- Using:

So, H₁has fired, H₂ has not

Compute Weighed Sum of Inputs (S) into the Output layer
- I_O1= + W₁₁* O_H1 + W₂₁* O_H2
- I_O1= (-1) + 0.2*(0.2311) + 0.23 * (0.1547)
- I_O1 = -0.9181

- I_O2= + W₁₂* O_H1 + W₂₂* O_H2
- I_O2= (0.5) + 0.26 (0.2311) + 1.14 * (0.1547)
- I_O2 = 0.7364

Apply Sigmoid Function to calculate the output from the ANN

So, the Regualar Neural Netwrok predicts category associated with O₂

Output of Regualar Neural Netwrok
- o(x₄) = -1 (Male)
Actual Value t(x₄) is differentfrom Predicted Value o(x₄)
- Text Example is incorrectly Classified

Second Test Example
- X₅= < 198, 31, 214, 34 > –
- t(x₅) = -1 (Male)
Evaluating Text Example x₅
- Compute Weighed Sum of Inputs (S)to Hidden layer
  - Using where
  - I_H1= + W₁₁*X₁ + W₂₁*X₂ + W₃₁*X₃ + W₄₁*X₄
  - S_H1 = (-1.5) + 0.3(198) + (-0.2)*( 31) + 0.5*(214) + 1.2(34)
  - S_H1 = 199.5

- - I_H2= + W₁₂*X₁ + W₂₂*X₂ + W₃₂*X₃ + W₄₂*X₄
  - S_H2= (0.4) + 0.6(198) + (-1.3)*( 31) + 1.3*(214) + 3.2(34)
  - S_H2= 465.9

Apply Sigmoid Function to calculate the output from the Hidden layer

So, H₂has fired, H₁ has not

Compute Weighed Sum of Inputs (S) to the Output layer
- I_O1= b01 + W₁₁* O_H1 + W₂₁* O_H2
- I_O1= (-1) + .2 (0.3047) + 0.23 * (0.1787)
- I_O1= -0.8979

- I_O2= + W₁₂* O_H1 + W₂₂* O_H2
- I_O2= (0.5) + 0.26 (0.3047) + 1.14 * (0.1787)
- I_O2= 0.7829

Apply Sigmoid Function to calculate the output from the ANN

So, the Regualar Neural Netwrok predicts category associated with O₂

Output of Regualar Neural Netwrok
- o(x₅) = -1 (Male)
Actual Value t(x₅) is sameas Predicted Value o(x₅)
- Test Example is correctly Classified

Second Test Example

- X5 = < 198, 31, 214, 34 > –

- t(x5) = -1 (Male)

Evaluating Text Example x5

- Compute Weighed Sum of Inputs (S)to Hidden layer

- - Using Ij= bj+i=1nwijOi where Oi= Xi

- - IH1= bh1 + W11*X1 + W21*X2 + W31*X3 + W41*X4

SH1 = (-1.5) + 0.3(198) + (-0.2)*( 31) + 0.5*(214) + 1.2(34)

SH1 = 199.5

- - IH2= bh2 + W12*X1 + W22*X2 + W32*X3 + W42*X4

SH2 = (0.4) + 0.6(198) + (-1.3)*( 31) + 1.3*(214) + 3.2(34)

SH2 = 465.9

- Apply Sigmoid Function to calculate the output from the Hidden layer

- - Using: Oj = 11+ e -Ij

- - OH1 = 1(1 + e-199.5) = 11+2.2816 = 0.3047

- - OH2 = 1(1 + e-465.9) = 11+4.5941 = 0.1787

- - So, H2 has fired, H1 has not

- Compute Weighed Sum of Inputs (S) to the Output layer

- - IO1 = bO1 + W11* OH1 + W21* OH2

IO1 = (-1) + .2 (0.3047) + 0.23 * (0.1787)

IO1 = -0.8979

- - IO2 = bO2 + W12* OH1 + W22* OH2

IO2 = (0.5) + 0.26 (0.3047) + 1.14 * (0.1787)

IO2 = 0.7829

- Apply Sigmoid Function to calculate the output from the ANN

- - Using: Oj = 11+ e -Ij

- - OO1 = 1(1 + e0.8979) = 11+2.4544 = 0.2894

- - OO2 = 1(1 + e-0.7829) = 11+0.4570 = 0.6863

- - So, the Regualar Neural Netwrok predicts category associated with O2

- Output of Regualar Neural Netwrok

- - o(x5) = -1 (Male)

- Actual Value t(x5) is same as Predicted Value o(x5)

- - Test Example is correctly Classified

Apply Model on Test Data

Application Phase – Regular Neural Network

We assume that our Model

- performed well on large Test Data and can be deployed in Real-world

Model is deployed in the Real-world and now we can make

- predictions on Real-time Data

Step 1: Take Input from User

Step 2: Convert User Input into Feature Vector

- Exactly same as Feature Vectors of Training and Testing Data

Step 3: Apply Model on the Feature Vector

Step 4: Return Prediction to the User

Step 1: Take input from User

Step 2: Convert User Input into Feature Vector

- Note that order of Attributes / Feature must be exactly same as that of Training and Testing Examples

Step 3: Apply Model on Feature Vector

Unseen Example
- x = < 235, 64, 159, 41>
Make Prediction for Unseen Example x
- Compute Weighed Sum of Inputs (S) to Hidden layer
  - where
  - I_H1= + W₁₁*X₁ + W₂₁*X₂ + W₃₁*X₃ + W₄₁*X₄
  - I_H1= -1.5 + 0.3(235) + (-0.2)*( 64) + 0.5*(159) + 1.2(41)
  - I_H1= 184.9

- - I_H2= + W₁₂*X₁ + W₂₂*X₂ + W₃₂*X₃ + W₄₂*X₄
  - I_H2= 0.4 + 0.6(235) + (-1.3)*( 64) + 1.3*(159) + 3.2(41)
  - I_H2= 396.1

Apply Sigmoid Function to calculate the output from the Hidden layer

So, H₁ has fired, H₂ has not

Compute Weighed Sum of Inputs (S) into Output layer
- I_O1= + W₁₁* O_H1 + W₂₁* O_H2
- I_O1= (-1) + 0.2 (0.1666) + 0.23 * (0.0955)
- I_O1= -0.9447

- I_O2= + W₁₂* O_H1 + W₂₂* O_H2
- I_O2= (0.5) + 0.26 (0.1666) + 1.14 * (0.0955)
- I_O2= 0.6521

Apply Sigmoid Function to calculate the output from the ANN

So, the Regular Neural Network predicts category associated with O₂

Output of Regular Neural Network
- o(x) = -1 (Male)

Step 4: Return Prediction to the User
- Male
Note
- You can take Input from user, apply Model and return predictions as many times as you like 😊

Only Allah is Perfect 😊

Take Feedback on your deployed Model from

- Domain Experts and

- Users

Improve your Model based on Feedback 😊

Overfitting – Artificial Neural Networks

Definition

- Given a hypothesis space H, a hypothesis h ∈ H overﬁts the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances

What Causes Overfitting?

- Noise in Training Examples or

- Number of Training Examples is too small to produce a representative sample of Target Function

Why Overfitting is a Serious Problem?

- Overfitting is a problem for many Machine Learning Algorithms

- - For example

- - - Decision Tree Learning

- - - Regular Neural Networks

- - - Deep Learning Algorithm etc.

Plot Training Example Error versus Test Example Error:

Test Set Error is increasing

- ANN is Overfitting the Training Data

Backpropagation Algorithm is Gradient Descent Search

- Where the height of the hills is determined by Error

- But there are many dimensions to Search Space

- - One for each Weight in ANN

Therefore, Backpropagation Algorithm

- Can find its ways into Local Minima

Possible Solutions

- There can be many Possible Solutions to overcome the problem of Overfitting in ANN

- I am presenting below only four possible solutions

- Possible Solution 01

- - Training and Validation Set Approach

- Possible Solution 02

- - Learn Multiple ANNs

- Possible Solution 03

- - Momentum

- Possible Solution 01

- - Weight Decay Factor

Using this approach, Sample Data is split into three sets

- Training Set

- Validation Set

- Testing Set

Strengths

- Validation Set helps to check whether the Model is Overfitting or not during the Training

Weaknesses

- Holding data back for a Validation Set reduces data available for Training

Training Set

- is used to build the Model

Validation Set

- is used to check whether the Model is Overfitting or not during Training

Testing Set

- is used to evaluate the performance of the Model

Note

- Training Set and Validation Set are used in the

- - Training Phase

- Testing Set is used in the

- - Testing Phase

Question 1

- How to split Sample Data using Training and Validation Set Approach when we have very large / huge amount of data?

Answer 1

- A Good Split may be

- - Training Set = 80%

- - Validation Set = 10%

- - Testing Set = 10%

Note to efficiently train Deep Learning Algorithms, we need huge amount of Training Data

Question 1

- How to split Sample Data using Training and Validation Set Approach when we have sufficiently large amount of data?

Answer 2

- A Good Split may be

- - Training Set = 80%

- - Testing Set = 20%

- - Validation Set = 10% of Training Set

Machine Learning Problem

- Text Summarization

Deep learning Algorithm

- LSTM

Total Sample Data

- 100,000 instances

Splitting Data using the following Split Ratio

- Training Set = 80% = 80,000

- Validation Set = 10% = 10,000

- Testing Set = 10% = 10,000

Machine Learning Problem

- Sentiment Analysis

Machine Learning Algorithm

- Random Forest

Total Sample Data

- 10,000 instances

Splitting Data using the following Split Ratio

- Training Set = 80% = 7,200

- Testing Set = 20% = 2,000

- Validation Set = 10% of Training Set = 800

Question

- How do I know that Model is Overfitting during Training?

Answer

- Model is Not Overfitting

- - During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting

- Model is Overfitting

- - During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting

Learn multiple ANNs

- Starting with different random Weight settings

To make Predictions on Unseen Examples

- Choice No. 1

- - Use the best ANN

- Choice No. 2

- - Use a Voting Classifier comprising of multiple ANNs

Imagine rolling a ball down a hill

For each Weight

- Remember what was added in the previous Epoch

In the current epoch

- Add on a small amount of the previous Δ

The amount is determined by

- Momentum Parameter (denoted by α)

- - α is taken to be between 0 and 1

Caution:

- May not have enough Momentum to

- - get out of Local Minima

- Also, too much Momentum might carry search

- - Back out of the Local Minimum, into a Global Minimum

Using Weight Decay Factor

- Take a small amount off every Weight after each Epoch
Note that ANNs with smaller Weights aren’t as highly fine-tuned (Overfit)

Strengths

- Can learn problems with very complex Numerical Representations

- Can handle noisy Data

- Execution time in Application Phase is fast

- Good for Machine Learning Problems in which

- - Both Training Examples and Hypothesis (h) have Numeric Representation

Weaknesses

- Requires a lot of Training Time (particularly Deep Learning Models)

- Computations cost for ANNs (particularly Deep Learning Models) is high

- Overfitting is a serious problem in ANNs

- ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision

- - Accept a Hypothesis (h), if it is consistent with the Training Example

Reject a Hypothesis (h), if it is not consistent with the Training Example

TODO and Your Turn

Todo Tasks

Your Turn Tasks

Todo Tasks

Your Turn Tasks

Chapter Summary

Following Machine Learning Algorithms are based on Symbolic Representation

- FIND-S Algorithm

- List Then Eliminate Algorithm

- Candidate Elimination Algorithm

- ID3 Algorithm

Symbolic Representation – Representing Training Examples

- Attribute-Value Pair

- Input

- - Categorical

- Output

- - Categorical

Symbolic Representation – Representing Hypothesis (h)

- Two Types of Hypothesis (h) Representations

- - Conjunction (AND) of Constrains on Input Attributes

- - Disjunction (OR) of Conjunction (AND) of Input Attributes

- Note that both types of Hypothesis (h) Representations are Symbolic

- - i.e. based on Symbols (Categorical Values)

Problem – Symbolic Representation

- Cannot handle Machine Learning Problems with very complex Numeric Representations

Solution

- Non-Symbolic Representations

- - for e.g. Artificial Neural Networks

Artificial Neural Networks – Summary

Representation of Training Examples (D)

- Attribute-Value Pair

- Input

- - Numeric

- Output

- - Numeric

Representation of Hypothesis (h)

- Combination of Weights between Units

- - Weights are Numeric values

Searching Strategy

- Exhaustive Search

Training Regime

- Batch Method

Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)

ANNs are suitable for those Machine Learning Problems in which

- Training Examples (both Input and Output) can be represented as real values (Numeric Representation)

- Hypothesis (h) can be represented as real values (Numeric Representation)

- Slow Training Times are OK

- Predictive Accuracy is more important then understanding

- When we have noise in Training Data

The diagram below shows the General Architecture of Artificial Neural Networks

ANNs mainly have three Layers

- Input Layer

- Hidden Layer

- Output Layers

Each Layer contains one or more Units

Main Parameters of ANNs are

- No. of Input Units

- No. of Hidden Layers

- - No. of Hidden Units at each Hidden Layer

- No. of Output Units

- Mathematical Function at each Hidden and Output Unit

- Weights between Units

- ANN will be Fully Connected or Not?

- Learning Rate

ANNs can be broadly categorized based on

- Number of Hidden Layers in an ANN Architecture

Two Layer Neural Networks (a.k.a. Perceptron)

- Number of Hidden Layers = 0

Multi-layer Neural Networks

- Regular Neural Network

- - Number of Hidden Layers = 1

- Deep Neural Network

- - Number of Hidden Layers > 1

A Perceptron is a simple Two Layer ANN, with

- One Input Layer

- - Multiple Input Units

- Output Layer

- - Single Output Unit

A Perceptron can be used for Binary Classification Problems

We can use Perceptrons to build larger Neural Networks

Perceptron has limited learning abilities i.e. fails to learn simple Boolean-valued Functions (for e.g. XOR)

A Perceptron works as follows

- Step 1: Random Weighs are assigned to edges between Input-Output Units

- Step 2: Input is feed into Input Units

- Step 3: Weighted Sum of Inputs (S) is calculated

- Step 4: Weighted Sum of Inputs (S) is given as Input to the Mathematical Function at Output Unit

- Step 5: Mathematical Function calculates Output for Perceptron

Before using ANN to build a Model

- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)

Perceptron – Summary

- Representation of Training Examples (D)

- - Numeric

- Representation of Hypothesis (h)

- - Numeric (Combination of Weights between Units)

- Searching Strategy

- - Exhaustive Search

- Training Regime

- - Incremental Method

- Strengths

- - Perceptron’s can learn Binary Classification Problems

- - Perceptron’s can be combined to make larger ANNs

- - Perceptron’s can learn linearly separable functions

- Weaknesses

- - Perceptron’s fail to learn simple Boolean-valued Functions which are not linearly separable functions

A Regular Neural Network consists of an Input Layer, one Hidden Layers, and an Output Layer

A Regular Neural Network can be used for both

- Binary Classification Problems and

- Multi-class Classification Problems

Regular Neural Network – Summary

- Representation of Training Examples (D)

- - Numeric

- Representation of Hypothesis (h)

- - Numeric (Combination of Weights between Units)

- Searching Strategy

- - Exhaustive Search

- Training Regime

- - Incremental Method

- Strengths

- - Can learn both linearly separable and non-linearly separable Target Functions

- - Can handle noisy Data

- - Can learn Machine Learning Problems with very complex Numerical Representations

- Weaknesses

- - Computational Cost and Training Time are high

A Regular Neural Network works as follows

- Step 1: Random Weighs are assigned to edges between Input, Hidden, and Output Units

- Step 2: Inputs are fed simultaneously into the Input Units (making up the Input Layer)

- Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)

- Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)

- Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)

- Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)

- Step 7: Softmax Layer converts the vectors at Output Units into probabilities and Class with highest probability is the Prediction of the Regular Neural Network

Regular Neural Network Training Rule is used to tweaking Weights, when

- Actual Value is different from Predicted Value

How Regular Neural Network Training Rule Works?

- Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the Class of given Training Example) as 1 and

- - Set the Target Output Values of remaining Output Units as 0

- Step 2: Training the Regular Neural Network using Training Example E and get Predictions (Observed Output o(E)) from Regular Neural Network

- Step 3: If Target Output t(E) is different from Observed Output o(E)

- - Calculate the Network Error

- - - Back propagate the Error by updating weights between Output-Hidden Units and Hidden-Input Units

- Step 4: Keep training Regular Neural Network, Network Error becomes very small

Overfitting is a serious problem in ANNs (particularly Deep Learning algorithms)

Four Possible Solutions to overcome Overfitting in ANNs are as follows

- Training and Validation Set Approach

- Learn Multiple ANNs

- Momentum

- Weight Decay Factor

ANNs – Strengths and Weaknesses

- Strengths

- - Can learn problems with very complex Numerical Representations

- - Can handle noisy Data

- - Execution time in Application Phase is fast

- - Good for Machine Learning Problems in which

- - - Both Training Examples and Hypothesis (h) have Numeric Representation

- Weaknesses

- - Requires a lot of Training Time (particularly Deep Learning Models)

- - Computations cost for ANNs (particularly Deep Learning Models) is high

- - Overfitting is a serious problem in ANNs

- - ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision

- - - Accept a Hypothesis (h), if it is consistent with the Training Example

- - - Reject a Hypothesis (h), if it is not consistent with the Training Example

Ilm o Irfan

Machine Learning

Table of Contents

Chapter 11- Artificial Neural Networks

Chapter Outline

Quick Recap

Artificial Neural Network

TODO and Your Turn

General Architecture - Artificial Neural Networks

TODO and Your Turn

Perceptron - Two Layer Artificial Neural Networks

Machine Learning Cycle – Perceptron

Perceptron – Learning Algorithm

Training Phase – Perceptron

Testing Phase – Perceptron

Application Phase – Perceptron

Feedback Phase – Perceptron

TODO and Your Turn

Multi-layer Artificial Neural Networks

Machine Learning Cycle – Regular Neural Network

Multilayer ANN – Learning Algorithm

Training Phase – Regular Neural Network

Testing Phase – Multilayer ANN

Application Phase – Regular Neural Network

Overfitting – Artificial Neural Networks

TODO and Your Turn

Chapter Summary

In Next Chapter

Chapter 10 - Decision Tree Learning

Chapter 12 - Bayesian Learning

Share this article!

About Us

Quick Links

Useful Links

Subscribe Our Newsletter

Ilm o Irfan

Machine Learning

Table of Contents

Chapter 11- Artificial Neural Networks

Chapter Outline

Quick Recap

Artificial Neural Network

TODO and Your Turn​

General Architecture - Artificial Neural Networks

TODO and Your Turn​

Perceptron - Two Layer Artificial Neural Networks

Machine Learning Cycle – Perceptron

Perceptron – Learning Algorithm

Training Phase – Perceptron

Testing Phase – Perceptron

Application Phase – Perceptron

Feedback Phase – Perceptron

TODO and Your Turn

Multi-layer Artificial Neural Networks

Machine Learning Cycle – Regular Neural Network

Multilayer ANN – Learning Algorithm

Training Phase – Regular Neural Network

Testing Phase – Multilayer ANN

Application Phase – Regular Neural Network

Overfitting – Artificial Neural Networks

TODO and Your Turn

Chapter Summary

In Next Chapter

Chapter 10 - Decision Tree Learning

Chapter 12 - Bayesian Learning

Share this article!

About Us

Quick Links

Useful Links

Subscribe Our Newsletter

TODO and Your Turn

TODO and Your Turn