Chapter 11- Artificial Neural Networks
Chapter Outline
- Chapter Outline
- Quick Recap
- Artificial Neural Networks
- General Architecture – Artificial Neural Networks
- Perceptron – Two Layer Artificial Neural Networks
- Machine Learning Cycle – Perceptron
- Multi-layer Artificial Neural Networks
- Overfitting – Artificial Neural Networks
- Chapter Summary
Quick Recap
- Quick Recap – Decision Tree Learning
- Main Problems – Candidate Elimination Algorithm
- Cannot handle noisy Data
- Cannot handle Numeric Data
- Cannot work for complex representations of Data
- It may fail to find the Target Function in the Hypothesis Space
- i.e. converge to an empty Version Space
- Proposed Solution
- Decision Tree Learning Algorithms
- Decision Tree Learning is a method for approximating Target Functions / Concepts, in which the learned function (or Model) is represented by a Decision Tree
- Learned Decision Trees (or Models) can also be re-represented as
- Sets of If-Then Rules (to improve human readability)
- Decision Tree Learning is most popular in Inductive Learning Algorithms
- ID3 Algorithm – Summary
- Representation of Training Examples (D)
- Attribute-Value Pair
- Representation of Hypothesis (h)
- Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)
- Searching Strategy
- Greedy Search
- Simple to Complex (Hill Climbing)
- Training Regime
- Batch Method
- Inductive Bias
- Shorter Decision Trees are preferred over Longer Decision Trees
- Decision Trees that place high Information Gain Attributes close to the root are preferred to those that do not
- Strengths
- Can handle noisy Data
- Always finds the Target Function because
- ID3 Algorithm searches a complete Hypothesis Space
- Weaknesses
- Cannot handle very complex Numeric Representation
- Note
- State-of-the-art Decision Tree Learning Algorithms (like Random Forest) can handle simple Numeric Representations
- However, they fail to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)
- To convert a Decision Tree (or Model) into If-Then Rules, follow the following steps
- Step 1: Write down all the paths in the Decision Tree
- A path is a Conjunction (AND) of Attributes
- Step 2: Join the paths using Disjunction (OR)
- Decision Tree = Disjunction of Paths
- In general, Decision Tree Learning Algorithms are best for Machine Learning Problems where
- Instances / Examples are Represented as Attribute–Value Pair
- Target Function (f) is Discrete Valued
- Disjunctive Hypothesis may be Required
- Possibly Noisy / Incomplete Training Data
- ID3 Algorithm, learns Decision Tree by constructing them top- down, beginning with the following question
- Which Attribute should be tested at the root of the tree?
- Generally, a statistical test is used to evaluate each Input Attribute to determine
- How well it alone classifies the Training Examples?
- Best Attribute is the one which alone best classifies the Training Examples
- Best Attribute is selected and used as the test at the root node of the tree
- A descendant of the root node is then created for each possible value of this Attribute, and
- Training Examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the Training Example’s value for this Attribute)
- The entire process is then repeated using the Training Examples associated with each descendant node to select the Best Attribute to test at that point in the tree
- This process continues for each new leaf node until either of two conditions is met
- Every Attribute has already been included along this path through the tree, or
- Training Examples associated with this leaf node , all have the same Output value (i.e., their Entropy is Zero)
- ID3 Algorithm forms a Greedy Search for an acceptable Decision Tree, in which the ID3 Algorithm never backtracks to reconsider earlier choices
- Many measures are available for picking the best classifier Attribute
- for e.g. Information Gain, Gain Ratio etc.
- Information Gain is a useful measure of for picking the best classifier Attribute
- Information Gain is the expected reduction in Entropy resulting from partitioning a Set of Training Examples on the basis of an Attribute
- Information Gain measures
- how well a given Attribute separates Training Examples with respect to their Target Classification
- Information Gain is defined in terms of
- Entropy
- Entropy gives a measure of purity / impurity of a Set of Training Examples
- Value of Entropy lies between [0 – 1]
- 0 means the Sample is Pure
- 1 means that Sample has maximum impurity
- Minimum Entropy
- Entropy is minimum (i.e. ZERO), when all the Training Examples fall into one single Class / Category
- Maximum Entropy
- Entropy is maximum (i.e. ONE), when half of the Training Examples are Positive and remaining half are Negative
- A limitation of Information Gain measure is that it favors attributes with many values over those with few values
- Hypothesis Space of ID3 Algorithm is complete space of finite, discrete-valued functions w.r.t available Attributes
- Hypothesis Space of Candidate Elimination Algorithm is incomplete space because
- It only contains Hypotheses with Conjunction relationship between Attributes
- ID3 Algorithm maintains only one Hypothesis (h / Decision Tree) at any time, instead of, e.g., all Hypotheses (or Decision Trees) consistent with Training Examples seen so far
- This means that we cannot
- determine how many alternative Hypotheses (Decision Trees) are consistent with Training Examples
- ID3 Algorithm incompletely searches a complete Hypothesis Space (H)
- Candidate Elimination Algorithm completely searches an incomplete Hypothesis Space (H)
- ID3 Algorithm performs no backtracking
- Once an Attribute is selected for testing at a given node, this choice is never reconsidered
- Therefore, ID3 Algorithm is susceptible to converging to Locally Optimal Solution rather than Globally Optimal Solutions
- Batch Method for Training is more robust to errors in Training Examples compared to Incremental Method
- Preference Bias only effects order in which Hypotheses are searched
- In Preference Bias, Hypothesis Space (H) will contain the Target Function
- Restriction Bias effects which Hypotheses are searched
- In Restriction Bias, Hypothesis Space (H) may / may not contain the Target Function
- Generally, better to choose Machine Learning Algorithm with Preference Bias rather than Restriction Bias
- Some Machine Learning Algorithms may combine Preference and Restriction Biases
- for e.g. Checker’s Learning Program
- Given a hypothesis space H, a hypothesis h ∈ H overfits the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances
- What Causes Overfitting?
- Noise in Training Examples
- Number of Training Examples is too small to produce a representative sample of Target Function
- Why Overfitting is a Serious Problem?
- Overfitting is a serious problem for many Machine Learning Algorithms
- For example
- Decision Tree Learning
- Regular Neural Networks
- Deep Learning Algorithm etc.
- Overfitting is a real problem for Decision Tree Learning
- For Decision Tree Learning, one empirical study showed that for a range of tasks there was a
- Decrease of 10% – 25% in Accuracy due to Overfitting
- The simple ID3 Algorithm can produce
- Decision Trees that overfit the Training Examples
- Two general approaches to avoid Overfitting in Decision Tree Learning are
- Stop growing Decision Tree before perfectly fitting Training Examples
- e.g. when Data Split is not statistically significant
- Grow full Decision Tree, then prune afterwards
- In practice
- Second approach has been more successful
- Most Common Approach to avoid Overfitting is
- Training and Validation Set Approach
- Using Training and Validation Set Approach, Sample Data is split into three sets
- Training Set
- is used to build the Model
- Validation Set
- is used to check whether the Model is Overfitting or not during Training
- Testing Set
- is used to evaluate the performance of the Model
- Holding Data back for a Validation Set reduces Data available for Training
- During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting
- During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting
- Two approaches for Diction Tree Pruning to avoid Overfitting are
- Reduced Error Pruning Approach
- Rule Post-Pruning Approach
- Rule Post-Pruning Approach is better than Reduced Error Pruning Approach
Artificial Neural Network
- Artificial Neural Networks (ANNs)
- Definition
- Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)
- Applications of ANNs in Natural Language Processing
- Text Classification
- Information Extraction
- Semantic Parsing
- Question Answering
- Paraphrase Detection
- Natural Language Generation
- Text Summarization
- Machine Translation
- Speech Recognition
- Character Recognition and many more 😉
- Applications of ANNs in Image Processing
- Face Detection
- Face Recognition
- Fake Image / Video Detection
- Object Detection in Image / Video
- Activity Recognition in Image / Video
- Natural language Description Generation from Image
- Captioning of Image / Video and many more 😉
- Artificial Neural Networks – Biological Motivation
- Biological Motivation
- Human Brain can classify / categorize Real-world Objects easily
- Human Brain is made up of Networks of Neurons
- Naturally occurring Neural Networks
- Each Neuron is connected to many others
- Input to one Neuron is the Output from many others
- Like Human Brain
- ANNs are Neural Networks which can categorize / classify Real-world Objects
- Don’t take the analogy too far 😊
- Human Brain has approximately 100,000,000,000 Neurons
- ANNs usually have < 1000 Neurons
- To conclude
- ANNs are a gross simplification of real Neural Networks
- Example 1 – Artificial Neural Networks
- Machine Learning Problem
- Squaring Integers
- Input
- An Integer Number (Numeric)
- Output
- An Integer Number (Numeric)
- Set of Training Examples (D)
- Consider the following Set of Training Examples (D)
- In Training Examples (D), we have
- Single Input
- Single Output
- Job of Learner (Artificial Neural Network)
- Learn from Input (Numeric) to predict Output (Numeric)
- Output of Learner (Artificial Neural Network)
- Model / h = x2
- where x is an Integer Number
- Note
- You can see in this example that
- ANN has learned from Numeric values (Inputs) to predict Numeric values (Outputs)
- Example 2 – Artificial Neural Networks
- Machine Learning Problem
- Learn Relationship between Three Integer
- Input
- Three Integer Numbers (Numeric)
- Output
- An Integer Number (Numeric)
- Set of Training Examples (D)
- Consider the following Set of Training Examples (D)
- In Training Examples (D), we have
- Multiple Inputs
- Single Output
- Job of Learner (Artificial Neural Network)
- Learn from Input (Numeric) to predict Output (Numeric)
- Output of Learner (Artificial Neural Network)
- Model / h = [A, B, C] -> A*C – B
- where A, B and C are Integer Numbers
- Note
- You can see in this example that
- ANN has learned from Numeric values (Inputs) to predict Numeric values (Outputs)
- Also, calculations in this Machine Learning Problem are more complex then the previous one ( Squaring Integers Machine Learning Problem)
- Example 3 – Artificial Neural Networks
- Machine Learning Problem
- Categorizing Vehicles
- Input
- An Image
- Output
- Category of Vehicle
- Possible Output Values
- Car
- Bus
- Tank
- Set of Training Examples (D)
- Consider the following Set of Training Examples (D)
- In Training Examples (D)
- Input is Image (Non-numeric)
- Output is Categorical (Non-numeric)
- Problem
- ANNs can only understand Non-Symbolic Representations
- Solution
- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
- Converting Output into Non-Symbolic Representation
- Question
- How to transform Categories into Numeric Representations?
- A Possible Answer
- Map each Category to
- A Number or
- A Range of Real Valued Numbers (e.g., 0.5 – 0.9)
- Considering Vehicle Categorization Problem
- Map each Category to a Number
- After Mapping, Output in Training Examples will be as follows
- Note
- الحمداللہ, Output is transformed into Numeric Representation, In Sha Allah, in next Slides I will try to explain
- How to transform an Image into Numeric representations?
- الحمداللہ, Output is transformed into Numeric Representation, In Sha Allah, in next Slides I will try to explain
- Converting Input (Image) into Non-Symbolic Representation
- Question
- How to transform an Image into Numeric representations?
- A Possible Answer
- Use a Feature Extraction Method
- e.g. Extract four Pixel Values from each Image
- Note
- For simplicity , I have only taken four Pixels (Attributes / Features)
- Considering Vehicle Categorization Problem
- Extract four Pixel Values from each Image
- Range of Pixel Values is [0 – 255]
- After Converting Image into Pixel Values, Training Examples will be as follows
- Set of Training Examples (D)
- Job of Learner (Artificial Neural Network)
- Learn from Input (Numeric) to predict Output (Numeric)
- Output of Learner (Artificial Neural Network)
- Model / h
- Note
- You can see in this example that
- ANN has learned from Numeric values (Inputs) to predict
- Numeric values (Outputs)
- Also, calculations in this Machine Learning Problem are more complex then previous two Machine Learning Problems: (1) Suqring Integers and (2) Learning Relationship between Three Integers
- Conclusion
- ANNs are very efficient , since they have the capability to learn complex concepts like
- Vehicle Categorization
- Suitable Problems for ANNs
- Training Examples (both Input and Output) can be represented as
- real values (Numeric Representation)
- Hypothesis (h) can be represented as
- real values (Numeric Representation)
- Slow Training Times are OK
- ANNs can take hours and days to train Neural Networks
- Predictive Accuracy is more important then understanding
- What the ANN has learned?
- Since ANNs have a Black Box Representation, it is very difficult to understand what they have learned?
- Execution of Learned Function (Model / h) must be quick
- In Application Phase , Learned Neural Networks (Model / h) can categorise unseen example very quickly
- Very useful in time critical situations (e.g. Is that a Car or Tank?)
- ANNs are fairly robust to noise in Training Data
TODO and Your Turn
TODO Task 1
- Task 1
- Consider the following Sample Data and answer the questions given below.
- Note
- Your answer should be
- Well Justified
- Questions
- Write Input and Output for the above Machine Learning Problem?
- What is the representation of Input and Output for the above Machine Learning Problem?
- Can you apply ANN on Sample Data? If No, what changes you will have to make in the Input and Output so that ANN can learn from Training Data? Note that only use four Features / Attributes to represent Input.
- Is the above Machine Learning Problem suitable for ANN?
- What potential challenges can we have when we handle above Machine Learning Problem using ANN?
Your Turn Task 1
- Task 1
- 1Select a Machine Learning Problem (similar to the one given in TODO Task) and answer the questions given below
- Questions
- Write Input and Output for the selected Machine Learning Problem?
- What is the representation of Input and Output for the selected Machine Learning Problem?
- Can you apply ANN on Sample Data obtained for your selected Machine Learning Problem? If No, what changes you will have to make in the Input and Output so that ANN can learn from Training Data? Note that only use four Features / Attributes to represent Input.
- Is the selected Machine Learning Problem suitable for ANN?
- What potential challenges can we have when we handle selected Machine Learning Problem using ANN?
General Architecture - Artificial Neural Networks
- General Architecture - Artificial Neural Networks
- Three Main Layers
- Input Layer
- Hidden Layer
- Output Layer
- Each Layer contains one or more Units
- Input Units
- Units in the Input Layer called Input Units
- Hidden Units
- Units in the Hidden Layer called Hidden Units
- Output Units
- Units in the Output Layer called Output Units
- Number of Units at each Layer
- Question 1
- How many Units should be there at Input Layer?
- Answer
- Number of Input Units = Number of Attribute / Feature Values
- Example – Categorizing Vehicles
- We extracted four Pixels (Attributes / Features)
- Number of Input Units = 4
- Question 2
- How many Units should be there at Output Layer?
- Answer
- Number of Output Units = Number of Classes
- Example – Categorizing Vehicles
- There are 3 Classes (Car, Bus and Tank)
- Number of Output Units = 3
- Question 3
- How many Units should be there at Hidden Layer?
- Answer
- There is no definite answer to this question
- Normally, people randomly select
- Number of Hidden Layers and
- Number of Hidden Units
- SoftMax Layer - ANN
- Considering Vehicle Categorization Problem
- Output Units contain the Output generated by ANN
- Problem
- How can we interpret the vectors at Output Units to categorize Image as Car / Bus / Tank?
- A Possible Solution
- Use a Softmax Function
- Softmax Function
- Softmax Function takes vectors as input and converts them into probabilities such that sum of probabilities of all Output Units is 1
- The Output Unit with the highest probability will be the predicted Class / Category of an instance / example
- Softmax Layer is the last Layer of ANN
- Mathematical Functions – ANNs
- ANNs embed giant Mathematical Functions (e.g. relu, sigmoid etc.)
- All the Hidden Units and Output Units in an ANN, have the
- same Mathematical Function
- Input to a Mathematical Function
- Weighted Sum of Inputs (S)
- Fully Connected ANN vs Partially Connected ANN
- Fully Connected ANN
- A Fully Connected Neural Network consists of a series of Fully Connected Layers
- Fully Connected Layer
- In a Fully Connected Layer, each Unit receives input from every Unit of the previous layer
- Partially Connected ANN
- A Partially Connected Neural Network consists of a series of Partially Connected Layers
- Partially Connected Layer
- In a Partially Connected Layer, each Unit does not receive input from every Unit of the previous layer
- Example – Fully Connected ANN
- Example – Partially Connected ANN
- Feed Forward Network
- Question
- Why ANN is called a Feed Forward Network?
- Answer
- A Simple ANN is also called Feed Forward Network because numbers in the ANN, move in a forward direction
- i.e. Input Layer == > Hidden Layer(s) == > Output Layer
- Note that calculations are performed at each Hidden and Output Unit
- Weights – ANN
- The edges between Input-Hidden Units, Hidden-Hidden Unit and Hidden-Output Unit contain the
- Weights
- Recall
- Hypothesis (h) Representation in ANNs
- Combination of Weights between Units
- To Train ANN
- Initially, Weights are randomly assigned
- Normally, small Weights are assigned in a range of [-0.5 to +0.5]
- Hypothesis Space (H) in ANN
- Set of All Possible Combinations of Weights between Units
- Recall – Learning is a Searching Problem
- The main goal of the Learner (ANN) is to search the Hypoes Space (H) to find a Hypothesis (h / Model), which best fits the Set of Training Examples
- Hypothesis Space (H) - ANNs
- Hypothesis Space (H) in ANN
- Set of All Possible Combinations of Weights between Units
- Note that Hypothesis Space (H) in ANN is very complex
- Question
- What will happen if I increase the Number of Hidden Layers in ANN?
- Answer
- Both complexity of Model (h) and computational cost will increase
- ANNs have Black Box Representation
- Model / h returned by Learner
- Best Combination of Weights between Units
- ANNs are said to have Black Box Representation because
- Useful knowledge about learned concept (or Model / h) is difficult to extract
- Importance – ANN Architecture
- The Architecture of ANN plays an important role on the
- performance and computational cost of ANN
- Important Parameters to consider in designing ANN Architecture
- Main Parameters
- No. of Input Units
- No. of Hidden Layers
- No. of Hidden Units at each Hidden Layer
- No. of Output Units
- Mathematical Function at each Hidden and Output Unit
- Weights between Units
- ANN will be Fully Connected or Not?
- Types of Artificial Neural Networks
- ANNs can be broadly categorized based on
- Number of Hidden Layers in an ANN Architecture
- Two Layer Neural Networks (a.k.a. Perceptron)
- Number of Hidden Layers = 0
- Multi-layer Neural Networks
- Regular Neural Network
- Number of Hidden Layers = 1
- Deep Neural Network
- Number of Hidden Layers > 1
- Chapter Focus
- Perceptron (Two Layer ANNs)
- Regular Neural Network (Multi-layer ANNs)
TODO and Your Turn
TODO Task 2
- Task 1
- Adeel wants to develop a Face Recognition System from Image. He wants to recognize faces of five persons: Adee, Sohail, Aqeel, Nabeel and Ghufran. Each Image is of 4×4 Pixels. To develop a Face Recognition System Regular Neural Netwok are used.
- Note
- Your answer should be
- Well Justified
- Questions
- Write Input and Output for the above Machine Learning Problem?
- How many Input Units will be there in Regular Neural Network?
- How many Hidden Units will be there in Regular Neural Network?
- How many Output Units will be there in Regular Neural Network?
- What will be the Main Parameters of Regular Neural Network?
- Draw architecture of Regular Neural Network for the above Machine Learning Problem?
Your Turn Task 2
- Task 1
- Select a Machine Learning Problem (similar to Face Recognition in TODO Task) and answer the questions given below.
- Questions
- Write Input and Output for the selected Machine Learning Problem?
- How many Input Units will be there in Regular Neural Network?
- How many Hidden Units will be there in Regular Neural Network?
- How many Output Units will be there in Regular Neural Network?
- What will be the Main Parameters of Regular Neural Network?
- Draw architecture of Regular Neural Network for the selected Machine Learning Problem?
Perceptron - Two Layer Artificial Neural Networks
- Perceptron
- Definition
- A Perceptron is a simple Two Layer ANN, with
- One Input Layer
- Multiple Input Units
- Output Layer
- Single Output Unit
- A Perceptron can be used for Binary Classification Problems
- A Sample Perceptron
- Strengths
- Perceptron are useful to study because
- We can use Perceptrons to build larger Neural Networks
- Weaknesses
- Perceptron has limited learning abilities
- i.e. Fails to learn simple Boolean-valued Functions (for e.g. XOR)
- How Perception Works?
- A Perceptron works as follows
- Step 1: Random Weighs are assigned to edges between Input-Output Units
- Step 2: Input is feed into Input Units
- Step 3: Weighted Sum of Inputs (S) is calculated
- Step 4: Weighted Sum of Inputs (S) is given as Input to the Mathematical Function at Output Unit
- Step 5: Mathematical Function calculates Output for Perceptron
- Mathematical Functions - Perceptron
- Some of the Mathematical Functions that can be used in Perceptron are as follows
- Linear Function
- Simply output the Weighted Sum of Inputs (S)
- Step Function
- where S represents Weighted Sum of Inputs and
- T represents Threshold
- Sigma Function
- Similar to Step Function but differentiable
- Example – Learning in Perceptron
- Machine Lerning Problem
- Gender Identification from Image
- Input
- Photo / Image (Balck and White) of a Human
- Output
- Gender of the Human
- Task
- Given 2×2 Pixel Balck and White Image of a Human (Input), predict the Gender of the Human (Output)
- Treated as
- Learning Input-Output Function
- i.e. Learn from Input to predict Output
- Input
- 2×2 Black and White Image
- Output
- Class / Category 01 = Male
- Class / Category 01: = Female
- Categorization Rule
- If Image contains 2, 3 or 4 White Pixels then
- It is Female
- If Image contains 0 or 1 White Pixels then
- It is Male
- Perceptron Architecture
- Input Layer
- Four Input Units (one for each Pixel)
- Output Layer
- One Output Unit
- +1 for Female and
- -1 for Male
- Mathematical Function
- Step Function
- Perceptron Architecture
- Need to Learn two things
- Weights between Input and Output Units
- Value for the Threshold (T)
- Make calculations easier by
- Thinking of Threshold (T) as a Weight from a special Input Unit, whose
- Output from the Input Unit is always 1
- Exactly the same result, but we only have to learn
- Weights between Input and Output Units
- Updated Perceptron Architecture
- Input Layer
- Five Input Units
- Four Input Units (one for each Pixel)
- One Input Unit (for Threshold)
- Output Layer
- One Output Unit
- +1 for Female and
- -1 for Male
- Mathematical Function
- Step Function
- Updated Perceptron Architecture
Machine Learning Cycle – Perceptron
- Machine Learning Cycle
- Four phases of a Machine Learning Cycle are
- Training Phase
- Build the Model using Training Data
- Testing Phase
- Evaluate the performance of Model using Testing Data
- Application Phase
- Deploy the Model in Real-world , to make prediction on Real-time unseen Data
- Feedback Phase
- Take Feedback form the Users and Domain Experts to improve the Model
- Sample Data
- Consider the Sample Data of five Black and White Images
- In Sample Data
- Input is Image (Non-numeric)
- Output is Categorical (Non-numeric)
- Problem
- ANNs can only understand Non-Symbolic Representations
- Solution
- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
- In Sample Data
- Input is Image (Non-numeric)
- Output is Categorical (Non-numeric)
- Problem
- ANNs can only understand Non-Symbolic Representations
- Solution
- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
- Converting Output into Numeric Representation
- Female = +1
- Male = -1
- Converting Input into Numeric Representation
- Consider the Sample Data with four Pixels for each Image
- Feature Extraction from Image Data
- Value of Black Pixel
- -1
- Value of White Pixel
- +1
- Note that Pixel values are extracted from
- Left to Right, Top to Bottom
- Sample Data Cont…
- Consider the Sample Data with four Pixels for each Image
- Sample Data – Vector Representation
E1 = <-1, +1, +1, -1> +
E2 = <-1, +1, -1, -1> –
E3 = <-1, +1, +1, +1> +
E4 = <+1, +1, +1, +1> +
E5 = < -1, -1, -1, -1> –
- Split the Sample Data
- We split the Sample Data using Random Split Approach into
- Training Data – 2 / 3 of Sample Data
- Testing Data – 1 / 3 of Sample Data
- Training Data
- Training Data – Vector Representation
E1 = <-1, +1, +1, -1> +
E2 = <-1, +1, -1, -1> –
E3 = <-1, +1, +1, +1> +
- Testing Data
- Testing Data – Vector Representation
E4 = <+1, +1, +1, +1> +
E5 = < -1, -1, -1, -1> –
Perceptron – Learning Algorithm
- Perceptron – Learning Algorithm
- Perceptron Training Rule
- Perceptron Training Rule is used to tweak Weights, when
- Actual Value is different from Predicted Value
- How Perceptron Training Rule Works?
- When Target Output t(E) is different from Observed Output o(E)
- Add on Δi to weight wi
- where i = η ( t(E) – o(E) ) xi
- Do this for every Weight in Perceptron (or ANN)
- Interpretation
- Considering the Gender Identification Problem
- (t(E) – o(E)) will either be +2 or –2 [cannot be the same sign]
- So we can think of the addition of Δi as the movement of Weights in a direction
- Which will improve the Perceptron (or ANN) performance with respect to E
- Multiplication by xi
- Moves it more if the Input is bigger
- Learning Rate
- η is called the Learning Rate
- Usually set to something small (e.g., 0.1)
- To control the movement of the Weights
- Not to move too far for one Training Example
- which may over-compensate for another Training Example
- If a large movement is actually necessary for the Weights to correctly categorise E
This will occur over time with multiple epochs
Training Phase – Perceptron
- Training Phase
- First Training Example
- x1 = <-1, +1, +1, -1> +
- t(x1) = +1
- Epoch 01
- Compute Weighed Sum of Inputs
- S = W0* 1 + W1*X1 + W2*X2 + W3*X3 + W4*X4
- S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1)
- S = -2.2
- Apply Step Function to S to get prediction from Perceptron i.e. o(x1)
- Output of Perceptron (or ANN)
- o(x1) = -1
- Actual Value t(x1) is different from Predicted Value o(x1)
- Tweak Weights using Perceptron Training Rule
- Calculating the Error Values
- Δ0 = η(t(E)-o(E)) x0
- = 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2
- Δ1 = η(t(E)-o(E)) x1
- = 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2
- Δ2 = η(t(E)-o(E)) x2
- = 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2
- Δ3 = η(t(E)-o(E)) x3
- = 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2
- Δ4 = η(t(E)-o(E)) x4
- = 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2
- Calculating the New Weights
- w’0 = -0.5 + Δ0 = -0.5 + 0.2 = -0.3
- w’1 = 0.7 + Δ1 = 0.7 + -0.2 = 0.5
- w’2 = -0.2 + Δ2 = -0.2 + 0.2 = 0
- w’3= 0.1 + Δ3 = 0.1 + 0.2 = 0.3
- w’4 = 0.9 + Δ4 = 0.9 – 0.2 = 0.7
- Perceptron with New Weights
- Training Phase Cont…
- First Training Example
- x1 = <-1, +1, +1, -1> +
- t(x1) = +1
- Epoch 02
- Compute Weighed Sum of Inputs
- S = -0.3 (1) + 0.5(-1) + 0*(1) + 0.3*(1) + 0.7(-1)
- S = -1.2
- Apply Step Function to S to get prediction from Perceptron i.e. o(x1)
- If (S > 0) Then
o(x1) = +1
else
o(x1) = -1
- Output of Perceptron (or ANN)
- o(x1) = -1
- Actual Value t(x1) is different from Predicted Value o(x1)
- Tweak Weights using Perceptron Training Rule
- First Training Example
- x1 = <-1, +1, +1, -1> +
- t(x1) = +1
- Still gets the wrong Classification / Categorisation
- But the value is closer to ZERO (from -2.2 to -1.2)
- In a few epochs time, Training Example x1 will be correctly Classified / Categorised
- Assumption
- I assume that we have trained Perceptron on all 3 Training Examples and Model (or hypothesis) learned is given below
- Summary - Training Phase of ANN Algorithm
- Recall
- Training Data
- Model
- Note that Model is Black Box Representation and it is very difficult to
- understand what the Model has learned?
- In sha Allah, in the next Phase i.e. Testing Phase, we will
Evaluate the performance of the Model
Testing Phase – Perceptron
- Testing Phase
- Question
- How good Model has learned?
- Answer
- Evaluate the performance of the Model on unseen data (or Testing Data)
- Evaluation Measures
- Evaluation will be carried out using
- Error measure
- Error
- Definition
- Error is defined as the proportion of incorrectly classified Test instances
- Formula
Note
- Accuracy = 1 – Error
- Evaluate Model (Perceptron) Cont…
First Test Example
- x4 = <+1, +1, +1, +1> +
- t(x4) = +1
- Evaluating Text Example x4
- Compute Weighed Sum of Inputs
- S = -0.8 (1) + 0.4(1) + 0.2*(1) + 0.8*(1) + 0.3(1)
- S = 0.9
- Apply Step Function to S to get prediction from Perceptron i.e. o(x1)
- Output of Perceptron (or ANN)
- o(x4) = +1
- Actual Value t(x4) is same as Predicted Value o(x4)
- i.e. Text instace is correctly Classificed
- Second Test Example
- X5 = <-1, -1, -1, -1> –
- t(x5) = -1
- Evaluating Text Example x5
- Compute Weighed Sum of Inputs
- S = -0.8 (1) + 0.4(-1) + 0.2*(-1) + 0.8*(-1) + 0.3(-1)
- S = -2.5
- Apply Step Function to S to get prediction from Perceptron i.e. o(x5)
- If (S > 0) Then
- o(x5) = +1
- else
- o(x5) = -1
- If (S > 0) Then
- Output of Perceptron (or ANN)
- o(x5) = -1
- Actual Value t(x5) is same as Predicted Value o(x5)
- i.e. Text instace is correctly Classificed
- Summary - Evaluate Model (Perceptron)
Application Phase – Perceptron
- Application Phase
- We assume that our Model
- performed well on large Test Data and can be deployed in Real-world
- Model is deployed in the Real-world and now we can make
- predictions on Real-time Data
- Steps – Making Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector
- Step 4: Return Prediction to the User
- Making Predictions on Real-time Data
- Step 1: Take input from User
- Step 2: Convert User Input into Feature Vector
- Note that order of Attributes / Feature must be exactly same as that of Training and Testing Examples
Step 3: Apply Model on Feature Vector
- Step 4: Return Predition to the User
- Male
- Note
You can take Input from user, apply Model and return predictions as many times as you like 😊
Feedback Phase – Perceptron
- Feedback Phase
- Only Allah is Perfect 😊
- Take Feedback on your deployed Model from
- Domain Experts and
- Users
- Improve your Model based on Feedback 😊
- Strengths and Weaknesses – Perceptron
- Strengths
- Perceptron’s can learn Binary Classification Problems
- For example, we learned Gender Identification Problem using a simple Perceptron
- Perceptron’s can be combined to make larger ANNs
- Perceptron’s can learn linearly separable functions
- Weaknesses
- Perceptron’s fail to learn simple Boolean-valued Functions which are not linearly separable functions
- Examples – Limitations of Perceptron
- Perceptron’s can learn simple Boolean valued functions which are linearly separable functions
- For example, AND Function, OR Function etc.
- Truth Table of AND Function Truth Table of OR Function
- Truth Table of AND Function
- Truth Table of OR Function
- Boolean-valued AND Function and OR Function
- Perceptron’s cannot learn simple Boolean valued functions which are not linearly separable functions
- For example, XOR Function
- Truth table of XOR Function
- Boolean-valued XOR Function
TODO and Your Turn
TODO Task 3
- Task 1
- Consider the Sample Data of five Black and White Image of 2×2 Pixels. If Image contains 2, 3 or 4 White Pixels then it will be categorized as Emotion othersie it will be categorized as Neutral (or No Emotion).
- Note
- Your answer hsould be
- Well Justified
- Questions
- Write Input and Output for the above Machine Learning Problem?
- How many Input Units will be there in Perceptron?
- How many Hidden Units will be there in Perceptron?
- How many Output Units will be there in Perceptron?
- What will be the Main Parameters of Perceptron?
- Draw architecture of Perceptron for the above Machine Learning Problem?
- Task 2
- Execute Machine Learning Cycle for the above Machine Learning Problem?
Your Turn Task 3
- Task 1
- Select a Machine Learning Problem (similar to Emotion Prediction Problem given in TODO Task) and answer the questions given below.
- Questions
- Write Input and Output for the selected Machine Learning Problem?
- How many Input Units will be there in Perceptron?
- How many Hidden Units will be there in Perceptron?
- How many Output Units will be there in Perceptron?
- What will be the Main Parameters of Perceptron?
- Draw architecture of Perceptron for the selected Machine Learning Problem?
- Execute Machine Learning Cycle for the selected Machine Learning Problem?
Multi-layer Artificial Neural Networks
- Multi-layer ANN
- Definition
- A Multilayer Feed Forward Neural Network consists of an Input Layer, one or more Hidden Layers, and an Output Layer
- One Input Layer
- Multiple Input Units
- One or More Hidden Layers
- Multiple Hidden Units
- Output Layer
- Multiple Output Unit
- A Multi-layer ANN can be used for both
- Binary Classification Problems and
- Multi-class Classification Problems
- Chapter Focus
- The focus of this Chapter is on Multi-layer Neural Networks with one Hidden Layer i.e. Regular Neural Network
- Regular Neural Network
- A Sample Regular Neural Network
- Strengths and Weaknesses - Regular Neural Network
- Strengths
- Can learn both linearly separable and non-linearly separable Target Functions
- Can handle noisy Data
- Can learn Machine Learning Problems with very complex Numerical Representations
- Weaknesses
- Computational Cost and Training Time are high
- How Regular Neural Network Works?
- A Multilayer ANN works as follows
- Step 1: Random Weighs are assigned to edges between Input, Hidden, and Output Units
- Step 2: Inputs are fed simultaneously into the Input Units (making up the Input Layer)
- Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)
- Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)
- Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)
- Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)
- Step 7: Softmax Layer converts the vectors at Output Units into probabilities and Class with highest probability is the Prediction of the Regular Neural Network
- Mathematical Functions – Regular Neural Network
- Some of the popular and widely used Mathematic Functions in Regular Neural Network are
- Sigmoid
- Formula of Sigmoid Function
- σ(S) = 1(1 + e-S)
- where ‘S’ is Weighted sum of Inputs
- Example – Learning in Regular Neural Network
- Machine Lerning Problem
- Gender Identification from Image
- Input
- Photo / Image (Color) of a Human
- Output
- Gender of the Human
- Task
- Given RGB Color Image of a Human (Input), predict the Gender of the Human (Output)
- Treated as
- Learning Input-Output Function
- i.e. Learn from Input to predict Output
- Example – Learning in Multilayer ANN Cont…
- Input
- 2×2 RGB Image
- Output
- Class / Category 01 = Male
- Class / Category 02 = Female
- Categorization Rule
- If Image contains 2, 3 or 4 Red Pixels then
- It is Female
- If Image contains 0 or 1 Red Pixels, then
- It is Male
- Multilayer Architecture
- Input Layer
- Four Input Units (one for each Pixel)
- Hidden Layer
- Two Hidden Units that’s receives input (Weighted Sum of Inputs) from the Input Layer and sends its Output (Weighted Sum of Inputs) to Output Layer
- Output Layer
- Two Output Units
- O1 for Female
- O1 for Male
- Mathematical Function
- Sigmoid Function
- Multilayer ANN Architecture
- Need to Learn
Combination of Weights between Unites which best fits the Training Data
Machine Learning Cycle – Regular Neural Network
- Machine Learning Cycle
- Four phases of a Machine Learning Cycle are
- Training Phase
- Build the Model using Training Data
- Testing Phase
- Evaluate the performance of Model using Testing Data
- Application Phase
- Deploy the Model in Real-world, to make prediction on Real-time unseen Data
- Feedback Phase
- Take Feedback form the Users and Domain Experts to improve the Model
- Sample Data
- Consider the Sample Data of five Color Images
- In Sample Data
- Input is Image (Non-numeric)
- Output is Categorical (Non-numeric)
- Problem
- ANNs can only understand Non-Symbolic Representations
- Solution
- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
- Converting Output into Numeric Representation
- Female = +1
- Male = -1
- Converting Input into Numeric Representation
- Feature Extraction from Image Data
- Extract four color Pixels for each Image
- Feature Extraction from Image Data
- Value of Red Pixel
- 0 – 255
- Value of Green Pixel
- 0 – 255
- Value of Blue Pixel
- 0 – 255
- Note that Pixel values are extracted from
- Left to Right, Top to Bottom
- Sample Data Cont…
- Consider the Sample Data with Numeric Representation
- Sample Data – Vector Representation
E1 = < 200, 30, 70, 175 > +
E2 = < 140, 15, 84, 211 > –
E3 = < 25, 78, 158, 125 > +
E4 = < 36, 146, 243, 64 > +
E5 = < 198, 31, 214, 34 > –
- Split the Sample Data
- We split the Sample Data using Random Split Approach into
- Training Data – 2 / 3 of Sample Data
- Testing Data – 1 / 3 of Sample Data
- Training Data
- Training Data – Vector Representation
E1 = < 200, 30, 70, 175 > +
E2 = < 140, 15, 84, 211 > –
E3 = < 25, 78, 158, 125 > +
- Testing Data
- Testing Data – Vector Representation
E4 = < 36, 146, 243, 64 > +
E5 = < 198, 31, 214, 34 > –
Multilayer ANN – Learning Algorithm
- Learning Algorithm – Input / Output
- Input
- Set of Training Example (D)
- Learning Rate (l)
- A Regular Neural Network (N)
- Output
- A Trained Neural Network (Model / h)
- Algorithm
- Regular Neural Network learning the Gender Identification Problem using Backpropagation Algorithm (Backpropagation)
- Learning and Backpropagation Algorithm
- Note
- The pseudo code below is taken form the following Book
- Data Mining, Third Edition (Page: 398)
- In the given pseudo code
b refers to Bias
- Biases
- Biases are values associated with each Unit in the Input Layer and Hidden Layer of a Regular Neural Netwrok, but in practice are treated in exactly the same manner as other Weights
- Regular Neural Network Training Rule
- Regular Neural Network Training Rule is used to tweaking Weights, when
- Actual Value is different from Predicted Value
- How Regular Neural Network Training Rule Works?
- Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the Class of given Training Example) as 1 and
- Set the Target Output Values of remaining Output Units as 0
- Example – Step 1
- Consider our Regular Neural Network for Gender Identification Problem
- If Training Example E is Positive (Female), then
- Set Target Output Value of O1 to 1 and
- Set Target Output Value of O2 to 0
- If Training Example E is Negative (Male), then
- Set Target Output Value of O1 to 0 and
- Set Target Output Value of O2 to 1
- Step 2: Training the Regular Neural Network using Training Example E and get Predictions (Observed Output o(E)) from Regular Neural Network
- Step 3: If Target Output t(E) is different from Observed Output o(E)
- Calculate the Network Error
- Back propagate the Error by updating weights between Output-Hidden Units and Hidden-Input Units
- Step 4: Keep training Regular Neural Network, until Network Error becomes very small
- Learning Rate
- η is called the Learning Rate
- Usually set to something small (e.g., 0.1)
- To control the movement of the Weights
- Not to move too far for one Training Example
- which may over-compensate for another Training Example
- If a large movement is actually necessary for the Weights to correctly categorize E
- This will occur over time with multiple epochs
Training Phase – Regular Neural Network
- Training Phase
- First Training Example
- x1= < 200, 30, 70, 175 > +
- t(x1) = +1
- Epoch 01
- Compute Weighed Sum of Input (S) as Input to Hidden Layer
- Using where
- IH1= + W11*X1 + W21*X2 + W31*X3 + W41*X4
- IH1 = (-0.2) + (0.2)*(200) + (-0.1)*( 30) + (0.4)*(70) + (0.3)*(175)
- IH1 = 117.3
- Compute Weighed Sum of Input (S) as Input to Hidden Layer
- IH2= + W12*X1 + W22*X2 + W32*X3 + W42*X4
- IH2 = (0.9) + (0.7)*(200) + (-0.4)*( 30) + (0.8)*(70) + (0.1)*(175)
- IH2 = 202.4
- Apply Sigmoid Function to calculate the output from the Hidden layer
- Using:
- OH1 = 0.4670
- OH2
= 0.4433
- So, H1 has fired, H2 has not
- Compute Weighed Sum of Inputs (S) as Input into the Output Layer
- IO1 = + W11* OH1 + W21* OH2
- IO1 = (-0.4) + (- 0.3)*(0.4670) + (0.9)* (0.4433)
- IO1 = -0.1411
- IO2 = + W12* OH1 + W22* OH2
- IO2 = (0.7) + (0.6)*(0.4670) + (-0.5)*(0.4433)
- IO2 = 0.7585
- Apply Sigmoid Function to calculate the output from the ANN
- Using: =
- OO1= = = 0.4647
- OO2= = = 0.6810
- So, the Multilayer ANN predicts category associated with O2
- Output of Multilayer ANN
- o(x1) = -1 (Male)
- Actual Value t(x1) is different from Predicted Value o(x4)
- Tweak Weights using Backpropagation Algorithm / Method
- Backpropagate the Errors
- Compute the Error of each Unit in Output Layer
- Error of Output Unit O1
- Using :
- 4647 (1 – 0.4647) (1 – 0.4647)
- 1331
- Error of Output Unit O2
- Using :
- 6810 (1 – 0.6810) (-1 – 0.6810)
- – 0.3651
- Compute the Error of each Unit in Hidden Layer
- Error of Hidden Unit H1
- Error of Output Unit O1
- Error of Hidden Unit H2
- Update the Weights and Bias
- Updated Weights between Hidden and Outputs Units
- Using wij=wij+∆wij ; where, ∆wij=lErrjOj
- So, wij=wij+lErrjOj
- w11=w11+0.9ErrO1Oh1
- w’11=-0.3+0.90.1330.4670= -0.24
- w12=w12+0.9ErrO2Oh1
- w’12=0.6+0.9-0.3650.4670=0.44
- w21=w21+0.9ErrO1Oh2
- w’21=0.6+0.90.133- 0.3651=0.55
- w22=w22+0.9ErrO2Oh2
- w’22=-0.5+0.9-0.365- 0.3651=-0.38
- Updated Biases of Outputs Units
- Using bj=bj+∆bj ; where, ∆bj=lErrj
- So, bj=bj+lErrj
- bO1=bO1+lErrO1
- b’O1=-0.4+0.90.1331=-0.28
- bO2=bO2+lErrO2
- b’O2=0.7+0.9-0.365=0.37
- Updated Weights between Input and Hidden Units
- Using wij=wij+∆wij ; where, ∆wij=lErrjOj
- So, wij=wij+lErrjOj
- w11=w11+0.9Errh1Ox1
- w’11=0.2+0.9-0.063200= -11.14
- Calculate the Updated weights of all the remaining Units
- Updated Biases of Hidden Units
- Using bj=bj+∆bj ; where, ∆bj=lErrj
- So, bj=bj+lErrj
- bh1=bh1+lErrh1
- b’h1=-0.2+0.9-0.063=-0.25
- bh2=bh2+lErrh2
- b’h2=0.9+0.90.074=0.96
- Assumption
- I assume that we have trained Regualar Neural Netwrok on all 3 Training Examples and Model (or hypothesis) learned is given below
- Summary - Training Phase of Regualar Neural Netwrok Algorithm
- Recall
- Data = Model + Error
- Training Data
- Model
- Note that Model is Black Box Representation and it is very difficult to
- Understand what the Model has learned ?
- In sha Allah, In the next Phase i.e. Testing Phase, we will
Evaluate the performance of the Model
Testing Phase – Multilayer ANN
- Testing Phase
- Question
- How good Model has learned?
- Answer
- Evaluate the performance of the Model on unseen data (or Testing Data)
- Evaluation Measures
- Evaluation will be carried out using
- Error measure
- Error
- Definition
- Error is defined as the proportion of incorrectly classified Test instances
- Formula
- Note
- Accuracy = 1 – Error
- Evaluate Model (Regular Neural Network) Cont…
- First Test Example
- X4= < 36, 146, 243, 64 > +
- t(x4) = +1 (Female)
- Evaluating Text Example x4
- Compute Weighed Sum of Inputs (S) to Hidden layer
- Using : where
- IH1= + W11*X1 + W21*X2 + W31*X3 + W41*X4
- Compute Weighed Sum of Inputs (S) to Hidden layer
IH1 = (-1.5) + 0.3(36) + (-0.2)*(146) + 0.5*(243) + 1.2(64)
IH1 = 178.4
- IH2= + W12*X1 + W22*X2 + W32*X3 + W42*X4
IH2 = (0.4) + 0.6(36) + (-1.3)*( 146) + 1.3*(243) + 3.2(64)
IH2 = 352.9
- Apply Sigmoid Function to calculate the output from the Hidden layer
- Using:
- So, H1has fired, H2 has not
- Compute Weighed Sum of Inputs (S) into the Output layer
- IO1 = + W11* OH1 + W21* OH2
- IO1 = (-1) + 0.2*(0.2311) + 0.23 * (0.1547)
- IO1 = -0.9181
- IO2 = + W12* OH1 + W22* OH2
- IO2 = (0.5) + 0.26 (0.2311) + 1.14 * (0.1547)
- IO2 = 0.7364
- Apply Sigmoid Function to calculate the output from the ANN
- So, the Regualar Neural Netwrok predicts category associated with O2
- Output of Regualar Neural Netwrok
- o(x4) = -1 (Male)
- Actual Value t(x4) is differentfrom Predicted Value o(x4)
- Text Example is incorrectly Classified
- Second Test Example
- X5= < 198, 31, 214, 34 > –
- t(x5) = -1 (Male)
- Evaluating Text Example x5
- Compute Weighed Sum of Inputs (S)to Hidden layer
- Using where
- IH1= + W11*X1 + W21*X2 + W31*X3 + W41*X4
- SH1 = (-1.5) + 0.3(198) + (-0.2)*( 31) + 0.5*(214) + 1.2(34)
- SH1 = 199.5
- Compute Weighed Sum of Inputs (S)to Hidden layer
- IH2= + W12*X1 + W22*X2 + W32*X3 + W42*X4
- SH2 = (0.4) + 0.6(198) + (-1.3)*( 31) + 1.3*(214) + 3.2(34)
- SH2 = 465.9
- Apply Sigmoid Function to calculate the output from the Hidden layer
- So, H2has fired, H1 has not
- Compute Weighed Sum of Inputs (S) to the Output layer
- IO1 = b01 + W11* OH1 + W21* OH2
- IO1 = (-1) + .2 (0.3047) + 0.23 * (0.1787)
- IO1 = -0.8979
- IO2 = + W12* OH1 + W22* OH2
- IO2 = (0.5) + 0.26 (0.3047) + 1.14 * (0.1787)
- IO2 = 0.7829
- Apply Sigmoid Function to calculate the output from the ANN
- So, the Regualar Neural Netwrok predicts category associated with O2
- Output of Regualar Neural Netwrok
- o(x5) = -1 (Male)
- Actual Value t(x5) is sameas Predicted Value o(x5)
- Test Example is correctly Classified
- Second Test Example
- X5 = < 198, 31, 214, 34 > –
- t(x5) = -1 (Male)
- Evaluating Text Example x5
- Compute Weighed Sum of Inputs (S)to Hidden layer
- Using Ij= bj+i=1nwijOi where Oi= Xi
- IH1= bh1 + W11*X1 + W21*X2 + W31*X3 + W41*X4
SH1 = (-1.5) + 0.3(198) + (-0.2)*( 31) + 0.5*(214) + 1.2(34)
SH1 = 199.5
- IH2= bh2 + W12*X1 + W22*X2 + W32*X3 + W42*X4
SH2 = (0.4) + 0.6(198) + (-1.3)*( 31) + 1.3*(214) + 3.2(34)
SH2 = 465.9
- Apply Sigmoid Function to calculate the output from the Hidden layer
- Using: Oj = 11+ e -Ij
- OH1 = 1(1 + e-199.5) = 11+2.2816 = 0.3047
- OH2 = 1(1 + e-465.9) = 11+4.5941 = 0.1787
- So, H2 has fired, H1 has not
- Compute Weighed Sum of Inputs (S) to the Output layer
- IO1 = bO1 + W11* OH1 + W21* OH2
IO1 = (-1) + .2 (0.3047) + 0.23 * (0.1787)
IO1 = -0.8979
- IO2 = bO2 + W12* OH1 + W22* OH2
IO2 = (0.5) + 0.26 (0.3047) + 1.14 * (0.1787)
IO2 = 0.7829
- Apply Sigmoid Function to calculate the output from the ANN
- Using: Oj = 11+ e -Ij
- OO1 = 1(1 + e0.8979) = 11+2.4544 = 0.2894
- OO2 = 1(1 + e-0.7829) = 11+0.4570 = 0.6863
- So, the Regualar Neural Netwrok predicts category associated with O2
- Output of Regualar Neural Netwrok
- o(x5) = -1 (Male)
- Actual Value t(x5) is same as Predicted Value o(x5)
- Test Example is correctly Classified
- Summary - Evaluate Model (Multilayer ANN)
- Apply Model on Test Data
Application Phase – Regular Neural Network
- Application Phase
- We assume that our Model
- performed well on large Test Data and can be deployed in Real-world
- Model is deployed in the Real-world and now we can make
- predictions on Real-time Data
- Steps – Making Predictions on Real-time Data
- Step 1: Take Input from User
- Step 2: Convert User Input into Feature Vector
- Exactly same as Feature Vectors of Training and Testing Data
- Step 3: Apply Model on the Feature Vector
- Step 4: Return Prediction to the User
- Making Predictions on Real-time Data
- Step 1: Take input from User
- Step 2: Convert User Input into Feature Vector
- Note that order of Attributes / Feature must be exactly same as that of Training and Testing Examples
- Step 3: Apply Model on Feature Vector
- Unseen Example
- x = < 235, 64, 159, 41>
- Make Prediction for Unseen Example x
- Compute Weighed Sum of Inputs (S) to Hidden layer
- where
- IH1 = + W11*X1 + W21*X2 + W31*X3 + W41*X4
- IH1 = -1.5 + 0.3(235) + (-0.2)*( 64) + 0.5*(159) + 1.2(41)
- IH1 = 184.9
- Compute Weighed Sum of Inputs (S) to Hidden layer
- IH2 = + W12*X1 + W22*X2 + W32*X3 + W42*X4
- IH2 = 0.4 + 0.6(235) + (-1.3)*( 64) + 1.3*(159) + 3.2(41)
- IH2 = 396.1
- Apply Sigmoid Function to calculate the output from the Hidden layer
- So, H1 has fired, H2 has not
- Compute Weighed Sum of Inputs (S) into Output layer
- IO1 = + W11* OH1 + W21* OH2
- IO1 = (-1) + 0.2 (0.1666) + 0.23 * (0.0955)
- IO1 = -0.9447
- IO2 = + W12* OH1 + W22* OH2
- IO2 = (0.5) + 0.26 (0.1666) + 1.14 * (0.0955)
- IO2 = 0.6521
- Apply Sigmoid Function to calculate the output from the ANN
- So, the Regular Neural Network predicts category associated with O2
- Output of Regular Neural Network
- o(x) = -1 (Male)
- Step 4: Return Prediction to the User
- Male
- Note
- You can take Input from user, apply Model and return predictions as many times as you like 😊
- Feedback Phase
- Only Allah is Perfect 😊
- Take Feedback on your deployed Model from
- Domain Experts and
- Users
Improve your Model based on Feedback 😊
Overfitting – Artificial Neural Networks
- Overfitting
- Definition
- Given a hypothesis space H, a hypothesis h ∈ H overfits the Training Examples if there is another hypothesis h′ ∈ H such that h has smaller Error than h′ over the Training Examples, but h′ has a smaller Error over the entire distribution of instances
- What Causes Overfitting?
- Noise in Training Examples or
- Number of Training Examples is too small to produce a representative sample of Target Function
- Why Overfitting is a Serious Problem?
- Overfitting is a problem for many Machine Learning Algorithms
- For example
- Decision Tree Learning
- Regular Neural Networks
- Deep Learning Algorithm etc.
- Example – Overfitting in ANN
- Plot Training Example Error versus Test Example Error:
- Test Set Error is increasing
- ANN is Overfitting the Training Data
- Problems with Local Minima
- Backpropagation Algorithm is Gradient Descent Search
- Where the height of the hills is determined by Error
- But there are many dimensions to Search Space
- One for each Weight in ANN
- Therefore, Backpropagation Algorithm
- Can find its ways into Local Minima
- Possible Solutions
- There can be many Possible Solutions to overcome the problem of Overfitting in ANN
- I am presenting below only four possible solutions
- Possible Solution 01
- Training and Validation Set Approach
- Possible Solution 02
- Learn Multiple ANNs
- Possible Solution 03
- Momentum
- Possible Solution 01
- Weight Decay Factor
- Avoiding Overfitting - Training and Validation Set Approach
- Using this approach, Sample Data is split into three sets
- Training Set
- Validation Set
- Testing Set
- Strengths
- Validation Set helps to check whether the Model is Overfitting or not during the Training
- Weaknesses
- Holding data back for a Validation Set reduces data available for Training
- Avoiding Overfitting - Training and Validation Set Approach Cont…
- Training Set
- is used to build the Model
- Validation Set
- is used to check whether the Model is Overfitting or not during Training
- Testing Set
- is used to evaluate the performance of the Model
- Note
- Training Set and Validation Set are used in the
- Training Phase
- Testing Set is used in the
- Testing Phase
- Avoiding Overfitting - Training and Validation Set Approach Cont…
- Question 1
- How to split Sample Data using Training and Validation Set Approach when we have very large / huge amount of data?
- Answer 1
- A Good Split may be
- Training Set = 80%
- Validation Set = 10%
- Testing Set = 10%
- Note to efficiently train Deep Learning Algorithms, we need huge amount of Training Data
- Question 1
- How to split Sample Data using Training and Validation Set Approach when we have sufficiently large amount of data?
- Answer 2
- A Good Split may be
- Training Set = 80%
- Testing Set = 20%
- Validation Set = 10% of Training Set
- Example 1 – Splitting Data using Training and Validation Set Approach
- Machine Learning Problem
- Text Summarization
- Deep learning Algorithm
- LSTM
- Total Sample Data
- 100,000 instances
- Splitting Data using the following Split Ratio
- Training Set = 80% = 80,000
- Validation Set = 10% = 10,000
- Testing Set = 10% = 10,000
- Example 2 – Splitting Data using Training and Validation Set Approach
- Machine Learning Problem
- Sentiment Analysis
- Machine Learning Algorithm
- Random Forest
- Total Sample Data
- 10,000 instances
- Splitting Data using the following Split Ratio
- Training Set = 80% = 7,200
- Testing Set = 20% = 2,000
- Validation Set = 10% of Training Set = 800
- Avoiding Overfitting - Training and Validation Set Approach Cont…
- Question
- How do I know that Model is Overfitting during Training?
- Answer
- Model is Not Overfitting
- During Training, if Training Accuracy is increasing and Validation Accuracy is also increasing then Model is not Overfitting
- Model is Overfitting
- During Training, if Training Accuracy is increasing and Validation Accuracy is decreasing then Model is Overfitting
- Avoiding Overfitting – Learn Multiple ANNs
- Learn multiple ANNs
- Starting with different random Weight settings
- To make Predictions on Unseen Examples
- Choice No. 1
- Use the best ANN
- Choice No. 2
- Use a Voting Classifier comprising of multiple ANNs
- Avoiding Overfitting – Adding Momentum
- Imagine rolling a ball down a hill
- Momentum in Backpropagation Algorithm
- For each Weight
- Remember what was added in the previous Epoch
- In the current epoch
- Add on a small amount of the previous Δ
- The amount is determined by
- Momentum Parameter (denoted by α)
- α is taken to be between 0 and 1
- Caution:
- May not have enough Momentum to
- get out of Local Minima
- Also, too much Momentum might carry search
- Back out of the Local Minimum, into a Global Minimum
- Avoiding Overfitting – Use a Weight Decay Factor
- Using Weight Decay Factor
- Take a small amount off every Weight after each Epoch
- Note that ANNs with smaller Weights aren’t as highly fine-tuned (Overfit)
- Strengths and Weaknesses – ANNs
- Strengths
- Can learn problems with very complex Numerical Representations
- Can handle noisy Data
- Execution time in Application Phase is fast
- Good for Machine Learning Problems in which
- Both Training Examples and Hypothesis (h) have Numeric Representation
- Weaknesses
- Requires a lot of Training Time (particularly Deep Learning Models)
- Computations cost for ANNs (particularly Deep Learning Models) is high
- Overfitting is a serious problem in ANNs
- ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision
- Accept a Hypothesis (h), if it is consistent with the Training Example
Reject a Hypothesis (h), if it is not consistent with the Training Example
TODO and Your Turn
TODO Task 4
- Task 1
- Consider the Sample Data of five Color Image of 2×2 Pixels. If Image contains 2, 3 or 4 White Pixels then it will be categorized as Emotion otherwise it will be categorized as Neutral (or No Emotion).
- Note
- Your answer hsould be
- Well Justified
- Questions
- Write Input and Output for the above Machine Learning Problem?
- How many Input Units will be there in Regular Neural Netwrok?
- How many Hidden Units will be there in Regular Neural Netwrok?
- How many Output Units will be there in Regular Neural Netwrok?
- What will be the Main Parameters of Regular Neural Netwrok?
- Draw architecture of Regular Neural Netwrok for the above Machine Learning Problem?
- Execute Machine Learning Cycle for the above Machine Learning Problem?
Your Turn Task 4
- Task 1
- Selecte a Machine Learning Problem (similar to Emotion Prediction Problem given in TODO Task) and answer the questions given below.
- Questions
- Write Input and Output for the selected Machine Learning Problem?
- How many Input Units will be there in Regular Neural Netwrok?
- How many Hidden Units will be there in Regular Neural Netwrok?
- How many Output Units will be there in Regular Neural Netwrok?
- What will be the Main Parameters of Regular Neural Netwrok?
- Draw architecture of Regular Neural Netwrok for the selected Machine Learning Problem?
- Execute Machine Learning Cycle for the selected Machine Learning Problem?
Chapter Summary
- Chapter Summary
- Following Machine Learning Algorithms are based on Symbolic Representation
- FIND-S Algorithm
- List Then Eliminate Algorithm
- Candidate Elimination Algorithm
- ID3 Algorithm
- Symbolic Representation – Representing Training Examples
- Attribute-Value Pair
- Input
- Categorical
- Output
- Categorical
- Symbolic Representation – Representing Hypothesis (h)
- Two Types of Hypothesis (h) Representations
- Conjunction (AND) of Constrains on Input Attributes
- Disjunction (OR) of Conjunction (AND) of Input Attributes
- Note that both types of Hypothesis (h) Representations are Symbolic
- i.e. based on Symbols (Categorical Values)
- Problem – Symbolic Representation
- Cannot handle Machine Learning Problems with very complex Numeric Representations
- Solution
- Non-Symbolic Representations
- for e.g. Artificial Neural Networks
- Artificial Neural Networks – Summary
- Representation of Training Examples (D)
- Attribute-Value Pair
- Input
- Numeric
- Output
- Numeric
- Representation of Hypothesis (h)
- Combination of Weights between Units
- Weights are Numeric values
- Searching Strategy
- Exhaustive Search
- Training Regime
- Batch Method
- Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)
- ANNs are suitable for those Machine Learning Problems in which
- Training Examples (both Input and Output) can be represented as real values (Numeric Representation)
- Hypothesis (h) can be represented as real values (Numeric Representation)
- Slow Training Times are OK
- Predictive Accuracy is more important then understanding
- When we have noise in Training Data
- The diagram below shows the General Architecture of Artificial Neural Networks
- ANNs mainly have three Layers
- Input Layer
- Hidden Layer
- Output Layers
- Each Layer contains one or more Units
- Main Parameters of ANNs are
- No. of Input Units
- No. of Hidden Layers
- No. of Hidden Units at each Hidden Layer
- No. of Output Units
- Mathematical Function at each Hidden and Output Unit
- Weights between Units
- ANN will be Fully Connected or Not?
- Learning Rate
- ANNs can be broadly categorized based on
- Number of Hidden Layers in an ANN Architecture
- Two Layer Neural Networks (a.k.a. Perceptron)
- Number of Hidden Layers = 0
- Multi-layer Neural Networks
- Regular Neural Network
- Number of Hidden Layers = 1
- Deep Neural Network
- Number of Hidden Layers > 1
- A Perceptron is a simple Two Layer ANN, with
- One Input Layer
- Multiple Input Units
- Output Layer
- Single Output Unit
- A Perceptron can be used for Binary Classification Problems
- We can use Perceptrons to build larger Neural Networks
- Perceptron has limited learning abilities i.e. fails to learn simple Boolean-valued Functions (for e.g. XOR)
- A Perceptron works as follows
- Step 1: Random Weighs are assigned to edges between Input-Output Units
- Step 2: Input is feed into Input Units
- Step 3: Weighted Sum of Inputs (S) is calculated
- Step 4: Weighted Sum of Inputs (S) is given as Input to the Mathematical Function at Output Unit
- Step 5: Mathematical Function calculates Output for Perceptron
- Before using ANN to build a Model
- Convert both Input and Output into Numeric Representation (or Non-Symbolic Representation)
- Perceptron – Summary
- Representation of Training Examples (D)
- Numeric
- Representation of Hypothesis (h)
- Numeric (Combination of Weights between Units)
- Searching Strategy
- Exhaustive Search
- Training Regime
- Incremental Method
- Strengths
- Perceptron’s can learn Binary Classification Problems
- Perceptron’s can be combined to make larger ANNs
- Perceptron’s can learn linearly separable functions
- Weaknesses
- Perceptron’s fail to learn simple Boolean-valued Functions which are not linearly separable functions
- A Regular Neural Network consists of an Input Layer, one Hidden Layers, and an Output Layer
- A Regular Neural Network can be used for both
- Binary Classification Problems and
- Multi-class Classification Problems
- Regular Neural Network – Summary
- Representation of Training Examples (D)
- Numeric
- Representation of Hypothesis (h)
- Numeric (Combination of Weights between Units)
- Searching Strategy
- Exhaustive Search
- Training Regime
- Incremental Method
- Strengths
- Can learn both linearly separable and non-linearly separable Target Functions
- Can handle noisy Data
- Can learn Machine Learning Problems with very complex Numerical Representations
- Weaknesses
- Computational Cost and Training Time are high
- A Regular Neural Network works as follows
- Step 1: Random Weighs are assigned to edges between Input, Hidden, and Output Units
- Step 2: Inputs are fed simultaneously into the Input Units (making up the Input Layer)
- Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)
- Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)
- Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)
- Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)
- Step 7: Softmax Layer converts the vectors at Output Units into probabilities and Class with highest probability is the Prediction of the Regular Neural Network
- Regular Neural Network Training Rule is used to tweaking Weights, when
- Actual Value is different from Predicted Value
- How Regular Neural Network Training Rule Works?
- Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the Class of given Training Example) as 1 and
- Set the Target Output Values of remaining Output Units as 0
- Step 2: Training the Regular Neural Network using Training Example E and get Predictions (Observed Output o(E)) from Regular Neural Network
- Step 3: If Target Output t(E) is different from Observed Output o(E)
- Calculate the Network Error
- Back propagate the Error by updating weights between Output-Hidden Units and Hidden-Input Units
- Step 4: Keep training Regular Neural Network, Network Error becomes very small
- Overfitting is a serious problem in ANNs (particularly Deep Learning algorithms)
- Four Possible Solutions to overcome Overfitting in ANNs are as follows
- Training and Validation Set Approach
- Learn Multiple ANNs
- Momentum
- Weight Decay Factor
- ANNs – Strengths and Weaknesses
- Strengths
- Can learn problems with very complex Numerical Representations
- Can handle noisy Data
- Execution time in Application Phase is fast
- Good for Machine Learning Problems in which
- Both Training Examples and Hypothesis (h) have Numeric Representation
- Weaknesses
- Requires a lot of Training Time (particularly Deep Learning Models)
- Computations cost for ANNs (particularly Deep Learning Models) is high
- Overfitting is a serious problem in ANNs
- ANNs either reject or accept a Hypothesis (h) during Training i.e. takes a Binary Decision
- Accept a Hypothesis (h), if it is consistent with the Training Example
- Reject a Hypothesis (h), if it is not consistent with the Training Example
In Next Chapter
- In Next Chapter
- In Sha Allah, in next Chapter, I will present
- Bayesian Learning