**Quick Recap**

**Artificial Neural Networks**

**General Architecture – Artificial Neural Networks**

**Perceptron – Two Layer Artificial Neural Networks**

**Machine Learning Cycle – Perceptron**

**Multi-layer Artificial Neural Networks**

**Overfitting – Artificial Neural Networks**

**Chapter Summary**

## Quick Recap

- Quick Recap – Decision Tree Learning

**Main Problems – Candidate Elimination Algorithm**

**Cannot handle****noisy****Data**

**Cannot handle****Numeric****Data**

**Cannot work for****complex representations****of Data**

**It****may fail to find****the Target Function in the Hypothesis Space**

**i.e. converge to an****empty****Version Space**

**Proposed Solution**

**Decision Tree Learning Algorithms**

**Decision Tree Learning is a****method****for****approximating****Target Functions / Concepts, in which the****learned function****(or Model) is****represented****by a****Decision Tree**

**Learned Decision Trees (or Models) can also be****re-represented****as**

**Sets of If-Then Rules****(to****improve****human readability)**

**Decision Tree Learning is****most popular****in Inductive Learning Algorithms**

**ID3 Algorithm – Summary**

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Representation of Hypothesis (h)**

**Disjunction (OR) of Conjunction (AND) of Input Attribute (a.k.a. Decision Tree)**

**Searching Strategy**

**Greedy Search**

**Simple to Complex (Hill Climbing)**

**Training Regime**

**Batch Method**

**Inductive Bias**

**Shorter****Decision Trees are****preferred****over****Longer****Decision Trees**

**Decision Trees that place****high****Information Gain Attributes****close to the root****are****preferred****to those that do not**

**Strengths**

**Can handle****noisy****Data**

**Always finds****the Target Function because**

**ID3 Algorithm****searches****a****complete Hypothesis Space**

**Weaknesses**

**Cannot handle****very complex Numeric Representation**

**Note**

**State-of-the-art****Decision Tree Learning Algorithms (like Random Forest) can handle****simple Numeric Representations**

**However, they****fail****to handle very complex Numeric Representations (for e.g. Activity Detection in Video / Image)**

**To****convert****a Decision Tree (or Model) into If-Then Rules, follow the following steps**

**Step 1: Write down all the****paths****in the Decision Tree**

**A****path****is a****Conjunction (AND) of Attributes**

**Step 2: Join the****paths****using Disjunction (OR)**

**Decision Tree = Disjunction of Paths**

**In****general****, Decision Tree Learning Algorithms are****best****for Machine Learning Problems where**

**Instances / Examples are****Represented****as****Attribute–Value Pair**

**Target Function (f) is****Discrete Valued**

**Disjunctive Hypothesis may be Required**

**Possibly Noisy / Incomplete Training Data**

**ID3 Algorithm,****learns Decision Tree****by****constructing****them****top- down****, beginning with the following question**

**Which Attribute should be tested at the****root****of the tree?**

**Generally, a****statistical test****is used to****evaluate****each****Input Attribute****to determine**

**How well it****alone classifies****the Training Examples?**

**Best Attribute is the one which****alone best classifies the Training Examples**

**Best Attribute is****selected****and used as the test at the****root node****of the tree**

**A****descendant****of the root node is then created for****each possible value****of this Attribute, and**

**Training Examples are****sorted****to the****appropriate descendant node****(i.e., down the****branch****corresponding to the****Training Example’s value****for this Attribute)**

**The****entire process is then repeated****using the****Training Examples associated with each descendant node****to select the****Best Attribute****to test at that point in the tree**

**This process continues for****each new leaf node****until either of****two conditions is met**

**Every Attribute****has already been****included****along this****path****through the tree, or**

**Training Examples****associated****with this****leaf node****,****all****have the****same Output value****(i.e., their Entropy is Zero)**

**ID3 Algorithm forms a****Greedy Search****for an acceptable Decision Tree, in which the ID3 Algorithm****never backtracks****to reconsider earlier choices****Many measures****are available for****picking****the best classifier Attribute**

**for e.g. Information Gain, Gain Ratio etc.**

**Information Gain is a useful measure of for****picking****the best classiﬁer Attribute**

**Information Gain is the****expected reduction****in Entropy resulting from****partitioning****a Set of Training Examples on the basis of an Attribute**

**Information Gain measures**

**how well****a given Attribute****separates****Training Examples****with respect to their Target Classiﬁcation**

**Information Gain is deﬁned in terms of**

**Entropy**

**Entropy gives a measure of****purity / impurity****of a Set of Training Examples**

**Value of Entropy lies between [0 – 1]**

**0 means the****Sample is Pure**

**1 means that Sample has****maximum****impurity**

**Minimum Entropy**

**Entropy is****minimum****(i.e. ZERO), when****all****the Training Examples****fall into one single Class / Category**

**Maximum Entropy**

**Entropy is****maximum****(i.e. ONE), when****half****of the Training Examples are Positive and remaining half are Negative**

**A limitation of Information Gain measure is that it****favors****attributes with****many values****over those with****few values**

**Hypothesis Space of ID3 Algorithm is****complete space****of ﬁnite, discrete-valued functions w.r.t available Attributes**

**Hypothesis Space of Candidate Elimination Algorithm is****incomplete space****because**

**It****only****contains Hypotheses with****Conjunction relationship between Attributes**

**ID3 Algorithm maintains****only one Hypothesis****(h / Decision Tree) at any time, instead of, e.g.,****all Hypotheses (****or Decision Trees) consistent with Training Examples seen so far**

**This means that we****cannot**

**determine****how many alternative Hypotheses****(Decision Trees) are consistent with Training Examples**

**ID3 Algorithm****incompletely searches****a****complete Hypothesis Space (H)**

**Candidate Elimination Algorithm****completely searches****an****incomplete Hypothesis Space (H)**

**ID3 Algorithm performs****no backtracking**

**Once an Attribute is selected for testing at a given node, this choice is****never reconsidered**

**Therefore, ID3 Algorithm is****susceptible****to converging to****Locally Optimal Solution****rather than****Globally Optimal Solutions**

**Batch Method for Training is****more robust to errors in Training Examples****compared to Incremental Method**

**Preference Bias only effects****order****in which Hypotheses are****searched**

**In Preference Bias, Hypothesis Space (H)****will****contain the Target Function**

**Restriction Bias effects****which****Hypotheses are searched**

**In Restriction Bias, Hypothesis Space (H)****may / may not****contain the Target Function**

**Generally,****better****to****choose****Machine Learning Algorithm with Preference Bias rather than Restriction Bias**

**Some****Machine Learning Algorithms may****combine****Preference and Restriction Biases**

**for e.g. Checker’s Learning Program**

**Given a hypothesis space H, a hypothesis h****∈****H****overﬁts****the Training Examples if there is another hypothesis h′****∈****H such that h has****smaller Error****than h′ over the Training Examples, but h′ has a****smaller Error****over the****entire distribution of instances**

**What Causes Overfitting?**

**Noise****in Training Examples**

**Number of Training Examples is****too small****to produce a****representative sample****of Target Function**

**Why Overfitting is a Serious Problem?**

**Overfitting is a****serious****problem for****many****Machine Learning Algorithms**

**For example**

**Decision Tree Learning**

**Regular Neural Networks**

**Deep Learning Algorithm etc.**

**Overﬁtting is a****real problem****for Decision Tree Learning**

**For Decision Tree Learning, one****empirical study****showed that for a****range of tasks****there was a**

**Decrease of 10% – 25% in Accuracy due to Overfitting**

**The simple ID3 Algorithm****can produce**

**Decision Trees that****overfit****the Training Examples**

**Two general approaches to avoid Overfitting in Decision Tree Learning are**

**Stop growing****Decision Tree before perfectly ﬁtting Training Examples**

**e.g. when Data Split is not statistically signiﬁcant**

**Grow full****Decision Tree, then****prune****afterwards**

**In practice**

**Second approach has been****more successful**

**Most Common****Approach to avoid Overfitting is**

**Training and Validation Set Approach**

**Using Training and Validation Set Approach, Sample Data is split into three sets**

**Training Set**

**is used to****build the Model**

**Validation Set**

**is used to check whether the****Model is Overfitting or not during Training**

**Testing Set**

**is used to****evaluate the performance of the Model**

**Holding Data back for a Validation Set****reduces****Data available for Training**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is also****increasing****then****Model is not Overfitting**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is****decreasing****then****Model is Overfitting**

**Two approaches for Diction Tree Pruning to avoid Overfitting are**

**Reduced Error Pruning Approach**

**Rule Post-Pruning Approach**

**Rule Post-Pruning Approach is better than Reduced Error Pruning Approach**

## Artificial Neural Network

- Artificial Neural Networks (ANNs)

**Definition**

**Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)**

**Applications of ANNs in Natural Language Processing**

**Text Classification**

**Information Extraction**

**Semantic Parsing**

**Question Answering**

**Paraphrase Detection**

**Natural Language Generation**

**Text Summarization**

**Machine Translation**

**Speech Recognition**

**Character Recognition****and many more****😉**

**Applications of ANNs in Image Processing**

**Face Detection**

**Face Recognition**

**Fake Image / Video Detection**

**Object Detection in Image / Video**

**Activity Recognition in Image / Video**

**Natural language Description Generation from Image**

**Captioning of Image / Video****and many more****😉**

- Artificial Neural Networks – Biological Motivation

**Biological Motivation**

**Human Brain can****classify / categorize****Real-world Objects easily**

**Human Brain is made up of****Networks of Neurons**

**Naturally occurring****Neural Networks**

**Each Neuron is****connected****to****many****others**

**Input****to one Neuron is the****Output****from many others**

**Like Human Brain**

**ANNs are Neural Networks which can****categorize / classify****Real-world Objects**

**Don’t take the analogy too far****😊**

**Human Brain has****approximately****100,000,000,000 Neurons**

**ANNs****usually****have < 1000 Neurons**

**To conclude**

**ANNs are a****gross simplification****of****real****Neural Networks**

- Example 1 – Artificial Neural Networks

**Machine Learning Problem**

**Squaring Integers**

**Input**

**An Integer Number (Numeric)**

**Output**

**An Integer Number (Numeric)**

**Set of Training Examples (D)**

**Consider the following Set of Training Examples (D)**

**In Training Examples (D), we have**

**Single Input**

**Single Output**

**Job of Learner (Artificial Neural Network)**

**Learn from Input (Numeric) to****predict****Output (Numeric)**

**Output of Learner (Artificial Neural Network)**

**Model / h = x****2**

**where x is an Integer Number**

**Note**

**You can see in this example that**

**ANN has****learned****from Numeric values (Inputs) to****predict****Numeric values (Outputs)**

- Example 2 – Artificial Neural Networks

**Machine Learning Problem**

**Learn Relationship between Three Integer**

**Input**

**Three Integer Numbers (Numeric)**

**Output**

**An Integer Number (Numeric)**

**Set of Training Examples (D)**

**Consider the following Set of Training Examples (D)**

**In Training Examples (D), we have**

**Multiple Inputs**

**Single Output**

**Job of Learner (Artificial Neural Network)**

**Learn from Input (Numeric) to****predict****Output (Numeric)**

**Output of Learner (Artificial Neural Network)**

**Model / h = [A, B, C] -> A*C – B**

**where A, B and C are Integer Numbers**

**Note**

**You can see in this example that**

**ANN has****learned****from Numeric values (Inputs) to****predict****Numeric values (Outputs)**

**Also,****calculations****in this Machine Learning Problem are****more complex****then the previous one (****Squaring Integers****Machine Learning Problem)**

- Example 3 – Artificial Neural Networks

**Machine Learning Problem**

**Categorizing Vehicles**

**Input**

**An Image**

**Output**

**Category of Vehicle**

**Possible Output Values**

**Car**

**Bus**

**Tank**

**Set of Training Examples (D)**

**Consider the following Set of Training Examples (D)**

**In Training Examples (D)**

**Input is Image (Non-numeric)**

**Output is Categorical (Non-numeric)**

**Problem**

**ANNs can****only understand****Non-Symbolic Representations**

**Solution**

**Convert both****Input and Output into****Numeric Representation****(or Non-Symbolic Representation)**

**Converting****Output into Non-Symbolic Representation**

**Question**

**How to****transform****Categories into Numeric Representations?**

**A Possible Answer**

**Map****each Category to**

**A Number or**

**A Range of Real Valued Numbers (e.g., 0.5 – 0.9)**

**Considering Vehicle Categorization Problem**

**Map each Category to a Number**

**After****Mapping,****Output****in Training Examples will be as follows**

**Note**

**الحمداللہ,****Output****is****transformed****into****Numeric Representation****, In Sha Allah, in next Slides I will try to explain****How to****transform****an Image into Numeric representations?**

**Converting****Input (Image) into Non-Symbolic Representation**

**Question**

**How to****transform****an Image into Numeric representations?**

**A Possible Answer**

**Use a Feature Extraction Method**

**e.g. Extract four****Pixel Values****from each Image**

**Note**

**For****simplicity****, I have only taken four Pixels (Attributes / Features)**

**Considering Vehicle Categorization Problem**

**Extract four****Pixel Values****from each Image**

**Range of Pixel Values is [0 – 255]**

**After Converting Image into Pixel Values****, Training Examples will be as follows**

**Set of Training Examples (D)**

**Job of Learner (Artificial Neural Network)**

**Learn from Input (Numeric) to****predict****Output (Numeric)**

**Output of Learner (Artificial Neural Network)**

**Model / h**

**Note**

**You can see in this example that**

**ANN has****learned****from Numeric values (Inputs) to****predict**

**Numeric values (Outputs)**

**Also,****calculations****in this Machine Learning Problem are****more complex****then previous two Machine Learning Problems: (1) Suqring Integers and (2) Learning Relationship between Three Integers**

**Conclusion**

**ANNs are very****efficient****, since they have the****capability****to learn****complex concepts****like**

**Vehicle Categorization**

- Suitable Problems for ANNs

**Training Examples (both Input and Output) can be****represented****as**

**real values (Numeric Representation)**

**Hypothesis (h) can be****represented****as**

**real values (Numeric Representation)**

**Slow****Training Times are****OK**

**ANNs can take****hours****and****days****to****train****Neural Networks**

**Predictive Accuracy****is****more****important then understanding**

**What the ANN has learned?**

**Since ANNs have a****Black Box Representation****, it is****very difficult****to****understand****what they have learned?**

**Execution****of Learned Function (Model / h) must be****quick**

**In****Application Phase****, Learned Neural Networks (Model / h) can categorise****unseen****example****very quickly**

**Very useful in****time critical****situations (e.g. Is that a Car or Tank?)****ANNs are fairly****robust****to****noise****in Training Data**

## General Architecture - Artificial Neural Networks

- General Architecture - Artificial Neural Networks

**Three Main Layers**

**Input Layer**

**Hidden Layer**

**Output Layer**

**Each Layer contains one or more****Units**

**Input Units**

**Units in the Input Layer called Input Units**

**Hidden Units**

**Units in the Hidden Layer called Hidden Units**

**Output Units**

**Units in the Output Layer called Output Units**

- Number of Units at each Layer

**Question 1**

**How many****Units****should be there at Input Layer?**

**Answer**

**Number of Input Units = Number of****Attribute / Feature Values**

**Example – Categorizing Vehicles**

**We extracted****four Pixels****(Attributes / Features)**

**Number of Input Units = 4**

**Question 2**

**How many****Units****should be there at Output Layer?**

**Answer**

**Number of Output Units = Number of Classes**

**Example – Categorizing Vehicles**

**There are 3 Classes (Car, Bus and Tank)**

**Number of Output Units = 3**

**Question 3**

**How many****Units****should be there at Hidden Layer?**

**Answer**

**There is****no definite answer****to this question**

**Normally, people****randomly****select**

**Number of Hidden Layers and**

**Number of Hidden Units**

- SoftMax Layer - ANN

**Considering Vehicle Categorization Problem**

**Output Units****contain the****Output generated by ANN**

**Problem**

**How can we****interpret****the****vectors at Output Units****to****categorize Image****as Car / Bus / Tank?**

**A Possible Solution**

**Use a Softmax Function**

**Softmax Function**

**Softmax Function takes****vectors****as input and****converts****them into****probabilities****such that****sum of probabilities of all Output Units is 1**

**The Output Unit with the****highest probability****will be the****predicted Class / Category****of an instance / example**

**Softmax Layer is the****last****Layer of ANN**

- Mathematical Functions – ANNs

**ANNs****embed****giant Mathematical Functions (e.g. relu, sigmoid etc.)**

**All****the Hidden Units and Output Units in an ANN, have the**

**same****Mathematical Function**

**Input to a Mathematical Function**

**Weighted Sum of Inputs (S)**

- Fully Connected ANN vs Partially Connected ANN

**Fully Connected ANN**

**A Fully Connected Neural Network consists of a****series****of Fully Connected Layers**

**Fully Connected Layer**

**In a Fully Connected Layer,****each Unit****receives input from****every Unit****of the previous layer**

**Partially Connected ANN**

**A Partially Connected Neural Network consists of a****series****of Partially Connected Layers**

**Partially Connected Layer**

**In a Partially Connected Layer, each Unit****does not****receive input from****every Unit****of the previous layer**

- Example – Fully Connected ANN

- Example – Partially Connected ANN

- Feed Forward Network

**Question**

**Why ANN is called a Feed Forward Network?**

**Answer**

**A Simple ANN is also called Feed Forward Network because****numbers****in the ANN, move in a****forward****direction**

**i.e. Input Layer == > Hidden Layer(s) == > Output Layer**

**Note that****calculations****are performed at****each****Hidden and Output Unit**

- Weights – ANN

**The****edges****between Input-Hidden Units, Hidden-Hidden Unit and Hidden-Output Unit contain the**

**Weights**

**Recall**

**Hypothesis (h) Representation in ANNs**

**Combination of Weights between Units**

**To****Train****ANN**

**Initially, Weights are****randomly****assigned**

**Normally,****small****Weights are assigned in a range of [-0.5 to +0.5]**

**Hypothesis Space (H) in ANN**

**Set of****All****Possible Combinations of Weights between Units**

**Recall – Learning is a Searching Problem**

**The****main goal****of the Learner (ANN) is to****search****the Hypoes Space (H) to****find****a Hypothesis (h / Model), which****best fits****the Set of Training Examples**

- Hypothesis Space (H) - ANNs

**Hypothesis Space (H) in ANN**

**Set of****All****Possible Combinations of Weights between Units**

**Note that Hypothesis Space (H) in ANN is****very complex**

**Question**

**What will happen if I increase the Number of Hidden Layers in ANN?**

**Answer**

**Both****complexity of Model (h)****and****computational cost****will increase**

- ANNs have Black Box Representation

**Model / h returned by Learner**

**Best****Combination of Weights between Units**

**ANNs are said to have Black Box Representation because**

**Useful knowledge****about learned concept (or Model / h) is****difficult****to****extract**

- Importance – ANN Architecture

**The****Architecture of ANN****plays an****important****role on the**

**performance****and****computational cost****of ANN**

**Important****Parameters****to consider in****designing****ANN Architecture**

**Main Parameters**

**No. of Input Units**

**No. of Hidden Layers**

**No. of Hidden Units at each Hidden Layer**

**No. of Output Units**

**Mathematical Function at each Hidden and Output Unit**

**Weights between Units**

**ANN will be Fully Connected or Not?**

- Types of Artificial Neural Networks

**ANNs can be****broadly****categorized based on**

**Number of Hidden Layers****in an****ANN Architecture**

**Two Layer Neural Networks (a.k.a. Perceptron)**

**Number of Hidden Layers = 0**

**Multi-layer Neural Networks**

**Regular Neural Network**

**Number of Hidden Layers = 1**

**Deep Neural Network**

**Number of Hidden Layers > 1**

**Chapter Focus**

**Perceptron (Two Layer ANNs)**

**Regular Neural Network (Multi-layer ANNs)**

## Perceptron - Two Layer Artificial Neural Networks

- Perceptron

**Definition**

**A Perceptron is a****simple****Two Layer ANN, with**

**One Input Layer**

**Multiple Input Units**

**Output Layer**

**Single Output Unit**

**A Perceptron can be used for Binary Classification Problems**

**A Sample Perceptron**

**Strengths**

**Perceptron are****useful to study****because**

**We can****use****Perceptrons to****build larger****Neural Networks**

**Weaknesses**

**Perceptron has****limited****learning abilities**

**i.e.****Fails****to learn****simple****Boolean-valued Functions (for e.g. XOR)**

- How Perception Works?

**A Perceptron works as follows**

**Step 1:****Random****Weighs are assigned to****edges****between Input-Output Units**

**Step 2:****Input****is****feed****into****Input Units**

**Step 3: Weighted Sum of Inputs (S) is****calculated**

**Step 4: Weighted Sum of Inputs (S) is given as****Input****to the****Mathematical Function at Output Unit**

**Step 5: Mathematical Function calculates****Output for Perceptron**

- Mathematical Functions - Perceptron

**Some of the Mathematical Functions that can be used in Perceptron are as follows**

**Linear Function**

**Simply output the Weighted Sum of Inputs (S)**

**Step Function**

**where S represents Weighted Sum of Inputs and**

**T represents Threshold**

**Sigma Function**

**Similar****to Step Function but****differentiable**

- Example – Learning in Perceptron

**Machine Lerning Problem**

**Gender Identification from Image**

**Input**

**Photo / Image (Balck and White) of a Human**

**Output**

**Gender of the Human**

**Task**

**Given****2×2 Pixel Balck and White Image****of a Human (Input),****predict****the Gender of the Human (Output)**

**Treated as**

**Learning Input-Output Function**

**i.e. Learn from Input to****predict****Output**

**Input**

**2×2 Black and White Image**

**Output**

**Class / Category 01****= Male**

**Class / Category 01:****= Female**

**Categorization Rule**

**If Image contains 2, 3 or 4 White Pixels then**

**It is Female**

**If Image contains 0 or 1 White Pixels then**

**It is Male**

**Perceptron Architecture**

**Input Layer**

**Four Input Units (one for each Pixel)**

**Output Layer**

**One Output Unit**

**+1 for Female and**

**-1 for Male**

**Mathematical Function**

**Step Function**

**Perceptron Architecture**

**Need to Learn two things**

**Weights between Input and Output Units**

**Value for the Threshold (T)**

**Make calculations easier by**

**Thinking of****Threshold (T) as a Weight****from a****special Input Unit****, whose**

**Output from the****Input Unit****is always 1**

**Exactly the same result****, but we only have to learn**

**Weights between Input and Output Units**

**Updated****Perceptron Architecture**

**Input Layer**

**Five Input Units**

**Four Input Units (one for each Pixel)**

**One Input Unit (for Threshold)**

**Output Layer**

**One Output Unit**

**+1 for Female and**

**-1 for Male**

**Mathematical Function**

**Step Function**

**Updated****Perceptron Architecture**

## Machine Learning Cycle – Perceptron

- Machine Learning Cycle

**Four phases of a Machine Learning Cycle are**

**Training Phase**

**Build the Model****using Training Data**

**Testing Phase**

**Evaluate the performance of Model****using Testing Data**

**Application Phase**

**Deploy the Model in Real-world****, to****make prediction****on Real-time unseen Data**

**Feedback Phase**

**Take Feedback form the****Users****and****Domain Experts****to****improve the Model**

- Sample Data

**Consider the Sample Data of five Black and White Images**

**In Sample Data**

**Input is Image****(Non-numeric)**

**Output is Categorical****(Non-numeric)**

**Problem**

**ANNs can****only understand****Non-Symbolic Representations**

**Solution**

**Convert****both****Input and Output into****Numeric Representation****(or Non-Symbolic Representation)**

**In Sample Data**

**Input is Image****(Non-numeric)**

**Output is Categorical****(Non-numeric)**

**Problem**

**ANNs can****only understand****Non-Symbolic Representations**

**Solution**

**Convert****both****Input and Output into****Numeric Representation****(or Non-Symbolic Representation)**

**Converting Output into****Numeric Representation**

**Female****= +1**

**Male****= -1**

**Converting Input into****Numeric Representation**

**Consider the Sample Data with****four Pixels****for each Image**

- Feature Extraction from Image Data

**Value of Black Pixel**

**-1**

**Value of White Pixel**

**+1**

**Note that****Pixel values****are extracted from**

**Left to Right, Top to Bottom**

- Sample Data Cont…

**Consider the Sample Data with****four Pixels****for each Image**

- Sample Data – Vector Representation

**E****1**** = <-1, +1, +1, -1> +**

**E****2**** = <-1, +1, -1, -1> –**

**E****3**** = <-1, +1, +1, +1> +**

**E****4**** = <+1, +1, +1, +1> +**

**E****5**** = < -1, -1, -1, -1> –**

- Split the Sample Data

**We split the Sample Data using****Random Split Approach****into**

**Training Data****– 2 / 3 of Sample Data**

**Testing Data****– 1 / 3 of Sample Data**

- Training Data

- Training Data – Vector Representation

**E****1**** = <-1, +1, +1, -1> +**

**E****2**** = <-1, +1, -1, -1> –**

**E****3**** = <-1, +1, +1, +1> +**

- Testing Data

- Testing Data – Vector Representation

**E****4**** = <+1, +1, +1, +1> +**

**E****5**** = < -1, -1, -1, -1> –**

### Perceptron – Learning Algorithm

- Perceptron – Learning Algorithm

- Perceptron Training Rule

**Perceptron Training Rule is used to****tweak Weights****, when**

**Actual Value****is****different****from****Predicted Value**

**How Perceptron Training Rule Works?**

**When Target Output t(E) is****different****from Observed Output o(E)**

**Add on Δ****i****to weight w****i**

**where****i = η ( t(E) – o(E) ) xi**

**Do this for****every Weight****in Perceptron (or ANN)**

**Interpretation**

**Considering the Gender Identification Problem**

**(t(E) – o(E)) will either be +2 or –2 [cannot be the same sign]**

**So we can think of the****addition of Δ****i****as the****movement of Weights in a direction**

**Which will****improve****the Perceptron (or ANN) performance****with respect to E**

**Multiplication by x****i**

**Moves it****more****if the Input is bigger**

- Learning Rate

**η is called the Learning Rate**

**Usually****set to something****small****(e.g., 0.1)**

**To****control****the movement of the Weights**

**Not to move****too far****for one Training Example**

**which may****over-compensate****for another Training Example**

**If a****large movement****is actually necessary for the Weights to correctly categorise E**

**This will occur over time with ****multiple epochs**

### Training Phase – Perceptron

- Training Phase

**First Training Example**

**x****1****= <-1, +1, +1, -1> +**

**t(x****1****) = +1**

**Epoch 01**

**Compute Weighed Sum of Inputs**

**S = W****0***** 1 + W****1*****X****1****+ W****2*****X****2****+ W****3*****X****3****+ W****4*****X****4**

**S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1)**

**S = -2.2**

**Apply Step Function to S to get prediction from Perceptron i.e. o(x****1****)**

**Output of Perceptron (or ANN)**

**o(x****1****) = -1**

**Actual Value t(x****1****) is****different****from Predicted Value o(x****1****)**

**Tweak Weights****using Perceptron Training Rule**

- Calculating the Error Values

**Δ****0****= η(t(E)-o(E))**x0

**= 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2**

**Δ****1****= η(t(E)-o(E))**x1

**= 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2**

**Δ****2****= η(t(E)-o(E))**x2

**= 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2**

**Δ****3****= η(t(E)-o(E))**x3

**= 0.1 * (1 – (-1)) * (1) = 0.1 * (2) = 0.2**

**Δ****4****= η(t(E)-o(E))**x4

**= 0.1 * (1 – (-1)) * (-1) = 0.1 * (-2) = -0.2**

- Calculating the New Weights

**w’****0****= -0.5 + Δ****0****= -0.5 + 0.2 = -0.3****w’****1****= 0.7 + Δ****1****= 0.7 + -0.2 = 0.5****w’****2****= -0.2 + Δ****2****= -0.2 + 0.2 = 0****w’****3****= 0.1 + Δ****3****= 0.1 + 0.2 = 0.3****w’****4****= 0.9 + Δ****4****= 0.9 – 0.2 = 0.7**

- Perceptron with New Weights

- Training Phase Cont…

**First Training Example**

**x****1****= <-1, +1, +1, -1> +**

**t(x****1****) = +1**

**Epoch 02**

**Compute Weighed Sum of Inputs**

**S = -0.3 (1) + 0.5(-1) + 0*(1) + 0.3*(1) + 0.7(-1)****S = -1.2**

**Apply Step Function to S to get prediction from Perceptron i.e. o(x****1****)**

**If (S > 0) Then**

** o(x****1****) = +1**

** else **

** o(x****1****) = -1**

**Output of Perceptron (or ANN)**

**o(x****1****) = -1**

**Actual Value t(x****1****) is****different****from Predicted Value o(x****1****)**

**Tweak Weights****using Perceptron Training Rule**

**First Training Example**

**x****1****= <-1, +1, +1, -1> +**

**t(x****1****) = +1**

**Still gets the****wrong****Classification / Categorisation**

**But the value is****closer****to ZERO (from -2.2 to -1.2)**

**In a****few epochs****time, Training Example x****1****will be****correctly****Classified / Categorised**

- Assumption

**I****assume****that we have****trained****Perceptron on all 3 Training Examples and Model (or hypothesis)****learned****is given below**

- Summary - Training Phase of ANN Algorithm

**Recall**

**Training Data**

**Model**

**Note that Model is Black Box Representation and it is****very difficult****to**

**understand****what the Model has learned?**

**In sha Allah, in the next Phase i.e. Testing Phase, we will**

** Evaluate the performance of the Model**

### Testing Phase – Perceptron

- Testing Phase

**Question**

**How****good****Model has learned?**

**Answer**

**Evaluate the performance of the Model on****unseen****data (or Testing Data)**

- Evaluation Measures

**Evaluation will be carried out using**

**Error measure**

- Error

**Definition**

**Error is defined as the proportion of incorrectly classified Test instances**

**Formula**

** Note**

**Accuracy = 1 – Error**

- Evaluate Model (Perceptron) Cont…

**First Test Example**

**x****4****= <+1, +1, +1, +1> +**

**t(x****4****) = +1**

**Evaluating Text Example x****4**

**Compute Weighed Sum of Inputs**

**S = -0.8 (1) + 0.4(1) + 0.2*(1) + 0.8*(1) + 0.3(1)****S = 0.9**

**Apply Step Function to S to get prediction from Perceptron i.e. o(x****1****)**

**Output of Perceptron (or ANN)**

**o(x****4****) = +1**

**Actual Value t(x****4****) is****same****as Predicted Value o(x****4****)**

**i.e. Text instace is****correctly Classificed**

**Second Test Example**

**X****5****= <-1, -1, -1, -1> –**

**t(x****5****) = -1**

**Evaluating Text Example x****5**

**Compute Weighed Sum of Inputs**

**S = -0.8 (1) + 0.4(-1) + 0.2*(-1) + 0.8*(-1) + 0.3(-1)****S = -2.5**

**Apply Step Function to S to get prediction from Perceptron i.e. o(x****5****)**

**If (S > 0) Then****o(x****5****) = +1**

**else****o(x****5****) = -1**

**Output of Perceptron (or ANN)**

**o(x****5****) = -1**

**Actual Value t(x****5****) is****same****as Predicted Value o(x****5****)**

**i.e. Text instace is****correctly Classificed**

- Summary - Evaluate Model (Perceptron)

### Application Phase – Perceptron

- Application Phase

**We assume that our Model**

**performed well****on****large Test Data****and can be****deployed in Real-world**

**Model is****deployed in the Real-world****and now we can make**

**predictions****on Real-time Data**

- Steps – Making Predictions on Real-time Data

**Step 1: Take Input from User**

**Step 2: Convert****User Input****into****Feature Vector**

**Exactly same****as****Feature Vectors****of Training and Testing Data**

**Step 3:****Apply****Model on the****Feature Vector**

**Step 4: Return****Prediction****to the User**

- Making Predictions on Real-time Data

**Step 1: Take input from User**

**Step 2: Convert User Input into****Feature Vector****Note that****order****of Attributes / Feature must be****exactly****same as that of Training and Testing Examples**

**Step 3: Apply Model on ****Feature Vector**** **

**Step 4: Return****Predition****to the User**

**Male**

**Note**

**You can take Input from user, apply Model and return predictions as many times as you like ****😊**

### Feedback Phase – Perceptron

- Feedback Phase

**Only Allah is Perfect****😊**

**Take Feedback on your****deployed****Model from**

**Domain Experts and**

**Users**

**Improve****your Model based on Feedback****😊**

- Strengths and Weaknesses – Perceptron

**Strengths**

**Perceptron’s can****learn****Binary Classification Problems**

**For example, we learned Gender Identification Problem using a simple Perceptron**

**Perceptron’s can be****combined****to make****larger****ANNs**

**Perceptron’s can learn****linearly separable functions**

**Weaknesses****Perceptron’s****fail****to learn simple Boolean-valued Functions which are****not linearly separable functions**

- Examples – Limitations of Perceptron

**Perceptron’s****can learn****simple Boolean valued functions which are****linearly separable functions**

**For example, AND Function, OR Function etc.**

**Truth Table of AND Function****Truth Table of OR Function****Truth Table of AND Function**

**Truth Table of OR Function**

**Boolean-valued AND Function and OR Function**

**Perceptron’s****cannot learn****simple Boolean valued functions which are****not linearly separable functions**

**For example, XOR Function**

**Truth table of XOR Function**

**Boolean-valued XOR Function**

## Multi-layer Artificial Neural Networks

- Multi-layer ANN

**Definition**

**A Multilayer Feed Forward Neural Network consists of an Input Layer, one or more Hidden Layers, and an Output Layer**

**One Input Layer**

**Multiple Input Units**

**One or More Hidden Layers**

**Multiple Hidden Units**

**Output Layer**

**Multiple Output Unit**

**A Multi-layer ANN can be used for both**

**Binary Classification Problems and**

**Multi-class Classification Problems**

- Chapter Focus

**The focus of this Chapter is on Multi-layer Neural Networks with****one****Hidden Layer i.e. Regular Neural Network**

- Regular Neural Network

**A Sample Regular Neural Network**

- Strengths and Weaknesses - Regular Neural Network

**Strengths**

**Can learn both****linearly separable****and****non-linearly separable****Target Functions**

**Can handle****noisy****Data**

**Can learn Machine Learning Problems with****very complex Numerical Representations**

**Weaknesses**

**Computational Cost and Training Time are****high**

- How Regular Neural Network Works?

**A Multilayer ANN works as follows**

**Step 1:****Random****Weighs are assigned to****edges****between Input, Hidden, and Output Units**

**Step 2: Inputs are****fed simultaneously****into the Input Units (making up the Input Layer)**

**Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)**

**Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)**

**Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)**

**Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)**

**Step 7: Softmax Layer converts the****vectors****at Output Units into****probabilities****and****Class with highest probability****is the****Prediction****of the Regular Neural Network**

- Mathematical Functions – Regular Neural Network

**Some of the****popular****and****widely used****Mathematic Functions in Regular Neural Network are**

**Sigmoid**

**Formula of Sigmoid Function**

**σ(S) =****1(1 + e-S)**

**where ‘S’ is Weighted sum of Inputs**

- Example – Learning in Regular Neural Network

**Machine Lerning Problem**

**Gender Identification from Image**

**Input**

**Photo / Image (Color) of a Human**

**Output**

**Gender of the Human**

**Task**

**Given****RGB Color Image****of a Human (Input),****predict****the Gender of the Human (Output)**

**Treated as**

**Learning Input-Output Function**

**i.e. Learn from Input to****predict****Output**

- Example – Learning in Multilayer ANN Cont…

**Input**

**2×2 RGB Image**

**Output**

**Class / Category 01****= Male**

**Class / Category 02****= Female**

**Categorization Rule**

**If Image contains 2, 3 or 4 Red Pixels then**

**It is Female**

**If Image contains 0 or 1 Red Pixels, then**

**It is Male**

**Multilayer Architecture**

**Input Layer**

**Four Input Units (one for each Pixel)**

**Hidden Layer**

**Two Hidden Units that’s receives input (Weighted Sum of Inputs) from the Input Layer and sends its Output (Weighted Sum of Inputs) to Output Layer**

**Output Layer**

**Two Output Units**

**O****1****for Female**

**O****1****for Male**

**Mathematical Function**

**Sigmoid Function**

**Multilayer ANN Architecture**

**Need to Learn**

**Combination of Weights between Unites which ****best fits**** the Training Data**

### Machine Learning Cycle – Regular Neural Network

- Machine Learning Cycle

**Four phases of a Machine Learning Cycle are**

**Training Phase**

**Build the Model****using Training Data**

**Testing Phase**

**Evaluate the performance of Model****using Testing Data**

**Application Phase**

**Deploy the Model in Real-world****, to****make prediction****on Real-time unseen Data**

**Feedback Phase**

**Take Feedback form the****Users****and****Domain Experts****to****improve the Model**

- Sample Data

**Consider the Sample Data of five Color Images**

**In Sample Data**

**Input is Image****(Non-numeric)**

**Output is Categorical****(Non-numeric)**

**Problem**

**ANNs can****only understand****Non-Symbolic Representations**

**Solution**

**Convert both****Input and Output into****Numeric Representation****(or Non-Symbolic Representation)**

**Converting****Output into****Numeric Representation**

**Female****= +1**

**Male****= -1**

**Converting****Input into****Numeric Representation**

**Feature Extraction from Image Data**

**Extract four color Pixels for each Image**

- Feature Extraction from Image Data

**Value of Red Pixel**

**0 – 255**

**Value of Green Pixel**

**0 – 255**

**Value of Blue Pixel**

**0 – 255**

**Note that****Pixel values****are extracted from**

**Left to Right, Top to Bottom**

- Sample Data Cont…

**Consider the Sample Data with****Numeric Representation**

- Sample Data – Vector Representation

**E****1**** = < 200, 30, 70, 175 > +**

**E****2**** = < 140, 15, 84, 211 > –**

**E****3**** = < 25, 78, 158, 125 > +**

**E****4**** = < 36, 146, 243, 64 > +**

**E****5**** = < 198, 31, 214, 34 > –**

- Split the Sample Data

**We split the Sample Data using****Random Split Approach****into**

**Training Data****– 2 / 3 of Sample Data**

**Testing Data****– 1 / 3 of Sample Data**

- Training Data

- Training Data – Vector Representation

**E****1**** = < 200, 30, 70, 175 > +**

**E****2**** = < 140, 15, 84, 211 > –**

**E****3**** = < 25, 78, 158, 125 > +**

- Testing Data

- Testing Data – Vector Representation

E4 = < 36, 146, 243, 64 > +

E5 = < 198, 31, 214, 34 > –

### Multilayer ANN – Learning Algorithm

- Learning Algorithm – Input / Output

**Input**

**Set of Training Example (D)**

**Learning Rate (l)**

**A Regular Neural Network (N)**

**Output**

**A Trained Neural Network (Model / h)**

**Algorithm**

**Regular Neural Network****learning****the Gender Identification Problem****using****Backpropagation Algorithm (Backpropagation)**

- Learning and Backpropagation Algorithm

**Note**

**The pseudo code below is taken form the following Book**

**Data Mining, Third Edition (Page: 398)**

**In the given pseudo code**

**b refers to Bias**

**Biases**

**Biases are****values associated****with****each Unit****in the Input Layer and Hidden Layer of a Regular Neural Netwrok, but in practice are****treated****in****exactly the same****manner as****other Weights**

- Regular Neural Network Training Rule

**Regular Neural Network Training Rule is used to****tweaking Weights****, when**

**Actual Value****is****different****from****Predicted Value**

**How Regular Neural Network Training Rule Works?**

**Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the****Class of given Training Example****) as 1 and**

**Set the Target Output Values of remaining Output Units as 0**

**Example – Step 1**

**Consider our Regular Neural Network for Gender Identification Problem**

**If Training Example E is Positive (Female), then**

**Set Target Output Value of O****1****to 1 and**

**Set Target Output Value of O****2****to 0**

**If Training Example E is Negative (Male), then**

**Set Target Output Value of O****1****to 0 and**

**Set Target Output Value of O****2****to 1**

**Step 2: Training the Regular Neural Network using Training Example E and get****Predictions****(Observed Output o(E)) from Regular Neural Network****Step 3: If Target Output t(E) is****different****from Observed Output o(E)**

**Calculate the****Network Error**

**Back propagate the Error****by****updating weights****between Output-Hidden Units and Hidden-Input Units**

**Step 4: Keep****training****Regular Neural Network, until****Network Error****becomes****very small**

- Learning Rate

**η is called the Learning Rate**

**Usually****set to something****small****(e.g., 0.1)**

**To****control****the movement of the Weights**

**Not to move****too far****for one Training Example**

**which may****over-compensate****for another Training Example**

**If a****large movement****is actually necessary for the Weights to correctly categorize E**

**This will occur over time with****multiple epochs**

### Training Phase – Regular Neural Network

- Training Phase

**First Training Example****x**_{1}**= < 200, 30, 70, 175 > +****t(x**_{1}**) = +1**

**Epoch 01****Compute Weighed Sum of Input (S) as Input to Hidden Layer****Using****where****I**_{H1}**=****+ W**_{11}***X**_{1}**+ W**_{21}***X**_{2}**+ W**_{31}***X**_{3}**+ W**_{41}***X**_{4}**I**_{H1}**= (-0.2) + (0.2)*(200) + (-0.1)*( 30) + (0.4)*(70) + (0.3)*(175)****I**_{H1}**= 117.3**

**I**_{H2}**=****+ W**_{12}***X**_{1}**+ W**_{22}***X**_{2}**+ W**_{32}***X**_{3}**+ W**_{42}***X**_{4}**I**_{H2}**= (0.9) + (0.7)*(200) + (-0.4)*( 30) + (0.8)*(70) + (0.1)*(175)****I**_{H2}**= 202.4**

**Apply Sigmoid Function to calculate the output from the Hidden layer****Using:**

** **

**O**_{H1}**=****0.4670****O**_{H2}

** = 0.4433**

**So, H**_{1 }**has fired, H**_{2}**has not**

**Compute Weighed Sum of Inputs (S) as Input into the Output Layer***I*_{O1 }**=****+ W**_{11}*******O**_{H1}**+ W**_{21}*** O**_{H2}*I*_{O1 }**= (-0.4) + (- 0.3)*(0.****4670****) + (0.9)* (****0.4433****)***I*_{O1 }**= -0.1411**

*I*_{O2 }**=****+ W**_{12}*******O**_{H1}**+ W**_{22}*** O**_{H2}*I*_{O2 }**= (0.7) + (0.6)*(0.****4670****) + (-0.5)*(****0.4433****)***I*_{O2 }**= 0.7585**

**Apply Sigmoid Function to calculate the output from the ANN****Using:****=****O**_{O1}**=****=****= 0.4647****O**_{O2}**=****=****= 0.6810****So, the Multilayer ANN predicts category associated with O**_{2}

**Output of Multilayer ANN****o(x**_{1}**) = -1 (Male)**

**Actual Value t(x**_{1}**) is different from Predicted Value o(x**_{4}**)****Tweak Weights using****Backpropagation Algorithm / Method**

- Backpropagate the Errors

**Compute the Error of each Unit in Output Layer****Error of Output Unit O**_{1}**Using :****4647 (1 – 0.4647) (1 – 0.4647)****1331**

**Error of Output Unit O**_{2}**Using :****6810 (1 – 0.6810) (-1 – 0.6810)****– 0.3651**

**Compute the Error of each Unit in Hidden Layer****Error of Hidden Unit H**_{1}

**Error of Hidden Unit H**_{2}

** **

- Update the Weights and Bias

**Updated Weights between Hidden and Outputs Units**

**Using wij=wij+∆wij ; where, ∆wij=lErrjOj**

**So, wij=wij+lErrjOj**

**w11=w11+0.9ErrO1Oh1**

**w’11=-0.3+0.90.1330.4670= -0.24**

**w12=w12+0.9ErrO2Oh1**

**w’12=0.6+0.9-0.3650.4670=0.44**

**w21=w21+0.9ErrO1Oh2**

**w’21=0.6+0.90.133- 0.3651=0.55**

**w22=w22+0.9ErrO2Oh2**

**w’22=-0.5+0.9-0.365- 0.3651=-0.38**

**Updated Biases of Outputs Units**

**Using bj=bj+∆bj ; where, ∆bj=lErrj**

**So, bj=bj+lErrj**

**bO1=bO1+lErrO1**

**b’O1=-0.4+0.90.1331=-0.28**

**bO2=bO2+lErrO2**

**b’O2=0.7+0.9-0.365=0.37**

**Updated Weights between Input and Hidden Units**

**Using wij=wij+∆wij ; where, ∆wij=lErrjOj**

**So, wij=wij+lErrjOj**

**w11=w11+0.9Errh1Ox1**

**w’11=0.2+0.9-0.063200= -11.14**

**Calculate the Updated weights of all the remaining Units**

**Updated Biases of Hidden Units**

**Using bj=bj+∆bj ; where, ∆bj=lErrj**

**So, bj=bj+lErrj**

**bh1=bh1+lErrh1**

**b’h1=-0.2+0.9-0.063=-0.25**

**bh2=bh2+lErrh2**

**b’h2=0.9+0.90.074=0.96**

- Assumption

**I assume that we have****trained****Regualar Neural Netwrok on all 3 Training Examples and Model (or hypothesis)****learned****is given below**

- Summary - Training Phase of Regualar Neural Netwrok Algorithm

**Recall**

**Data = Model + Error**

**Training Data**

**Model**

**Note that Model is Black Box Representation and it is****very difficult****to**

**Understand****what the Model has****learned****?**

**In sha Allah, In the next Phase i.e. Testing Phase, we will**

**Evaluate the performance of the Model**

### Testing Phase – Multilayer ANN

- Testing Phase

**Question**

**How****good****Model has learned?**

**Answer**

**Evaluate the performance of the Model on****unseen****data (or Testing Data)**

- Evaluation Measures

**Evaluation will be carried out using**

**Error measure**

- Error

**Definition**

**Error is defined as the proportion of incorrectly classified Test instances**

**Formula**

**Note**

**Accuracy = 1 – Error**

- Evaluate Model (Regular Neural Network) Cont…

**First Test Example****X**_{4}**= < 36, 146, 243, 64 > +****t(x**_{4}**) = +1 (Female)**

**Evaluating Text Example x**_{4}**Compute Weighed Sum of Inputs (S) to Hidden layer****Using :****where****I**_{H1}**=****+ W**_{11}***X**_{1}**+ W**_{21}***X**_{2}**+ W**_{31}***X**_{3}**+ W**_{41}***X**_{4}

** I**_{H1}** = (-1.5) + 0.3(36) + (-0.2)*(146) + 0.5*(243) + 1.2(64)**

** I**_{H1}** = 178.4**

**I**_{H2}**=****+ W**_{12}***X**_{1}**+ W**_{22}***X**_{2}**+ W**_{32}***X**_{3}**+ W**_{42}***X**_{4}

** I**_{H2}** = (0.4) + 0.6(36) + (-1.3)*( 146) + 1.3*(243) + 3.2(64)**

** I**_{H2}** = 352.9**

**Apply Sigmoid Function to calculate the output from the Hidden layer****Using:**

**So, H**_{1}**has fired, H**_{2}**has not**

**Compute Weighed Sum of Inputs (S) into the Output layer****I**_{O1 }**=****+ W**_{11}*******O**_{H1}**+ W**_{21}*** O**_{H2}**I**_{O1 }**= (-1) + 0.2*(0.****2311****) + 0.23 * (****0.1547****)****I**_{O1}**= -0.9181**

**I**_{O2 }**=****+ W**_{12}*******O**_{H1}**+ W**_{22}*** O**_{H2}**I**_{O2 }**= (0.5) + 0.26 (0.****2311****) +****1****.14 * (****0.1547****)****I**_{O2}**= 0.7364**

**Apply Sigmoid Function to calculate the output from the ANN**

**So, the Regualar Neural Netwrok****predicts category associated with O**_{2}

**Output of****Regualar Neural Netwrok****o(x**_{4}**) = -1 (Male)**

**Actual Value t(x**_{4}**) is****different****from Predicted Value o(x**_{4}**)****Text Example is****incorrectly Classified**

**Second Test Example****X**_{5}**= < 198, 31, 214, 34 > –****t(x**_{5}**) = -1 (Male)**

**Evaluating Text Example x**_{5}**Compute Weighed Sum of Inputs (S)to Hidden layer****Using****where****I**_{H1}**=****+ W**_{11}***X**_{1}**+ W**_{21}***X**_{2}**+ W**_{31}***X**_{3}**+ W**_{41}***X**_{4}**S**_{H1}**= (-1.5) + 0.3(198) + (-0.2)*( 31) + 0.5*(214) + 1.2(34)****S**_{H1}**= 199.5**

**I**_{H2}**=****+ W**_{12}***X**_{1}**+ W**_{22}***X**_{2}**+ W**_{32}***X**_{3}**+ W**_{42}***X**_{4}**S**_{H2 }**= (0.4) + 0.6(198) + (-1.3)*( 31) + 1.3*(214) + 3.2(34)****S**_{H2 }**= 465.9**

**Apply Sigmoid Function to calculate the output from the Hidden layer**

**So, H**_{2}**has fired, H**_{1}**has not**

**Compute Weighed Sum of Inputs (S) to the Output layer****I**_{O1 }**= b01****+ W**_{11}*******O**_{H1}**+ W**_{21}*** O**_{H2}**I**_{O1 }**= (-1) + .2 (****0.3047****) + 0.23 * (****0.1787****)****I**_{O1 }**= -0.8979**

**I**_{O2 }**=****+ W**_{12}*******O**_{H1}**+ W**_{22}*** O**_{H2}**I**_{O2 }**= (0.5) + 0.26 (****0.3047****) +****1****.14 * (****0.1787****)****I**_{O2 }**= 0.7829**

**Apply Sigmoid Function to calculate the output from the ANN**

**So, the Regualar Neural Netwrok****predicts category associated with O**_{2}

**Output of****Regualar Neural Netwrok****o(x**_{5}**) = -1 (Male)**

**Actual Value t(x**_{5}**) is****same****as Predicted Value o(x**_{5}**)****Test Example is****correctly Classified**

**Second Test Example**

**X5 = < 198, 31, 214, 34 > –**

**t(x5) = -1 (Male)**

**Evaluating Text Example x5**

**Compute Weighed Sum of Inputs (S)to Hidden layer**

**Using Ij= bj+i=1nwijOi where Oi= Xi**

**IH1= bh1 + W11*X1 + W21*X2 + W31*X3 + W41*X4**

** SH1 = (-1.5) + 0.3(198) + (-0.2)*( 31) + 0.5*(214) + 1.2(34)**

** SH1 = 199.5**

**IH2= bh2 + W12*X1 + W22*X2 + W32*X3 + W42*X4**

** SH2 = (0.4) + 0.6(198) + (-1.3)*( 31) + 1.3*(214) + 3.2(34)**

** SH2 = 465.9**

**Apply Sigmoid Function to calculate the output from the Hidden layer**

**Using: Oj = 11+ e -Ij**

**OH1 = 1(1 + e-199.5) = 11+2.2816 = 0.3047**

**OH2 = 1(1 + e-465.9) = 11+4.5941 = 0.1787**

**So, H2 has fired, H1 has not**

**Compute Weighed Sum of Inputs (S) to the Output layer**

**IO1 = bO1 + W11* OH1 + W21* OH2**

** IO1 = (-1) + .2 (0.3047) + 0.23 * (0.1787)**

** IO1 = -0.8979**

**IO2 = bO2 + W12* OH1 + W22* OH2**

** IO2 = (0.5) + 0.26 (0.3047) + 1.14 * (0.1787) **

** IO2 = 0.7829**

**Apply Sigmoid Function to calculate the output from the ANN**

**Using: Oj = 11+ e -Ij**

**OO1 = 1(1 + e0.8979) = 11+2.4544 = 0.2894**

**OO2 = 1(1 + e-0.7829) = 11+0.4570 = 0.6863**

**So, the Regualar Neural Netwrok predicts category associated with O2**

**Output of Regualar Neural Netwrok**

**o(x5) = -1 (Male)**

**Actual Value t(x5) is same as Predicted Value o(x5)**

**Test Example is correctly Classified**

- Summary - Evaluate Model (Multilayer ANN)

**Apply Model on Test Data**

### Application Phase – Regular Neural Network

- Application Phase

**We assume that our Model**

**performed well****on****large Test Data****and can be****deployed in Real-world**

**Model is****deployed in the Real-world****and now we can make**

**predictions****on Real-time Data**

- Steps – Making Predictions on Real-time Data

**Step 1: Take Input from User**

**Step 2: Convert****User Input****into****Feature Vector**

**Exactly same****as****Feature Vectors****of Training and Testing Data**

**Step 3:****Apply****Model on the****Feature Vector**

**Step 4: Return****Prediction****to the User**

- Making Predictions on Real-time Data

**Step 1: Take input from User**

**Step 2: Convert User Input into****Feature Vector**

**Note that****order****of Attributes / Feature must be****exactly****same as that of Training and Testing Examples**

**Step 3: Apply Model on****Feature Vector**

**Unseen Example****x = < 235, 64, 159, 41****>**

**Make Prediction for Unseen Example x****Compute Weighed Sum of Inputs (S) to Hidden layer****where****I**_{H1 }=**+ W**_{11}*X_{1}+ W_{21}*X_{2}+ W_{31}*X_{3}+ W_{41}*X_{4}**I**_{H1 }= -1.5 + 0.3(235) + (-0.2)*( 64) + 0.5*(159) + 1.2(41)**I**_{H1 }= 184.9

**I**_{H2 }=**+ W**_{12}*X_{1}+ W_{22}*X_{2}+ W_{32}*X_{3}+ W_{42}*X_{4}**I**_{H2 }= 0.4 + 0.6(235) + (-1.3)*( 64) + 1.3*(159) + 3.2(41)**I**_{H2 }= 396.1

**Apply Sigmoid Function to calculate the output from the Hidden layer**

**So, H**_{1}has fired, H_{2}has not

**Compute Weighed Sum of Inputs (S) into Output layer****I**_{O1 }=**+ W**_{11}***O**_{H1}**+ W**_{21}* O_{H2}**I**_{O1 }= (-1) + 0.2 (0.**1666****) + 0.23 * (****0.0955****)****I**_{O1 }= -0.9447

**I**_{O2 }=**+ W**_{12}***O**_{H1}**+ W**_{22}* O_{H2}**I**_{O2 }= (0.5) + 0.26 (0.**1666****) + 1.14 * (****0.0955****)****I**_{O2 }= 0.6521

**Apply Sigmoid Function to calculate the output from the ANN**

**So, the Regular Neural Network predicts category associated with O**_{2}

**Output of****Regular Neural Network****o(x) = -1 (Male)**

**Step 4: Return****Prediction****to the User****Male**

**Note****You can take Input from user, apply Model and return predictions as many times as you like****😊**

- Feedback Phase

**Only Allah is Perfect****😊**

**Take Feedback on your****deployed****Model from**

**Domain Experts and**

**Users**

**Improve**** your Model based on Feedback ****😊**

## Overfitting – Artificial Neural Networks

- Overfitting

**Definition**

**Given a hypothesis space H, a hypothesis h****∈****H****overﬁts****the Training Examples if there is another hypothesis h′****∈****H such that h has****smaller Error****than h′ over the Training Examples, but h′ has a****smaller Error****over the****entire distribution of instances**

**What Causes Overfitting?**

**Noise****in Training Examples or**

**Number of Training Examples is****too small****to produce a****representative sample****of Target Function**

**Why Overfitting is a Serious Problem?**

**Overfitting is a problem for many Machine Learning Algorithms**

**For example**

**Decision Tree Learning**

**Regular Neural Networks**

**Deep Learning Algorithm etc.**

- Example – Overfitting in ANN

**Plot Training Example Error versus Test Example Error:**

**Test Set Error is****increasing**

**ANN is Overfitting the Training Data**

- Problems with Local Minima

**Backpropagation Algorithm is Gradient Descent Search**

**Where the****height of the hills****is determined by****Error**

**But there are****many dimensions****to Search Space**

**One for each Weight in ANN**

**Therefore, Backpropagation Algorithm**

**Can find its ways into Local Minima**

**Possible Solutions**

**There can be****many****Possible Solutions to overcome the problem of Overfitting in ANN**

**I am presenting below only four possible solutions**

**Possible Solution 01**

**Training and Validation Set Approach**

**Possible Solution 02**

**Learn Multiple ANNs**

**Possible Solution 03**

**Momentum**

**Possible Solution 01**

**Weight Decay Factor**

- Avoiding Overfitting - Training and Validation Set Approach

**Using this approach, Sample Data is split into three sets**

**Training Set**

**Validation Set**

**Testing Set**

**Strengths**

**Validation Set helps to****check****whether the Model is Overfitting or not during the Training**

**Weaknesses**

**Holding data back for a Validation Set****reduces****data available for Training**

- Avoiding Overfitting - Training and Validation Set Approach Cont…

**Training Set**

**is used to****build****the Model**

**Validation Set**

**is used to****check****whether the Model is Overfitting or not during Training**

**Testing Set**

**is used to****evaluate****the performance of the Model**

**Note**

**Training Set and Validation Set are used in the**

**Training Phase**

**Testing Set is used in the**

**Testing Phase**

- Avoiding Overfitting - Training and Validation Set Approach Cont…

**Question 1**

**How to split Sample Data using Training and Validation Set Approach when we have****very large / huge****amount of data?**

**Answer 1**

**A Good Split may be**

**Training Set****= 80%**

**Validation Set****= 10%**

**Testing Set****= 10%**

**Note to****efficiently train****Deep Learning Algorithms, we need****huge****amount of Training Data**

**Question 1**

**How to split Sample Data using Training and Validation Set Approach when we have****sufficiently large****amount of data?**

**Answer 2**

**A Good Split may be**

**Training Set****= 80%**

**Testing Set****= 20%**

**Validation Set****= 10% of Training Set**

- Example 1 – Splitting Data using Training and Validation Set Approach

**Machine Learning Problem**

**Text Summarization**

**Deep learning Algorithm**

**LSTM**

**Total Sample Data**

**100,000 instances**

**Splitting Data using the following Split Ratio**

**Training Set****= 80%****= 80,000**

**Validation Set****= 10%****= 10,000**

**Testing Set****= 10%****= 10,000**

- Example 2 – Splitting Data using Training and Validation Set Approach

**Machine Learning Problem**

**Sentiment Analysis**

**Machine Learning Algorithm**

**Random Forest**

**Total Sample Data**

**10,000 instances**

**Splitting Data using the following Split Ratio**

**Training Set****= 80%****= 7,200**

**Testing Set****= 20%****= 2,000**

**Validation Set****= 10% of Training Set****= 800**

- Avoiding Overfitting - Training and Validation Set Approach Cont…

**Question**

**How do I know that Model is Overfitting****during Training****?**

**Answer**

**Model is Not Overfitting**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is also****increasing****then Model is****not****Overfitting**

**Model is Overfitting**

**During Training, if****Training Accuracy****is****increasing****and****Validation Accuracy****is****decreasing****then Model is Overfitting**

- Avoiding Overfitting – Learn Multiple ANNs

**Learn****multiple****ANNs**

**Starting with****different random Weight****settings**

**To make****Predictions****on Unseen Examples**

**Choice No. 1**

**Use the****best****ANN**

**Choice No. 2**

**Use a****Voting Classifier****comprising of****multiple****ANNs**

- Avoiding Overfitting – Adding Momentum

**Imagine rolling a ball down a hill**

- Momentum in Backpropagation Algorithm

**For each Weight**

**Remember what was added in the****previous****Epoch**

**In the****current****epoch**

**Add on a****small****amount of the previous Δ**

**The amount is determined by**

**Momentum Parameter (denoted by α)**

**α is taken to be between 0 and 1**

**Caution:**

**May not have****enough****Momentum to**

**get out of Local Minima**

**Also,****too much****Momentum might carry search**

**Back out of the Local Minimum, into a Global Minimum**

- Avoiding Overfitting – Use a Weight Decay Factor

**Using Weight Decay Factor**

**Take a****small****amount****off****every Weight after each Epoch**

**Note that ANNs with****smaller****Weights aren’t as highly fine-tuned (Overfit)**

- Strengths and Weaknesses – ANNs

**Strengths**

**Can learn problems with****very complex Numerical Representations**

**Can handle****noisy****Data**

**Execution****time in Application Phase is****fast**

**Good for Machine Learning Problems in which**

**Both Training Examples and Hypothesis (h) have****Numeric Representation**

**Weaknesses**

**Requires a****lot****of Training Time (particularly Deep Learning Models)**

**Computations cost****for ANNs (particularly Deep Learning Models) is****high**

**Overfitting is a****serious problem****in ANNs**

**ANNs either****reject****or****accept****a Hypothesis (h) during Training i.e. takes a****Binary Decision**

**Accept a Hypothesis (h), if it is****consistent****with the Training Example**

**Reject a Hypothesis (h), if it is ****not consistent**** with the Training Example**

## Chapter Summary

- Chapter Summary

**Following Machine Learning Algorithms are based on Symbolic Representation**

**FIND-S Algorithm**

**List Then Eliminate Algorithm**

**Candidate Elimination Algorithm**

**ID3 Algorithm**

**Symbolic Representation – Representing Training Examples**

**Attribute-Value Pair**

**Input**

**Categorical**

**Output**

**Categorical**

**Symbolic Representation – Representing Hypothesis (h)**

**Two Types of Hypothesis (h) Representations**

**Conjunction (AND) of Constrains on Input Attributes**

**Disjunction (OR) of Conjunction (AND) of Input Attributes**

**Note that****both****types of Hypothesis (h) Representations are****Symbolic**

**i.e. based on****Symbols****(Categorical Values)**

**Problem – Symbolic Representation**

**Cannot handle****Machine Learning Problems with****very complex Numeric Representations**

**Solution**

**Non-Symbolic Representations**

**for e.g. Artificial Neural Networks**

**Artificial Neural Networks – Summary**

**Representation of Training Examples (D)**

**Attribute-Value Pair**

**Input**

**Numeric**

**Output**

**Numeric**

**Representation of Hypothesis (h)**

**Combination of Weights between Units**

**Weights are****Numeric****values**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Batch Method**

**Artificial Neural Networks (a.k.a. Neural Networks) are Machine Learning Algorithms, which learn from Input (Numeric) to predict Output (Numeric)**

**ANNs are suitable for those Machine Learning Problems in which**

**Training Examples (both Input and Output) can be****represented****as real values (Numeric Representation)**

**Hypothesis (h) can be****represented****as real values (Numeric Representation)**

**Slow****Training Times are****OK**

**Predictive Accuracy****is****more****important then understanding**

**When we have noise in Training Data**

**The diagram below shows the General Architecture of Artificial Neural Networks**

**ANNs mainly have three Layers**

**Input Layer**

**Hidden Layer**

**Output Layers**

**Each Layer contains one or more****Units**

**Main Parameters of ANNs are**

**No. of Input Units**

**No. of Hidden Layers**

**No. of Hidden Units at each Hidden Layer**

**No. of Output Units**

**Mathematical Function at each Hidden and Output Unit**

**Weights between Units**

**ANN will be Fully Connected or Not?**

**Learning Rate**

**ANNs can be****broadly****categorized based on**

**Number of Hidden Layers****in an****ANN Architecture**

**Two Layer Neural Networks (a.k.a. Perceptron)**

**Number of Hidden Layers = 0**

**Multi-layer Neural Networks**

**Regular Neural Network**

**Number of Hidden Layers = 1**

**Deep Neural Network**

**Number of Hidden Layers > 1**

**A Perceptron is a****simple****Two Layer ANN, with**

**One Input Layer**

**Multiple Input Units**

**Output Layer**

**Single Output Unit**

**A Perceptron can be used for Binary Classification Problems**

**We can****use****Perceptrons to****build larger****Neural Networks**

**Perceptron has****limited****learning abilities i.e.****fails****to learn****simple****Boolean-valued Functions (for e.g. XOR)**

**A Perceptron works as follows**

**Step 1:****Random****Weighs are assigned to****edges****between Input-Output Units**

**Step 2:****Input****is****feed****into****Input Units**

**Step 3: Weighted Sum of Inputs (S) is****calculated**

**Step 4: Weighted Sum of Inputs (S) is given as****Input****to the****Mathematical Function at Output Unit**

**Step 5: Mathematical Function calculates****Output for Perceptron**

**Before using ANN to build a Model**

**Convert****both****Input and Output into****Numeric Representation****(or Non-Symbolic Representation)**

**Perceptron – Summary**

**Representation of Training Examples (D)**

**Numeric**

**Representation of Hypothesis (h)**

**Numeric (Combination of Weights between Units)**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Incremental Method**

**Strengths**

**Perceptron’s can****learn****Binary Classification Problems**

**Perceptron’s can be****combined****to make****larger****ANNs**

**Perceptron’s can learn****linearly separable functions**

**Weaknesses**

**Perceptron’s****fail****to learn simple Boolean-valued Functions which are****not linearly separable functions**

**A Regular Neural Network consists of an Input Layer, one Hidden Layers, and an Output Layer**

**A Regular Neural Network can be used for both**

**Binary Classification Problems and**

**Multi-class Classification Problems**

**Regular Neural Network – Summary**

**Representation of Training Examples (D)**

**Numeric**

**Representation of Hypothesis (h)**

**Numeric (Combination of Weights between Units)**

**Searching Strategy**

**Exhaustive Search**

**Training Regime**

**Incremental Method**

**Strengths**

**Can learn both****linearly separable****and****non-linearly separable****Target Functions**

**Can handle****noisy****Data**

**Can learn Machine Learning Problems with****very complex Numerical Representations**

**Weaknesses**

**Computational Cost and Training Time are****high**

**A Regular Neural Network works as follows**

**Step 1:****Random****Weighs are assigned to****edges****between Input, Hidden, and Output Units**

**Step 2: Inputs are****fed simultaneously****into the Input Units (making up the Input Layer)**

**Step 3: Weighted Sum of Inputs (S) is calculated and fed as input to the Hidden Units (making up the Hidden Layer)**

**Step 4: Mathematical Function (at each Hidden Unit) is applied to the Weighted Sum of Inputs (S)**

**Step 5: Weights Sum of Input is calculated for each Hidden Unit and fed to the Output Units (making the Output Layer)**

**Step 6: Mathematical Function (at each Output Unit) is applied to the Weighted Sum of Inputs (S)**

**Step 7: Softmax Layer converts the****vectors****at Output Units into****probabilities****and****Class with highest probability****is the****Prediction****of the Regular Neural Network**

**Regular Neural Network Training Rule is used to****tweaking Weights****, when**

**Actual Value****is****different****from****Predicted Value**

**How Regular Neural Network Training Rule Works?**

**Step 1: For a given Training Example, set the Target Output Value of Output Unit (which is mapped to the****Class of given Training Example****) as 1 and**

**Set the Target Output Values of remaining Output Units as 0**

**Step 2: Training the Regular Neural Network using Training Example E and get****Predictions****(Observed Output o(E)) from Regular Neural Network**

**Step 3: If Target Output t(E) is****different****from Observed Output o(E)**

**Calculate the****Network Error**

**Back propagate the Error****by updating weights between Output-Hidden Units and Hidden-Input Units**

**Step 4: Keep****training****Regular Neural Network,****Network Error****becomes****very small**

**Overfitting is a****serious problem****in ANNs (particularly Deep Learning algorithms)**

**Four Possible Solutions to overcome Overfitting in ANNs are as follows**

**Training and Validation Set Approach**

**Learn Multiple ANNs**

**Momentum**

**Weight Decay Factor**

**ANNs – Strengths and Weaknesses**

**Strengths**

**Can learn problems with****very complex Numerical Representations**

**Can handle****noisy****Data**

**Execution****time in Application Phase is****fast**

**Good for Machine Learning Problems in which**

**Both Training Examples and Hypothesis (h) have****Numeric Representation**

**Weaknesses**

**Requires a****lot****of Training Time (particularly Deep Learning Models)**

**Computations cost****for ANNs (particularly Deep Learning Models) is****high**

**Overfitting is a****serious problem****in ANNs**

**ANNs either****reject****or****accept****a Hypothesis (h) during Training i.e. takes a****Binary Decision**

**Accept a Hypothesis (h), if it is****consistent****with the Training Example**

**Reject a Hypothesis (h), if it is****not consistent****with the Training Example**

