Gentle introduction to Neural Networks — Part 1 (Feedforward )
This is how a neural network looks like -
Let’s start with structured data and understand how NN really works.
Objective is to classify data in to one of the 2 classes.
Overall architecture — 1 I/P, 1 Hidden, 1 O/P layer
Lets’ break it down layer by layer
Input Layer- Through which data is passed . One thing to note here is any permutation and combination of data can be sent to hidden layer. For example- 2,4 or 2,1,9 etc. This is what network learns over-time which is the best permutation and combination that leads to correct output. This also leads to introducing non-linearity in data.
i.e, (2*w1 + b + 4*w2 + b) = 2w1 + 4w2 +2b
Let’s look at sigmoid function which is getting used in this network.
Sigmoid-
Activation functions plays 2 major roles in hidden layer-
1. It introduces non-linearization in the data.
2. it normalizes or standardizes the data.
Normalization — Scale down data to certain scale.
Suppose we have data dispersion from 10 to 1000 — [10,20,30,40,50,100,1000,200]
Since distance between min and max is around 1000–10 = 990 . It might lead to network not learning pattern properly.
If we try to scale down data (scaling doesn’t change meaning of the data) , then network has to learn in smaller range. it will be able to learn relationships in much better and faster way.
Next we will get outputs from hidden layer-
Suppose weights are initialized(will be random at first ) and O1,O2,O3 are outputs.
Since sigmoid always scales data to 0–1 , let’s assume we got O1= 0.5, O2=0.2 and O3=0.3 .
And w11, w12, w13 = 0.1, 0.2, 0.3.
O1 * w11 + O2 * w12 + O3 * w13 = 0.5*0.1 + 0.2*0.2 + 0.3*0.3 = 0.40 (approx)
This will be passed to O/P layer which has sigmoid activation function. Which will return a probability.
Replacing z with 0.40 . Suppose we get = 0.7.
What this means?
The probability of Class 2 = 0.7 and Class 1= 1–0.7=0.3 hypothetically.
Probability of Class2 > Class1
Expected was Class 1 but network predicted it’s Class 2. Which is wrong prediction.
Now we will try to find out the loss.
Note: Loss is used in terms of individual and Cost is used in terms of batches.
At first iteration which we saw , network had no idea of weights or biases. Those were initialized randomly.
There are lot of loss functions that are available like — L1 loss, L2 loss, Huber loss, Hinge loss.
This whole iteration is called as forward propagation.