Gentle introduction to Neural Networks — Part 1 (Feedforward )

Shubham Kanwal
3 min readJun 2, 2021
https://playground.tensorflow.org

This is how a neural network looks like -

Netron

Let’s start with structured data and understand how NN really works.

Objective is to classify data in to one of the 2 classes.

Overall architecture — 1 I/P, 1 Hidden, 1 O/P layer

Class 2 is prediction made by network for 1st record. This is a wrong prediction which network learns to correct over time in back propagation.

Lets’ break it down layer by layer

Input Layer- Through which data is passed . One thing to note here is any permutation and combination of data can be sent to hidden layer. For example- 2,4 or 2,1,9 etc. This is what network learns over-time which is the best permutation and combination that leads to correct output. This also leads to introducing non-linearity in data.

Let’s assume data 2 and 4 will be passed to 1st neuron of hidden layer. At the same time weights and error term bias are passed with inputs. These are learnable parameters which network learns over-time.
So total input that will go inside 1st neuron of hidden layer is — ∑(𝑤𝑖𝑥𝑖 + bi)
i.e, (2*w1 + b + 4*w2 + b) = 2w1 + 4w2 +2b
The data from input layer is = 2w1 + 4w2 +2b which is then passed to activation function. There are around 50+ activation functions available as of now.
Let’s look at sigmoid function which is getting used in this network.

Sigmoid-

It gives you output between range 0 and 1 , doesn’t matter what z is infinity or otherwise

Activation functions plays 2 major roles in hidden layer-

1. It introduces non-linearization in the data.
2. it normalizes or standardizes the data.

Normalization — Scale down data to certain scale.
Suppose we have data dispersion from 10 to 1000 — [10,20,30,40,50,100,1000,200]
Since distance between min and max is around 1000–10 = 990 . It might lead to network not learning pattern properly.
If we try to scale down data (scaling doesn’t change meaning of the data) , then network has to learn in smaller range. it will be able to learn relationships in much better and faster way.

Next we will get outputs from hidden layer-

Suppose weights are initialized(will be random at first ) and O1,O2,O3 are outputs.
Since sigmoid always scales data to 0–1 , let’s assume we got O1= 0.5, O2=0.2 and O3=0.3 .
And w11, w12, w13 = 0.1, 0.2, 0.3.

O1 * w11 + O2 * w12 + O3 * w13 = 0.5*0.1 + 0.2*0.2 + 0.3*0.3 = 0.40 (approx)

This will be passed to O/P layer which has sigmoid activation function. Which will return a probability.

Formula of sigmoid function

Replacing z with 0.40 . Suppose we get = 0.7.
What this means?

The probability of Class 2 = 0.7 and Class 1= 1–0.7=0.3 hypothetically.

Probability of Class2 > Class1

Expected was Class 1 but network predicted it’s Class 2. Which is wrong prediction.

Now we will try to find out the loss.
Note: Loss is used in terms of individual and Cost is used in terms of batches.

At first iteration which we saw , network had no idea of weights or biases. Those were initialized randomly.

There are lot of loss functions that are available like — L1 loss, L2 loss, Huber loss, Hinge loss.

This whole iteration is called as forward propagation.

--

--