Why an activation function result (y) of Layer 1 is used as a parameter (x) in Layer 2?

Could anyone help to understand the logic of passing activation factor (a) from Layer 1 as a parameter in a function in Layer 2?

I’ll try to explain what confuses me.

  1. Every neuron on every layer uses the same activation function. For the simplicity let it be f(x) = wx + b.
  2. On a visualization ‘x’ is on the X axis, and f(x) is on the Y axis. So when using a function we calculate Y value based on X parameter.
  3. BUT, in neuron networks, we then pass the f(x) (or Y) value as an X parameter on the next Layer of the neuron network. So a neuron_L2 takes Y_L1 and put’s it on the place of X_L2, and then calculates Y_L2.

This flipping Y to X doesn’t make any sense to me. I would appreciate if someone could explain the logic in it.


1 Like

The layers are chained together.

  • X is the input to the first layer. Its output is A1.
  • Then A1 is the input to the hidden layer. Its output is A2.
  • Repeat this process for each layer.
1 Like

I understand the “picture”: X comes in, A1 comes out; A1 comes in, A2 comes out.
But I can’t understand the logic and the sense of it. I believe the NN builders did not just accidentally bumped many layers together and saw that it somehow gives correct results. They put some logic in the chain structure. I’m trying to understand the logic.

1 Like

They started with no hidden layers, that’s just simple regression.

Then someone had the idea that by adding a hidden layer that includes a non-linear function, the hidden layer could learn non-linear combinations of the input features.

Once that was seen to work, it’s a simple matter for someone to ask “If one hidden layer is good, then what happens if we add more hidden layers?”.

1 Like

Thank you @TMosh for the answers. I guess I understand it now. Could you confirm that the below is correct?

  1. There is no flipping between X and Y. Layer 1 learns input data set (which is why the data is put to X parameter, AND Layer 2 learns the dataset provided by the previous layer (a1). So every next layer is looking for the best weights for the dataset from the previous layer. And by this it improves the predictability of the final result because every next layer improves the findings of the previous layer.

  2. 3 neurons per 1 layer would never be used in a real case. Looking for deep dependencies having just 3 point of data doesn’t make sense. But if there are tenth of hundreds of them - the next layer gets tenth or hundreds pieces of data, and statistics can make some valuable conclusion out of it.

Did I get it right?


Yes, I agree with your summaries.

1 Like