Intuition of nodes, forward prop, backward prop in a particular layer

I was understanding the implementation details of a particular hidden layer l having nodes nl. Intuitively, from linear algebra point of view, a layer just transforms an input X (nx X m) into Z (nl X m). Or we can say that initially the dependent (or target) variable y could be represented as data points in nx dimensions using X, which is performed some matrix transformation operation (i.e. σ(W * X + b) ) and converted to a representation in nl dimensions Z.

This also explains concept explained by Prof Andrew in ML Specialization, each dimension can be analogously seen as a ‘feature’ before transformation. These dimensions can be seen as bit generalized feature after each layer. e.g. housing price prediction (6 features: area, # of floors, # of beds, …) → (3 hypothetical features: house locality, ease of connectivity …).

Then with each layer we keep on getting more and more generalized features in each layer to eventually represent our (1-D) approximated hypothesis model yhat to actual values of y.

Question: Is my intuition in correct direction for forward prop? Also, it is bit difficult for me come up with an intuition for backward prop.

Thank you.

Item Link: https://www.coursera.org/learn/neural-networks-deep-learning/lecture/Y20qP/explanation-for-vectorized-implementation

Yes, at each layer, you perform a linear transformation (really an “affine” transformation) and then apply a non-linear activation function to the output of each neuron. So the equation at layer l is what you showed:

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}
A^{[l]} = g^{[l]}(Z^{[l]})

For the first layer, we have A^{[0]} = X. One notational subtlety is that you used * to express matrix multiplication, but please be aware that in Prof Ng’s notation * always means “elementwise” multiply or the Hadamard Product. When he means real dot product style multiplication, he just writes the operands adjacent with no explicit operator. In the formula above, I’ve made it a bit more explicit by using the LaTeX cdot operator.

Prof Ng does cover this. Have you gotten all the way through the Week 3 lectures? The point of back prop is to use the derivatives of the cost function to figure out which direction you need to push each of the parameters in order to get a lower cost. This is an iterative process in which you need to take small steps to avoid overshooting.

The intuitive picture most commonly used for Gradient Descent (the iterative process of decreasing the cost) is that the cost is a surface, since it’s a scalar function of all the parameters. Of course we can only visualize that in 3D, so we have 2 input dimensions (x and y) and the output is z. So you’re standing on a surface and you put a soccer ball down at your feet and let go. Which direction is it going to roll? In the direction that most rapidly decreases the cost, right? The direction that gives the steepest descent at that point on the surface. So you take a small step in that best direction and then repeat a few thousand times. :nerd_face:

I’m sure Prof Ng goes through everything I just said someplace in DLS C1. It’s been a while since I watched all the lectures, so I don’t remember exactly where he paints that picture.

1 Like

Intuitively the process of ‘learning’ is to get yhat close to y by learning good representations (this is a very loose term). However, in a mathematically rigorous answer there are 2 points to note:

  1. The term “generalized feature” is not quantifiable. The hope is that the network learns generalized ‘features’, but there’s no guarantee (even if the model converges to a global optima for the loss). The learning process only ensures that the ‘features’ are good enough to perform the learning task, but recent experiments on overparameterized models suggest that very high dimensional projections are not necessarily overfits - so, the features maybe generalizable
  2. The ‘features’ produced by the intermediate layers aren’t necessarily interpretable. They can be complex transformations of the underlying input features
1 Like

Thank you for your answers.

I agree, phrasing it as a “generalized feature” was not good choice and both your points.

Please ignore training part for now. Let me add an example and try to illustrate what intuition I have and why it makes sense:


Item Link: https://www.coursera.org/learn/advanced-learning-algorithms/lecture/MsbrF/demand-prediction

Prof. Andrew’s Idea in above slide was that we could ‘possibly’ learn intermediate parameters (let us call it, hypothetical features) as ‘affordability’, ‘awareness’ etc.

Let’s assume, after model training, a single test sample x_test[i] in 4-D space. Initially, this sample was represented as a vector (4, 1) where each dimension was a feature (assuming feature independence).

Now, after training we got some transformation matrix (here, w[1] (4, 3)). Which is used for transformation and provides a[1] (3, 1). In this matrix a[1], we do not care what each dimension represent but it is kind of representing approximately (hence, the error) the same vector x (4, 1) but now as a[1] (3, 1) in 3-D (n[l] dimensions). We can kind of say, we’re reducing (or ‘changing’) the dimensions necessary to represent the same data.

small note: number of hypothetical features in layers = number of nodes in layer nl

Then, repeated again for layer 2.

Point 1: Agreed, these hypothetical features doesn’t guarantee optima but actually I am focused on the count nl of such features/dimensions.

Point 2: Agreed, these hypothetical features aren’t entirely interpretable in most cases. I used word entirely, because we at least get some sense that they initially represent simple features (e.g. edges at different angles or similar features in initial layers of face recognition) and later more complex ones.

Let’s extend the intuition to convnets. In object detection problem where the image information is initially stored as intensity values represented in lower number of dimensions (i.e. 3-D RBG channels and other similar ones). Later on we represent same information in some hypothetical features which would be same as number of filters applied (multiplied by resolution of course). We perform maxpooling and similar operation to kind of reduce dimensions.

Then each of vectors in this hypothetical features space would be considered as a dimension itself when the volume calculated by convnets is input into dense network. This is bit hard to explain in words but quite interesting to imagine how dimensions might be changing behind the scene! :blush:

E.g. an image having resolution (na, nb, nc) where na = width, nb = height and nc channels. One intuition here, the total number of features are na * nb * nc. If nf filters applied in , the same information can be represented in these new nf hypothetical features (hence, we get output as (na, nb, nf)). But here we are not reducing the dimensions just changing their representation into new hypothetical features. From pixel intensity in R,G and B spectrum to kind of ‘pixel intensity’ in hypothetical dimensions like vertical edges, horizontal edges and so on (notice nf such features/dimensions here). I used kind of because after applying filters they are not representing intensity I think.

Intuitively yes, your understanding is correct.

Everything about using the intermediate layer activation as a ‘training feature’ (assuming we converge to the global optima) that serves as ‘input layer’ to the downstream neural network architecture should work per the KKT conditions. Loosely this is like the “optimal path” condition of dynamic programming.

In simple terms, assume we “magically know a transformation” of input variables that ultimately lead to “features” like a_1=“affordability”, a_2=“awareness” and a_3=“perceived quality” produced by the original network. Assume we truncated neural network as shown below and train the weights “w” (starting from a random 3x1 matrix) with the same loss and same activation function:

Assuming the solution converges to a global optima the weights w learnt by the truncated network should be exactly equal to the weights learnt by the original neural network.

And finally, yes - ultimately if the final layer is a softmax or logit, and if you “magically know a transformation” that converts your input vector (say, image, text, video, …) into the pre-final layer activation, learning the final layer is a simple “logistic regression” optimization task. Also, yes - the hypothetical dimensions extracted using your procedure for the convnet can be treated as a input vector space for the downstream fully connected layer(s). In fact, a deep conv net is just an abstraction of a deep multi-layer neural network with “weight constraints”.