Just to make sure. forward prop would be implemented on W matrices/tensors that have already been trained using some deep learning algorithm? i.e each layer has to find an optimal W and B for its units given the input activation vector? and to achieve this each layer would have to systematically do feature engineering and then implement some form of gradient decent automatically? What i mean is that forward prop is used to make predictions after the algorithm has found an optimal W and B for each layer
Hey @Werner_Pierce,
Welcome to the community. You have summarised a lot of information in a few words, where each of the few words could have it’s own separate discussion, but let me try to point out where you are going right and where you are going wrong?
- Forward Propagation
- Cost Function Computation (using the predicted and true values)
- Back Propagation (computing the gradients and updating the parameters, i.e., weights & bias using various optimisation algorithms such as gradient descent, RMSProp, Adam)
These 3 steps together train a typical neural network. Now when you say “some deep learning algorithm”, I am assuming you are referring to these 3 steps. So, you can clearly see that the forward prop is an essential part of the “deep learning algorithm” that you are referring to.
However, once the neural network has been trained using these 3 steps (including forward propagation), we also use forward propagation at the time of inference to find out the predicted labels. In fact, at the time of the inference, back propagation is not carried out at all, and cost computation may or may not happen depending on the scenario. For instance, if we are carrying out inference on the (dev/test) set (in which case we have the true labels), we compute the cost just to give us a measure of how well our model is performing. But let’s say if we are carrying our inference after the model has been deployed (in which case we don’t have the true labels), cost computation can’t be performed, unless and until we acquire the labels for the data that the model gets post deployment.
P.S. - A small note here regarding the use of cost as a performance metric. It is not much interpretable, and hence, metrics such as accuracy, F1-score, recall, precision, MSE, RMSE, etc which are designed to be used as performance metrics are more suitable for judging the model.
Ques - So is the above statement correct?
Ans - Partially, since forward prop is also used to find out the suitable values of W
and b
.
Stay tuned for the next part
Cheers,
Elemento
Now, let’s come to the second statement,
Now, this is pretty much correct. The aim of the neural network is indeed to find out the optimal W
and b
given the input activation vectors, so that it can perform as good as possible on the task at hand. Now, if you refer to this as the aim of the “layer”, I don’t think one would mind, since a layer is a part of the neural network.
Ques - So is the above statement correct?
Ans - I would say Yes.
So, let’s come to the next statement,
Now let’s focus on the word “systematically”. I am not really sure if the feature engineering would be “systematic” in the way human beings define this word as. This is because neural networks, especially large ones tend to come up with such complex functions of the input features that they tend to escape the humans’ interpretability. Now, in the last decade, there has been some fascinating research work trying to unravel the mysterious functions that the neural networks come up with to boost the interpretability of the neural networks, but to what extent it has been achieved, you might have to find that out for yourself by looking at the latest research work focused on increasing the interpretability of neural networks.
Ques - So is the above statement correct?
Ans - I am ambivalent about this. The neural network does follow a system of basic mathematics, like linear algebra, activation functions, differentiation, etc, but using this system, I guess it escapes the definition of “systematic” as defined by human beings.
Let’s move on,
Once again, let’s focus on “automatically”. The gradient descent is done by computing gradients with respect to the loss function and then updating the estimates of the parameters using these gradients and various optimisation algorithms as aforementioned, but frameworks such as Tensorflow and Pytorch do this for us in the background and we don’t have to write any code for this. However, if you implement a neural network from scratch in say Python or C, you would have to implement the gradient descent.
Ques - So is the above statement correct?
Ans - No, gradient descent has to be implemented. Although popular frameworks do this for us.
Coming to the last statement,
I guess I have covered this already in my first post. I hope this helps.
Cheers,
Elemento
Ah okay, awesome makes sense, Big thanks!
Thanks for the reply! What i meant with automatically, is that if i were to build a framework or a neural network from scratch. I would have to implement in simple terms (Because there is a lot that goes into it) an algorithm that computes gradients with respect to the loss function and then updates parameters accordingly for each layer so that whenever i am stringing layers together and training my model it seems like it is doing gradient descent automatically if that makes sense. Or in this case TensorFlow does gradient descent for me whenever i give it an input vector to train a model built using its framework
Hey @Werner_Pierce,
Indeed it makes sense but it really depends on you. If you want your framework to not support automatic backward propagation, you have the complete freedom for doing so.
Cheers,
Elemento
by input activation vector do you mean the x_train vector?