While fitting the data, we input feature matrix X and target vector y but we do not specify the outputs for the hidden layers. How does TensorFlow calculate cost for neurons inside hidden layers without the output and optimise it?
There are no outputs for the hidden layers, thatâs why theyâre called âhidden layersâ.
The process of computing the gradients for the hidden layer weights uses a mathematical method called âbackpropagation of errorsâ.
This calculation is builtinto the Tensor Flow layer objects.
Hi @Naman_Chhibbbar Great question:
The way it works is by using backpropagation, so in this context, the output values of the hidden layers are specified by the process, however, the focus of the model is not on the hidden layers but on the input and output process. Hidden layers can be seen as intermediate steps that allow optimizing the weights of your model.
Please let me know if this answers your question
Hello!
Firstly, thanks for the compliment and your time. Your answer certainly brings some clarity to my doubt, but I still canât understand it completely. Iâm still not sure what is the cost function in a neural network and how do we optimise it.
It would be helpful if you could provide an explanation for this or some resources which I can explore.
Thanks again!
This is covered in more detail in Week 2.
these concepts are very important. In doubt feel free to rewarch the course or ask any open question in this forum.
A cost function provides a metric you can use to improve your optimization. When fitting a model you actually want to minimise the costs, (the model error). See also:
Gradient descent is a powerful method for above mentioned optimization when fitting your model. It is really well explained by Andrew Ng in this video:
Also this thread might be interesting for you:
see also this source.
Happy learning!
Best regards
Christian
Yes. I also strongly suggest using ChatGPT it is specially good to answer these type of questions, plus you can customize the answer in any format you want.
I hope it helps
I strongly recommend against using any chatbot tools, as you have no way to validate whether the answer is correct or a confabulation.
Hereâs how I would think about this.
Neural networks are created to make predictions, right? Take an input matrix, run it through some matrix computations (or transformations) and produce a resultant matrix. That resultant matrix, the predictions, is often called \hat{y} (Iâm deliberately avoiding the word output for now)
If we know in advance what the prediction should be, we can compare what the prediction should be with what it actually is to give us a measure of prediction error. Or loss. In these courses we typically call the total of the error or loss for a collection of predictions cost. The error is calculated by comparing the known correct values, the y, with the predicted values, the \hat{y}. Conceptually, error = (y  \hat{y}). The cost is computed by aggregating the loss of many predictions, and often involves an average. So you end up with something like \frac{\sum\limits_{i=1}^{n}(y  \hat{y})}{n} but be aware that there are many options and specifics vary depending on the problem under analysis.
While it is true that matrix operations at every layer of a neural network produce resultant matrices, we only measure error at the last layer because that is our set of predictions; the output. Forward propagation produces one final resultant matrix, the output, which during training can be compared with known correct values (the ground truth or labels). If the computed error, or loss, or cost, is higher than desired, steps can be taken to adjust parameters and make a new prediction. The steps of adjusting the parameters are what is known as the backpropagation mentioned in the replies above. The increment of adjustment, both magnitude and direction, is driven by the exact expression of the cost function and its partial derivatives. Itâs these partial derivatives that play a key role in the minimization of the cost.
Note that all this is true for machine learning in general, regardless of the implementation technology. That is TensorFlow, Keras, PyTorch, etc all have to accomplish this same process and they do it in largely similar ways. In some of the deeplearning.ai courses you implement all these computations and components directly in Python to get a better feel for how they work. Maybe come back to this thread after completing a few more of them, and see if the picture is clearer. Let us know what you find out!
Q: How does TensorFlow calculate cost for neurons inside hidden layers without the output and optimise it?
ChatGPT answer:
TensorFlow uses a technique called backpropagation to calculate the cost (also known as the loss or error) for neurons inside hidden layers without the output and optimize it.
During the forward pass of training, the input data is fed through the neural network, and the output is calculated. The cost function is then applied to the output, and the error is computed as the difference between the predicted output and the actual output.
During the backward pass, the error is propagated back through the network, starting from the output layer and moving towards the input layer. The weights and biases of the neurons inside the hidden layers are updated based on the error calculated for the output layer. This process is repeated for each training example in the dataset.
The backpropagation algorithm uses a technique called the chain rule to compute the gradients of the cost function with respect to the weights and biases of each neuron in the hidden layers. These gradients are used to update the weights and biases in a way that minimizes the cost function.
The optimization process typically involves using an algorithm such as gradient descent, which adjusts the weights and biases in the direction of the steepest descent of the cost function. This process continues until the cost function converges to a minimum value, indicating that the network has learned to make accurate predictions for the given input data.
Maybe Iâm hallucinating, but thatâs a pretty good answer, no? Think maybe deeplearning will stop using humans to answer forum questions and just use LLM chat bots from now on?
Nope.
Forward propagation computes the cost.
Backpropagation computes the gradients.
Exactly. Itâs not (yet?) 100% accurate. So its responses should be read critically/ skeptically and validated by comparing with oneâs own knowledge and that gleaned from other sources. Assimilate what is useful, discard the rest. If one doesnât know what one doesnât know, a ChatGPT response might serve as a good jumping off point. Where did all those assertions come from? Can I find corroboration? Which source do I trust more?
The own documentation can be a great place to look for. The way I usually use ChatGPT is to go to the main source of the information (if I know what I am looking for) and ask ChatGPT for help if I donât understand something. I usually ask things like âexplain to a 9years oldâ, summarize this text, or create bullet points of this text.
I hope this helps!
I realize this question is a couple months old. But I donât see that it got an answer.
Backpropagation computes the gradient
But isnât the gradient just the derivative of the cost?
If there are 100 training examples, and 25 units in the first layer, how are the 25 weight vectors calculated if not by using a cost function? And how do you get the cost function without the original supervision signal?
I am in week 3 btw and I have been through the videos of earlier weeks, Iâm not seeing this part explained. We send the number of hidden units into the TF model, but itâs not clear to me how this calculation happens. (that said, I accept the possibility that it was explained and I missed it ).
Yes, the gradients are the partial derivatve of the cost equation, with respect to each weight value.
In a NN, the method is called âbackpropagation of errorsâ. The gradients are first computed at the output layer (where we have labels), and these errors are applied to the hidden layer (where we donât have any labels).
There is calculus that shows how this process works. Itâs complicated.
TensorFlow automates this process for us, as its layer classes already include code to compute the gradients.
Hello @Aaron_Newman,
To supplement to this discussion, I am sharing with you some of the formulae that are discussed in the Deep Learning Specialization which are the courses that really go into deeper neural networks.
I want us to focus on just the 1st and the 4th equations.

Superscript [L] refers to the last layer (output layer), [L1] the second last (a hidden layer, which is the discussion in this thread)

The 1st is the gradient in the output layer which looks pretty simple  the difference between predictions and their true labels. Such simple form inherits from, for example in a regression problem, the Mean Squared Error which is the squared of the same difference.

The 4th is the gradient in the hidden layer which looks much more complex, which is the result of back propagation (that is a result of a series of mathematics).

Gradient of something tells us how to update that something in order to minimize the cost. If the 1st tells us that the update is driven by the difference, then the 4th tells that the update is driven by a much more complicated form that is derived through propagating the error from the Lth layer to the (L1)th layer.

Our dataset has only labels for the output layer because it is what we want from the neural network. We donât and canât care about what the hidden layers should give us, and we donât have labels for those hidden layers. However, by mathematics or by back propagation, we know how each layer (including hidden layers) should change in order for the output layer to predict something closer to the labels.
Cheers,
Raymond
This is a good question. I notice he just goes to the weights in the hidden layers , and doesnât discuss how they got there. The inputs to the hidden layers are x, weight vectors and the bias.
The outputs for the hidden layers in tensorflow are calculated by the Dense function.
According to the documentation, the output is: activation(np.dot(input, kernel) + bias)
. Input here being the X and W vectors for each input (for layer 1, for other layers it is just output of previous layer). Activation is a function like sigmoid or RELU that you provide (the default is linear f(x)=x). dot is the dot product. Kernel is a matrix generated by the system. It uses the input shape to determine the size of the kernel, and the values come from some type of probability distribution.
There are other types of layers but this is the layer that is used for the first few weeks of this class.