# How does TensorFlow compute cost in hidden layers?

While fitting the data, we input feature matrix X and target vector y but we do not specify the outputs for the hidden layers. How does TensorFlow calculate cost for neurons inside hidden layers without the output and optimise it?

There are no outputs for the hidden layers, thatâs why theyâre called âhidden layersâ.

The process of computing the gradients for the hidden layer weights uses a mathematical method called âbackpropagation of errorsâ.

This calculation is built-into the Tensor Flow layer objects.

2 Likes

Hi @Naman_Chhibbbar Great question:

The way it works is by using backpropagation, so in this context, the output values of the hidden layers are specified by the process, however, the focus of the model is not on the hidden layers but on the input and output process. Hidden layers can be seen as intermediate steps that allow optimizing the weights of your model.

3 Likes

Hello!

Firstly, thanks for the compliment and your time. Your answer certainly brings some clarity to my doubt, but I still canât understand it completely. Iâm still not sure what is the cost function in a neural network and how do we optimise it.

It would be helpful if you could provide an explanation for this or some resources which I can explore.

Thanks again!

This is covered in more detail in Week 2.

1 Like

these concepts are very important. In doubt feel free to re-warch the course or ask any open question in this forum.

A cost function provides a metric you can use to improve your optimization. When fitting a model you actually want to minimise the costs, (the model error). See also:

Gradient descent is a powerful method for above mentioned optimization when fitting your model. It is really well explained by Andrew Ng in this video:

Also this thread might be interesting for you:

Happy learning!

Best regards
Christian

1 Like

Yes. I also strongly suggest using ChatGPT it is specially good to answer these type of questions, plus you can customize the answer in any format you want.

I hope it helps

I strongly recommend against using any chatbot tools, as you have no way to validate whether the answer is correct or a confabulation.

Neural networks are created to make predictions, right? Take an input matrix, run it through some matrix computations (or transformations) and produce a resultant matrix. That resultant matrix, the predictions, is often called \hat{y} (Iâm deliberately avoiding the word output for now)

If we know in advance what the prediction should be, we can compare what the prediction should be with what it actually is to give us a measure of prediction error. Or loss. In these courses we typically call the total of the error or loss for a collection of predictions cost. The error is calculated by comparing the known correct values, the y, with the predicted values, the \hat{y}. Conceptually, error = (y - \hat{y}). The cost is computed by aggregating the loss of many predictions, and often involves an average. So you end up with something like \frac{\sum\limits_{i=1}^{n}(y - \hat{y})}{n} but be aware that there are many options and specifics vary depending on the problem under analysis.

While it is true that matrix operations at every layer of a neural network produce resultant matrices, we only measure error at the last layer because that is our set of predictions; the output. Forward propagation produces one final resultant matrix, the output, which during training can be compared with known correct values (the ground truth or labels). If the computed error, or loss, or cost, is higher than desired, steps can be taken to adjust parameters and make a new prediction. The steps of adjusting the parameters are what is known as the backpropagation mentioned in the replies above. The increment of adjustment, both magnitude and direction, is driven by the exact expression of the cost function and its partial derivatives. Itâs these partial derivatives that play a key role in the minimization of the cost.

Note that all this is true for machine learning in general, regardless of the implementation technology. That is TensorFlow, Keras, PyTorch, etc all have to accomplish this same process and they do it in largely similar ways. In some of the deeplearning.ai courses you implement all these computations and components directly in Python to get a better feel for how they work. Maybe come back to this thread after completing a few more of them, and see if the picture is clearer. Let us know what you find out!

3 Likes

Q: How does TensorFlow calculate cost for neurons inside hidden layers without the output and optimise it?

TensorFlow uses a technique called backpropagation to calculate the cost (also known as the loss or error) for neurons inside hidden layers without the output and optimize it.

During the forward pass of training, the input data is fed through the neural network, and the output is calculated. The cost function is then applied to the output, and the error is computed as the difference between the predicted output and the actual output.

During the backward pass, the error is propagated back through the network, starting from the output layer and moving towards the input layer. The weights and biases of the neurons inside the hidden layers are updated based on the error calculated for the output layer. This process is repeated for each training example in the dataset.

The backpropagation algorithm uses a technique called the chain rule to compute the gradients of the cost function with respect to the weights and biases of each neuron in the hidden layers. These gradients are used to update the weights and biases in a way that minimizes the cost function.

The optimization process typically involves using an algorithm such as gradient descent, which adjusts the weights and biases in the direction of the steepest descent of the cost function. This process continues until the cost function converges to a minimum value, indicating that the network has learned to make accurate predictions for the given input data.

Maybe Iâm hallucinating, but thatâs a pretty good answer, no? Think maybe deeplearning will stop using humans to answer forum questions and just use LLM chat bots from now on?

Nope.

Forward propagation computes the cost.

Exactly. Itâs not (yet?) 100% accurate. So its responses should be read critically/ skeptically and validated by comparing with oneâs own knowledge and that gleaned from other sources. Assimilate what is useful, discard the rest. If one doesnât know what one doesnât know, a ChatGPT response might serve as a good jumping off point. Where did all those assertions come from? Can I find corroboration? Which source do I trust more?

2 Likes

The own documentation can be a great place to look for. The way I usually use ChatGPT is to go to the main source of the information (if I know what I am looking for) and ask ChatGPT for help if I donât understand something. I usually ask things like âexplain to a 9-years oldâ, summarize this text, or create bullet points of this text.

I hope this helps!

1 Like

I realize this question is a couple months old. But I donât see that it got an answer.

But isnât the gradient just the derivative of the cost?

If there are 100 training examples, and 25 units in the first layer, how are the 25 weight vectors calculated if not by using a cost function? And how do you get the cost function without the original supervision signal?

I am in week 3 btw and I have been through the videos of earlier weeks, Iâm not seeing this part explained. We send the number of hidden units into the TF model, but itâs not clear to me how this calculation happens. (that said, I accept the possibility that it was explained and I missed it ).

Yes, the gradients are the partial derivatve of the cost equation, with respect to each weight value.
In a NN, the method is called âbackpropagation of errorsâ. The gradients are first computed at the output layer (where we have labels), and these errors are applied to the hidden layer (where we donât have any labels).

There is calculus that shows how this process works. Itâs complicated.

TensorFlow automates this process for us, as its layer classes already include code to compute the gradients.

Hello @Aaron_Newman,

To supplement to this discussion, I am sharing with you some of the formulae that are discussed in the Deep Learning Specialization which are the courses that really go into deeper neural networks.

I want us to focus on just the 1st and the 4th equations.

• Superscript [L] refers to the last layer (output layer), [L-1] the second last (a hidden layer, which is the discussion in this thread)

• The 1st is the gradient in the output layer which looks pretty simple - the difference between predictions and their true labels. Such simple form inherits from, for example in a regression problem, the Mean Squared Error which is the squared of the same difference.

• The 4th is the gradient in the hidden layer which looks much more complex, which is the result of back propagation (that is a result of a series of mathematics).

• Gradient of something tells us how to update that something in order to minimize the cost. If the 1st tells us that the update is driven by the difference, then the 4th tells that the update is driven by a much more complicated form that is derived through propagating the error from the L-th layer to the (L-1)-th layer.

• Our dataset has only labels for the output layer because it is what we want from the neural network. We donât and canât care about what the hidden layers should give us, and we donât have labels for those hidden layers. However, by mathematics or by back propagation, we know how each layer (including hidden layers) should change in order for the output layer to predict something closer to the labels.

Cheers,
Raymond

1 Like

This is a good question. I notice he just goes to the weights in the hidden layers , and doesnât discuss how they got there. The inputs to the hidden layers are x, weight vectors and the bias.

The outputs for the hidden layers in tensorflow are calculated by the Dense function.

According to the documentation, the output is: activation(np.dot(input, kernel) + bias). Input here being the X and W vectors for each input (for layer 1, for other layers it is just output of previous layer). Activation is a function like sigmoid or RELU that you provide (the default is linear f(x)=x). dot is the dot product. Kernel is a matrix generated by the system. It uses the input shape to determine the size of the kernel, and the values come from some type of probability distribution.

There are other types of layers but this is the layer that is used for the first few weeks of this class.

1 Like