How does TensorFlow compute cost in hidden layers?

Naman_Chhibbbar · April 1, 2023, 7:34am

While fitting the data, we input feature matrix X and target vector y but we do not specify the outputs for the hidden layers. How does TensorFlow calculate cost for neurons inside hidden layers without the output and optimise it?

TMosh · April 1, 2023, 7:38am

There are no outputs for the hidden layers, that’s why they’re called ‘hidden layers’.

The process of computing the gradients for the hidden layer weights uses a mathematical method called “backpropagation of errors”.

This calculation is built-into the Tensor Flow layer objects.

pastorsoto · April 1, 2023, 3:03pm

Hi @Naman_Chhibbbar Great question:

The way it works is by using backpropagation, so in this context, the output values of the hidden layers are specified by the process, however, the focus of the model is not on the hidden layers but on the input and output process. Hidden layers can be seen as intermediate steps that allow optimizing the weights of your model.

Please let me know if this answers your question

Naman_Chhibbbar · April 1, 2023, 3:51pm

Hello!

Firstly, thanks for the compliment and your time. Your answer certainly brings some clarity to my doubt, but I still can’t understand it completely. I’m still not sure what is the cost function in a neural network and how do we optimise it.

It would be helpful if you could provide an explanation for this or some resources which I can explore.

Thanks again!

TMosh · April 2, 2023, 5:43am

This is covered in more detail in Week 2.

Christian_Simonis · April 2, 2023, 6:29am

Hi @Naman_Chhibbbar

these concepts are very important. In doubt feel free to re-warch the course or ask any open question in this forum.

A cost function provides a metric you can use to improve your optimization. When fitting a model you actually want to minimise the costs, (the model error). See also:

https://towardsdatascience.com/cost-functions-the-underpinnings-of-machine-learning-549ac5edb211

Gradient descent is a powerful method for above mentioned optimization when fitting your model. It is really well explained by Andrew Ng in this video:

Gradient Descent (C1W2L04) - YouTube

Also this thread might be interesting for you:

How different initialization of centroids of K-means results in drastic different clusters ? They all share common cost function - #5 by Christian_Simonis

see also this source.

Happy learning!

Best regards
Christian

pastorsoto · April 6, 2023, 3:11am

Yes. I also strongly suggest using ChatGPT it is specially good to answer these type of questions, plus you can customize the answer in any format you want.

I hope it helps

TMosh · April 6, 2023, 3:34am

I strongly recommend against using any chatbot tools, as you have no way to validate whether the answer is correct or a confabulation.

ai_curious · April 6, 2023, 2:00pm

Here’s how I would think about this.

Neural networks are created to make predictions, right? Take an input matrix, run it through some matrix computations (or transformations) and produce a resultant matrix. That resultant matrix, the predictions, is often called \hat{y} (I’m deliberately avoiding the word output for now)

If we know in advance what the prediction should be, we can compare what the prediction should be with what it actually is to give us a measure of prediction error. Or loss. In these courses we typically call the total of the error or loss for a collection of predictions cost. The error is calculated by comparing the known correct values, the y, with the predicted values, the \hat{y}. Conceptually, error = (y - \hat{y}). The cost is computed by aggregating the loss of many predictions, and often involves an average. So you end up with something like \frac{\sum\limits_{i=1}^{n}(y - \hat{y})}{n} but be aware that there are many options and specifics vary depending on the problem under analysis.

While it is true that matrix operations at every layer of a neural network produce resultant matrices, we only measure error at the last layer because that is our set of predictions; the output. Forward propagation produces one final resultant matrix, the output, which during training can be compared with known correct values (the ground truth or labels). If the computed error, or loss, or cost, is higher than desired, steps can be taken to adjust parameters and make a new prediction. The steps of adjusting the parameters are what is known as the backpropagation mentioned in the replies above. The increment of adjustment, both magnitude and direction, is driven by the exact expression of the cost function and its partial derivatives. It’s these partial derivatives that play a key role in the minimization of the cost.

Note that all this is true for machine learning in general, regardless of the implementation technology. That is TensorFlow, Keras, PyTorch, etc all have to accomplish this same process and they do it in largely similar ways. In some of the deeplearning.ai courses you implement all these computations and components directly in Python to get a better feel for how they work. Maybe come back to this thread after completing a few more of them, and see if the picture is clearer. Let us know what you find out!

ai_curious · April 6, 2023, 11:58pm

Q: How does TensorFlow calculate cost for neurons inside hidden layers without the output and optimise it?

ChatGPT answer:

TensorFlow uses a technique called backpropagation to calculate the cost (also known as the loss or error) for neurons inside hidden layers without the output and optimize it.

During the forward pass of training, the input data is fed through the neural network, and the output is calculated. The cost function is then applied to the output, and the error is computed as the difference between the predicted output and the actual output.

During the backward pass, the error is propagated back through the network, starting from the output layer and moving towards the input layer. The weights and biases of the neurons inside the hidden layers are updated based on the error calculated for the output layer. This process is repeated for each training example in the dataset.

The backpropagation algorithm uses a technique called the chain rule to compute the gradients of the cost function with respect to the weights and biases of each neuron in the hidden layers. These gradients are used to update the weights and biases in a way that minimizes the cost function.

The optimization process typically involves using an algorithm such as gradient descent, which adjusts the weights and biases in the direction of the steepest descent of the cost function. This process continues until the cost function converges to a minimum value, indicating that the network has learned to make accurate predictions for the given input data.

Maybe I’m hallucinating, but that’s a pretty good answer, no? Think maybe deeplearning will stop using humans to answer forum questions and just use LLM chat bots from now on?

TMosh · April 7, 2023, 12:52am

Nope.

Forward propagation computes the cost.
Backpropagation computes the gradients.

ai_curious · April 7, 2023, 1:51pm

Exactly. It’s not (yet?) 100% accurate. So its responses should be read critically/ skeptically and validated by comparing with one’s own knowledge and that gleaned from other sources. Assimilate what is useful, discard the rest. If one doesn’t know what one doesn’t know, a ChatGPT response might serve as a good jumping off point. Where did all those assertions come from? Can I find corroboration? Which source do I trust more?

pastorsoto · April 7, 2023, 3:57pm

The own documentation can be a great place to look for. The way I usually use ChatGPT is to go to the main source of the information (if I know what I am looking for) and ask ChatGPT for help if I don’t understand something. I usually ask things like “explain to a 9-years old”, summarize this text, or create bullet points of this text.

I hope this helps!

Aaron_Newman · June 12, 2023, 5:56pm

I realize this question is a couple months old. But I don’t see that it got an answer.

Back-propagation computes the gradient

But isn’t the gradient just the derivative of the cost?

If there are 100 training examples, and 25 units in the first layer, how are the 25 weight vectors calculated if not by using a cost function? And how do you get the cost function without the original supervision signal?

I am in week 3 btw and I have been through the videos of earlier weeks, I’m not seeing this part explained. We send the number of hidden units into the TF model, but it’s not clear to me how this calculation happens. (that said, I accept the possibility that it was explained and I missed it ).

TMosh · June 12, 2023, 6:11pm

Yes, the gradients are the partial derivatve of the cost equation, with respect to each weight value.
In a NN, the method is called “backpropagation of errors”. The gradients are first computed at the output layer (where we have labels), and these errors are applied to the hidden layer (where we don’t have any labels).

There is calculus that shows how this process works. It’s complicated.

TensorFlow automates this process for us, as its layer classes already include code to compute the gradients.

rmwkwok · June 13, 2023, 1:07am

Hello @Aaron_Newman,

To supplement to this discussion, I am sharing with you some of the formulae that are discussed in the Deep Learning Specialization which are the courses that really go into deeper neural networks.

I want us to focus on just the 1st and the 4th equations.

Superscript [L] refers to the last layer (output layer), [L-1] the second last (a hidden layer, which is the discussion in this thread)
The 1st is the gradient in the output layer which looks pretty simple - the difference between predictions and their true labels. Such simple form inherits from, for example in a regression problem, the Mean Squared Error which is the squared of the same difference.
The 4th is the gradient in the hidden layer which looks much more complex, which is the result of back propagation (that is a result of a series of mathematics).
Gradient of something tells us how to update that something in order to minimize the cost. If the 1st tells us that the update is driven by the difference, then the 4th tells that the update is driven by a much more complicated form that is derived through propagating the error from the L-th layer to the (L-1)-th layer.
Our dataset has only labels for the output layer because it is what we want from the neural network. We don’t and can’t care about what the hidden layers should give us, and we don’t have labels for those hidden layers. However, by mathematics or by back propagation, we know how each layer (including hidden layers) should change in order for the output layer to predict something closer to the labels.

Cheers,
Raymond

Aaron_Newman · June 14, 2023, 7:51pm

This is a good question. I notice he just goes to the weights in the hidden layers , and doesn’t discuss how they got there. The inputs to the hidden layers are x, weight vectors and the bias.

The outputs for the hidden layers in tensorflow are calculated by the Dense function.

According to the documentation, the output is: activation(np.dot(input, kernel) + bias). Input here being the X and W vectors for each input (for layer 1, for other layers it is just output of previous layer). Activation is a function like sigmoid or RELU that you provide (the default is linear f(x)=x). dot is the dot product. Kernel is a matrix generated by the system. It uses the input shape to determine the size of the kernel, and the values come from some type of probability distribution.

There are other types of layers but this is the layer that is used for the first few weeks of this class.

Topic		Replies	Views
Do we choose cost function only according to the output layer Advanced Learning Algorithms week-2	6	534	August 30, 2022
Neural network regularization: What does it mean to regularize a hidden layer? Advanced Learning Algorithms week-3	4	550	April 3, 2023
Hidden layers in deep neural network Neural Networks and Deep Learning	1	576	June 27, 2021
Gradient decent in a multi layered neural network Advanced Learning Algorithms week-1	3	229	February 21, 2024
Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks Neural Networks and Deep Learning	2	676	May 17, 2023

How does TensorFlow compute cost in hidden layers?

Related topics