How to compute J(w) in gradient checking

I have a question related to
"Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
I know how to compute \frac{\partial J}{\partial w}
but how to compute J(\theta) is it Z= WX+b or not?

Would you like to discuss J(\theta) in general, or would you like us to focus on a particular video of the course? If you want to focus on a video, please share with us the name of the video.

Otherwise, generally speaking, \theta represents model parameters, which means all the weights and bias of our neural network. J means cost, and it can be anything such as the averaged squared loss (for a regression problem) or the averaged log loss (for a binary regression problem). If we choose to use the averaged squared loss, then J(\theta) = \frac{1}{m} \sum_{i=0}^{m-1}(y^{(i)} - \hat{y}^{(i)}(x^{(i)}; \theta))^2. Essentially, you take the average (over all samples) of the squared errors.


1 Like

It sounds like you are talking about the Gradient Checking assignment in DLS C2 W1. You filed this under “General Discussion”, so I moved it for you to DLS C2 by using the little “edit pencil” on the title.

What they do in the notebook for the simple “1D” case of Gradient Checking is basically “fake”. They just define J to be a linear function of the inputs to demonstrate the concept of approximating derivatives by doing “finite differences”. So this is not a real example of a neural network. Please check out the second section of the exercise in which they do the full case: there we build a real multilayer network to do a classification, so they use the normal “cross entropy” loss function, as you would expect.

One other point to make here is that they wrote the network code in the simplest and hard-coded way, just to keep the code small. Notice that they didn’t do the full generality that we saw in DLS C1 W4 for how to build a flexible multilayer network. So what they are showing is an example of a “real” neural network, but this is not how we would really build it for the general case.


I mean when I want to compute \frac{\partial J}{\partial w} if there is one layer in network

Z= WX + b \rightarrow A= \sigma(Z ) \rightarrow J(\hat{Y}, Y)

\frac{\partial J}{\partial w} = \frac{\partial J}{\partial z} . \frac{\partial z}{\partial w}

now I want to compute the value of
\frac{\partial J}{\partial w} approx = \frac{J (w + \varepsilon ) - J (w - \varepsilon )}{ 2\varepsilon } what is the value of J (w + \varepsilon ) or J (w - \varepsilon )

to check if the different between them less than 10^{-7} or not

The cost is the average of the loss values over all the inputs. So you’d need to know the X values. As you have formulated it above, it’s not clear what you mean by J. You’ve applied sigmoid to Z it looks like, although that’s not how they did it in the notebook for the 1D case. So if you are adding sigmoid, are you also adding cross entropy loss?

Also please note that if you are only talking about how this works in the notebook, that is all covered in the instructions. If that’s the case, then the best idea is to read them again carefully and it all should make sense.

I think your J(\hat{Y}, Y) should be replaced with J(\hat{Y} \mid_{w}, Y) because your \hat{Y} depends on w. Consequently, Your J(w+\epsilon) should be replaced with J(\hat{Y} \mid_{w+\epsilon}, Y).