In the second week we learned that by using derivative dw on Loss/cost we can check the effect of w on Loss.
ReLu is a good choice because when z is negative the derivative is zero and when z is positive the derivative is one.
My question is that why are we concern for the derivative of activation function when we usually perform derivative on Loss/cost. And will the loss function change if we use ReLu instead of sigmoid.
Yes, in order to calculate the gradients, we need the derivative of every function that is in the “chain” of functions that take us from a given layer all the way to the cost J at the final output. So that includes every activation function in addition to the derivatives of the “linear” portion of the calculation at each layer. This is all a big application of the Chain Rule of calculus.
Note that everything we have discussed up to this point is networks that do classifications. That means we cannot use ReLU as the activation at the output layer: we need either sigmoid (for binary classifications) or softmax (for multi-class classifiers). But we can still choose ReLU as one of the possible activation functions at any of the internal or “hidden” layers of the network. We need the derivatives of all the functions at the hidden layers as well.
Will the loss formula and the derivative result will also change if we change activation function.
It depends on which layer you are talking about. If you are discussing internal hidden layers, then changing the activation function does not change the loss function itself. Of course the complete “end to end” function that takes the data as input and produces the loss as the final result is changed, since that is a huge function composition including every step in every layer. But the actual function used to compute the loss based on the y and \hat{y} values does not change.
But it can matter if you change the activation function at the output layer. There are different loss functions that you use depending on the nature of the predictions that your network needs to make. As I mentioned above, classifiers will always use sigmoid for the binary case and softmax for the multiclass case. With each of those output activations, the values look like probabilities and we use to appropriate version of cross entropy loss (“log loss”) as Prof Ng discusses in the lectures.
But not all networks are classifiers. We don’t really cover this case in DLS Course 1, but you can also have networks that predict some continuous numeric output like the price of a house or a stock or the temperature at noon tomorrow. In that case, you can’t use sigmoid, but you might well choose either a linear output (no activation at all) or ReLU as the output layer activation in the case that only positive values are meaningful. In those cases, using cross entropy loss does not make any sense. Most commonly with a “regression” (the term for a model that predicts a continuous value), the loss function will be based on some form of Euclidean distance. The most commonly used loss function in those cases is MSE (Mean Squared Error), which is just the average of the square of the Euclidean distance between the y and \hat{y} values over all the samples. That is computationally efficient and has nice mathematical properties for running gradient descent.
In DLS Course 4, we will see some examples of more complex cases in which the outputs of the network are actually a mix of classifications and regressions and that will require “hybrid” loss functions that incorporate several terms. YOLO in DLS C4 W3 is the example I was thinking of there, so “hold that thought” and stay tuned for that.