Hi there,
I’m doing the programming assignment from week 4 and when looking into the code for sigmoid_backward
I’m not sure why activation A is not in the cache, that way the extra computation
s = 1/(1+np.exp(-Z))
does not need to be performed from scratch. Am I missing something? Isn’t s = A for that layer?
def sigmoid_backward(dA, cache):
"""
Implement the backward propagation for a single SIGMOID unit.
Arguments:
dA -- post-activation gradient, of any shape
cache -- 'Z' where we store for computing backward propagation efficiently
Returns:
dZ -- Gradient of the cost with respect to Z
"""
Z = cache
s = 1/(1+np.exp(-Z))
dZ = dA * s * (1-s)
assert (dZ.shape == Z.shape)
return dZ
Hi Mark, welcome to the community!
I don’t think you’re missing anything; your understanding is correct. The output s is A for that layer, since A = \sigma(Z) . The decision to cache Z instead of A is, I think, a design decision to balance memory usage and computational cost. Both approaches are valid, and in practice, you may choose one over the other based on the specific requirements of your application.
Hope this helps!
Thank you for your answer!
I believe the extra computation doesn’t have a significant impact here, as only the last layer uses a sigmoid activation function. However, for a network where the activation function is sigmoid throughout, this might have a more noticeable effect. Am I correct in my understanding?
1 Like
Yes, in networks where the sigmoid activation function is used in multiple layers, recomputing s = \sigma(Z) during backpropagation at each layer can have a more noticeable impact on computational efficiency. You are right, and considering whether to cache A is a valid optimization, especially in resource-constrained environments or large networks. To follow the instructions given in the assignment, they may expect you to calculate s again from Z , rather than storing both Z and A in the cache to avoid extra computation.
1 Like
The point of the cache values is that it needs to cover the general case. The general formula involves g'(Z), right? You just happen to get lucky in the sigmoid case that g'(Z) can be computed more cheaply directly from A and not by using Z. The problem is that is not true for all activation functions. Although it also happens to be true for tanh
.
Maybe that is because it turns out that tanh and sigmoid are very closely related mathematically.
One way to get both a general solution and avoid recomputing in some cases would be to include both Z and A in the cache. That would cost a little more memory, but save compute in some cases. If we were considering implementing this on our own, we could add that feature.
1 Like