W4 - Shouldn't the activation A be also cached?

cloudmark · November 6, 2024, 6:03pm

Hi there,

I’m doing the programming assignment from week 4 and when looking into the code for sigmoid_backward I’m not sure why activation A is not in the cache, that way the extra computation

s = 1/(1+np.exp(-Z))

does not need to be performed from scratch. Am I missing something? Isn’t s = A for that layer?

def sigmoid_backward(dA, cache):
    """
    Implement the backward propagation for a single SIGMOID unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    
    assert (dZ.shape == Z.shape)
    
    return dZ

nadtriana · November 6, 2024, 6:49pm

Hi Mark, welcome to the community!

I don’t think you’re missing anything; your understanding is correct. The output s is A for that layer, since A = \sigma(Z) . The decision to cache Z instead of A is, I think, a design decision to balance memory usage and computational cost. Both approaches are valid, and in practice, you may choose one over the other based on the specific requirements of your application.

Hope this helps!

cloudmark · November 6, 2024, 6:55pm

Thank you for your answer!

I believe the extra computation doesn’t have a significant impact here, as only the last layer uses a sigmoid activation function. However, for a network where the activation function is sigmoid throughout, this might have a more noticeable effect. Am I correct in my understanding?

nadtriana · November 6, 2024, 7:03pm

Yes, in networks where the sigmoid activation function is used in multiple layers, recomputing s = \sigma(Z) during backpropagation at each layer can have a more noticeable impact on computational efficiency. You are right, and considering whether to cache A is a valid optimization, especially in resource-constrained environments or large networks. To follow the instructions given in the assignment, they may expect you to calculate s again from Z , rather than storing both Z and A in the cache to avoid extra computation.

cloudmark · November 6, 2024, 7:19pm

Thank you for your help!

paulinpaloalto · November 6, 2024, 7:45pm

The point of the cache values is that it needs to cover the general case. The general formula involves g'(Z), right? You just happen to get lucky in the sigmoid case that g'(Z) can be computed more cheaply directly from A and not by using Z. The problem is that is not true for all activation functions. Although it also happens to be true for tanh.

Maybe that is because it turns out that tanh and sigmoid are very closely related mathematically.

One way to get both a general solution and avoid recomputing in some cases would be to include both Z and A in the cache. That would cost a little more memory, but save compute in some cases. If we were considering implementing this on our own, we could add that feature.

cloudmark · November 7, 2024, 7:22am

Thank you!

Topic		Replies	Views
Course 1 week 4, assignment 1, exercise 8: linear activation backward Neural Networks and Deep Learning	4	648	February 11, 2022
Are we caching Z for backprop only for RELU? Neural Networks and Deep Learning	5	596	September 6, 2022
Sigmoid Function in Layer L Neural Networks and Deep Learning	8	721	January 30, 2023
W4_A1_Computing Activation functions in Linear Activation Backward Neural Networks and Deep Learning	7	491	August 14, 2023
Help on week 4 Q8 "linear_activation_backward" Neural Networks and Deep Learning	2	492	April 17, 2023

W4 - Shouldn't the activation A be also cached?

Related topics