W4 - Shouldn't the activation A be also cached?

The point of the cache values is that it needs to cover the general case. The general formula involves g'(Z), right? You just happen to get lucky in the sigmoid case that g'(Z) can be computed more cheaply directly from A and not by using Z. The problem is that is not true for all activation functions. Although it also happens to be true for tanh. :nerd_face:

Maybe that is because it turns out that tanh and sigmoid are very closely related mathematically.

One way to get both a general solution and avoid recomputing in some cases would be to include both Z and A in the cache. That would cost a little more memory, but save compute in some cases. If we were considering implementing this on our own, we could add that feature.

1 Like