why was l1[z1<0] = 0 and not just taking taking max between 0 and l1 value itself not the z1, what’s the idea here? For context l1 is W2.T x yhat-y, and z is W1 x X + bias1
Hi! l1[z1<0] is a step function implementing ReLu derivative and gradient multiplication.
What I mean is that during the calculation of gradients using chain rule, we arrive at point when we have to perform :