ReLU Lab Questions and joy

I love the C2_W2_Relu Lab! As always, it spurred some questions and insights. First question: Can we think of ReLU giving us the ability to minimize an intermediary cost function that is piece-wise linear, where the relationship of w and b to each other determines the decision boundary, where the decision is “When does this feature start contributing to a?” So we’re trying to figure out an optimal a for this layer, which needs to be optimized before any later a’s can be optimized. Once the neural network is fitted, I bet these a’s themselves can be analyzed to see how well they “fit”. Then the question becomes what exactly is this intermediary loss function that is being optimized? Does it have the same characteristics as the final loss function? This looks a lot like piece-wise linear regression.

It would be really cool if the lab allowed to set a relationship between w and b, so that you could solve for the decision boundary, and then quickly match the slope by varying w. It would be insane to watch this ReLU function fitting a in real-time. Amazing stuff!

Hi @pritamdodeja,

For “intermediary”, if you are contrasting the linear piecewise boundary to the assumption that the ideal boundary should be curved, then I can see why you would say that. I cannot comment on whether we can or we cannot interpret the problem this way or that way because there are infinite ways, but at least I can see some rationale behind yours. Nice try!

I think you are talking about like an animation that shows the linear piecewise solution changing over the training process. That would be a very nice animation indeed, and you actually can find this here.


You know how in an earlier lecture Dr. Ng talked about the decision boundary being wx+b. In the ReLU case, the transition point is a decision boundary. My point was the lab can emphasize this point by having the student solve for the decision boundary, and using this information to modify w and b because the decision boundary causes w and b to have a certain linear relationship. The first time I tried to solve it, I was fumbling around and missing the point that was being made. Once I thought in terms of decision boundaries, the problem became much easier: match the slope w first, and then set b such that the decision boundary/transition point matches.

I will definitely have to spend some time on tensorflow playground, but I think I need to understand the concept of activations better. So much to learn!

Actually, what we do is a method called backpropapagation, where in each run, we work backwards from the Cost function all the way to the first layer and modify the parametrs (w,b) at every layer in a single shot. And then we forward propogate the outputs “a” at each layer to evaluate the Cost Function at the output layer, for these updated set of parameters.

Food for thought: How do we analyse these “a’s” in the hidden layers? Do we know what portion of the final output is coming from each of these layers, and from each unit of these layers. Just supposing that we did know this, still do we have a target output for each of these hidden layers so that we can calculate error/cost at THAT layer?

If I think of the f(X) with parameters w, b as L3(L2(L1(A0))) with the activations being included in L then A1, thanks to ReLU, should learn the most foundational features first, which is what I meant we have to optimize A1 first. The set of foundational features should be common across models, just their positioning should vary due to randomization. As such, I think the fact by optimizing J on the outermost function, some other J_1 must have to be minimized for the inner most function. I’m not even trying to think of in terms of the final output, I’m just imagining the network being chopped off or made transparent at the earlier layers.

Hey @pritamdodeja, maybe in the ReLU you have found a very good way to first set the w then it is easier to set the b, and I find it a good way too, and this is the human way. As @shanup pointed out, we use back propagation to optimize those parameters and as you know, back propagation updates both parameters at the same time instead of first getting to the best w before to the best b. We can look at the boundary by our eye and our 2D visual processing power, but back propagation doesn’t have that luxary.

I would say back propagation has infinitely good eyesight, because it’s able to see infinitely small df/dx to compute derivatives. Moreover, it has photographic memory in the form of the gradient tape :).