I know that this is just an off-handed example that was probably created on the fly, but I want to confirm my understanding that using 2 RelU activation functions in a row is equivalent to using just one? So in a real deep neural network, we would most likely not use this.

However, this is NOT the message behind the slide. Each box in the slide represents a Dense layer, and writing “ReLU” in a box means that the Dense layer uses a ReLU activation. Therefore, the slide is talking about ReLU( W^{[2]} \times ReLU( W^{[1]} x+ b^{[1]}) + b^{[2]}).

But W^{[1]}x + b^{[1]} is also a linear transformation on X, right? So it’s basically 4 nested linear transformations. Which is equivalent to using RelU in one node?

if the inner W^{[1]} x+ b^{[1]} is -1, and W^{[2]} is -1, and b^{[2]} is 0, then the whole expression is computed to ReLU(0) which is 0.

If we take away the inner ReLU, so it becomes ReLU( W^{[2]} \times ( W^{[1]} x+ b^{[1]}) + b^{[2]}), then the answer is changed to ReLU( -1 \times -1+ 0) which is 1.

So, if we take away the inner ReLU, we will be computing a different thing, so in this case, it is not equvialent.