If we use ReLU in all the hidden layers and sigmoid for the output layer, wouldn’t that be almost like using a normal sigmoid activation function without ANN?
I understood how Linear Activation function in the hidden layers with sigmoid in the output layer would make using an ANN pointless and not to mention waste of resources as mentioned by Prof. Ng.
Wouldn’t replacing Linear activation function with ReLU have the same effect as the former (at least in some cases) ?
I believe this confusion stems from the thought that the ReLU is a Linear Function - But that is only one half of the story.
The ReLU is Linear in the range [0, \infty] and non-linear in the range [-\infty, + \infty]. While we generally focus only on the [0, \infty] range, the “0” output in the [-\infty, 0] range is just as important, as it silently helps to control the location of the inflection points that are so crucial for the Neural Network to model just about any output function.
That clears my doubt! In fact, I had that doubt because I assumed that ReLU is linear function and neglected to consider the range. The lab session that week cleared my doubt as well where there’s a beautiful explanation with graph that kinda gives the intuition as to why ReLU is non-linear.