Professor said that Leaky ReLU is slightly better than ReLU, but why people most frequently used ReLU? For ReLU, if the input is negative, isn’t the parameters end up not updating because the derivative of ReLU is 0?
I had the same question and I found this which would be interesting to you as well. There’s a paper which studied different types of leaky ReLU here. According to the paper, three types of studied leaky ReLU were consistently outperformed the original ReLU. However, reasons of their superior performances still lack rigorous justification from theoretic aspect. Also, how the activations perform on large scale data is still need to be investigated and maybe these are the reasons for using ReLU more frequently. It is worth mentioning that the paper was published in 2015 and I’m not sure if these problems have been solved yet or not.
1 Like