Hello @Zephyrus,

This post explained why neurons can act differently - because they are initialized to different values.

Then gradient descent guides neurons to change so that the cost is miniimized.

ReLU itself is also a piecewise linear function (it changes direction at x=0), and this property is “inherited” by function that is addition of any number of ReLU functions. For example, you have 2 ReLUs: ReLU(x) and ReLU(x-1).

ReLU(x) turns at x=0, ReLU(x-1) turns at x=1. If you add the two up, the resulting ReLU(x) + ReLU(x-1) will turn at x=0 first, then turn at x=1 again, so the moment to turn is decided by the parameters w and b in ReLU(wx+b), and those parameters are changed by gradient descent.

Raymond