Hi there,

in addition to the answers:

Where as in a linear regression model, parameters are fitted (and e.g. the gradient is not equal to 1 and as @Elemento pointed out this can be the case in an n-dimensional space)… ReLU has a clear definition as a function of one parameter, which passes through positives numbers (gradient = 1 since y =1x) but blocks everything else to zero. The ability of the neural net to describe and learn non-linear characteristics and cause effects is enabled due the combination of many neurons where the non-linearity is emerging from the negative part of the ReLU function. During the training the „best“ parameters (or weights) can be learned to minimize a cost function.

Since many neurons are assigned with bias and weights, linked with an activation function, by combination of multiple neurons (as the neural net in total does) this allows to learn highly nonlinear behavior, although the activation function of one neuron itself possesses only a piecewise linear activation function in case of ReLU, see also:

- Choice of activation function - #8 by Christian_Simonis
- Differences between ReLU and linear for positive values - #3 by Christian_Simonis

Best regards

Christian