Exactly as Tom says, a function is either linear or it’s not and ReLU is “piecewise” linear, but that is nonlinear. It might seem counterintuitive, but it works. You can think of ReLU as the “minimalist” activation function: it’s incredibly cheap to compute and provides just the bare minimum of nonlinearity. It acts like what they call a “high pass filter” in the signal processing world: it zeros all negative values and passes through the positive values unchanged. It doesn’t always work, because returning zeros for all the negative values is a version of what Prof Ng will later call the “dead neuron problem”. I haven’t taken MLS, so I’m not sure if he discusses that there, but he does in DLS. Because of the low compute cost of ReLU it is common to try that first as the hidden layer activation and in a lot of cases it works just fine. If you don’t get good training results with that, then you try Leaky ReLU which is almost as cheap to compute. If that also doesn’t give good results, only then you graduate to more computationally expensive functions like tanh, sigmoid, swish and others.