I remember the reason we have a sigmoid function is that the outcome y hat is a probability between 0 and 1. If we use ReLU to approximate the sigmoid function and ReLU is defined as y_hat=max(0,z), and z is not capped by 1. What happens, when z >1 and y_hat >1?

Hi @MonsterCookieJar , thanks for your question and welcome to the community, I hope the course is interesting and useful for you.

With respect to your question the ReLU is not approximating the Sigmoid. Sigmoid indeed has a value between 0 and 1, which is useful for binary classification. Tanh is another activation function that is similar to sigmoid, with the exception that it runs from -1 to 1. The problem with both sigmoid and tanh is that the slide/gradient of the curve for large and small values goes rapidly to 0 (the curve of sigmoid is flattening to 1 or 0), which in practice slows down the learning rate. Because with the Relu the slope is fixed to 1 and not going to 0 for larger values of z, in practice it allows the model to learn much faster, which is a reason why Relu is used most often in the hidden layers of a neural network.

Exactly. Please note that we do * not* use ReLU to approximate sigmoid. They are completely different. We only use ReLU as the activation function in the “hidden” layers of a network, where we do not require the behavior of sigmoid that we use in the output layer: that the result of sigmoid looks like the probability that the prediction is “yes”. That is what makes sigmoid the activation function of choice for the output layer anytime the network’s goal is a binary (“yes/no”) classification.

Ah! Very helpful. Thank you so much.