From the first lecture Prof. Ng specifies that, for the first region, ReLU is defined ‘max of 0’. I am not sure why it is not ‘just defined as zero’. How does the ‘max part’ fit in ? It would seem that any other value would give a curve that does not look like ReLU (?).
relu(x) = max(0, x)
The input i.e. x, can take on all real values. Output is strictly non-negative. Read this link to learn more about relu.
The point is that ReLU is 0 for all negative input values, but for positive input values it does not modify the input value. So saying:
ReLU(z) = max(0, z)
is one nice simple way to express that. You could also use a more complex “conditional” way to express it like this:
ReLU(z) = \left\{ \begin{array}{ c l } z & \quad \textrm{if } z \geq 0 \\ 0 & \quad \textrm{otherwise} \end{array} \right.
The latter formulation may be more clear, but the advantage of the first way is that it translates directly into the implementation in numpy.
Oh, okay, now this makes more sense. Perhaps I misheard him say ‘max 0’ or something, but expressing the equation like this it is much clearer now. Thanks both.
This is a very clear representation of ReLU. And when you look at the ReLU graph you’ll see that any number that is less than or equal to zero is 0 (flat on the x-axis) while only values beyond 0 takes any values on y-axis.
To the right of the origin, the graph of ReLU is just the 45 degree line for the line y = x, because ReLU just passes through the positive values. So it is “piecewise linear” with the inflection point at the origin. But that still makes it a non-linear function, which is what we require for an activation function. In math there is no such thing as “almost linear”: it’s either linear or it’s not.
Hmm… So even on the first lecture I realize I have kind of waded into a big subject with only a ‘beginners’ mind. So I figure I might ask-- I mean, yes, I understand, running a TANH every time, very dificult in computation-- Yet ReLU kind of suggests a neuron is ‘always on’ (or not). I understand this is what practioners have found works best in implementation. Yet it strikes me a little strange you can’t like ‘temper’ the signal-- only drive it to zero.
For the hidden layers, you have quite a few choices for the activation functions. What you will see in this and the subsequent courses is that Prof Ng normally uses ReLU for the hidden layer activations. You can think of ReLU as the “minimalist” activation function: it’s dirt cheap to compute and provides just the minimum required amount of non-linearity. But it has some limitations as well: it has the “dead neuron” or “vanishing gradient” problem for all z < 0, so it may not work well in all cases. But it seems to work remarkably well in lots of cases. So it looks like there is a natural order in which you try the possible hidden layer activation functions: start with ReLU, if that doesn’t work well then try Leaky ReLU, which is almost as cheap to compute and eliminates the “dead neuron” problem. With Leaky ReLU you also can try different values of the slope for negative values. If that doesn’t work, then you try the more expensive functions like tanh, sigmoid, swish or other possibilities.
Thanks Paulin,
Again, interesting to what you mention I recently stumbled upon this old article by Karpathy where he brings this issue up too:
Thus, I think for me will be the hardest part for me to wrap my brain around-- I mean, yeah, no, if-- somehow you have the formulea (your target), then you can ‘divine’ the derrivative.
Though the best of my understanding thus far here is you just ‘don’t’ have that, so you kind of poke and prod. Or, perhaps I am already misunderstanding ‘gradient descent’ ?
I haven’t read the Karpathy article yet, but gradient descent is not something mysterious. The gradients are just the derivatives of the various functions that you use and those are known. You don’t “divine” the derivatives: it’s just calculus. You are defining the various activation functions. Isaac Newton could have figured out that part in 1710 if you had given him the above formula for ReLU. That plus the Chain Rule and you’re there.
Ok, the article is great. The general rule is “anything written by Andrej Karpathy is worth your time to read”.
He’s just explaining why back prop is not guaranteed to work in all cases. When we design networks, we are making a lot of choices. There are right and wrong choices that can be made and the problem is that what is right and what is wrong is “situational”: there is not one magic “silver bullet” solution that works in all cases. So you need to understand some of the pitfalls to guide you in making choices. I just waved my hands at that in my earlier response where I mentioned that ReLU doesn’t always work, but it’s worth trying as your first choice because in cases where it does work, its computational efficiency is a big win.
The question of how to make these choices will be a continuing theme of the courses. Here in Course 1, we are just getting started and there is just too much more fundamental material to cover first, but choosing what Prof Ng calls “hyperparameters” will be a major theme of Courses 2 and 3, so “hold that thought” and stay tuned! There is lots more interesting material ahead.
I don’t disagree Paulin… But if you knew the equation you were trying to solve… You wouldn’t need a ‘neural net’ to try and solve it, right ? This would be like a standard Calculus problem…
Ok, I think we are saying different things. I thought you were saying that Karpathy said that we didn’t know the derivatives of our functions. That’s not what he’s saying: he’s saying what I said in my immediately previous reply: you have lots of different choices for the architecture of your network and that is what you need to experiment with. For any given set of choices, that gives you the function and you take its derivative. The problem is that those derivatives sometimes have properties that interact badly with your particular network architecture and data (e.g. the flattening of the tails of sigmoid and tanh or the “dead neuron” problem of ReLU).
And the point of “Machine Learning” is that the machine (the algorithm) actually learns the function. The function is determined by the values of both the hyperparameters (things you need to choose) and the parameters (the weight and bias values that are actually learned through back propagation). So you don’t know the function a priori and the algorithm creates it for you.
Oh no friend, like I said I am still trying to learn, so sorry if I ask ‘dumb questions’; However a Turing/Von Neumann machine is not so ‘great’ at doing calculus as compared to a pure analog machine (you can do all the smooth curves, and I’ve thought a bit about this with either FPGAs or Analog Devices chips).
Not a priori is fine, but this strikes me as a ‘little bit’ devious. In any case, I think that is why we play (?).
These are all good, relevant and constructive thoughts. Maybe I said this incorrectly:
We do have a function at each stage of the training: it is determined by the current values of the weights that we pick. We start with random values for the weights but within a well-defined structure (number of layers, choices of activation functions and so forth) and the function doesn’t work very well. Then the training happens by computing the derivatives of the cost w.r.t. the current (not very good) parameters and then we use the gradients to push them incrementally in a better direction. If we have made good choices as Karpathy was describing, then those incremental training steps will eventually get us to a function that really works well and solves our problem.
Anyway, this is just Week 1 of Course 1, so maybe we are getting ahead of ourselves a bit here. Just “hold those thoughts” and learn all that Prof Ng has to tell us here. Things will be much more concrete by the time you get through Course 1.
Also note that it’s not the machine that’s doing calculus: we are doing that by either doing the analytic derivatives of functions like sigmoid, ReLU and tanh and then writing the code to implement those derivative functions. Or writing the code to approximate derivatives using “finite difference” methods, which we’ll see in Course 2. Then we have to write the code to implement back propagation. But once we’ve done all that, then we turn the algorithm loose on the training data and the learning takes place.
But as I said a minute ago, it probably makes more sense to just cruise ahead with the rest of Course 1 instead of listening to me. It will all make more sense once you’ve heard what Prof Ng says and worked the assignments.
I am all in agreement. It is just nice to have/find a good community. Otherwise not a big deal. I know I may be wrong, but I am thinking. I mean I recently completed the HarvardX Professional Certificate in Data Science. I thought it was excellent, but we kind of never dived into ‘deep learning’, so I am trying to make up a little bit.