In the first question of the graded quiz we are asked what is true about gradient descent. One of the possible answers is “*It only works for differentiable functions*” which is accepted as true. However in the videos and practice lab we have function *e^x - log(x)* and its derivative is *e^x - 1/x*. We can see that the derivative **does not** exist when *x = 0*, but still we **do** apply gradient descent.

Why is it so? Does gradient descent work for non-differentiable functions or do we apply gradient descent where we should not?

The function e^x-\log(x) is not defined at x=0. The function is differentiable everywhere where it’s defined.

Can you provide a link to the videos or lab work where we apply gradient descent despite an undefined gradient? My access to the material in the course is now limited but perhaps with the appropriate context I can still try to help.

As for your original question, technically speaking, the update formulas based on gradient descent are by definition dependent on a differentiable value, so if the (partial) derivative does not exist, then technically there is no solution to gradient descent in theory.

Again, I dont have access to the material any longer, but if, for instance, you are referring to a relu activation function that is technically undefined precisely at x=0, then in practice, carrying forward treating the derivative as either 0 (as is the case when x<0) or 1 (as is the case when x>0) as the derivative will be fine in either of the following two scenarios:

A) you find yourself in the practically impossible scenario where you find yourself in a situation where x precisely equals zero. If you do, then arbitrarily pick a side (i.e. choose either 0 or 1 as the slope of relu at x=0). After an iteration of gradient descent, it is almoat guaranteed that you will not have to deal with the same x=0 scenario the following iteration.

B) x does not precisely equal 0 so you dont need to worry about the undefined derivative at x=0.

Point is, whether you end up going with the derivative being 1 or 0 when x=0, you will still end up converging to the same value.

I may have missed some points here trying to explain this, but I’m open to hearing what you guys have to say. And I absolutely encourage anyone that understands this stuff better than I do to chime in, especially if my explanation is off, because I’d love to learn from you guys as well!

So basically a function can be undefined at some points, or even over an interval, but still differentiable where it is defined, and that’s enough for gradient descent? On the other hand, if a function is defined on some interval, but its derivative is not defined on the same interval - such a function is not differentiable and gradient descent wouldn’t work. Is this correct?

Here is a video: https://www.coursera.org/learn/machine-learning-calculus/lecture/daSiv/optimization-using-gradient-descent-in-one-variable-part-2

It uses the formula I mentioned in the original post, but according to @Titus_Teodorescu hints I might get it wrong. See my understanding so far in the comment right above this one.

Ok thanks for that link. For some reason i was under the impression that you were talking about a function that was defined everywhere, but with a derivative that was undefined at a specific point, such as the relu activation function.

As @Titus_Teodorescu pointed out, f’(x) is defined everywhere that f(x) is defined. Therefore, gradient descent will work for the defined parts of the function. Gradient descent needs a derivative value in order to iterate to the next step, so yes, the function needs to be differentiable over that region in order to use gradient descent over that region.

I came here with the same question. Now that I’ve read the thread, I understand I was confusing a function’s differentiability with “difficult to optimize.” To summarize:

- A differentiable function is one whose derivative exists at each point in its domain.
- The e^x - log(x) IS a differentiable function, and the derivate is e^x - 1/x
- The e^x - log(x) is a hard to optimize, so we use gradient descent.