Hey Sir,
I have noticed that the the method of gradient descent requires cost function to be differentiable at points, theta. But, in general, there probably will be another case that is the cost function is not differentiable at theta.

I wonder if there are any other optimisation methods we could use to find the optimal solution?

The cost functions that are using in ML/DL are all differentiable. It is required because we need the gradients in order to perform the optimizations. Note that all the functions applied throughout the sequence need to be differentiable, but we can sort of “skate by” with a few single points of non-differentiability. The most obvious examples are the ReLU and Leaky ReLU activation functions: those each are non-differentiable at z = 0, but you can return either 0 or 1 as g'(0) in the ReLU case and it turns out that it does not cause any problems.

Gradient descent is just one example of the general class of algorithms called “Conjugate Gradient Methods”, but all of those involve gradients. I have not done any research on this specific question, but you are trying to find a minimum point on a high dimensional surface. The other general minimization approach is the multidimensional equivalent of “set the derivative to zero and solve”. Note that for the type of problem we have, that approach just makes things more, not less, complicated. The problem is that we still end up with an equation for which there is no analytic solution. So you need an iterative approximation of some sort, e.g. the multi-dimensional analog of the Newton-Raphson Method for finding the zeros of a univariate function. Of course that also involves calculus: now we need to take the second derivative and then use that to approximate the zeros of the first derivative (the gradient).