While designing cost function as we have many features we are practically unable to visualize how the output function looks, in this case do we use ∇f (first order differentiation) and second order Hessian matrix in order to determine if it is not an option other than convex or concave up/down

like ∇f = 0
then H >0 (convex) or H < 0 (concave)

can you please help how we can decide on the cost function as with gradient decent algorithm we try multiple times in order to be sure we are not stuck in local minima or local maxima , how gradient decent and concave/convex functions go hand in hand together

In practice, local minima aren’t much of a problem. As long as you find a minimum that gives “good enough” performance, you don’t need to find a better one.

Yes, as Tom says, the normal cost functions we use for linear regression (MSE) or logistic regression (BCE) are convex in those cases. But note that if we use those same cost functions with multi-layer neural networks (perceptrons), then they are no longer convex. The point is that the cost function is the complete function that takes the parameters of the network (weights and bias values at all layers) as input and maps that to the cost, based on the training data. Of course we are also in very high dimensions, since it’s typical for a neural network to have thousands or even millions of parameters. The cost surfaces are very complex and impossible for us to really visualize with our human brains evolved to perceive in only 3 spatial dimensions.

There is a lot of math going on here and the questions have been studied and (as Tom also pointed out) the experts have figured out how to make gradient descent work in a lot of cases. If you take DLS, for example, you’ll learn about more sophisticated techniques like Adam, RMSprop and so forth that are useful for getting efficient convergence. Here’s a thread which discusses the general point in more detail and also links to some other information about this, including a paper by Yann LeCun’s group showing that for sufficiently complex networks, there are good solutions that we have a good probability of finding with gradient descent.