In addition to @gent.spah great answer:

- side note: in linear regression the optimum can be calculated analytically with the normal equation which works particularly well if the number of features is not too large and the data set is not super big. Other wise (super many features + really big data): gradient descent can be superior due to its iterative optimization approach where no matrix inversion step [cubic complexity] is needed in contrast to the analytical solution w/ normal equation.
- in general: in very complex tasks in the optimization problem you cannot just plot the costs (on the one hand because it’s usually multi-dimensional as @gent.spah‘s stated correctly), but also because these costs are not so easy and simple to compute in general. Actually we use gradient descent to make our next step within the optimization literally in this direction where we expect that the global optimum is and then carefully check again (over and over)…

These threads can be interesting for you, too:

- Gradient Descent for multiple feature linear regression - #9 by AbdElRhaman_Fakhry
- Supervised learning - #7 by Christian_Simonis

Hope that helps!

Best regards

Christian