Dumb question alert. Once upon a time, I used to collect data, graph it, curve fit it (straight line, polynomial, or exponential) and then make predictions via interpolation and extrapolation. I forget how I did it, but I don’t remember it being difficult.
My first question is, what is the difference between what I used to do and “machine learning”?
Second, in the course, a great deal of time is spent on gradient descent, etc. in order to find the best fit. But I did not do that back in the day. Perhaps I was using a library that performed the optimization without me knowing. I computed the cost function (it wasn’t called that back then) to determine the quality of the fit, but that’s all. Why does it appear to be so much more difficult in relation to ML?
it sounds like you used all the data (100% training data) to fit the model and then went for extrapolation. A good practice in ML is to use:
a training set to fit the data (e.g. 70% or so)
a validation set to tune hyperpameters or improve features
a final test set, which was never seen by the model as final litmus test before deploying
Here you can find a nice outline from Prof. Andrew Ng:
Note: ML is a highly iterative process (just to give you one reason: because in reality you often will observe a distribution shift of data) and often the CRISP-DM method is used to work from the business problem iteratively through several steps to deployment and operation to realise the actual benefit in a data-driven way.
To fit the model you do not necessarily need gradient descent. There are several other ways to solve the optimization problem to obtain your model parameters. E.g. in a linear regression you can calculate them analytically.
It sounds like you were probably using a technique like Least Squares Curve fitting in the “back in the day” version that you are remembering. Prof Ng does show us Linear Regression, which is the simplest case of Least Squares curve fitting. In that case, there actually is a “closed form” solution called the Normal Equations, so you either used that and didn’t have to deal with iterative approximations or were using a package that hid the solution method.
You could consider that the simplest version of ML or the precursor to ML. What happens after that is that we “graduate” to data like images in which the level of complexity is so high that we need a much more complex function in order to “fit” the data. Just a polynomial curve (even one with a high exponent) is not going to give us a complex enough “decision boundary”.
Thanks for all your comments. @paulinpaloalto , yes, I believe it was the least squares curve fitting method. Everyone’s comments and links have been helpful!
When I studied Physics at university, wikipedia wouldn’t mention “Machine Learning” in almost any page for those popular statistics concepts. I remember the moment I was shocked by how it had became everywhere. Back then, we could use Microsoft Excel Solver to do fitting, or as you said, we could call some libraries in MatLab/C/LabVIEW/Mathematica/etc to do linear or non-linear fitting, and fitting was called fitting only and never was called training nor learning.
At that time I did look into the algorithms behind the scene, and if you ask me about the difference non-technical-wise, I would say, gradient descent is as classic as those algorithms, but gradient descent turns out to be the popular one today because it could fit (or train) layers of neural network using the so-called “backward propagation”, and it is also proven to be friendly to large amount of data. So, gradient descent can work with complex neural network and big data, and it wins.
Talking about curve fitting in the past, did you have to assume how the equation should look like before you even start to fit? For example, we plot the data, found out how it looked like, and pieced components in. If it looked periodic, that might be we wanted something sinosuide. If it decayed, then exponential… Nowadays this process isn’t common in general for every problem, both due to the limitation of visualizing many dimensional data and thanks to the generalization potential by neural networks.
However, just as much as generalization power NN could give us, there are skills that we need to learn to train an useful NN. It might be more difficult, but I think it is just a different approach. In the past, we had to be able to make many model assumptions to feed our prior knowledge about the problem into the curve-fitting process, and now we have to be able to coach a NN to go in the right way. Both require understanding of what we are doing
Thanks Raymond. Yes, as you say, back then (in the 80’s and 90’s) it started with a graphical approach. I may have even tried a few different fits (straight line, exponential, etc.) to determine which worked best. But you are also correct in saying that the problems were much simpler - less complex, far fewer dimensions, and far less data. But it is also very interesting to learn that gradient descent was utilized in those “black box” curve-fitting math routines.