This is not an Deeplearning or Coursera official post, but I thought it might be helpful to the community for people to share their notes and questions.
The topic for this post is MLS Week 1 which covers linear regression, the cost function, and gradient descent.
Do you have any notes or major take aways you would like to share? What about questions you still have regarding these topics?
Hey @jesse,
Welcome to the community. A great initiative indeed, and thanks a lot for your contributions to the community. Looking forward to having more of them.
One of my observations regarding linear regression is that optimizing your cost function J:
doesnot mean your cost function= 0 with optimized parameters but
does mean the derivative of gradient descent= 0
The distinction here is that you may have optimized your cost function with some parameters, but there will likely still be some cost remaining even if gradient descent worked to get your cost function to the minimum (i.e., “where the gradient is zero”).
TL;DRminimum cost \ne 0 cost, but minimum cost does mean the gradient is 0 at those values for your cost function parameters.
The value of your cost function for any combination of w,b depends on how well the Linear Regression line (or any other model, for that matter) matches the target variable “y”. If you are able to find a Line that EXACTLY goes through ALL the target values y, then your Cost will be 0.
To give you an example:
x = [1,2]
y = [300,500]
w = 200, b =100
At this combination of w and b values, the gradient = 0 and Cost = 0
Now, what we showed above is a controlled mathematical exercise just to prove the contrary point. In a real scenario with a lot of data points we would seldom come across a case where all the target values would fall along a line (or any model function), in which case it will be a non-zero cost.
In Week 1, there is an optional lab: Cost Function - There is a section Cost Function Intution. Run the plt_intuition cell and it will aloow you to play around with the values of “w” and see how the Cost changes.
@shanup@Elemento I have a question about defining the term learning algorithm. So far, for week 1 in the course, we have discussed linear regression (specifically univariate linear regression). In this context, we have talked about a few functions:
the function of a straight line f_{wb}(x)=wx+b
the cost functionJ(w,b)=\frac{1}{2m}\sum\limits_{i=1}^m(f_{wb}(x^i)-y^i)^2
the function for gradient descent to continuously update parameters w and b
When we talk about a learning algorithm here, what exactly do we mean? Is (3) gradient descent the learning algorithm? Or is it more abstract like, “the implementation of gradient descent using cost functionJ with the function of a straight line f_{wb}(x) is the learning algorithm for linear regression?”
My intuition says it must be some combination of all the functions, but I am hoping for a more formal definition.
In a nutshell, the learning algorithm stands for the entire process by which we find the optimal parameters of the model which satisfy the convergence conditions that we have specified.
In the context of what you have mentioned, the learning algorithm would be:
start out with random initial values for w and b (0 works fine too for Linear Regression).
Identify the direction of steepest descent at the current value of (w,b)
Take a step in that direction from the current value of (w,b) → w – alpha * dj/dw; b – alpha * dj/db
Check if convergence conditions have been met – If yes, exit with the latest values of w,b that we got from step #3
If not, go back to step #2 but use the latest value of w,b from step #3…and continue the steps
Whenever #4 is satisfied, the Learning algorithm is complete. The output of the Learning algorithm would be the optimal model parameters “w” and “b”
Thanks for this reply. It seems the definition is very nuanced after all. How about this:
learning algorithm
: the combination of a model and its parameters that are the result of a process that minimized the cost of using the model on its relative training set
I like it that you want to encapsulate it in a single sentence.
In later lectures, you will get to see situations where minimal cost might not always be an indicator of best model - It would be for Linear Regression and for a simple Logistic Regression model. If you notice in my earlier post I put it roughly as “meeting the convergence conditions” - In the upcoming videos, Prof. Andrew will elaborate on this topic.
Aha, this seems to be the trickster in my definition. I will work on this more. Also, I am realizing I had a weak understanding of the definition of algorithm which was hurting my formation of this definition. For now I will refine to this:
learning algorithm
: a process concerning a model, its parameters, and a training set that meet some specified convergence conditions with respect to the parameters and model cost which can then be used in the model for the purpose of reduced-error predictions
Now it should be general enough to cover more than linear regression?
an algorithm is something that takes input(s) and produces output(s)
for our case, inputs are (1) a model assumption (e.g. linear regression, logistic regression, decision tree, random forest, neural network, etc), (2) a dataset, and (3) hyper-parameters
output can just be a trained model
an algorithm can include many parts that make use of the inputs for purposes. For example, gradient descent (C1W1, for how to update parameters in models such as linear/logistic regression and neural network, and not for decision tree nor random forest), feature scaling (for …, C1W2), feature engineering (for …, C1W2), regularization techniques (C1W3) and so on.
hyper-parameters are for specifying how different parts of the algorithm works, for example, cost function formula (C1W1) together with learning rate (C1W1) are needed for gradiet descent, regularization parameter\lambda (C1W3) and so on.
Above I tried to put together some topics covered in course one, because our courses are for ML so everything here matters. You might want to scan through both courses again to put everything in. “convergence” should be a part of the algorithm and there are also controlling parameters for how to define “at which point we consider it converged” and those parameters should belong to hyper-parameters.
Each part of the algorithm serves for a purpose and is for you to decide when to use it (that’s why it’s helpful to listen to Professor Andrew Ng discussing the rationale behind them), and the list of hyper-parameters should change with the content of the algorithm.
Your list will grow as you proceed through this specialization, and further into the more extensive Deep Learning Specialization.
After we have this, we might think about what are the key elements to put into the definition so that it won’t be too vague and not too detailed
The algorithm is deemed learnt when at the end it spits out the value of w and b. Finding out the most optimum value of w and b is the ultimate aim (for this uni variate linear regression case).
Now there may be more complex regression models with more features (w, the size of home , is the only one feature in our case) and we would say the algorithm has learnt when we are able to find out the optimized values for all those features.
Identifying the features itself is another thing to learn though, which we will come to know later in the course.
i have an question can someone explain me the how the numbers in cost function were calculated? like, (500 + 1831 + … +) because when i do ((209.x + b) - 250)^2 + … this is not same that in the pictures. can someone explaine me?
w = 209 and b = 0
The values that are shown on the graph are as per the below equation:
\dfrac{1}{2}(w.x^{(i)} + b - y^{(i)})^{2}
Your calculation was missing the divided by 2. If you remember the cost function is as follows.
\dfrac{1}{2m}\sum_{i=1} ^{m} (w.x^{(i)} + b - y^{(i)})^{2}
Substitute the exact values of w and b, and you will get the numbers that are shown on the graph.
Keep in mind that as you click on the contour plot to change the values of w and b, you should retreive the exact values with the decimal places to be able to match the exact loss values shown on the graph. Any slight differences in the actual and retreived values of w and b will get magnified when we take the square of the loss expression shown above.
Except for the missing denominator 2, your thought process is right. Give it a shot with this change and see if you are able to get the values displayed on the graph.
Thanks i understand what i was missing. divided by 2 and i was multiplying w by 1000 lol. Thanks for the explanation btw. You help me to understand better.