Well, looks like I cannot edit my first post each time since it times out so I will delete and repost as needed?
In MLS Course 1 I learned about two kinds of supervised machine learning problems, (1) linear regression and (2) classification. Here, supervised means that the learning algorithm had labels provided.
Linear Regression
univariate linear regression means only one input feature x
General Notation — f_{w,b}(x)=wx+b
Training Set Prediction Notation — f_{w,b}(x^{(i)})=wx^{(i)}+b
multiple linear regression means multiple input features \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} where n is the number of features
General Notation — f_{w,b}(x_1, x_2, ..., x_n)=w_1x_1+w_1x_1+...+w_nx_n+b
Vectorized Notation — f_{\vec{w},b}(\vec{x})=\vec{w} \cdot \vec{x} + b
Training Set Prediction Notation — f_{\vec{w},b}(\vec{x^{(i)}})=\vec{w} \cdot \vec{x^{(i)}} + b
The goal here is to fit a line or curve to a dataset (i.e., model) that can be used to predict “reasonable” values for new values of x (in the context of univariate) or \vec{x} (in the context of multiple).
With this goal in mind, the question is, “How do we accomplish this goal?”
- Define what it means to have a better or worse prediction — cost function
- Define a goal for getting better predictions — minimize the cost function
- Determine how we will minimize the cost function — gradient descent
Cost Function
Note: From here on out I will only reference multiple linear notations.
In the context of our linear regression problem, we use the squared error cost function which looks like this:
J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2
This can be summarized as:
The cost of our model given specific values for the parameters w and b
It is important to note that this cost is in the context of all outcomes of the model on the training set.
Minimizing Cost
Minimizing the cost of our model means minimizing our cost function J(\vec{w},b). In other words, trying to get our model to do the very best it can at predicting “acceptable” values for new values of \vec{x}
Gradient descent
Gradient descent is a method of iterating over values for the parameters of the model until a specific goal is satisfied. That goal is convergence. Convergence here means “at a local minimum cost” where the parameters no longer change much with each additional step.
Gradient descent works well with convex functions. The squared error cost function produces this shape.
The parts of the gradient descent equation that stick out to me are:
- \alpha — the learning rate (changing this gives bigger and smaller steps)
- the gradient — what the slope is of the derivative for a parameter (e.g., w_j) given the corresponding cost function
- it is ideal to graph the cost function with each step of gradient descent to ensure the cost is going down and at a reasoable rate
Once the goal of gradient descent is satisfied (i.e., convergence) then that model with those specific parameters is considered ready to produce “acceptable” predictions.
Improving Gradient Descent
Feature Scaling
When your features \vec{x} are of different scale (e.g., x^{(i)}_1=1000 while x^{(i)}_2=.01) gradient descent can take a long time to converge. Scaling these features can help overcome this problem. The three common ways to do this are:
- scaling → divide a given x_j by its corresponding max value normalizing all \vec{x} between 0 and 1
- mean normalization → \frac{x_j - \mu_j}{x_{j \text{ range}}} normalizing values between -1 and 1 with a mean of 0
Improving The Model
So you have a model, you have a cost function, you minimize the cost function using gradient descent, and your predictions are not that great… Now what?
underfit
is when a model has high bias
overfit
is when a model has high variance
3 ways to improve
- More training data
- Different features or better engineered features can help (e.g., removing features)
- Regularization which reduces the size / impact (i.e., weight) of parameters
Regularization
The process of reducing the impact of lesser input features on predictions without removing any of them from your learning algorithm. This is applied at the level of the cost function which in turn has an impact on the calculation of your gradient during gradient descent.
Note: You do not need to apply regularization on the parameter b.
\frac{\lambda}{2m}\sum\limits_{j = 1}^{m}w_j^{2}
where if,
\lambda= close to 0, the model will underfit (basically eliminating the parameters \vec{w}
or if,
\lambda= large number, the model will overfit
Regularization (w/ Squared Error Cost)
J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2+\frac{\lambda}{2m}\sum\limits_{j = 1}^{m}w_j^{2}
Regularization (w/ Squared Error Cost in Gradient Descent)
\frac{\partial J(\vec{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j
Logistic Regression (Binary Classification)
This is a valid approach when your y is binary, 0 or 1. This methodology uses the sigmoid function which produces values between 0 and 1.
Sigmoid function → g(z)=\frac{1}{1+e^{-z}}
We use this function by applying it to our existing linear model where z=f_{\vec{w},b}(\vec{x}) = \vec{w}\vec{x} + b or vectorized as \vec{w} \cdot \vec{x} + b. So logistic regression looks like,
Logistic regression → f_{\vec{w},b}(\vec{x}) = g(\vec{w} \cdot \vec{x} + b) = \frac{1}{1+e^{-(\vec{w} \cdot \vec{x} + b)}}
The above equation produces values between 0 and 1 which can be thought of as:
The probability y will be 1
This reality is not really binary though, at least not until we apply something called the decision boundary which is a heuristic threshold (a custom rule). The most common rule is for values of f_{\vec{w},b}(\vec{x}) \geq 0.5 predict 1 else 0.
We happen to know that f_{\vec{w},b}(\vec{x}) \geq 0.5 when z \geq 0 (i.e., \vec{w} \cdot \vec{x} + b \geq 0). This is better understood in a picture which won’t be drawn here.
Cost Function
Something important to consider here is that we need a cost function that captures the unique characteristics of our prediction model outputting only 0 or 1 AND satisfies that results of the cost function create a “convex” function. Unfotunately, sqaured error cost function does not work here (it creates many local minima). This introduces a new concept of loss function which will be combined with a similar (not same) cost function as squared error cost function.
The new cost function looks like this,
J(\vec{w},b) = \frac{1}{m} \sum\limits_{i = 1}^{m} \left[ loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) \right]
Loss Function
The loss function is applied for each iteration i of the training set and looks like this,
loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) = -y^{(i)} \log\left(f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right)
While this looks overwhelming, think about our specific cases for y, that is, when it is equal to 0 or 1. This function penalizes the model heavily when it is further from the truth (e.g., y^{(i)}=1 and the model predicts 0).
Convenient Reality
Other concepts from linear regression carry over like feature scaling, regularization, and even gradient descent is represented the same for parameters w_j. However, while gradient descent is represented the same way, remember that the cost function behind the scenes and f_{\vec{w},{b}}(\vec{x}^{(i)}) is actually the sigmoid function.
Major Takeaways
- Seeing i typically means we are talking about iterating through the training set. So in the context of prediction, i=2 means we are talking about the model’s prediction given the values of x at training set sample 2
I will come back and finish this.