I don’t think you can post the exercise solution code like that, but:
This is definitely “batch gradient descent” as you take the whole batch (X,y)
, compute a gradient and then apply a gradient to the current parameters (as opposed to: cutting the batch into mini-batches and processing those or even processing the (X[i], y[i])
one by one)
As I see it, you may divide the accumulated gradient by the number of examples, but it’s just a scaling value. You may consider it to be already included in self-learning_rate
.
Computing the gradient
Is the formula in gradient()
for sample_gradient
correct? It computes
(\hat{y} - y) * x
where \hat{y} is the prediction.
That doesn’t look like the right formula. (Update: I am wrong, it’s correct!)
There is also an unnecessary call to current_x = np.asarray(current_X)
right before that.
Incrementing the error
The criterium whether to increment the error doesn’t look right to me. If the loss > 0.5 we can’t conclude much. The correct criterium would be (with log the natural logarithm)
loss > log(2) \approx 0.693147.
Because the decision table is:
This can be handled with an if/else or else:
With the prediction being:
\hat{y} = \sigma(params \cdot x)
and
loss(x) = - y * log(\hat{y}) - (1-y) * log(1-\hat{y})
In case of error with: y = 1 and \hat{y} < \frac{1}{2}
loss(x) = - log(\hat{y}) = -log(\frac{1}{2}*\mu) with \mu < 1
loss(x) = log(2) - log(\mu) with log(\mu) \in ]-\infty, 0[
loss(x) > log(2)
Similarly for y = 0 and \hat{y} > \frac{1}{2}
PS
You want to rename the method cost()
to loss()
. It is computing the loss of a single example after all.
PPS
In train()
para = initial_para # Avoid modifying initial parameters
I don’t think that is going to work. para
will just be another reference to initial_para
, not a copy of initial_para
You want:
para = initial_para.copy() # Avoid modifying initial parameters
On the other hand, a new array is created anyway in gradient()
, so it doesn’t even matter.