Classification confusion

I decided to do a bit of extra work and implement the linear regression and classification tasks on my computer from scratch. I downloaded a couple of data sets from the internet for each one.

The linear regression one went well, but I am confused on classification.

As I step through gradient descent, my cost function increases but the % of predictions that are correct increases too! The algorithm goes from a cost of 0.13 with 34% correct to a cost of 1.48 with 96% correct.

I’ve gone through all of my functions repeatedly and can’t find my error. I feel like I must have made two different mistakes that somehow cancel each other out since the process generates a good final result.

Unfortunately, I don’t think that I can attach my Jupyter notebook to this post.

Edit: I uploaded my notebook to github here: GitHub - compilebunny/ML-learning: Temporary ML learning repository

Are there any common errors that could produce this result? A common pair of errors perhaps?

1 Like

Hi @Jonathan_Germain,

That’s interesting.

  1. By “goes from a cost of 0.13 with 34% correct to a cost of 1.48 with 96% correct”, were you saying that your gradient descent increased the cost over iterations?

  2. Continuing with the above quote, were 0.13 and 1.48 training cost or validation cost; and were 34% and 96% training accuracy or validation accuracy?

Well, you may upload your code to your Github and share the link to your GIthub here. Just to note that we cannot share any course’s lab anywhere, but I suppose you were developing your own code, right?

Raymond

1 Like

Thanks for the idea. I’ve uploaded my notebook to github here:

1 Like

To answer your other questions:

  1. Yes, my gradient descent increases the cost over iterations.
  2. The 34% and 96% are the percent of predicted values that match actual values at 0 and 1000 iterations, respectively. Since this is an elementary implementation, I didn’t use separate training and validation data sets.
1 Like

Hi @Jonathan_Germain,

Then this tells us that you should double check your gradient descent algorithm. You need to make sure that the training costs go down over iterations.

Are you saying that the costs are training set costs, and the accuracies are training set accuracies?

That’s bad. Either there is an error in your gradient descent code, or your learning rate is too high.

I don’t think a set of random numbers is a very good test. Random data doesn’t contain very much information from which you can learn a model.

The dataset is not a set of random numbers. It is from the Wisconsin Diagnostic Breast Cancer (WDBC) database, however I changed the output to 0/1 for benign/malignant.

I found the problem. It was an error in my logistic loss function.

With that fixed, the cost function decreases from 1.2 to 0.2, and the % correct increases from 0.35 to 0.96 over the course of 1000 cycles.

From reading your notebook, I saw that somewhere in there you created a dataset using random numbers.

That’s good news that you found the problem.

This plot confused me.
image