How to understand accurcy of a binary classifier vs random

Based on Course 1 Logistic Regression, I’ve built my own binary classifier, and gather data.
Since my learning task is very, very difficult, I’m hoping to get my accuracy just above random probability (I’m hoping to find some patterns in data that would give me just a little edge over just picking at random).
So to establish the baseline, just after random initialization I make predictions with this weights on my dev set. In theory, I should give 50%, right?
When the dev set is 67 examples, the dev “random” accuracy is 48%.
When the dev set is 134 examples, the dev “random” accuracy I 52%.
I would call it close enough, I understand that the bigger the dev set, the closer it will get to 50%.

Now, after training weights for just a little bit a get these results:
When the dev set is 67 examples, the dev accuracy is 58%.
When the dev set is 134 examples, the dev accuracy is 46%.

My question is: Could a trained model do worse on the dev set than just random initialized weights? I imagine that training accuracy should get only better since cost is decreasing, but how about dev/test acc? Or these fluctuations are due to the dev set being too small?


Hello, Martin. That’s an interesting exercise and I encourage you to keep exploring offline as the specialization progresses. Well done!

You have come across an interesting feature of ML/DL, and one that will be addressed in Course 2 and emphasized throughout the Specialization. Your cost is nicely decreasing (monotonically) during training so your learning rate parameter is well-chosen. The longer one trains, the model parameters will do better and better at fitting the training set. Which is great. But, here is the rub. A better fit of the training set will, at some point, lead to a deterioration in the model’s ability to generalize to the dev/test set, and degrade out-of-training sample accuracy.

You have discovered the problem of “overfitting.” And in Course 2, you will learn ways to address it. But if you want to get right into it you can compute the cost using the dev/test set at each iteration and plot that alongside the training cost curves. Note the two different cost behaviors as the iterations increase. You should see what I mean.

Finally, more data will naturally result in fewer “anomalous” results. Since the costs are averages, the law-of-large numbers guaranties it. What is your training/test split, i.e. the proportion of all examples dedicated to training the model vs those set aside for testing. As a first pass, a 2/3 - 1/3 training/test split will suffice. There will be more on that later too.