I did ReLU as well and there my results look exactly like yours. The one observation is that on the standard n = 4 case, the learning is really slow. So maybe fiddling with more iterations or a higher learning rate would be worth it, although the LR actually defaults to 1.2 here which is pretty high. But then I just ran 1 - 50 unit test cases with no hyperparameter changes and the n = 50 case actually works pretty well with 85% accuracy:
But the training takes forever. It might be worth a bit more fiddling to see if there’s a sweet spot between n = 20 and n = 50 where we could get good accuracy with reasonable compute cost. The other thing to notice there is that the n = 50 ReLU case actually takes a fundamentally different approach than either sigmoid or tanh at trying to discriminate that cluster of red dots right at the origin. So it looks like maybe it actually does give a qualitatively different solution. Maybe we could get it to do something similar with the blue dots in the upper center of the picture if we gave it even more neurons to work with. Or as @kenb said, maybe the smarter thing would be to try 2 hidden layers. We could probably get much greater complexity with fewer than 50 total neurons. Worth another try after we finish Week 4 and learn how to build the fully general case!
