General implementation of deep neural network for multi class classification problem, using course 1 and course 2

so using the concepts I learnt from course 1 and course 2, I tried to implement a deep neural network for a multi class classification problem by doing the relevant tweaks to the model (tried gradient descent and adam optimization, xavier initialization, regularization). but I’ve noticed that even if some versions of my model gave around 80+ % test accuracy, the cost keeps increasing with every 1000 iterations. logically this seems flawed to me, but it would be great if I got some guidance on the same.

1 Like

Cost should not be increasing.

Unless there is some defect in your model, here are two common ways to fix this:

  • Normalize the features.
  • Use a lower learning rate.

What method did you use? (sklearn, TensorFlow, or something else)?

no i didnt use any library, just implemented the whole thing on my own like what was taught in the course. could be an error in the implementation then?

1 Like

Implementation errors are always possible.

fair enough. I will debug it and try to find the error. Just wanted to get some conceptual clarity, I read somewhere that sometimes the optimization algorithm might not try to minimize the costs directly but will try to optimize some other function, and due to this the costs might rise, could that be the case?

1 Like

That’s not true in my experience.

Well, you’re saying that you wrote the optimization algorithm in this case, right? So what is it minimizing? Is it the cost or something else? :grin:

All the algorithms that Prof Ng has shown us at least so far up through DLS C2 minimize the cost. Of course that propagates backwards through the gradients, but it is minimization of the cost that drives everything.

But there is no guarantee that you always get convergence …

1 Like

ok yea it is an extension from the course so therefore minimizing the cost. I will get to the debugging. Thank you!

1 Like

the no guarantee for convergence, might be due to the choice of hyperparameters is it?
like learning rate

1 Like

Yes, exactly. Unfortunately the search space for hyperparameters has lots of dimensions, but learning rate is a key one. Note that there are more sophisticated optimization algorithms which dynamically manage the learning rate. We see a couple of such techniques in the DLS C2 W2 Optimization assignment.

But if your cost is monotonically increasing rather than decreasing and I was the one writing the code, my first step would be the debugging that you have been mentioning. There’s a lot of code you need to build all this stuff from scratch, which of course means lots of ways things can go off the rails. :scream_cat:

1 Like

It’s admirable that you are digging in with this level of detail. In the “real world” these days, the SOTA is so complicated that nobody can really build everything from scratch themselves, but rather they use “platforms” like TensorFlow or Pytorch or a bunch of others. We start learning about TF in Week 3 of DLS C2, of course, and then continue to learn more through C4 and C5.

One approach to understanding where your problem is would be to build a parallel version of the same network architecture with TF and then compare the results. Of course you are describing a pretty sophisticated solution by the time you get to Adam Optimization, Xavier Initialization and Regularization, so you’ll need to be more familiar with TF than we will get here in DLS C2.

2 Likes

yea I have gone through the tensorflow implementation that comes up little later in the course, but wanted to see if I am able to do the same in python from scratch. The explanation in the course was very intriguing, I had to at least give it a try. :sweat_smile:

1 Like

It is an interesting and valid idea to try this. If you are already familiar enough with TF, my point was just that you could use that as a comparison. If you implement the exact same architecture in TF and directly in python, then you can compare both the results and the performance in terms of training time and the like. But if the TF implementation converges and yours does not, then that would indicate that there is something wrong in your implementation. But if you get the same symptom of monotonically increasing cost even with the TF implementation, then it indicates that your data is not suitable for the architecture you have chosen in some way. Meaning that the solution is to explore different architectures, instead of trying to debug your code.

But notice that you’ve got a lot of work to do if you are doing Adam Optimization and Regularization: those all affect backprop as well, right? So you need to implement the gradients through all those layers and functions. In TF that is all just magically handled for you.

yea fair point. I will use the tensorflow implementation to validate the results I get from implementing it from scratch. Also the error was coming up in computing cost, the target variable in my dataset was of the dimension (1,m), where as the activations of the final layer was (k,m) where k denotes the number of classes. so I had to one hot encode the true labels in order to calculate the costs. now it seems to be decreasing with every iteration. yea adam and regularization were pretty intense to implement, so i tried to make it little easier by create a lot of helper functions.