Week 2, Assignment 2: changing code for dataset with many classes

Okay, so while doing this assignment, I am- at the same time, editing the code to work on an image classification problem of my own. I would like a little nudge in the right direction. right now, I am trying to figure out how to use the labels generated by image_dataset_from_directory for a multi class softmax classifier. instead of using the final dense layer to do binary classification, I want to create a layer with the 224 classes found from the subdirectories in my dataset.


You just need to create an output dense layer with 224 output neurons and a softmax activation. The loss function is the softmax version of cross entropy loss. Then the only further tricky bit is making sure you can handle converting the data from the directory format to arrays and labels. You can start with the labels in “categorical” form (one byte for each label with a number between 0 and 223) and then convert to “one hot” on the fly in the places where you need that (when computing the loss typically, although if you are using TF there is a version of the loss function that can handle the labels in categorical form).

Most of what I just described was covered in DLS Course 2 Week 3, other than the issues of getting the data loaded in the correct form and the point about the categorical version of the loss function. The whole network in that case was fully connected Dense layers, but all you need to change in the convnet architecture is the output layer and that’s already a Dense layer. So you just need to convert from binary/sigmoid/cross entropy to multiclass/softmax/multiclass cross entropy. The fact that the earlier internal layers of the network include Conv and Pooling layers doesn’t change anything w.r.t. how the output layer needs to work.

Thanks! yeah… I actually figured it out on my own in the time it took to get a response :slight_smile:

Super mentor… woah… hey are you a Stanford professor too?

That’s great! Onward!

No, just a retired software engineer. Not sure I remember why they came up with that “super mentor” title, but there are a number of us who have been mentoring DLS for a while that got “promoted” to that title. Please note that the mentors also have no employment relationship with DeepLearning.AI: we are fellow students who volunteer our time to support these courses.


Well, I appreciate you doing that. I have a follow up question. I was able to start fine tuning the MobileNetV2, but quickly realized some snags in my plan. Let me explain the project first, then my problems.

Project: mushroom classifier
I am building an edge application similar to an image classifier like the mobile application “picture this” or “picture mushroom”. ImageNet is unable to identify, or identify with enough specificity, various plants and mushrooms. I am using an iNaturalist dataset from a computer vision competition that took place in 2021 (if memory serves). It is roughly 224 GB and contains 1787841 images. My goal is to use that dataset to train a model to classify a picture of one of the 223 listed species of fungi, or “not a fungus”.

I structured the directory into subdirectories of all of the species of mushrooms, plus a subdirectory containing all of the specimen from other kingdom of life. Also, I did notice that there is now MobileNetV3, so I adjusted to use that (V3Large) architecture. Next, I got the code working on a smaller dataset as a toy example, noticed a few things, like overfitting after a number of epochs, and gained some intuitions. Finally, I made a few edits that seemed reasonable, ran my code on the large dataset, and went to sleep.

The dataset is pretty big. This is going to take a pretty long time running on my machine. I actually waaaas trying to use the terminal emulator in VSC to run the code. It actually seems to have stopped running overnight though. It was stuck at the end of the third epoch of adjusting the output layer.
So, for one thing, do I really need to adjust the output weights in accordance with the imagenet params or can I skip that altogether if I am going to unfreeze some of the previous layers?
Further, should I be running this somewhere besides the terminal emulator in VSC- maybe CMD or some other way? I’ve never run a script that took nearly this long and suppose I’m a little hackey in my current methods.
Do I even need all of this data? I feel like I do for the ID of different fungi, but for the “not a fungus” class, I’m not so sure. Is less sometimes more? Should I just focus on making this work with the larger volume of data or could I do well to shrink my dataset?
Does it make sense in this instance to still use a pretrained model? Would it make sense to train from scratch, or is imagenet still a good starting place? My intuition tells me that, even though I have a lot of data, I will have to iterate through training fewer times for similar results than I would if training from scratch. I am curious though if training from scratch might yield EVEN BETTER results than fine tuning. Unfreezing SOME of the layers at least improved performance in our lab example.
Finally… I’m wondering if we can do even better than MobileNetV3 architecture for this project…

Hi, Richard.

Wow, that sounds like a very interesting project that could have real utility, as opposed to just a learning exercise. Well, first I should make the important disclaimer that I’ve never really tried applying the course materials to a “real” problem of the scale that you are doing here. The most I ever did was play around on my own with the MNIST handwritten digits database. Those images are pretty small and greyscale, so it was not hard to train on my own computer (just a MacBook Pro with no additional GPU). So the most I can do to help will be based on what I’ve heard Prof Ng say and not on any real world experience that is comparable to what you are doing. There are several other mentors and students I know of who have done serious work of the scale that you’re dealing with, so we can also hope that they will notice this thread and chime in as well.

224GB is pretty large. I assume you must be doing minibatch and using the Yann LeCun rule “friends don’t let friends use minibatch sizes greater than 32”. But anytime you run a large job like that on your local computer, you can go off the “swapping” cliff. That’s something I actually do know about, since operating systems was my area of expertise during my career. It’s all “virtual memory” and when the size of that which is actively being referenced gets close to or larger than your actual physical RAM, things go off a cliff in a really serious way, because you’re reading and writing virtual memory pages to disk, which is milliseconds instead of nanoseconds. Meaning that the machine may just seem to be hung and nothing’s happening. As mentioned above, I’m a MacOS user, so I can use the Activity Monitor app to see what’s going on with memory and cpu utilitization. Dunno what the equivalent tool would be on Windows, but there must be one. But generally speaking, if you don’t have a GPU rig, I think most people use Cloud services for running big training jobs, e.g. Google Colab or AWS. The latter costs money, although it’s way cheaper than buying your own h/w. Colab can be used for free, but you queue up behind the paying customers and they only let your job run continuously for a few hours in the free model. So if you’re going that route, you need to jazz up your training to be restartable: checkpoint the model every few iterations and write the logic to restart from a previous model.

For all your other questions, my belief from what I’ve heard Prof Ng say is that there is no single “silver bullet” answer that is correct in all or even most cases. The only generalizable answer is: try it and see what happens. E.g. on the question of training from scratch, versus only training your new specific output layer or freezing at some intermediate point in the layers and training everything after that, he does discuss that in the Transfer Learning and Mobilnet section of DLS C4 W2. And my memory of what he said (although I confess it’s been a couple of years since I actually watched those lectures) is that there is no right answer that works in all cases. I guess your dataset is large enough that you could start with the pretrained model and unfreeze and train all the layers and it should not make the solution worse than it would have been if you had kept most of the earlier layers frozen. But it’s a question of whether the additional compute cost and time that would imply would make enough of an improvement to be justifiable. Maybe in your case if you’re running it on your own hardware and have the patience, that doesn’t matter. But when OpenAI is training something like GPT-4 we’re talking tens of millions of dollars of compute cost or more, so these decisions are not trivial in that context.

You could also try subsetting the dataset both overall and in the “non-fungi” labels specifically and see if that gives sufficiently good results with a noticably smaller compute cost and time. Meaning instead of doing a toy subset, try randomly selecting half of all the class samples and see what happens both in terms of training cost and the prediction accuracy of the resulting model.

In terms of what other models might be better for your problem, there again I have no relevant experience. I’ve looked briefly at the list of pretrained models that Keras offers and there may be other resources like that to consult.

Sorry, not much useful information that you hadn’t already considered in my response, but I see you posted this as a thread in a more general forum, so I hope you’ll get responses from people with more experience than I have.

It would be great to hear what you learn in this process. It could help others wanting to build real solutions.