Preparing Data for Deep Learning

Hello fellow Deep Learners,

where can I find Information on how to prepare my data (X and Y) to start my own Deep Learning project?

So far – I’m on DLS Course 2 – all data has been provided and imported in each programming exercise. Now I’m looking for a way to learn how to get my own data set up.

Than you and best regards,

1 Like

Hey @Bakir ,

I’m really happy to know you are having so much interest in machine learning.

I did a quick google search and this is the top pick.

I want to bring up something important here, and that is your question has quite a lot of answers.

You can start by preparing your own dataset and map them in a H5 files, the kind that are shared in our courses. H5 files are a great way to read data. Even .csv files.

That’s what I personally did. Used a H5 file from this course, understood how it worked by reverse engineering it, and then making my own for my own dataset.

You can even read data from a text file.

You can search on websites like kaggle for existing datasets. Even TensorFlow Datasets.

Really, there are a lot of possibilities. Everything is a google search away.

To learn more about Tensorflow Datasets, you can take our Tensorflow Specialisations.


TensorFlow has a very neat way of reading data. For example for “a cat, dog, or none”, all you need to do is place all the pictures of the same kind in one folder. For example, all the cat pictures in one folder, all the dog pictures in another and all the other kind of pictures in another folder. Then all you have to do is point towards these folders and Tensorflow takes cares of it rest. It automatically reads them, shuffles them t make a dataset and assign them labels.

1 Like

Thank you for your reply, @Mubsi – very helpful indeed :slight_smile:

Hi community! I wish to have opinions on data preparation in case of image data. My dataset consists of around 500 thousand images of chest x-ray. I’ll be using EfficientNet architecture for my model. While preparing data for training, I came across some images which are unlike others (see the attached pic). In some of these anomalous images, large portion of the image is just black pixels, in others it is white, and in some others, there is high noise, etc.
How should I go about figuring out these images from my dataset because it isn’t possible to look at each image manually? Also, should I outrightly reject these images or include them in my dataset after some modifications?

Hello @Harshit1097

Identify the characteristics of those images. From what I have seen, in those images, there seems to always have rows (or columns) that are completely dark. You might develop algorithms to scan each image, count the number of dark rows and the number of dark columns, and finally tag photos whose numbers are larger than some threshold value experimented and set by you.

You can then take out those tagged images, and visually inspect them yourself. This inspection step is important for you to develop confidence to that tagging algorithm. Rely on yourself first before relying on that algorithm.

It is irresponsible to answer a question like that without thorough understanding of the situation.

It is a decision by whether you need them or not. For example, do you know why those images exist? Does the testing samples or the real-world samples have something similar to that or share similar characteristics? Does removing those images result in any drop of performance ?

Therefore, only you can answer it for yourself.


Thanks @rmwkwok for the detailed reply!