After each week I do a personally defined exercise on top of the end-of-week assignment to test out my knowledge.
This week I finished Course 2, Week 2 program so I wanted to prepare my own dataset, train the model, and test it. So on my tablet using SamsungNotes I wrote 20x10 lines of numbers in a different style with a different pen:
I split the image using GIMP into multiple smaller images which are named: prefix_row_col.jpg so I use rename.py to change the names to actual labels where row_value.jpg
The idea of the exercise was:
Load the images from the filesystem
Visualise the dataset
Load images in grayscale and into a numpy array
Train the model and asses the quality of the model
The method createDatasetFromImages loads the data from the folder in grayscale format into a numpy array. When loaded the array is a grayscale value 0 to 255 of each pixel in each row. Since these are the features of the image I ravel them into a 1d array.
Here is my code:
The predictions are very much a miss. I played around with epochs/learning rate and a number of layers. Could the issue here be that I’m using Dense units while I should be using Convolutional layers but I’m not 100% sure? Or possibly because my dataset is too small.
In the multi-class example, only Dense layers are used but the model still works.
Could it be that since I have 7000+ X values as features that this is too many ?
@TMosh so one thing I wanted to attempt was to reduce the number of features by using the NxM output of the np.asarray(Image.open().convert()) so my features shape becomes:
(N, NxM) or from my example: (50000, 85, 85)
However, I’m not sure how to train the model with such a shape NxM is basically row x col pixels.
I have also tried normalization but was unsuccessful. The results were a tiny bit better but not really. I tried increasing the dataset by adding more copies tho now that I think about it that won’t really help as it won’t really introduce more variance.
I suggest you to also watch the videos in week 3 which may be able to motivate you to split your data into a training set and a cv set. Tensorflow allows us to supply not just the training set but also a validation set so we can monitor how the performance of the model changes with respect to each of the training and the cv set. This will be helpful in deciding what to do next.
WIthout a cv set, I cannot be very confident but I might consider adding more neurons/layers to your model, since your loss stopped improving after 7 rounds.
@TMosh@rmwkwok thank you for the help. @rmwkwok I provided a split of the dataset as the validation to monitor the model. So yes the dataset in question was way too small and basically just having copies made no sense and thus did not improve on the model.
I was however able to improve the model a tiny bit, first I reverted to 1d representation of the images in the dataset. However, before that I reduced all of my images to 20x20 which also reduced the number of features and I removed padding.
After that, I scaled down the feature values and I went down to just a single dense layer with number of units matching the number of pixels in the image.
I noticed that predictions for 0, 1, 4, 7 worked somewhat well, the values in the prediction result were far apart from others however 2, 3, 5, 6, 8, 9 when compared were all relatively close. Say I provide a prediction for image with number 3 drawn the values of 2, 5, 8, 9 were all very close.
@rmwkwok@TMosh I finished the Advanced Learning Algorithms course 2/3 and went back to this project. And basically just changed my dataset back to original and used different feature scaling and now I get extremely good results even when introducing new images that were not part of original dataset train/test split.
Since at first my images were 85x85 pixels and resulted in quite a lot of features I scaled the images down to 20x20 but this did not have a drastic effect on the model precision.
I played around with the number of Dense layers and units but nothing really yielded decent results. So I added the metrics to observe how the model training is doing.
I recalled Andrew saying we should ideally reduce our features to 0.0 - 1.0 value range. Initially I did this by just using the feature scaling that was recommended by Andrew in the course:
x = (x - x.mean()) / (x.max() - x.min())
The values of images are basically grayscale intensity 0 - 255. This seems to have caused issues. So I changed to basically:
x = x / 255
I played around a bit with the number of Dense layers and units per layer and found that 400 → 25 → 15 → 10 gives the best results. Now on my incredibly small dataset, I’m getting exceptionally good results.
So I’m trying to figure out two things now:
Why did the proposed scaling not work and X / 255 work so well? Is this normal practice basically trying out different feature scaling options or did I create an issue using that approach to scale values.
What I’m trying to understand is what is happening inside a Dense layer. Say on the first layer Dense(units=400, activation='relu'), we are taking 400 features, calculating gradient descent for each unit a new feature but what is this feature? What does setting it to 400 different units actually do won’t the gradient descent actually comes to the somewhat same values for each run ? Or is each unit in a dense layer attempting a say higher order polynomial function ?
@rmwkwok I forgot to commit the updated notebook and images in the dataset/ I ran the resize.py script to scale them down to 20x20 so I had less features. It should work now if you pull the latest images.
I made the loss curves for both the training and testing sets.
Below used x = x/255
Below used x = (x - x.mean()) / (x.max() - x.min()).
We can see that
both achieve pretty similar loss level for the testing set - somewhere in the middle between 1.5 and 2.0, so I would not conclude that either way of normalization is significantly better. I encourage you to run this experiment in your notebook, and compare the testing set’s loss.
The second plot reached the best loss value with less number of epochs.
1000 epochs are too many - both plots see increasing testing set loss. The first plot sees it > ~400 epochs whereas the second plot sees it > ~150 epochs. After those number of epochs, the models started to overfit. (training loss decreases but testing loss increases)
From now on I will focus on the 2nd normalization approach. There is a large gap between the training curve and the testing curve and continue to develop over epochs. In the spirit to close the gap, I remove your first dense layer which has 400 neurons, and get the next plot:
Now the best testing loss is closer to 1.5 and is obviously better than before taking away the first dense layer. Also, the gap is smaller.
This is not over, but I will let you explore more in the tuning of your neural network and any hyper-parameters related to the training process itself. Please feel free to share your finding, and preferably with curves like mine and explain your thought flow. In this way and if you would like to, we can further discuss how to further make improvements.
@rmwkwok This is awesome thank you so very much. I completely forgot I could have plotted the loss and see what’s happening I was too focused on the end result/prediction and forgot to pay attention to how the loss changes by number of epochs.
You explain these subjects really well. Thank you so much, you really are awesome.