Is there a method to get a rough estimate on how wide and deep my neural network should be based on my input feature size and training examples? I am hand crafting a neural network based on my specific data implementation and I currently have 160 x 160 x 3 features and 3600 examples. I’ve learned the basics of constructing a DNN and CNN in DLS course but I don’t know the designing aspects of it.
The number of examples has no bearing on the design of the network.
Here are my tips. Others may have different guidance.
- Start with one hidden layer.
- Do as much as you can to get good performance from this one hidden layer, before you add any more.
- The number of units in the hidden layer can be adjusted. Start with either “the average between the number of input features and the number of output labels” or “the square root of the number of input features”. Maybe try both.
- Adjust the number of hidden layer units, and experiment with regularization (using a validation set) appropriately) to try to get good-enough performance on the test set.
- Use a learning curve to see if the training set is large enough. If not, then get more data.
Thank you sir for reading my question. I also have some questions that I want to ask referring to your points:
- I’m using a CNN for my implementation. Does one hidden layer mean CONV → RELU → MAXPOOL → DENSE (Softmax)?
- What do you mean “Do as much as I can to get good performance?” Since I won’t have much units to start with I wouldn’t able to use advanced techniques such as dropout to a large degree. The most I can think of is fine tuning and do some data regularization technique. I assume you are referring to increasing performance on training set.
- How to use a learning curve to see if the training set is large enough? I’m guessing you mean if the network overfits the training set, increase the data. But if my network underfits the set, decrease the data.
- If I do add layers, I found that adding a dense (1D layer ) to a network increases the trained parameters by a huge amount. Should I add a dense layer or a Conv block ?
Thanks,
Yuhan
If you’re just doing a simple NN model, you don’t need conv or maxpool layers.
Those are only important it you can’t get good enough results with a Dense NN with one or maybe a few hidden layers.
Dropout is one of many forms of regularization.
Decreasing the amount of data is never necessary, unless the amount of data makes it too difficult for training (i.e. takes too long).
Use Conv blocks if fully-connected Dense layers aren’t feasible (i.e takes too long to train, because of the number of parameters).
Designing a NN is more a matter of experience than following a cookbook.
Hello @TMosh
I am trying your advice and building a DNN with one hidden layer with keras. This is my first time doing this on my own, so could you help me check if my implementation is right? Here are some details about my model.
I am using a 1D signal for this implementation (I used a signal-graph before) but since I’m using a DNN I decided to feed in 1D signal.
After some signal preprocessing, the signal length is 600. Signal shape is (1, 600)
I used one hidden layer of 30 units to ensure that I don’t have too many parameters. I used a relu activation function as Professor Andrew recommended in my memory.
The output layer is a softmax layer with 6 units.
The following is a screenshot of my model:
Thanks for reading my message!
Best,
Yuhan
What exactly do you mean by “signal”? Is each example a set of time-series data?
Yes, indeed! It’s a vibrational time series data which I conducted fourier transform and concatenated different channels. Also, is the network correct? Just by judging the parameters I would say it’s the same that I’ve expected.
Also I have another question, Are there any sizing concerns on the one hidden layer? I set it to 30 because there will then only be around 18000 parameters, which might be good enough for my implementation.
You will not get very good results with a simple fully-connected NN to create a time-series model.
The fully-connected NN does not have any concept of relationships where the values depend on the previous values.
Time-series data needs a sequence model, like a Recurrent Neural Network. That’s a different level of complexity. Typically it’s covered in a course on deep learning.
I used fourier transform on the time series to the frequency domain. I am not using a implementation as complex as speech generation. I just want to extract and identify vibrational features from mechanical structures using a deep neural network because of its benefits.
If you have a spectrograph (i.e. a different spectrum at each instant of time), then it’s still a time series.
If you took a time series and did fourier analysis on it, then it became a frequency plot. Do you have a different plot for every example? If so then it’s OK for a NN to try to model.
What are the output labels for each example? You say you have six outputs, are they one-hot codes for different conditions?
You say each example has size (160x160x3). Why is it that size?
Thank you sir for reading the messages!
I would firstly like to add some clarifications about my implementation:
-
The vibrational data was originally wrong, I was trying out a CNN so I written some code to transform it into images (160 x 160 x 3). I did not put much thought into the size, I just want to use the CNN that I practiced on the DLS course.
-
Now based on what you’ve recommended, I decided to use the Fourier transformed coefficients of the time series and feed it into my network. There are 150 coefficients and 4 channels, therefore each sample is 600
-
I have a different raw data for every example. The labels are one-hot encoded, and there are 6 conditions.
-
I recently tried out my simple network and it has performed surprisingly well on my small dataset. Here are my results:
- It seems that I’ve encountered an exploding gradient problem on my validation set. I would need to try out some regularization technique.
I don’t see that, you only get exploding gradients when training, and for that you use the training set (not the validation set).
The issue I see is overfitting the training set. So regularization would be worth trying.
Also, since the training cost is still decreasing when you reach 100 epochs, I’d say the solution hasn’t converged yet, and you could try more epochs.
is your time series signal in steady state?
If so, I believe using the Fourier transform for feature engineering makes total sense.
But in transient signals, a Fourier transform is not the best tool in my opinion since due to its characteristics it assumes that there is periodicity and the (transient) signal will appear again periodically, even though this is not the case. You can push the periodicity toward Inf. with zero padding, but I believe when applying feature engineering with a Fourier transform, it just makes sense if the signal is in a steady state.
In case of a transient signal, a wavelet transform might be more suitable.
Best regards
Christian
Hello @Christian_Simonis
I have tried to use spectrogram before and I have found my signal is steady state although it is excited by random forces. I’m relatively new to deep learning, so if you have more to share about feature engineering, please give me some sources that I could look into in the future. It would be greatly appreciated.
Thank you for your feedback!
Best,
Yuhan
Hello @TMosh
I have tried training it with more epochs, my model now has reached 97% accuracy on my testing set and 65% accuracy on my validation set. It seems like now I have encountered a high-variance problem. According to the course, I could either try (1) getting more data, or (2) trying out regularization techniques. I’m not sure if my model is ‘overfitting’ my dataset though, would you mind to elaborate on how you came to this conclusion? It would be greatly appreciated.
Loss:
Accuracy:
Confusion Matrix
Thanks for guiding me on this project!
Best,
Yuhan
C
Ok, very good. Then the Fourier transform can be a powerful tool for feature engineering.
Here some threads that might be interesting for you:
- Can we start with the circle equation as decision boundary? - #12 by Christian_Simonis
- Question on Linear and logistic regression - #3 by Christian_Simonis
In the 2nd link, also a residual analysis (per feature) is described. Often this makes sense to evaluate whether or not you still have systematic patterns left that you could potentially exploit in your feature engineering: In a perfect world you would just see some random (Gaussian) distribution in your residua and no systematic patterns.
Also a feature importance analysis like SHAP (SHapley Additive exPlanations) can be a powerful tool in your feature engineering process.
Happy learning and good luck with your project!
Best regards
Christian