How do I think about what’s happening in the hidden layers of a neural net? For example, in the softmax lab in week 2 we start with some training data, run through the first layer and have 25 outputs, 15 on the second, and 10 outputs.
Questions:
Why 25 neurons? Why not 100? Is there a rule of thumb?
How does one determine the best type of architecture for a neural network? I understand the very basics of the in/out (i.e. categorical will have more than one output, binary will have one), but I am having a tough time discovering what’s going on in the hidden layers and how one might set them up.
Is there anything interesting going on in the hidden layers that would be human readable? Are they meaningful at all?
Overall I can see how to use tf in a super basic way, but I am really missing the “how.”
@ColinMcEnroe as to your second question, I feel this will be much more clear if you decide to go on to take DLS. In short, in many cases the type of architecture you choose will depend on the problem you are trying to solve.
A DNN might be used for classification, say, where CNNs are typically used for image/audio models, and RNNs (GRUs, LSTM, etc) are for data with a time-series component.
None of which is to say this is an explicit restriction on what architecture you chose for what type of problem. I mean, for example, handwritten digit identification with MINST is essentially an image classification problem-- However, you can still get really good results just with a DNN, not needing a CNN.
You can also ‘reformulate’ your data to better match the type of model you want to work with.
As to ‘what’s happening’ in the hidden nodes, I think for the most part it is generally agreed upon that we are performing ‘feature detection’, or finding the parts of our data that most effect our outputs.
Towards your third question, is there anything ‘humanly readable’— Mmm, generally I would say ‘no’. But as you’ll also see in DLS, say you’re dealing with a CNN, there are ways you can kind of ‘reconstruct’ the features at different levels to get a bit of a sense of what is going on. But the results of this are entirely data/problem/model specific.
There is no ‘general rule of thumb’ for interpreting model weights in a human understandable way, at least on their own.
Thanks for sharing! This class is obviously so introductory so I need to keep my expectations as such. I look forward to thinking about this more deeply as I work through the different modules. I suspect that actually building some of my own workbooks with different datasets will be very helpful in building more intuition.
Thinking of DL and MINST data, is it that CNN and DNN can be trained to output valid and accurate results, but one might be more costly to run and optimize? Or is it something else? Is this generally true (i.e. for datasets besides MINST)?
Why 25 neurons? Why not 100? Is there a rule of thumb?
The number of hidden layer units is determined by experimentation. The more units, the more complexity the model can learn, but also the longer it can take to train, and very complex models may overfit the training set.
One rule of thumb for the initial number of hidden layer units is the square root of the number of input features. This may (or may not) be a good starting point.
How does one determine the best type of architecture for a neural network?
Experience. The kind of NN discussed in this course is a Dense fully-connected network. It’s good for certain types of tasks. More complex tasks will require other architectures, like Convolutional NN’s or Recurrent NN’s. You can learn about those in the Deep Learning Specialization.
Is there anything interesting going on in the hidden layers that would be human readable?
Not necessarily. Don’t worry about this too much. The concept is that the hidden layers learn to identify some non-linear combinations of the input features. The traditional way this is explained (which is nice for being intuitive but is not necessarily true) might be that if the inputs are images, the hidden layers might identify particular shapes or patterns or edges in the images.
As Anthony and Tom have said, what happens there is basically not that human interpretable in general. I guess at a high level you can say that it must be “meaningful” if the networks produce useful results, even if we can’t be very specific about the actual details. But this is an interesting question that researchers have thought about. If you take DLS and get as far as DLS Course 4 (Convolutional Networks), Prof Ng will show some research that was done to instrument neurons in the hidden layers of ConvNets to see what “triggers” them the most strongly in the inputs. If you want a preview, he has put a lot of the lectures out on YouTube and here’s the lecture I mentioned (it’s in Week 4 of DLS C4 when you actually take the course). Even though this is talking about ConvNets, which have a somewhat different architecture than you’ve seen yet, the fundamental ideas discussed in the lecture should make some intuitive sense. If you’re curious, it would also give you more of a sense of what you’ll learn by taking DLS after MLS as Anthony mentioned.
As Paul says in the case of image detection you can see that hidden layers learn features low/high level of the image, this can be seen if you visualize the images at these hidden layers. At Tensorflow Advanced Techniques specialization in the 3rd course they create such visualizations!
What is basically happening now in the hidden layers is just mathematical transformations, you input something and you get an output, the issue is that we are dealing a with high dimensional matrices and its not easy to visualize beyond 3 dimensions, for us humans.
More simply I would say that the hidden layers are basically trying to find coefficients to fit a function (whatever dimension) like very simply put y = ax + b (a and b are the coefficients). The coefficients will give a good fitting between input and output, instead of doing it by solving an equation, the computer will perform the trial and error process (using predefined rules) to find them so you dont have to do the heavy lifting.
The entire process instead of a high dimensional polynomial is broken down to linear regression+activation steps which is basically doing the same thing but not exactly, its like a circle is drawn with many small straight segments. An approximation but it does a pretty nice job for most input cases and we are interested in most cases generally speaking!
Great questions, I am only addressing your questions asked as I noticed no one replied to it.
as per your description, softmax activation was used, that means the data has more than 2 features or more precisely multi-features data as this gets confirmed when you mentioned the last output comes as 25, remember the last dense layer always comes with the number features your data contains and is not just chosen as per your choice.
But just in case we are doing an analysis of the same data we used only 2 features, then the activation changes as well as dense layer unit.
correction required here, categorical is used when you have data features more than 2, even categorical has one more variant sparse categorical crossentropy during this both, you still use softmax activation but in model compile the loss changes again depending on what kind of data output you’re getting.
And data with 2 features or 2 classes will use binary crossentropy as loss and sigmoid as activation for the last dense layer.
Trying to find what hidden layers do in a neural network is just like trying to understand how brain function, knowing significance of each hidden layer type, be it convolutional layer, pooling layer, fully connected layers and computation of each layer used with its significance.(You will come across this in DLS and get to practice some great model algorithm when you do Tensorflow Developer Professional Certificate and Tensorflow Advanced Technique Specialisation) as mentioned by every mentor here.
@gent.spah so, not to ‘detour’ a little from @ColinMcEnroe’s question (though he might find this interesting, and I’m going to at least pull in @paulinpaloalto and @tmosh and @ai_curious on this because I want to get to the bottom of this)–
Gent, at first I felt I had the same insight you express: Like, okay, wow, this is kind of like a really big linear regression with non-linear activations-- Or, at least it appears that way.
However from various outside sources, including the related MITx class I tried before I showed up at DLAI and before I knew anything about DL-- Well, they spend a whole lot of time talking about the Perceptron, and classify a neural network as an MLP (Multi-layer Perceptron).
Now, I’d roughly studied Perceptrons many years prior, but it has only been much more recently that I’ve finally figured out what is going on.
I mean, if the point/purpose of regression is to find a ‘trend’, perceptrons on the other hand are almost the ‘anti-trend’. On the surface they look very similar-- They are both equations of a line and are oriented based on the underlying data. Yet while regression looks for commonalities, a perceptron is doing the exact opposite and is optimally trying to split classes, or find that perfect distance that most completely separates the data in two.
And, I suppose, at least part of the question here as well is… obviously the MLP is seen as a foundational model of Deep Learning; But I’ve never heard
Of the Neural Linear Regression Network, or anything of the sort
Even though it really, really does look like linear regression, I’ve never heard any teacher or seen in any book ‘well, to go from this node to the next one we apply linear regression from the previous activation against the current set of weights’.
In a way, that would be such an easy way to explain that-- only; no one does or says that.
So, at least in part it is this disparity that had me start to wonder 'hmm, what exactly do we really have going on here between nodes ?
I think of architectural choices in terms of benefit/cost tradeoff. In your example, you have two hidden layers. For a given training regime, ask what accuracy does that produce? What is the runtime throughput? Are they sufficient for your operational requirements? If you need more accuracy, maybe more layers and/or more neurons would help, especially if the input is complex. But this gain comes at the expense of additional computation and perhaps runtime throughput. Also, as you add layers, you may need to add some other architectural tricks such as dropout, regularization, or pooling to avoid well known mathematical gotchas* of deep networks. *Avoiding specifics here since I don’t remember if they are taught in MLS…you will certainly learn about them in the Deep Learning Specialization. As mentioned elsewhere, too many additional layers/neurons/parameters may improve training accuracy but degrade generalization due to overfitting of the training data.
I suggest starting with a model that has been used before (not necessarily by you) as a guideline, potentially reducing its complexity to something minimally viable, then adding back in if you run into unacceptable performance.
@TMosh so I’ve thought about it a bit more during the day, and unless anyone has a better suggestion, this is what I’ve concluded:
The perceptron is a more apt description because:
It includes an activation function, whereas regression does not.
There is a standard ‘update’ procedure to incrementally adjust your weights and your bias. In linear regression, there is no such thing.
So WX + b, yes, it is a minimum a description of a line (or more likely, a hyperplane)-- But it is not linear regression because that is actually this formula:
\hat{\beta} = (X^TX)^{-1}X^TY
Or that is linear regression, or as in one fell swoop-- no ‘updating’, you have your least squares formulation (your optimization).
… WX + b… Is just a ‘line’ (plane, etc).
P.s. @ColinMcEnroe sorry to ‘hijack’ your post a little . You were wondering what happens in hidden layers, and I started wondering this too, so I thought relevant to bring up. Best, -A
What i mean is small little segments added together to form the fitting of the relationship between input and output. Now this is speaking in 2D because the problem might have many more dimensions! Anyways this is my idea of how I would think about it.