I have completed the MLS and begun week one of the DLS and now have a question going back to this module from week one of the Advanced Learning Algorithms course.
Please tell me if this is correct:
When one feeds a feature vector X, as input to a dense hidden layer with the activation function ReLu, each node will fit a ReLu function through every feature in vector X and provide a prediction. Each of these predictions are organized into a new vector a[1], which is used as input for the next hidden layer, and so on.
If this is correct wouldnāt the output vector a[1], be a vector of length equal to the number of nodes in hidden layer one, and each element in vector a[1] be the same number?
In a dense hidden layer, every feature from vector X, was used in every node therefore every node will produce the same output value in vector a[1]. Specifically elements a[1]_1, a[2]_2 ā¦ a[j] are all the same number.
That just feels like it canāt be correct. Can someone help me understand better? Thanks for the help!
No, thatās not what a hidden layer does. Predictions are only formed at the output layer. Every unit has an activation value, but only in the output layer is that a prediction.
Each pair of adjacent layers in a neural network are connected by a weight matrix.
The size of the weight matrix is {outputs by inputs}, where outputs is the number of units in the next layer, and inputs is the number of units in the previous layer.
The number of units in each hidden layer is independent. You select that as part of designing the model.
The number of units in the input and output layers are determined by the number of features (the input), and the number of outputs (i.e. the number of labels).
Suppose each hidden layer in this NN diagram is a dense hidden layer with the same activation function, ReLu. Also, suppose that vector X, has 100 features.
Please tell me if these statements are correct:
Vector a[1] is the output of hidden layer 1 and will have length 25. Correct.
Vector a[2] is the output of hidden layer 2 and will have length 15. Correct.
Elements a[1]_1, a[1]_2, ā¦ a[1]_25 are all the same value. Incorrect.
Elements a[2]_1, a[2]_2, ā¦ a[2]_15 are all the same value. Incorrect.
Thanks for your patience and willingness to help me out.
Hereās a simple depiction of the connections in a neural network. This one is from the ācoffee roastingā lab, Iām not sure which course has that.
This NN has two input units, three hidden layer units, and one output.
The units are the circles, every straight line represents a weight value.
The size of W1 is (3 x 2).
The size of W2 is (1 x 3)
The weight (and bias) values are learned which minimize the cost at the output layer. The method is called ābackpropagation of errorsā. The math involves a lot of calculus, which most ML courses assume is already proven and taken as fact.
This diagram doesnāt show the bias value that is included with each unit.
Each unit computes the sum of the product of its weight and the input values, plus a bias, with some activation g(ā¦) applied. In vector algebra, it gives you the whole result as a vector for each example.
For the hidden layer A1, itās A1 = g(W1 * X + b).
So what you actually get at A1 is a matrix of size (m x 3), since āmā is the number of examples (rows) in the training set X.
At the output layer, there will be an additional process to compute predictions. For example, if youāre doing classification, then A2 will be turned into logical true/false values by applying some threshold operation >= 0.5. This happens because sigmoid() has a range of 0 (False) to 1 (True), and 0.5 is the boundary right in the middle.
@cwebster, one more key point is that the weights and biases need to be randomly initialized before training. Otherwise, if they are all the same and the inputs are the same, then each node in a layer will learn the same thing. It sounds like this may be the missing piece that was puzzling you.