Question about Neural Network input and output vectors

I have completed the MLS and begun week one of the DLS and now have a question going back to this module from week one of the Advanced Learning Algorithms course.

Please tell me if this is correct:

When one feeds a feature vector X, as input to a dense hidden layer with the activation function ReLu, each node will fit a ReLu function through every feature in vector X and provide a prediction. Each of these predictions are organized into a new vector a[1], which is used as input for the next hidden layer, and so on.

If this is correct wouldnā€™t the output vector a[1], be a vector of length equal to the number of nodes in hidden layer one, and each element in vector a[1] be the same number?

In a dense hidden layer, every feature from vector X, was used in every node therefore every node will produce the same output value in vector a[1]. Specifically elements a[1]_1, a[2]_2 ā€¦ a[j] are all the same number.

That just feels like it canā€™t be correct. Can someone help me understand better? Thanks for the help!

No, thatā€™s not what a hidden layer does. Predictions are only formed at the output layer. Every unit has an activation value, but only in the output layer is that a prediction.

Each pair of adjacent layers in a neural network are connected by a weight matrix.

The size of the weight matrix is {outputs by inputs}, where outputs is the number of units in the next layer, and inputs is the number of units in the previous layer.

The number of units in each hidden layer is independent. You select that as part of designing the model.

The number of units in the input and output layers are determined by the number of features (the input), and the number of outputs (i.e. the number of labels).

This is correct, thank you.

@TMosh Youā€™ve lost me a bit, here. Allow me to rephrase my question and include a visual reference:

Suppose each hidden layer in this NN diagram is a dense hidden layer with the same activation function, ReLu. Also, suppose that vector X, has 100 features.

Please tell me if these statements are correct:

  1. Vector a[1] is the output of hidden layer 1 and will have length 25. Correct.
  2. Vector a[2] is the output of hidden layer 2 and will have length 15. Correct.
  3. Elements a[1]_1, a[1]_2, ā€¦ a[1]_25 are all the same value. Incorrect.
  4. Elements a[2]_1, a[2]_2, ā€¦ a[2]_15 are all the same value. Incorrect.

Thanks for your patience and willingness to help me out.

Thatā€™s a really bad diagram, and I wish theyā€™d replace it with something better.

Iā€™ll see if I can find a better one and post it here.

Your notes 3 and 4 are incorrect. Every unit has a unique output value. Otherwise thereā€™s no point in having multiple units per layer.

1 Like

Thank you! That is exactly my point, it makes no sense to have multiple layers if points 3 and 4 are correct.

Can you explain in more detail how the hidden layers compute the output vectors a[1] and a[2]?

Hereā€™s a simple depiction of the connections in a neural network. This one is from the ā€œcoffee roastingā€ lab, Iā€™m not sure which course has that.

This NN has two input units, three hidden layer units, and one output.
The units are the circles, every straight line represents a weight value.
image

The size of W1 is (3 x 2).
The size of W2 is (1 x 3)

The weight (and bias) values are learned which minimize the cost at the output layer. The method is called ā€œbackpropagation of errorsā€. The math involves a lot of calculus, which most ML courses assume is already proven and taken as fact.

This diagram doesnā€™t show the bias value that is included with each unit.

Each unit computes the sum of the product of its weight and the input values, plus a bias, with some activation g(ā€¦) applied. In vector algebra, it gives you the whole result as a vector for each example.

For the hidden layer A1, itā€™s A1 = g(W1 * X + b).

So what you actually get at A1 is a matrix of size (m x 3), since ā€˜mā€™ is the number of examples (rows) in the training set X.

A similar process happens to compute A2.

At the output layer, there will be an additional process to compute predictions. For example, if youā€™re doing classification, then A2 will be turned into logical true/false values by applying some threshold operation >= 0.5. This happens because sigmoid() has a range of 0 (False) to 1 (True), and 0.5 is the boundary right in the middle.

1 Like

This is helpful. Thank you for taking the time to explain this to me.

Can you point me to a good resource that I could use to read more myself? Beyond the MLS and DLS, which Iā€™m already working through myself.

Sorry, I donā€™t have any other references.

@cwebster, one more key point is that the weights and biases need to be randomly initialized before training. Otherwise, if they are all the same and the inputs are the same, then each node in a layer will learn the same thing. It sounds like this may be the missing piece that was puzzling you.

2 Likes

This is the key that I miss understood! Thank you for pointing that out.

For anyone else, this is the video that answers my question: https://www.coursera.org/learn/neural-networks-deep-learning/lecture/XtFPI/random-initialization

1 Like