C2_W1_Lab02_CoffeeRoasting_TF weights and randomization

Hello, I have questions about weights:

  1. Why do we have 3 weighs for L2? According to this slide we should have 1 weight multiplied by 3 activation values.

  2. I see that we should initially set some values for weights and they can’t be 0 - we use some random function to do it. The question is: is it possible that random gives us such weights that our algorithm will not reach convergence?

There is also question about this part:


I don’t understand how it helps us since we just repeated our data?

The arrows in that diagram don’t show the weights. They show the data flow.

In a Dense network, each unit in a layer is connected to all of the units in the adjacent layers.

Possible, yes. Likely, not.

In a Dense network, each unit in a layer is connected to all of the units in the adjacent layers.

I see, but formula confuses me, because our L2 has only 1 unit (j-index) and then we should have only 1 weight. Am I understand correctly, that amount of connections (and weights) differs from one layer type to other and I will learn it later?

For counting the weights, it’s not the type of layer that matters. It’s the connections between layers.

Each place where you can draw a line between units in adjacent layers, that represents a weight value.

This figure from the Coffee Roasting lab is an over-simplification, because it doesn’t show the W1 weights effectively.

This is a better representation (omitting the bias units for clarity):
image
There are two input features.
There are three hidden layer units
There is one output layer unit.

Each unit in A1 and A2 also has a bias value (not shown here).

Hi @DagerD ,

j is just a variable addressing a particular unit in a layer. If you look at the enlarge portion of the diagram you posted, you can see for layer3, there are 3 units. Each circle within a layer is a unit.

So for layer 2, there are 5 units. If we were to take layer 2 unit 4, j would be 4 and l would be 2.

Got it, thank you!

Could you please also clarify this part?

Is it true, that there are two options (in lab case):

  1. work with initial data with more epochs
  2. duplicate data with less epochs

and both are equal to work with, but second chose just for optimization?

Iteration is a slow process, so putting multiple copies of the same data into the training set allows us to make fewer iterations.

Either way, the model sees the same amount of data in total - it’s just in bigger chunks if you duplicate the examples.