Coffee Roasting Example. How come 3 neurons with the same activation function (sigmoid) provided outputs for 3 different regions?


“C2_W1_Lab02_CoffeeRoasting_TF” Lab.

How come 3 neurons with the same activation function (sigmoid) provided outputs for 3 different regions? 3 the same functions got 1 the same set of input data. How come they (units) made different “conclusions” out the data AND the 3 conclusions match the 3 regions (time, temperature, time*temperature)?


1 Like

They have the same activation function, but not the same weights.

Each unit’s weight is randomly initialized. Since the cost function is not convex, each weight will evolve to learn a separate feature.

This method is called “breaking symmetry”.

Aha… In this case, does this mean that:

  1. in the example unit 0 covered duration, unit 1 covered temperature and unit 3 covered time*temperature accidentally? If starting random weights were different than eg. unit 0 could cover temperature and so on?

  2. Due to the random nature it could happen that 2 units would learn the same feature AND a feature could stay not be covered at all. Is the situation that we see in the example artificially created and in real situation having 3 neurons in a Layer we would probably get different results?


There is a better way to show the NN architecture than what is in the Lab02 notebook. Lab02 doesn’t really show the weights between the input and hidden layer.

Note that W1 has six weights - that’s all of the combinations of the two inputs (temperature and duration) and the three hidden layer units.

W2 are the weights that are used to compute the A2 value “good or bad coffee”. In this example, that’s True/False for whether the coffee is good.

For simplicity I’m not showing the bias weights b1 (a 3-element vector) and b2 (a scalar).

The A1 units (the hidden layer) give the non-linear combinations of the two input features. In general they don’t have any physical meaning - they’re just non-linear combinations of the input features.

In this simple coffee roasting example, it turns out that the three hidden layer units do have some explainable relationship.

Nowhere do we specifically compute a feature that is (temperature * duration). That’s an example of the non-linear process in the hidden layer activation function.


I think I get it now. Thank you @TMosh for your help!