Let’s consider an example where I’d like to train a neural network and I have a dataset of 100 examples. I wonder how the 100 examples are fed into the first layer. Is every unit of the first layer processing all 100 examples? or are the 100 examples spread across however many units there are in the first layer? And why would I get different sets of parameters for the units if they all receive the same data and use the same optimisation algorithm?
Thank you for any insight into this!
Luca
Every unit processes every example.
Each unit yields a different weight because the weights are each initialized to different random values.
Hi @Luca_Pellegrini great question:
The examples (data points) in your dataset are not spread across the units (also known as neurons or nodes) in the first layer. Instead, each example in your dataset is fed into the network one by one (or in batches) during the training phase, and each unit in the first layer processes all the features of the example. This process repeats until all examples in your dataset have been fed into the network.
Here’s a more detailed breakdown:
- Each data point (or example) in your dataset typically represents a vector of features. For instance, if you’re trying to classify images of digits, each image can be thought of as a vector where each element is a pixel value.
- Each unit in the first layer of your neural network is connected to all elements of this vector via a set of weights (one for each feature). When a data point is fed into the network, each unit in the first layer computes a weighted sum of all input features (a dot product of the feature vector and the weight vector), applies an activation function to this sum, and passes the result to the next layer.
- During the backpropagation phase of training, the network adjusts the weights in a way that minimizes the difference between its predictions and the actual values (the error). Each unit’s weights are adjusted independently, depending on the unit’s contribution to the total error. This is why different units end up with different sets of weights even though they all receive the same data and use the same optimization algorithm.
- The process of feeding examples into the network and adjusting weights based on the error is repeated many times (epochs) until the network’s performance plateaus.
The reason we have multiple units in a layer is to capture different features or aspects of the data. For instance, in an image classification task, different units might specialize in detecting different types of features, such as edges, corners, or color blobs. This allows the network as a whole to capture more complex patterns in the data. Each unit starts with random weights and during training, they evolve differently because of the error backpropagation, allowing them to learn different things from the same data.
I hope this helps!
Thank you (and @TMosh) for the prompt and clear reply. What I find most fascinating is the spontaneous ‘division of labor’ among neurons to detect different aspects of the data. If you can recommend any further reading on this please let me know.
I think the ‘division of labor’ is a concept used to teach but it might be somewhat misleading. Each neuron performs prediction based on the input, each neuron does the same thing but by chance, some outputs will be able to detect some patterns, is like performing several linear regression models of the same data, by chance some data will have more observations with one condition that explain better the target, and other will have better observations with other features to explain the target. So, in a sense, you will break down the labor to each neuron but the neuron does not know or pick the features of the dataset.
The deeplearning specialization can be a great path for you to learn more about it!
I hope this helps