Why in the course, the formula for calculating Z=Wx+B is emphasised more, I know this is analogous to logistic regression, but in the course 1 no where is explained why it can’t be something else like Z=Wx^2+W2sin(x)+B, why even we chose logistic regression to derive deep neural networks. Sorry for this argument, I don’t have a good prior knowledge of AI but this was a natural doubt coming in my mind.

The technology here has been worked out over quite a number of years. The fundamental building block of a Neural Network is a “layer” which consists of a linear mapping from the inputs to the outputs in which every neuron sees every input and is of the simple form of a linear transformation:

Z = W dot A + b

Then this must be followed by a non-linear activation function at each layer to compute the output that then feeds into the next layer. There is a wide choice of activation functions in the inner or “hidden” layers of the network and then you need either sigmoid or softmax at the output layer depending on whether you are doing binary or multiclass classification.

Those building blocks turn out to be incredibly useful and have tractable properties at each layer that permit “back propagation” to train the parameters to achieve better results, because the derivatives (gradients) of the linear and activation functions are well behaved. Then you get complexity by “stacking” multiple of these simple layers together.

So what you are proposing is to make the individual layers more complicated. It’s entirely possible that you could come up with a strategy that works using a more complex function than the linear transformation. But then you have to deal with whether the behavior is tractable from a training perspective and whether you can come up with a function that is as generally applicable (works in as many different problems) as the architecture we are learning here.

I don’t personally know the history of all the different ideas that have been tried in this space. My guess is that people like Yann LeCun, Geoffrey Hinton and Andrew Ng would probably have thought of alternative architectures. There is probably a reason that they have ended up doing it the way they do. Of course you are welcome to experiment with more complex functions. Maybe you’ll come up with a recipe that works better. If so, please publish the paper and maybe in a year everyone will be raving about how great Kumawat networks are and why didn’t we think of that sooner?

Thanks for this great answer but still I want to add that,

I know these scientists have done a lot of background work to propose these algorithms but they should have mentioned the basic intuition of arriving at that result in the beginning of the course so that people like me won’t end up thinking that all of this is just too hypothetical unless they post a doubt in discussion forum.

Hi,

If you want to know a little bit more about the background of neural networks you could have a look at:

- Neural Networks and Deep Learning Chapter 1 | Perceptrons section
- Dive into Deep Learning in particular section 3.1.4

These two books are free online and they are good references, so I hope it helps.