The lecture notes says that, if you have a sigmoid activation function, you don’t want your values to always be clustered here. You might want them to have a larger variance or have a mean that’s different than 0, in order to better take advantage of the nonlinearity of the sigmoid function rather than have all your values be in just this linear regime

Our doubt is, what would be the advantage over the non-linearity region of sigmoid function rather than linear region ?

We need nonlinearities for the network to be able to comport as a universal function approximator. If you use only linear functions, no matter how many layers your network has, it is only performing linear transformations of the input, thus not well suited to solve nonlinear problems.

In the example you mention, if all the values are clustered on the central part of the graph you have essencially a linear function of the input, so it is not helping you much.

I think there’s much more to it, but I hope that helps you to understand better.