I don’t get how this is mathematically derived and understood. Could anyone please provide an explanation, similarly how the normalization formula is derived for tanh? Additionally, is there a variance normalization formula for sigmoid?

This is all explained in the lectures. The point is not that the variance is 2/n a priori. The point is that they are trying to produce that variance for the weight initialization as a way to get better convergence behavior. Also note that Prof Ng is not saying that always works, either. There are several different algorithms, e.g. He Initialization and Xavier Initialization. There is no “silver bullet” solution that works the best in all cases. Please watch the relevant lecture again with the above thoughts in mind and hopefully it will make sense the second time through.