Whether we decide to initialize Ws using just normal distribution or using a scaler as in Xavier distribution afterwards, does it make any sense/difference to try to use normal distribution for every column of W independently?

I’m not sure I understand what you mean by normal distribution per column independently. We’re talking about a random normal distribution, right? It’s either random or it’s not. If it’s really random, then what is the difference if you call it once for all columns or n times for each column?

But maybe I’m missing your point and what you really meant was using different algorithms for different columns. I’ve never heard anyone discuss that, but maybe there is something interesting to be learned there. You could try some experiments and see if you see any interesting results when you try that. This is an experimental science: give it a try and see what you learn! Please share your results!

You are right, initializing each column independently doesn’t make it more like normal distribution! maybe I didn’t have clear understanding of how random algorithms work at that time.