I decided to dive deeper into Xavier Initialization to trying an build a more methematical intuiton behind why the formlula works. In the paper of the creation of the formula it says that to compromise between 1/n_in and 1/n_out it just doe 2/(n_in+n_out) but why does it do this because its not and average or a harmonic mean it just says compromise ?
This might be a useful read:
If I just look at the equations presented and without other context, equation [12] seems to be the result of adding up equation [10] and equation [11]. You see, equations [10] and [11] imply two different variance values, so a compromise is like to assume a third and common value that “satisfies” both equations, and in doing so, it ends up as equation [12].
Cheers,
Raymond
But when the layers are diffrent sizes the wont the formula not work to just add up the two equations if n_l and n_l+1 are diffrent ? Or did the researchers just find out the the formula was good enough even when sizes didn’t directly match up ?
You really need to ask the authors if you want to know their intention. We are all just making guesses.