Regression Trees video
The formula for choosing a split is
\sigma^2_{root}-\left(w^{left}\sigma^2_{left}+w^{right}\sigma^2_{right}\right)
Where \sigma^2 seems to be the sample variance
Variance calculator that quickly confirmed it’s the sample variance rather than the population variance
\sigma^2=\frac{\sum_{i=1}^N (X_i-\mu)^2}{N-1}
and \mu=\frac{1}{N}\sum_{i=1}^N X_i is the average.
My question was why the choice of the sample variance rather than the variance of the mean? To be clear, the variance of the mean is the same as the variance of the sample, but scaled down by a factor of the number of elements in the sample (or leaf).
\sigma^2_{mean}=\frac{\sigma^2_{sample}}{N}
Variance of the mean
The sample variance will always be larger when the number of elements of the leaf is larger, scaling as the number of elements in the leaf, while the variance of the mean reflects the inherent underlying variance of the population.
That means that the effect on the function that chooses the split is as if multiplying by the inherent variance of the population by the number of elements in the leaf squared (since it also multiplies by w^{left} or w^{right}). Why that choice?
Thanks,
Steven
I suppose how much I remember in statistical analysis, we use sample variance when we have limited data, and population variance when complete data is available.
Linear combinations of random variables-- what is the variance?
I think maybe this is the answer.
The variance of a linear combination of random variables depends on the squares of the coefficients.
So if
N=N_{left}+N_{right}
then if the average value of the continuous variables are \hat{x}_{left} and \hat{x}_{right},
N\hat{x}=N_{left}\hat{x}_{left}+N_{right}\hat{x}_{right}
or
\hat{x}=w^{left}\hat{x}_{left}+w^{right}\hat{x}_{right}
Then,
\sigma^2=w_{left}^2\sigma^2_{left}+w_{right}^2\sigma^2_{right}
The equation above is for \sigma_{leaf}^2 being the variance of the mean, and then if H=w_{leaf}\sigma^2_{leaf} (again, \sigma^2 here is the variance of the MEAN) this is equivalent to
w_{left}H_{left}+w_{right}H_{right}
H_{leaf}=w_{leaf}\sigma_{mean}^2=\frac{N_{leaf}}{N}\sigma_{mean}^2=\frac{N_{leaf}}{N}\frac{\sigma_{population,sample}^2}{N_{leaf}}=\frac{1}{N}\sigma_{population,sample}^2
So the two expressions for H are equivalent, up to a factor of N (the number of objects in the parent node), which is the same for all terms.