Why is the sample variance used rather than the variance of the mean in choosing the decision tree weights?

s-dorsher · January 5, 2025, 8:07pm

The formula for choosing a split is

\sigma^2_{root}-\left(w^{left}\sigma^2_{left}+w^{right}\sigma^2_{right}\right)

Where \sigma^2 seems to be the sample variance

Variance calculator that quickly confirmed it’s the sample variance rather than the population variance

\sigma^2=\frac{\sum_{i=1}^N (X_i-\mu)^2}{N-1}

and \mu=\frac{1}{N}\sum_{i=1}^N X_i is the average.

My question was why the choice of the sample variance rather than the variance of the mean? To be clear, the variance of the mean is the same as the variance of the sample, but scaled down by a factor of the number of elements in the sample (or leaf).

\sigma^2_{mean}=\frac{\sigma^2_{sample}}{N}

Variance of the mean

The sample variance will always be larger when the number of elements of the leaf is larger, scaling as the number of elements in the leaf, while the variance of the mean reflects the inherent underlying variance of the population.

That means that the effect on the function that chooses the split is as if multiplying by the inherent variance of the population by the number of elements in the leaf squared (since it also multiplies by w^{left} or w^{right}). Why that choice?

Thanks,
Steven

Deepti_Prasad · January 5, 2025, 8:35pm

I suppose how much I remember in statistical analysis, we use sample variance when we have limited data, and population variance when complete data is available.

s-dorsher · January 5, 2025, 11:15pm

Linear combinations of random variables-- what is the variance?

I think maybe this is the answer.

The variance of a linear combination of random variables depends on the squares of the coefficients.

So if

N=N_{left}+N_{right}

then if the average value of the continuous variables are \hat{x}_{left} and \hat{x}_{right},

N\hat{x}=N_{left}\hat{x}_{left}+N_{right}\hat{x}_{right}

or

\hat{x}=w^{left}\hat{x}_{left}+w^{right}\hat{x}_{right}

Then,

\sigma^2=w_{left}^2\sigma^2_{left}+w_{right}^2\sigma^2_{right}

The equation above is for \sigma_{leaf}^2 being the variance of the mean, and then if H=w_{leaf}\sigma^2_{leaf} (again, \sigma^2 here is the variance of the MEAN) this is equivalent to

w_{left}H_{left}+w_{right}H_{right}

H_{leaf}=w_{leaf}\sigma_{mean}^2=\frac{N_{leaf}}{N}\sigma_{mean}^2=\frac{N_{leaf}}{N}\frac{\sigma_{population,sample}^2}{N_{leaf}}=\frac{1}{N}\sigma_{population,sample}^2

So the two expressions for H are equivalent, up to a factor of N (the number of objects in the parent node), which is the same for all terms.

Topic		Replies	Views
Finding variance in decision tree leaf nodes Advanced Learning Algorithms week-module-4	5	643	April 26, 2024
Incorrect variance computation in Regression Trees (optional) Advanced Learning Algorithms week-module-4	6	41	August 2, 2025
Variance calculation for sample Probability & Statistics for Machine Learning &... week-module-3	1	308	February 6, 2024
Variance calculation Linear Algebra for Machine Learning and Data Sc... week-module-4	1	18	February 2, 2025
Question about computing variance for normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	562	June 24, 2021

Why is the sample variance used rather than the variance of the mean in choosing the decision tree weights?

Related topics