C2_W4_Decision_Tree_with_Markdown - why do we need X_node

In the final practice Lab, Exercise 3, we calculate the information gain. The exercise states

Some useful variables

X_node, y_node = X[node_indices], y[node_indices]

X_left, y_left = X[left_indices], y[left_indices]
X_right, y_right = X[right_indices], y[right_indices]

I got the expected answers using only the variables y_node, y[left_indices] and y[right_indices]

So Why do we need X_node, X[left_indices], and X[right_indices]?
Don’t they contain the same list of node indices as their y counterparts?

Hey @Sumitro_Sarkar,
Indeed, you can code the compute_information_gain function without the use of the X_node, X_left and X_right variables. This is because the only information that we need from X_node, X_left and X_right is their length, which can also be calculated from their y-counterparts.

Here, I am assuming you are aware of the fact that X contains the data and y contains the target variable, and in order to figure out the left_indices and right_indices, X is needed to be passed to the split_dataset function, which is called inside the compute_information_gain function, and that’s why we have passed X to the compute_information_gain function. I hope this helps.

Cheers,
Elemento

Elemento, thank you for clarifying that we only need the variables y_node, y[left_indices] and y[right_indices].

Yes, I understand that the complete data matrix X has to be passed to the compute_information_gain function to be further passed down to the split_dataset function.

But for variables X_node, X[left_indices], and X[right_indices], I think the Exercise 3 would be clearer if it simply did not list those at all, especially not under the comment “Some useful variables” , as these variables are not useful :slightly_smiling_face:.

Just my suggestion.

Thanks again

Hey @Sumitro_Sarkar,

As I just said, it’s up to the learner how (s)he would like to find out the length. One can use either the X variables or the Y variables. Additionally, in the videos, Prof Andrew hasn’t mentioned that we particularly need to use the Y variables, so, removing the X variables completely might trip up more learners the other way around, than it might help, leading them to believe that they can only find the length using the Y variables. I suppose you get my point now.

Cheers,
Elemento

Understood, thanks a lot.

Hi,

Following on from this conversation, I also don’t fully understand why we need the X_node, y_node, X_left, y_left, X_right, and y_right variables. Is it just to make it slightly easier/clearer when feeding in the relevant data as arguments into the information gain calculations?

e.g.
[1] left_split = len(y_left) / len(y_node)

is easier to read/write and follow than

[2] left_split = len(y[left_indices]) / len(y[node_indices])

Or would [2] not actually work due to syntax issues?

Jem

Hi @Jem_Lane,

The variables X_node, y_node, X_left, y_left, X_right, and y_right are used to represent different sets of data and target values at each node. They are important for calculating the information gain of each feature, which helps the algorithm to decide which feature to split on. By using these variables, the algorithm can make better decisions about how to split the data as it builds the tree.


It will work but when we break down a problem into smaller parts by separating the variables before computation can be helpful for better understanding and organizing the steps needed to solve the problem. By focusing on each part separately, we can approach the problem in a more logical and organized way, which can help us avoid mistakes and make the process clearer.

Regards,
Mujassim

Thanks for explaining Mujassim.

Jem