C2_W4_Decision_Tree_with_Markdown - why do we need X_node

Sumitro_Sarkar · September 11, 2022, 2:13am

In the final practice Lab, Exercise 3, we calculate the information gain. The exercise states

Some useful variables

X_node, y_node = X[node_indices], y[node_indices]

X_left, y_left = X[left_indices], y[left_indices]
X_right, y_right = X[right_indices], y[right_indices]

I got the expected answers using only the variables y_node, y[left_indices] and y[right_indices]

So Why do we need X_node, X[left_indices], and X[right_indices]?
Don’t they contain the same list of node indices as their y counterparts?

Elemento · September 11, 2022, 7:01am

Hey @Sumitro_Sarkar,
Indeed, you can code the compute_information_gain function without the use of the X_node, X_left and X_right variables. This is because the only information that we need from X_node, X_left and X_right is their length, which can also be calculated from their y-counterparts.

Here, I am assuming you are aware of the fact that X contains the data and y contains the target variable, and in order to figure out the left_indices and right_indices, X is needed to be passed to the split_dataset function, which is called inside the compute_information_gain function, and that’s why we have passed X to the compute_information_gain function. I hope this helps.

Cheers,
Elemento

Sumitro_Sarkar · September 11, 2022, 8:17am

Elemento, thank you for clarifying that we only need the variables y_node, y[left_indices] and y[right_indices].

Yes, I understand that the complete data matrix X has to be passed to the compute_information_gain function to be further passed down to the split_dataset function.

But for variables X_node, X[left_indices], and X[right_indices], I think the Exercise 3 would be clearer if it simply did not list those at all, especially not under the comment “Some useful variables” , as these variables are not useful .

Just my suggestion.

Thanks again

Elemento · September 11, 2022, 10:12am

Hey @Sumitro_Sarkar,

As I just said, it’s up to the learner how (s)he would like to find out the length. One can use either the X variables or the Y variables. Additionally, in the videos, Prof Andrew hasn’t mentioned that we particularly need to use the Y variables, so, removing the X variables completely might trip up more learners the other way around, than it might help, leading them to believe that they can only find the length using the Y variables. I suppose you get my point now.

Cheers,
Elemento

Sumitro_Sarkar · September 11, 2022, 1:06pm

Understood, thanks a lot.

Jem_Lane · April 24, 2023, 10:19am

Hi,

Following on from this conversation, I also don’t fully understand why we need the X_node, y_node, X_left, y_left, X_right, and y_right variables. Is it just to make it slightly easier/clearer when feeding in the relevant data as arguments into the information gain calculations?

e.g.
[1] left_split = len(y_left) / len(y_node)

is easier to read/write and follow than

[2] left_split = len(y[left_indices]) / len(y[node_indices])

Or would [2] not actually work due to syntax issues?

Jem

Mujassim_Jamal · April 24, 2023, 12:05pm

Hi @Jem_Lane,

The variables X_node, y_node, X_left, y_left, X_right, and y_right are used to represent different sets of data and target values at each node. They are important for calculating the information gain of each feature, which helps the algorithm to decide which feature to split on. By using these variables, the algorithm can make better decisions about how to split the data as it builds the tree.

It will work but when we break down a problem into smaller parts by separating the variables before computation can be helpful for better understanding and organizing the steps needed to solve the problem. By focusing on each part separately, we can approach the problem in a more logical and organized way, which can help us avoid mistakes and make the process clearer.

Regards,
Mujassim

Jem_Lane · April 24, 2023, 12:42pm

Thanks for explaining Mujassim.

Jem

Topic		Replies	Views
Compute information gain question Advanced Learning Algorithms week-4	2	524	July 18, 2022
Information gain calculation problem Advanced Learning Algorithms week-4	5	427	October 24, 2023
Compute_information_gain_test_error Advanced Learning Algorithms week-4	1	449	August 15, 2023
C2 W4 Decision Tree with Markdown - Information gain Advanced Learning Algorithms week-4	9	370	November 30, 2023
CW2_W4_Decision_Tree_with_Markdown information_gain fails? Advanced Learning Algorithms week-4	2	23	August 27, 2024

C2_W4_Decision_Tree_with_Markdown - why do we need X_node

Some useful variables

Related topics