C2_W4_Decision_Tree_with_Markdown: get_best_split

ajeancharles · July 11, 2023, 7:36am

Hi,

I am having a problem with the last test for “get_best_split” for Lab week 4: C2_W4_Decision_Tree_with_Markdown. I got the “2” right. But you are running some other test that I am failing. I tried to create a “pure” sample (all 0 or all 1 ) to debug. I am not making any progress. Can you please give me some insight into the test case that I am failing?
If I can reproduce the test case, I can fix the problem.

The best feature to split on: 2

AssertionError Traceback (most recent call last)
in
*** 3 ***
*** 4 # UNIT TESTS***
----> 5 get_best_split_test(get_best_split)

~/work/public_tests.py in get_best_split_test(target)
*** 129 result = target(X, y, node_indexes)***
*** 130 ***
→ 131 assert result == -1, f"When the target variable is pure, there is no best split to do. Expected -1, got {result}"
*** 132 ***
*** 133 y = X[:,0]***

AssertionError: When the target variable is pure, there is no best split to do. Expected -1, got 1

rmwkwok · July 11, 2023, 8:15am

Hello @ajeancharles,

The above part of the traceback told us that the get_best_split_test(...) came from a file called public_tests.py which can be found by clicking “File” > “Open” on the menu bar of the jupyter notebook opened on Cousera.

This told us the line number for positioning the code that triggered the error. If the line number is not convenient, we might also just copy some part of the code and search it for where it is. After you found the triggering line, you should be able to locate the test case that was complaining.

Good luck, and cheers,
Raymond

ajeancharles · July 11, 2023, 8:17am

Thank you! Most gracious of you!

ajeancharles · July 11, 2023, 1:56pm

Hi, I have a conceptual understanding of the nature of the error.
Unfortunately, this is not something the professor taught.
The error says, “If the target is fully correlated with other feature, that feature must be the best split”
What is happening the test script is taking a column of “X” as “y”. Hence the correlation comment.
There is nothing in the class on how to treat the correlation between target y and features X.

This is uncharted territory for me. I can force my code to give the right results, but I still wouldn’t understand what is going on. I don’t have a conceptual framework.

Any ideas?

ajeancharles · July 11, 2023, 6:00pm

By painstakingly looking at the test cases, running them, and putting print statements everywhere, I was able to solve the issues mentioned above. I am willing to teach the approach to anyone stuck.

TMosh · July 11, 2023, 7:27pm

Can you summarize the condition that your code was not addressing correctly? That should be a sufficient hint for other students.

ajeancharles · July 11, 2023, 7:46pm

Basic Solution
You have two variables X and y. Specific test cases are creating y instances which are equal to columns of X or are linear combinations of a column of X. (Hence the text from the error about correlated data). You have to find as a pre-processing if any columns of X are correlated to y (see Pearson correlation). If a noticeable correlation exists {1 or -1}, return the column’s index in question (you are exiting the function call).

If there is no strong correlation, proceed as usual, using what you’ve learned.

Hope that helps. Ask questions for clarification.

TMosh · July 11, 2023, 7:53pm

Assuming your compute_information_gain() works correctly, I don’t think that level of complexity is needed.

All that get_best_split() needs to do is call compute_information_gain() for each feature index, and remember which feature gave the largest gain.

ajeancharles · July 11, 2023, 8:10pm

Maybe you’re right. It’s not that complex, actually.
My code is written from the hints. Thus my code is very small and simple in general.

I just added an extra loop of ~7 lines to check for correlation. If I find a correlation, I do exactly what the error prescribes. I always listen to the code.

Anyways the proof of the pudding is in the eating. Also, there are thousands of roads that lead to Rome.

rmwkwok · July 11, 2023, 9:09pm

Hello @ajeancharles,

Thanks for sharing!

I just wanted to reply to your earlier question about correlation here.

The key thing is, even though we have used “correlation” to describe the situation, we didn’t mean to create a new way for determining which feature to split on. In all time, as Tom said, we stick to the approach taught in the lecture - computing information gain.

All roads lead to Rome. There are many ways to verbally describe that test case including calling it “perfectly correlated”. Since such description isn’t wrong, it’s going to be fine to implement your code as such, but I just wanted you to know that we don’t compute the correlation coefficient for checking the special case of whether it is perfectly correlated, instead, computing the information gain itself is sufficient to discover such special case because in the case of a perfect correlation, the related feature has the maximum gain.

In computer, we prefer the simplest road. If only computing information gain is enough, we then don’t compute any other stuff.

It would be great if you can check your code once again to see if it can pass all test cases without computing the correlation coefficients. I would be happy to help you through it if needed. Moreover, I appreciate your willingness to share and to try out different approaches. I believe it would already be an useful experience.

Cheers,
Raymond

ajeancharles · July 11, 2023, 10:04pm

Forgive my brain, it is old and maybe rotten. Of course, I would like the simple road, but I do see connections everywhere. So when the test case told me, “If the target is fully correlated with other features…” I took it at its word, started looking for correlations, and singled them out.

Just out of curiosity, how in the world did this logic happen to work? this is such a fluke. The testers were not really thinking about Pearson correlations.

My code is straightforward. I am more than glad to share it. I follow the hints as if I were a compiler/interpreter. The only thing that made the code look ugly was all the print statements I had put all over.

How do you want me to post it? After all, it is an exam. You probably have access to it on the testing server, or I can post it here. Let me know.

I like elegance too. But I have to see it to appreciate it.

TMosh · July 11, 2023, 11:35pm

@ajeancharles:
If a mentor needs to see your code, we’ll contact you via a private message with instructions.
We do not have access to the grading server.

ajeancharles · July 12, 2023, 12:30am

Thanks! Very much! Good news!

rmwkwok · July 12, 2023, 1:03am

Hello @ajeancharles,

After reading your replies, I know how much the use of the word “correlation” by the test’s designer has affected your approach to code the solution. Perhaps you have already got what I wanted to say, but just to reiterate,

there is no need to compute correlation coefficient even though it is valid, and
without the coefficient, and with only computing the information gain, it is possible to pass all the tests.

To get started, we could look at the following piece of code which is actually provided to you in the assignment as hint (in a collapsed code block underneath the exercise’s code cell):

def get_best_split(X, y, node_indices):   

    # Some useful variables
    num_features = X.shape[1]

    # You need to return the following variables correctly
    best_feature = -1

    ### START CODE HERE ###
    max_info_gain = 0

    # Iterate through all features
    for feature in range(num_features): 

        # Your code here to compute the information gain from splitting on this feature
        info_gain = 

        # If the information gain is larger than the max seen so far
        if info_gain > max_info_gain:  
            # Your code here to set the max_info_gain and best_feature
            max_info_gain = 
            best_feature = 
    ### END CODE HERE ##    

return best_feature

If we examine closely the above skeleton, we will find that, for each feature, we compute the gain, and then through the if statements, we keep updating the best gain value and the associated best feature.

We know that for a feature that is perfectly correlated, the gain has to be the maximal one (can you figure out Why?), therefore, once that perfect feature is being considered, it will satisfy the condition set by the if block, remembered by the variable best_feature, and set the max_info_gain to a value that no other feature can exceed!

If you just compare what I have said with the code skeleton above, do you have any disagreement? If no, then perhaps you can start from the skeleton and see if you can complete it?

Cheers,
Raymond

ajeancharles · July 12, 2023, 1:07am

Yes, this is the template that I use. Yes, I prefer simplicity and elegance, even though I did not achieve it here.

ajeancharles · July 12, 2023, 1:18am

I filled up this template you showed above using the hints in the exercise and previous exercises.
All other tests were successful except for this “correlated” stuff. I looked over all my notes from the Professor’s video; I could not find anything about the correlation.

Then I decided to pay attention to the wording of the test case response.

ajeancharles · July 12, 2023, 1:34am

“perfectly correlated” implies maximal amount of information, no entropy, no surprise. You can use regression theory to approximate a curve.

The problem must be somewhere else in my code; since we are using the same template.
Maybe when I am calculating the information gain or doing the splitting of the data.

for feature in range(num_features):
    
    # Your code here to compute the information gain from splitting on this feature
    info_gain = <Here I compute the information gain>


    # If the information gain is larger than the max seen so far
    if info_gain > max_info_gain:       
        
        max_info_gain = <I put the new info gain, using other hints>
        best_feature = <I update the feature, using other hints>

This is basically a repeat of the temple you show before, with my own comments and lots of print statements.

rmwkwok · July 12, 2023, 2:10am

Thanks @ajeancharles for the update. Then I think one natural thing to consider is whether your gains computed are correct. For you to check, I am sharing my results:

You can see:

each feature’s gain, and,
the feature with the best gain is chosen as the best feature.

Note the number 2, 3 and 4 are test cases that use perfectly correlated features.

I am sharing the three printing lines I have added, and maybe you can do the same and see if you spot anything different in the printed results?

def get_best_split(X, y, node_indices):   

    # Some useful variables
    num_features = X.shape[1]

    # You need to return the following variables correctly
    best_feature = -1

    ### START CODE HERE ###
    max_info_gain = 0

    print('\nBegin to loop through the features') #ADDED, please remove this line before submission.
    # Iterate through all features
    for feature in range(num_features): 

        # Your code here to compute the information gain from splitting on this feature
        info_gain = 
        print(feature, info_gain) #ADDED, please remove this line before submission.

        # If the information gain is larger than the max seen so far
        if info_gain > max_info_gain:  
            # Your code here to set the max_info_gain and best_feature
            max_info_gain = 
            best_feature = 
    ### END CODE HERE ##    

print(f'Best feature to split on: {best_feature}')  #ADDED, please remove this line before submission.
return best_feature

Please feel free to share your printed results for follow-up discussion.

Cheers,
Raymond

rmwkwok · July 12, 2023, 2:14am

By the way, @ajeancharles, once you are able to pass all public tests, and you are ready to submit your work, please remember to remove all the print command lines added by you as they are going to interfere with the autograder and cause your submission to fail.

ajeancharles · July 12, 2023, 6:18am

I will do it over the weekend. I am doing this for fun; I have been taking classes for a few days (knowledge binging) and need to catch up with some sleep. I need to get some sleep.
You will hear from me in a few days.

Topic		Replies	Views
Error in get_best_split tests in C2_W4_Decision_Tree_with_Markdown Advanced Learning Algorithms week-module-4	16	813	June 5, 2023
Problem with the function Get_Best_split in C2_w4 Advanced Learning Algorithms week-module-4	2	370	September 15, 2023
Assertion error in C2_W4 practice lab on best split function Advanced Learning Algorithms week-module-4	9	637	August 7, 2024
Requesting help for C2_W4_Decision_Tree_with_Markdown Advanced Learning Algorithms week-module-4	5	381	September 4, 2023
ML specialization, 2nd course, decision tree Advanced Learning Algorithms week-module-4	1	474	February 2, 2023

C2_W4_Decision_Tree_with_Markdown: get_best_split

The best feature to split on: 2

Related topics