Criteria to stop splitting

Hi there,

when we try to use criteria to decide to stop splitting, we have two choices:

  1. When the tree has reached a maximum depth, we would stop;

  2. When the number of examples in a node is below a threshold, we would stop.

But how do we know the tree has reached a maximum depth? or what is the threshold.

I have some slight impression that Andrew mentioned this in the video.

Thanks

Hi!

The tree is built recursively, and with each call further down the recursion the current depth is increased. When the current depth is the same as the max depth we return.

# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    tree.append((left_indices, right_indices, best_feature))
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

in this instantiation of the create tree function we do not use set a threshold.
But in order to do that simply add in a parameter, with the threshold to check against.
If both the left and right side don’t reach the threshold then return from the function.

Now, to answer the question of what should the depth and threshold be? We can do something called hyperparameter search. A simple one to understand is grid search. Take the two parameters that we want to find optimal values for, depth and threshold. Generate a set of values for the two parameters and combine them in every possible combination. Then use those values to create a bunch of decision trees and select the one that best fits the data.

There are other ways to do hyperparameter search that you can look up if interested.

Hope this helps!
Sam

Hello @Nick_Han,

I have moved your thread to Course 2 Week 4.

Raymond

Thanks Sam.

I think the hyper parameter search is pretty much like that we compare the results of information gain and then pick the one with the highest value.

It seems the question is out of the scope of the current course.

I would do some more research to figure out some coding to do this job.

Thanks

Regards,
Nick

No problem. Thanks Raymond.

Hello @Nick_Han,

For the depth, first, we need to tell the algorithm what the maximum is. We can set it to 3, or we can set it to 10, but whether 3 or 10 is better we rely on hyperparameter searching to tell us. As for how tree knows the maximum depth is reached - we count it. In the code that Sam provided, there is a line that checks whether the maximum has been reached:

 if current_depth == max_depth

As for the threshold, again, we provide it, then the algorithm counts the number of sample with a conditional check like the above. As for which threshold value is best to use, we need hyperparameter searching to tell us.

Now, if you ask whether we can reason the values of those hyperparameters without hyperparameters searching? We can try but it will be challenging because those and the other hyperparameters are coupled together. Take “threshold” as an example, if our dataset has 90 positive samples and 10 negative samples, then we might want that threshold to be less than 10, but such threshold might mean nothing if our max_depth is only 2 because the threshold might never be reached without enough splitting. Then you might want to increase step-by-step the max_depth enough to trigger the threshold, but you might have to give up in the middle because it starts to overfit.

Therefore, if by “how do we know …”, you mean “how do we choose the best values of …”, then it will be hyperparameter searching. if you meant “how the algorithm catches the moment of the threshold being reached”, then the algorithm counts.

Cheers,
Raymond

Brilliant. Raymond.

This is very clear and smart answer, which solves all of my question in mind.

First of all, when we provide the depth and threshold in the model, it would be easy for the algorithm to reach the condition and then stop.

My question is more about how we can decide the primal depth and threshold for the model, as, considering the very large amount of datasets, it is normally impossible to do that completely manually.

As Sam and yourself mentioned, hyperparameter searche is the solution to do that, which would be another interesting topic to dig into.

Thanks @rmwkwok @SamReiswig

Nick

You are welcome Nick @Nick_Han , and thank you for letting us know your thought! I just have one more point which I think you would also agree that we still want to understand the purpose for each hyperparameter to make the hyperparameters searching process more informed. So, even though we can’t reason out the best set of hyperparameters, we still want to undertand how each hyperparameter works.

Absolutely Raymond.

Just like when I build some complex financial models to predict the costs of production,I do two things to ensure the correctness of the model:

  1. make sure the logic of building the model is correct; ( which is like understanding the hyperparameter searching process)
  2. compare the actuals against the predictions to validate the accuracy of the model prediction

The logics of the processes are pretty much the same. Even with unsupervised learning, we still need to control our algorithms and the computation.

When we lose control of the data processing procedure, there always are some surprises we don’t want to see.

Completely agree with you Nick @Nick_Han!

Cheers,
Raymond