Hi!
The tree is built recursively, and with each call further down the recursion the current depth is increased. When the current depth is the same as the max depth we return.
# Not graded
tree = []
def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
"""
Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
This function just prints the tree.
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the target variable
node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
branch_name (string): Name of the branch. ['Root', 'Left', 'Right']
max_depth (int): Max depth of the resulting tree.
current_depth (int): Current depth. Parameter used during recursive call.
"""
# Maximum depth reached - stop splitting
if current_depth == max_depth:
formatting = " "*current_depth + "-"*current_depth
print(formatting, "%s leaf node with indices" % branch_name, node_indices)
return
# Otherwise, get best split and split the data
# Get the best feature and threshold at this node
best_feature = get_best_split(X, y, node_indices)
formatting = "-"*current_depth
print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
# Split the dataset at the best feature
left_indices, right_indices = split_dataset(X, node_indices, best_feature)
tree.append((left_indices, right_indices, best_feature))
# continue splitting the left and the right child. Increment current depth
build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
in this instantiation of the create tree function we do not use set a threshold.
But in order to do that simply add in a parameter, with the threshold to check against.
If both the left and right side don’t reach the threshold then return from the function.
Now, to answer the question of what should the depth and threshold be? We can do something called hyperparameter search. A simple one to understand is grid search. Take the two parameters that we want to find optimal values for, depth and threshold. Generate a set of values for the two parameters and combine them in every possible combination. Then use those values to create a bunch of decision trees and select the one that best fits the data.
There are other ways to do hyperparameter search that you can look up if interested.
Hope this helps!
Sam