Different from the decision tree, random forest and xgboost build little trees first and then have the final estimates by combining the information gained from every single tree. You can plot the final estimation of the decision tree using plot_tree from root throughout to end leaf. However, you can only visualize a single tree in ensemble tree models plot_tree(X, num_trees=n). I was wondering if there is a method to construct this final estimation of the tree for ensemble models such as random forest and xgboost. If no codes are available for the task, plausible ideas are welcomed.
Hi @zeno3175 I have not made either random forest or xgboost from scratch. this is a blog in which author has made random forest from scratch in python.
If you like to visualize your decision trees, either random or boosting you can use the
sklearn.tree. you can see this post here
- that method can only be applied to sklearn builtin tree or ensemble classifiers.
- the blog post above is quite old so some of the function parameters might have changed.
Thanks for the code @Moaz_Elesawey, but take a close look: the poster defines # Extract single tree: estimator = model.estimators_ before plotting export_graphviz . So it is still a plot of a single tree instead of the final estimation
you’re right but you can just loop for each tree in the ensemble and visualize it. by the way, visualizing the trees is not very helpful especially if your data contains a lot of features and you have a big depth of the tree you will find interpreting the tree results from the graph is quite a hard thing to do. but it’s very helpful if you are trying to explain the tree to non-technical users.
Though I have a lot of features in my training, in the end, the feature importance is only positive for limited features (< 35) after random forest fitting. So I was trying to link the feature importance: model.feature_importances_ to the tree visualization. But failed to do so. Anyway to start from feature importances to build the final estimated tree?
I see now what you want to do.
what I would do is just run the ensemble on all the features and then using feature_importance_ I will select the most important features and then I will just run a new ensemble using these features only. I believe it might decrease the model’s performance (e.g accuracy score) although it will increase its speed.
@Moaz_Elesawey But how does that help with the visualization of showing how the model split the data according to those considered important features?
I don’t think we can plot one tree diagram to completely describe how a tree ensemble works.
@Moaz_Elesawey I found something in R (plot.multi.trees) and it may help with visualization in one. I will test it out first. Hope it will succeed. One question: if the same parameters feed in, will R and python generate the same results?
as Raymond said you cannot visualize the ensemble altogether but you can visualize each estimator in this ensemble alone using the method shown above. but it does not work the way I thought it would work.
this example uses DecisionTreeRegressor on the iris dataset.
they should do that it just differs in the random_state and how each language computes random variables.
but both of them use
cblas in the backend and they should give the same results.
Please do share your finding with us. I think it is not for random forest, but looks like it can do something for gradient boosted trees.
For ensemble based methods instance based explanation of the decisions made is the way to go and comes under explainable AI.
For random forest see: View article