Tabtransformer: Tabular Data Modeling Using Contextual Embeddings

dan_herman · July 12, 2024, 6:28pm

Greetings,

I’m looking for anyone who has experience using this model on supervised or semisupervised applications? I am looking to run both scenarios and was curious to learn if the accuracy was superior to boosted trees for tabular data? I was also interested to learn if this model is interpretable? I understand how transformer models work, but I haven’t built them yet and have not tried to interpret them yet. Any insight is appreciated. I’ll include more info on my project below.

I’m currently working on a personal project. The topic is substance abuse treatment, where I am analyzing 400 features for about 2000 patients, receiving treatment over 24 weeks.

My goal is to create a data model that has enough baseline categorical data to predict treatment outcomes early in treatment. By detecting negative outcomes early in treatment (defined as patient dropout) the model will become clinically useful in practice, helping healthcare providers detect risk signals early in treatment to improve treatment and hopefully improve treatment response and reduce dropout.

I’m going to compare 3 models, Random Forest, XGBoost Classifier and Keras Structured Classification with Tabtransformer.

I’m looking to show incremental improvement from different models, based on complexity.

TMosh · July 12, 2024, 8:09pm

I do not think you have enough examples for that number of features. A general model will be difficult to achieve.

dan_herman · July 12, 2024, 10:02pm

Thanks for chiming in. I see where you’re coming from.

For XGBoost, I think I will be fine. I’ve worked with it before on wide datasets and it works well in finding signal. I am currently in the process of experimenting with different data models, I will start with a lot of features and iterate down to the best predictors.

You may be right for tab transformer. I went through the tutorial for tab transformer. They used 50k examples.

I considered creating a synthetic dataset. I am not familiar with how that works. But I’m going to do some research soon.

Cheers

dan_herman · July 15, 2024, 5:05pm

Mr TMosh, what I learned in this scenario is that the individual model determines requirements for features and samples.

After speaking with an industry expert on XGBoost, the guidelines are as follows.

N observations must > N features
Minimum of N=100 training samples

Does not work well with computer vision and NLP, those are more appropriate for deep learning.

For tabular data, XGBoost provides portability to work on distributed systems, supporting several hundred million instance datasets, providing superior accuracy and performance over most models for tabular data.

Keras offers structured classification with feature space and shows good accuracy on low num of samples. However, you are keen to pointing out that it may not work for tabtransformer.

That’s my analysis from the short research I just completed. Hopefully that’s helpful if anyone has similar questions when forming their ML strategy.

Topic		Replies	Views
When Trees Outdo Neural Networks: Decision Trees Perform Best on Most Tabular Data AI Discussions the-batch , ai-discussions	1	105	May 20, 2023
Data augmentation for tabular data Convolutional Neural Networks in TensorFlow week-2	7	805	July 14, 2023
I want to build a Tabular-GAN using LSTM how can i do it? AI Discussions	0	44	June 30, 2023
Data augmentation for structured data Convolutional Neural Networks in TensorFlow week-2	1	436	June 25, 2023
Synthetic Data for Healthcare problem, binary classification AI Discussions ai-discussions , project	9	221	April 9, 2024

Tabtransformer: Tabular Data Modeling Using Contextual Embeddings

Related topics