How to Deal; with Unlabled dataset?

rujalinagbhidkar · August 20, 2024, 7:38am

Hello everyone. I am new to NLP . Currently I am working on a project based on NLP. I am facing issue with the data, I have to train or fine-tune model on any hardware data-sheets that has only text. Can anybody let me know how to deal with this type of data.

mamba824 · August 20, 2024, 5:48pm

Hi @rujalinagbhidkar,
I am new to the AI world, and I would suggest you to not take my word blindly. But do you think using a Classification model might help you with the unlabelled dataset you currently have? You might need to analyse the data a bit before defining the classes. Just thinking out loud, please feel free to correct me.

ai_curious · August 20, 2024, 7:29pm

@mamba824

A Classification model might be helpful if the problem we’re trying to solve is classification, but the original post doesn’t say, so we can’t recommend anything specific.
A Classification model is likely going to want to train the model using examples. Labelled examples. In the image world it might be cat/non-cat. In NLP, a classic example is SPAM/Not SPAM. The classification model learns to predict (classify) from many labelled examples. Since the OP ‘s data is unlabelled, a Classification model isn’t suitable.

@rujalinagbhidkar you could help us help you if you described the objective. Many NLP tasks do work directly on unlabeled data, but not all (see above)

TMosh · August 20, 2024, 8:21pm

Can you be more specific about what you want this model to do?

rujalinagbhidkar · August 20, 2024, 8:23pm

The main objective of my project is to create a chatbot that will generate a test plan. These are very helpful in electronic field. So what i am trying to do is to train or fine-tune a model based on electronics data like take example of raspberry pi (to get detail of rpi we need thier documents mostly paragraphs, datasheets, some research papers or more…).
Now as you know these docs doesn’t have any labels. What type of approach should i use or which model can be picked to train or fine tune.

rujalinagbhidkar · August 20, 2024, 8:24pm

The main objective of my project is to create a chatbot that will generate a test plan. These are very helpful in electronic field. So what i am trying to do is to train or fine-tune a model based on electronics data like take example of raspberry pi (to get detail of rpi we need thier documents mostly paragraphs, datasheets, some research papers or more…).
Now as you know these docs doesn’t have any labels. What type of approach should i use or which model can be picked to train or fine tune.

TMosh · August 20, 2024, 8:45pm

This isn’t really my area of expertise, but here are some thoughts.

(Assuming this capability doesn’t already exist and I’m just not aware of it)

You’re going to have to start with a large language model, and then teach it to recognize what the testable requirements are in a component spec sheet.

That training is going to require some labeled data. Probably a lot of it.

Once you’ve identified the testable requirements, you’ll then need some means of writing a test procedure that uses the appropriate equipment for each type of measurement (voltages, power, timing, etc) and creates detailed procedures to verify each requirement.

I am not sure what you’re proposing is feasible at this time without using a labeled data set.

c.fleming · August 20, 2024, 8:49pm

If you’re using Python and Numpy there should be a way to reorganize the data, but if you’re looking for a fast solution, look into Pandas to label and reorganize the data as you see fit, here is a link to documentation:

User Guide — pandas 2.2.2 documentation (pydata.org)

ai_curious · August 20, 2024, 9:18pm

Is this an example of one of your inputs?

Is your objective to do something humans cannot do? Or to do the same thing humans can already do, but do it better/faster/cheaper?

rujalinagbhidkar · August 21, 2024, 7:09am

Yes exactly this is the example of my dataset. Humans are into it but it takes lot of rampup time to execute. Involving AI here can be very useful for them.

ai_curious · August 21, 2024, 12:26pm

If you haven’t yet worked through the short courses available on Generative AI, you absolutely should do before going much further. Focus especially on the segments of the lifecycle related to fine tuning.

I have found this linked article a good source for high-level explanations of the concepts and primary use cases:

Notice that two of the justifications they offer for performing fine-tuning on an existing foundational/pre-trained LLM are:

Customization-improving LLM performance in a specialized domain, and
Limited labelled data

which both seem to me relevant to your objective.

There has also been a lot of recent activity around applying LLMs to tabular data. A survey article is here:

https://arxiv.org/html/2402.17944v2

llm
#llm-fine-tuning

Topic		Replies	Views
Newbie Seeking Advice on AI Training Dataset Collection AI Discussions ai-discussions	0	41	March 28, 2025
Obtaining Labels for Fine-Tuning LLMs Generative AI with Large Language Models week-2	5	582	July 14, 2023
Finetuning logs AI Discussions	0	99	September 22, 2023
Can you mix and match different types of data? Finetuning Large Language Models	2	115	September 21, 2023
Fine-tuning an LLM on non-Q&A and unlabeled dataset Finetuning Large Language Models	0	357	September 30, 2023

How to Deal; with Unlabled dataset?

Related topics