How to Deal; with Unlabled dataset?

Hello everyone. I am new to NLP . Currently I am working on a project based on NLP. I am facing issue with the data, I have to train or fine-tune model on any hardware data-sheets that has only text. Can anybody let me know how to deal with this type of data.

Hi @rujalinagbhidkar,
I am new to the AI world, and I would suggest you to not take my word blindly. But do you think using a Classification model might help you with the unlabelled dataset you currently have? You might need to analyse the data a bit before defining the classes. Just thinking out loud, please feel free to correct me.

@mamba824

  1. A Classification model might be helpful if the problem we’re trying to solve is classification, but the original post doesn’t say, so we can’t recommend anything specific.

  2. A Classification model is likely going to want to train the model using examples. Labelled examples. In the image world it might be cat/non-cat. In NLP, a classic example is SPAM/Not SPAM. The classification model learns to predict (classify) from many labelled examples. Since the OP ‘s data is unlabelled, a Classification model isn’t suitable.

@rujalinagbhidkar you could help us help you if you described the objective. Many NLP tasks do work directly on unlabeled data, but not all (see above)

1 Like

Can you be more specific about what you want this model to do?

The main objective of my project is to create a chatbot that will generate a test plan. These are very helpful in electronic field. So what i am trying to do is to train or fine-tune a model based on electronics data like take example of raspberry pi (to get detail of rpi we need thier documents mostly paragraphs, datasheets, some research papers or more…).
Now as you know these docs doesn’t have any labels. What type of approach should i use or which model can be picked to train or fine tune.

The main objective of my project is to create a chatbot that will generate a test plan. These are very helpful in electronic field. So what i am trying to do is to train or fine-tune a model based on electronics data like take example of raspberry pi (to get detail of rpi we need thier documents mostly paragraphs, datasheets, some research papers or more…).
Now as you know these docs doesn’t have any labels. What type of approach should i use or which model can be picked to train or fine tune.

This isn’t really my area of expertise, but here are some thoughts.

(Assuming this capability doesn’t already exist and I’m just not aware of it)

You’re going to have to start with a large language model, and then teach it to recognize what the testable requirements are in a component spec sheet.

That training is going to require some labeled data. Probably a lot of it.

Once you’ve identified the testable requirements, you’ll then need some means of writing a test procedure that uses the appropriate equipment for each type of measurement (voltages, power, timing, etc) and creates detailed procedures to verify each requirement.

I am not sure what you’re proposing is feasible at this time without using a labeled data set.

If you’re using Python and Numpy there should be a way to reorganize the data, but if you’re looking for a fast solution, look into Pandas to label and reorganize the data as you see fit, here is a link to documentation:

User Guide — pandas 2.2.2 documentation (pydata.org)

Is this an example of one of your inputs?

Is your objective to do something humans cannot do? Or to do the same thing humans can already do, but do it better/faster/cheaper?

Yes exactly this is the example of my dataset. Humans are into it but it takes lot of rampup time to execute. Involving AI here can be very useful for them.

If you haven’t yet worked through the short courses available on Generative AI, you absolutely should do before going much further. Focus especially on the segments of the lifecycle related to fine tuning.

I have found this linked article a good source for high-level explanations of the concepts and primary use cases:

Notice that two of the justifications they offer for performing fine-tuning on an existing foundational/pre-trained LLM are:

  • Customization-improving LLM performance in a specialized domain, and
  • Limited labelled data

which both seem to me relevant to your objective.

There has also been a lot of recent activity around applying LLMs to tabular data. A survey article is here:

https://arxiv.org/html/2402.17944v2

llm
#llm-fine-tuning