Why do we need data csv in the lab?

Do I understand correct, that we have this file and this var data = pd.read_csv("data/ner_dataset.csv", encoding = "ISO-8859-1") only just to see heads? :grinning:
print('ORIGINAL DATA:\n', data.head(5))

And we deleted it after that. It looks really strange. Can you explain my, why do we need it at all?

Hi @someone555777

You should read the paragraph above the code cell, which states:

1 - Exploring the Data

We will be using a dataset from Kaggle, which we will preprocess for you. The original data consists of four columns: the sentence number, the word, the part of speech of the word, and the tags.

So the print('ORIGINAL DATA:\n', data.head(5)) shows you the first 5 lines of the original dataset which after the pre-processing (prepared for the model) looks a bit different (for example, we don’t need the “Sentence #” column and we split sentence words for inputs while tags go for targets).

Cheers

1 Like

so, have you just shown how is original table looks like? So, was it just for demostrative pupose how the initial table for NER could be?

Yes - we took someone’s data from Kaggle and showed how it looked prior to pre-processing for our needs.

so, in real life I can just create of two txt files initially. First is with sentence on a row and second is with tags on each word in a row through a space, correct?

I’m not sure I understand what you mean. You can open the files and look what’s inside them (File → Open … then navigate to these .txt files). And yes, this particular dataset (small for reference, 10 sentences and labels).
Sentences:

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "
They marched from the Houses of Parliament to a rally in Hyde Park .
Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .
The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton .
The party is divided over Britain 's participation in the Iraq conflict and the continued deployment of 8,500 British troops in that country .
The London march came ahead of anti-war protests today in other cities , including Rome , Paris , and Madrid .
The International Atomic Energy Agency is to hold second day of talks in Vienna Wednesday on how to respond to Iran 's resumption of low-level uranium conversion .
Iran this week restarted parts of the conversion process at its Isfahan nuclear plant .
Iranian officials say they expect to get access to sealed sensitive parts of the plant Wednesday , after an IAEA surveillance system begins functioning .

Labels:

O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O
O O O O O O O O O O O O O O O O O O B-per O O O O O O O O O O O
O O O O O O O O O O O B-geo I-geo O
O O O O O O O O O O O O O O O
O O O O O O O O O O O B-geo O O B-org I-org O O O B-gpe O O O B-geo O
O O O O O B-gpe O O O O B-geo O O O O O O O B-gpe O O O O O
O B-geo O O O O O O O O O O O O B-geo O B-geo O O B-geo O
O B-org I-org I-org I-org O O O O O O O O B-geo B-tim O O O O O B-gpe O O O O O O O
B-gpe O O O O O O O O O O B-geo O O O
B-gpe O O O O O O O O O O O O O O B-tim O O O B-org O O O O O

There is also the full dataset in “data/large/…” directory which you can open and look for yourself too.

I mean that in real life if I generate sentence and labels from zero, I can just miss the creation of this csv dataset.

Maybe it will not be easier to human to define tags to words, but easier for programmer to deserialize them.

By the way, do you have any recomendations of tagging for NLP tasks by human? What are best practices?

Yes, of course, you can create the dataset without the need of .csv file.

Usually, the problem is labeling the dataset (in this case, defining tags for words) and not “deserialization” as you call it.

Creating datasets is a lot of work for simple tasks and a tremendous (very very very much) amount of work for LLMs like ChatGPT and everything in between. But in any case there is a lot of work.

so, can you recommend any services for the text labeling? Or do you think that approach like in this csv table is enough fine? It would be really nice to have somthing easy to tag and easy to deserialize both.

Can ChatGPT generate datasets? :thinking:

I’m not sure I understand what are you asking.

I cannot recommend any particular service because I have not used many of them. Every text corpus is different in some way and also every task is different some way. Some companies label their datasets themselves, some hire outside companies to do it for them (like Mechanical Turk or Wow ai). In any case, it is an expensive endeavour.

I totally don’t understand what you mean by that. This dataset is for learning purposes and is fine enough for that.

And .csv files are very easy to deserialize and serialize (pd.read_csv(...) (docs) and pd.to_csv(...)(docs) are super easy ways to handle .csv files).

It can. But what I meant was that ChatGPT was trained on very large and expensive (a lot of human labour involved) dataset.

I meant what is the platform for data labeling? For example, I have people who can label. Where should they work? How to label data?

So, is arrpoach like in this csv from lab enough good? For example for NER tasks. When we arrange vertically words of sentences in the separate column and write labels near to words and dedicate one sentence from another by Sentence # in separate column. What do you think about this? My intuition says me that we can do something better if we built labeling system for NER task from zero. That can be easy to deserialize for programmer and easy to label for human both.

Not very easy. If we want to get ready to train sentences one by one as strings in txt file (this I call as deserialization in this case) we should do extra work. We should dedicate words by Sentence # separator that is on another column.

I think I better understand your question now. And I think you won’t find my answer very useful.

Reality is, that the majority of companies have different goals and different people working there on different legacy systems. Most of them have databases (SQL, Mongo etc.), some have other type of data stored in files like .pdf or .doc or other. The biggest ones (Google, Amazon, Meta, etc.) have their own solutions. So it depends on what is already there.

Other thing is what is the goal - maybe Netflix needs to recommend a movie based on reviews… then the labeling is kind of there - the users rate, review, watch etc. movies and that is the labels. Or maybe Google needs to pick the best advertisement for the email the user currently reads, then the labels are not that explicit. The same logic goes for smaller companies, maybe they have a directory structure of files, maybe the database tables already have label columns.

If talking only about Named Entity Recognition, nowadays the company would probably take the off the shelf model and fine-tune iteratively. But also, as I think your idea is, they have some software where comes in the documents where the labeler could easily assign the correct category. But in general, .csv files are a simple are “enough” of the media where the data comes from.

Cheers

1 Like

The same that we use in the lab? Or something more minimalistic? Because I really doesn’t like approach with that huge func data_generator, that I saw in the lab. Can you maybe give me an example?

so, is structure of csv from lab fine? Or can be better?

data_generator is pretty simple. In any case (taking the model from the lab or from Hugging Face or anywhere else) you need to prepare the inputs and targets according to the model - the way it was trained.

Here is PyTorch “data_loader” basic tutorial very well explained.

Everything can be better - nothing is perfect. It depends on how much better you want (how more money will you make or whatever is “better”). But for most cases (especially learning goals) it is good (and clear) enough.

1 Like