Ungrade lab 2: clean_dataset function

May fellow mentor/classmates give me some insight about what the function clean_dataset function is going to achieve?

I think that we need to clean the training data such that we could label each word with right categories, like
“Abhishek” → Name
“Jha” → Name
“Application” → Designation
“Development” → Designation
“Associate” → Designation
“-” → Unknown

Yet after i print out the result of cleaned_DF, it orders is so confusing.
Name Name Designation Designation Designation Empty Empty Empty Empty Empty Empty Empty Empty Email Address …

There are many empty marked. I can’t understand why “Empty” would appear.

Appreciated for any help. Many thanks.

To answer your question about the appearance of many “Empty”, you need to dive deeper into the function clean_dataset and pay attention to the variable emptyList.
About your thoughts on labeling, you can focus on the variable start in the function clean_dataset (included in the image below). I guess what the function does is that to label data, it depends on the variable start satisfying the if statement if (start>=int(entList[0]) and i<=int(entList[1])): or not. You may want to print out data[0][1] in another cell to figure it out (data[0][1] is actually an example of strDictData in the function clean_dataset and I have included the result of data[0][1] below). For example, with [1296, 1622, 'Skills']. Variable start needs to be in the range of (1296, 1622) to be added to the emptyList. In conclusion, the way “Name”, “Designation”, “Skills”,… are in the cleanedDF totally depends on entities from data passed to the function clean_dataset. Hopefully, the rough explanation above is going to help you out.



2 Likes

Hi Sonnh1902, thanks so much for your help.
Though I still cant get what the program is really doing, I will go on NLP course for transformer so that I could strengthen myself and comeback to revisit this challenge. :grinning:
Thanks so much indeed.