May fellow mentor/classmates give me some insight about what the function clean_dataset function is going to achieve?
I think that we need to clean the training data such that we could label each word with right categories, like
“Abhishek” → Name
“Jha” → Name
“Application” → Designation
“Development” → Designation
“Associate” → Designation
“-” → Unknown
Yet after i print out the result of cleaned_DF, it orders is so confusing.
Name Name Designation Designation Designation Empty Empty Empty Empty Empty Empty Empty Empty Email Address …
There are many empty marked. I can’t understand why “Empty” would appear.
Appreciated for any help. Many thanks.
To answer your question about the appearance of many “Empty”, you need to dive deeper into the function clean_dataset
and pay attention to the variable emptyList
.
About your thoughts on labeling, you can focus on the variable start
in the function clean_dataset
(included in the image below). I guess what the function does is that to label data, it depends on the variable start
satisfying the if statement if (start>=int(entList[0]) and i<=int(entList[1])):
or not. You may want to print out data[0][1]
in another cell to figure it out (data[0][1]
is actually an example of strDictData
in the function clean_dataset
and I have included the result of data[0][1]
below). For example, with [1296, 1622, 'Skills']
. Variable start
needs to be in the range of (1296, 1622)
to be added to the emptyList
. In conclusion, the way “Name”, “Designation”, “Skills”,… are in the cleanedDF
totally depends on entities
from data
passed to the function clean_dataset
. Hopefully, the rough explanation above is going to help you out.
2 Likes
Hi Sonnh1902, thanks so much for your help.
Though I still cant get what the program is really doing, I will go on NLP course for transformer so that I could strengthen myself and comeback to revisit this challenge. 
Thanks so much indeed.