Ungrade lab 2: clean_dataset function

jackchan.hk · October 30, 2022, 1:43pm

May fellow mentor/classmates give me some insight about what the function clean_dataset function is going to achieve?

I think that we need to clean the training data such that we could label each word with right categories, like
“Abhishek” → Name
“Jha” → Name
“Application” → Designation
“Development” → Designation
“Associate” → Designation
“-” → Unknown

Yet after i print out the result of cleaned_DF, it orders is so confusing.
Name Name Designation Designation Designation Empty Empty Empty Empty Empty Empty Empty Empty Email Address …

There are many empty marked. I can’t understand why “Empty” would appear.

Appreciated for any help. Many thanks.

sonnh1902 · October 30, 2022, 4:07pm

To answer your question about the appearance of many “Empty”, you need to dive deeper into the function clean_dataset and pay attention to the variable emptyList.
About your thoughts on labeling, you can focus on the variable start in the function clean_dataset (included in the image below). I guess what the function does is that to label data, it depends on the variable start satisfying the if statement if (start>=int(entList[0]) and i<=int(entList[1])): or not. You may want to print out data[0][1] in another cell to figure it out (data[0][1] is actually an example of strDictData in the function clean_dataset and I have included the result of data[0][1] below). For example, with [1296, 1622, 'Skills']. Variable start needs to be in the range of (1296, 1622) to be added to the emptyList. In conclusion, the way “Name”, “Designation”, “Skills”,… are in the cleanedDF totally depends on entities from data passed to the function clean_dataset. Hopefully, the rough explanation above is going to help you out.

jackchan.hk · November 2, 2022, 2:57pm

Hi Sonnh1902, thanks so much for your help.
Though I still cant get what the program is really doing, I will go on NLP course for transformer so that I could strengthen myself and comeback to revisit this challenge.
Thanks so much indeed.

Topic		Replies	Views
Course 5 Week 4: clean_dataset() is buggy? in the Named-Entity Recognition notebook Sequence Models	7	570	April 6, 2022
C3W1 assignment trouble Natural Language Processing in TensorFlow week-1	1	596	June 19, 2022
# GRADED FUNCTION: preprocess_dataset error - "name 'labels' is not defined" NLP with Sequence Models week-2	2	75	October 16, 2024
C3W3 assignment data Natural Language Processing in TensorFlow week-3	2	230	July 24, 2023
Course 5, Week 4 Optional Labs Sequence Models lab-help , week-4	8	45	September 17, 2024

Ungrade lab 2: clean_dataset function

Related topics