Course 5 Week 4: clean_dataset() is buggy? in the Named-Entity Recognition notebook

Referring to def clean_dataset. I tried to visually check the alignment between the word and entity array by doing this:

k=1
words = data[k][0].split()
ents = cleanedDF.iloc[k].setences_cleaned
pd.DataFrame(data={‘word’: words, ‘entity’: ents})

In this sample resume, “Graduation Year” is entirely missing. “Bachelor of Engg in Information Technology” should all be tagged as “Degree” (according to data) but the last word “Technology” is “Empty”. This is just one sample I eye-balled. There are many other misalignment.

I seriously believe there’s a bug in the clean_dataset. In particular, “i” is used to loop through individual resume, but it is also used in an inner loop “for i in range(lenOfString):”. This is extremely worrisome. While I take a guess of what this method is trying to achieve, it is very hard to understand its detailed logic overall.

I decide to actually rewrite this method, and will post, and like to verify if it is what this method intended to do, and without bugs.

You could be right. I think the mentors have not paid much attention to the optional labs.

I wrote this, if anyone is interested, please review. The general approach i used is (1) replace the words inside the full text with their corresponding entities (according to data), and then split into array. (2) loop through the array and replace all remaining words (that aren’t labelled) with “Empty”. This necessitates rename/transforming entities to be unique from literal words, and then replace them back with the original.

I spot-tested a few and this seems to align and account for all the entities a bit better. One thing that I am not certain with is if commas (and other punctuations) are handled correctly, and that sort of depends on how the tokenizer work, and what the model expects on input.

def generate_ner_labels_from_dataset(sentences_ner_data):

  entity_tagged_sentences = []
  for data in tqdm(sentences_ner_data):  # loop through each sample
    text = data[0]
    entities = data[1]['entities']     # IMPORTANT: assume only 1 entry in the data[1] dict

    # sort this in desc order of 'start', (since modifying text in-place will modify subsequent start/end that can screw this up, working in reverse order should be ok). 
    entities = sorted(entities, key=lambda tup: tup[0], reverse=True)   

    for entity in entities:
      start, end, name = entity

      name = name.replace('_', '!').replace(' ', '_')   # remove space in the entity name (will undo this later)
      name = f'<*entity*>{name}</*entity*>'        # add html-like tag so this will be recognized as entity name (not the literal text)

      replacement = ' '.join([name]*len(text[start: end].split()))    # replace real text with their entity assignment

      text = text[: start] + replacement + text[end:]

    texts = text.split()

    # replace any non-entity with 'Empty'
    # strip <*entity></*entity*> from entities and reverse format _ and ' '
    def to_entity(word):
      if not '<*entity*>' in word:
        return 'Empty'
      else:
        pattern = r"<\*entity\*>(.*?)<\/\*entity\*>"
        word = re.findall(pattern, word)[0]

        # return word.replace('<*entity*>', '').replace('</*entity*>', '').replace('_', ' ').replace('!', '_')
        return word.replace('_', ' ').replace('!', '_')

    # ents = texts.map(to_entity)
    ents = list(map(to_entity, texts))

    entity_tagged_sentences.append(ents)

    ner_labels_df = pd.DataFrame(data={'sentences_cleaned': entity_tagged_sentences})

  return ner_labels_df

I guess so. When I read this part of the course, I didn’t pay much attention in order to pass it. But now that I am actually wanting to work on a real project, I am scrutinizing all the logic closer to understand better. HuggingFace’s own tutorial doesn’t seem to have a complete end 2 end in the form of notebook, and they only provide a training script. I have yet to go through those. I was hoping the notebook format here could be better. but… anyway, I have posted my version of “clean_dataset()”, if you can connect with other mentors and course instructors, I would like to get feedback on it, and possibly use it with my real project.

Hi kechan,
thanks for posting this! I am also struggling with clean_dataset and what it tries to achieve. Sadly there are no function descriptions like there were for the other notebooks. Like you, I suspect that it is buggy. @TMosh, can this be brought up with the creators of the notebook?

kechan, I will try your functions later and report back.

Hi Melanie,
Yes. For few occasion, I don’t mind 0 code comment if the code is straightforward. But that def is quite nested with if-else all over and it becomes quite hard. At the least, the intended behavior of the function should be explained. I originally didn’t bother even reading that messy code until I smelled/suspected a bug, then it got a bit more frustrating.
Andrew Ng is an excellent instructor and did everything with love. But unfortunately, the optional labs don’t have much love going into it.

I agree, Andrew Ng is great at teaching complex topics!

Here the promised feedback on your function. I called it with

cleaned_df2 = generate_ner_labels_from_dataset(data)

and the returned dataframe had only shape (1,1) :slightly_frowning_face:

I did not do any further debugging. Without knowing exactly what this function is supposed to do, it does not make sense to me. For example: Both your function and the original function return “Name” twice in “sentences_cleaned”, but in “entities”, “name” is present only once. To me this looks like an error, but it it really? Thats where a function description would be handy.

I noticed there was a missing line that’s supposed to call mergeInterval(…). I did find that there could be a bug in it so I also rewrote my own called better_mergeIntervals.

Entities are demarcated by text[start, end] but they can have overlap, and thus need to be resolved/merged
case 1: overlap of entities of the same type, replace the [start, end] by the union.
case 2: overlap of entities of different type, this is ambiguous and resolving this is somewhat arbitrary.

def better_mergeIntervals(entities):
  entities = sorted(entities, key=lambda tup: tup[0])   # sort by 'start' the lower bound.
  merged_entities = []

  for current_entity in entities:
    if not merged_entities:
      merged_entities.append(current_entity)        # the very first entity, no merging consideration is needed.
    else:
      prev_entity = merged_entities[-1]

      if current_entity[0] <= prev_entity[1]:    # overlapped
        if prev_entity[2] is current_entity[2]:  # same type
          upper_bound = max(prev_entity[1], current_entity[1])
          merged_entities[-1] = (prev_entity[0], upper_bound, prev_entity[2])   # simple merge by union
        else:
          if prev_entity[1] >= current_entity[1]:   # if current entity interval is a subset of previous entity's, then drop current entity
            merged_entities[-1] = prev_entity
          else:
            merged_entities[-1] = (prev_entity[0], current_entity[1], current_entity[2])  # union and adapt the current entity type (this rule somewhat arbitrary)
      else:
          merged_entities.append(current_entity)
  return merged_entities
def generate_ner_labels_from_dataset(sentences_ner_data):

  entity_tagged_sentences = []
  for data in tqdm(sentences_ner_data):  # loop through each sample
    text = data[0]
    entities = data[1]['entities']     # IMPORTANT: assume only 1 entry in the data[1] dict

    # entities = mergeIntervals(entities)    # cleanse bad label, where entity interval (start, end) can overlap
    entities = better_mergeIntervals(entities)

    # sort this in desc order of 'start', 
    # (since modifying text in-place will modify subsequent start/end that can screw this up, working in reverse order should be ok). 
    entities = sorted(entities, key=lambda tup: tup[0], reverse=True)   

    for entity in entities:
      start, end, name = entity

      name = name.replace('_', '!').replace(' ', '_')   # remove space in the entity name (will undo this later)
      name = f'<*entity*>{name}</*entity*>'        # add html-like tag so this will be recognized as entity name (not the literal text)

      replacement = ' '.join([name]*len(text[start: end].split()))    # replace real text with their entity assignment

      text = text[: start] + replacement + text[end:]

    texts = text.split()

    # replace any non-entity with 'Empty'
    # strip <*entity></*entity*> from entities and reverse format _ and ' '
    def to_entity(word):
      if not '<*entity*>' in word:
        return 'Empty'
      else:
        pattern = r"<\*entity\*>(.*?)<\/\*entity\*>"
        word = re.findall(pattern, word)[0]

        return word.replace('_', ' ').replace('!', '_')

    ents = list(map(to_entity, texts))

    entity_tagged_sentences.append(ents)

    ner_labels_df = pd.DataFrame(data={'sentences_cleaned': entity_tagged_sentences})

  return ner_labels_df

my cleanedDF = generate_ner_labels_from_dataset(data) has sensible stuff (not shape 1,1).
Please try this and let me know.