C5 W4 lab2 What should I equate previous_word_idx to?

Code given is:

label_all_tokens = True
def tokenize_and_align_labels(tokenizer, examples, tags):
    tokenized_inputs = tokenizer(examples, truncation=True, is_split_into_words=False, padding='max_length', max_length=512)
    labels = []
    for i, label in enumerate(tags):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

What am I meant to type in this line instead of None?

        previous_word_idx = None

Also

I have no idea what’s going on in this lab either. I imagine an expert would understand it, but I have no idea what is going on.

They say a few random things I don’t understand and jump right into heavy code I don’t understand, and then the lab finishes without giving any examples of what the whole lab is for.

If it’s a lab about named entity recognition, it would be good to see an example of a sentence going in to some code and then some named entities coming out, to see the whole thing work.

Hi Jaime_Gonzales,

previous_word_idx = None should remain as such, as it is used in the elif statement that follows.

If word_idx is not None and previous_word_idx is None, label[word_idx] is appended to the label_ids list. At the end of the for loop, previous_word_idx is set to word_idx and if word_idx is not None, previous_word_idx will no longer be None.

So None is used actively in the code and is not to be changed. It’s not really a coding exercise.

The way I look at this lab is that it starts with some datacleaning that is rather specific to the dataset. So I don’t spend too much time on trying to understand the specifics, but I’m sure the data is cleaned in the process. For me, it’s merely an indication of the relevance of data cleaning. Because it is not presented as such, this may be confusing to learners.

Next, there is some padding and tags are linked to the data. The rationale is clear and the code does not seem too hard to understand (although it would not be easy to create from scratch). Still some additional clarification could have been useful.

The tokenization part is more general and requires some knowledge of the general approach taken by HuggingFace. This part could certainly use some elaboration and a reference to the HuggingFace course on using the transformers library.

So I agree that the lab could certainly have been presented more clearly. As a first contact with the idea of using the HuggingFace transformers and simply a demonstration it could be OK, if it were presented more clearly as such.

1 Like