C2_W1 Question about BERT Embeddings

In the C2_W1 video called ‘BERT: Example,’ there is a slide called ‘BERT Embeddings.’

I’ve completed quite a few courses in machine learning, deep learning, TensorFlow, and PyTorch, etc., but I still didn’t understand the presenter’s explanation of this graphic. Could someone please explain where the numbers shown in the graphic came from?

I understand the ‘Raw Input’ part (at the bottom of the slide), and the ‘Word Piece Tokenization’ annotation (2nd row up), but after that, I don’t understand the speaker’s explanation. Some specific questions I have include:

What is represented by the blue thing with dimensions (1, 4, 768)? Earlier in the video, the speaker said, “Word Piece Tokenization is a technique that is used to segment words into sub-words and is based on pre-trained models with the dimension of 768,” but that’s not the same as (1, 4, 768). What is in each element of this array, and where did it come from? What are the pre-trained models she is referring to? Are we simply leveraging the embeddings the BERT inventors produced when their model was invented?

Where did the numbers 101, 2293, 2023, and 4377 in the graphic come from? The speaker seems to suggest they are positional index numbers from the object with the dimensions (1, 4, 768), but how do you get a positional index of 4377 from something with up to 1 x 4 x 768 = 3072 positions in it? I’m confused.

Where does the object with the numbers 0, 0, 0, and 0 come from? What does it represent? I see the notations at the right about 0 = Sentence 1 and 1 = Sentence 2, but I’m not sure what this means. Is she trying to say that “Love this dress” is the first sentence being considered, so because of that, each token associated with it (including ‘[CLS]’) gets a “0” positional indicator? The tokens from the next sentence would each get a “1” positional indicator for ‘Segment ID’?

I think I understand the ‘Position Embedding Input ID’, but I have at least one other question:

Suppose I understand where each of the three gray ‘vectors’ in the graphic comes from (which I do not yet understand): how do they relate to the (1, 4, 768) green object which is the ‘input for the BERT model’ (at the top of the graphic)? If we computed the element-wise sum of three (1, 4) arrays, that would still produce a (1, 4) array, no? Where does the 768 come from?

So I am not fully tracking how the 3-word raw input gets converted into a (1, 4, 768) input to BERT. Does the same process ensure a 5-word phrase also gets converted into a (1, 4, 768) BERT input? How about two 3-word phrases input at the same time, with [CLS] and [SEP] tokens in the mix. Would the two phrases together still end-up as a single (1, 4, 768) BERT input?

Thank you!

Hi Andrew,

Great questions. Your questions are related to NLP. Hence, the presenter had a not-for-beginner pace while revising these concepts. Let me breakdown it here.

Green is just the sum of the blue sequences.

(1, 4, 768) is (batch_size, length_of_sequence, embedding dimension)

Since you are passing only one sentence, batch_size is 1; the number of sentences you are passing at one time. In practice, you would like to have a higher batch size so that the training happens faster, but the GPU memory is a constraint.

Your sentence has been split into four tokens. Now tokens is not equal to the number of words. It could be a rare occurrence, but tokens are a sub-words. Hence, number_of_tokens >= number_of_words. For the sake of convenience, the three words are represented as one token each, plus a [CLS] token that indicates the beginning of the sentence. So a 5 word phrase will have at least 6 tokens.
Now these tokens come from a ‘dictionary’ where each word is indexed. The presenter assumed that the indices for the words are 101, 2293, 2023, and 4377 respectively for the four tokens. Different models will have different index numbers.

Imagine yourself on Earth. You live in a 3D world, hence you need three coordinates to precisely locate yourself. Similarly, each token is represented by 768 dimensions/coordinates. Again, 768 is defined by the model. It can be 512 or any other number.
This is not related to the positional embedding, which indicates the position of your token in the sentence. Positional embedding will start from 0 and go to (number_of_tokens - 1). In this example, this is 0,1,2,3.

For some types of input, e.g. question and answering or sentence entailment, you will concatenate a pair of say, question and answer into one sequence. The question will be the first segment and the answer will be the second segment. The former will be indicated by segment_id of 0 and the latter will have 1. So e.g. the segment embedding will be something like (0,0,0,0,0,1,1,1,1) - of course depending on the number of tokens for each.

note: number_of_tokens = length_of_sequence

I hope I have answered all of your question. Let me know if I missed something.

Muhammad

Muhammad,

Thank you for the very thorough and helpful response – especially the explanation of the dimensions (i.e., batch_size, length_of_sequence, embedding dimension). I’ll follow-up if I have more questions.

Andrew