This code is slightly odd.
This used Hugging Face data set. And, if you look at a line above, you see
train_ds.set_format(type='tf', columns=columns_to_return)
With specifying “type='tf'
”, both 'input_ids
’ and 'attention_mask
’ are already "Tensor
". So, “to_tensor” does not make sense at all. This code may work on a particular version or combination of package(s), but, may not work in majority of environments.
Here is an action.
This code assumes ‘Ragged’ tensor, which includes a variety of records in different length. But, if you look at ‘input_ids’ and ‘attention_mask’, both are fixed length as below. (You can also see this is already a tf.Tensor.)
train_ds['input_ids']
<tf.Tensor: shape=(1000, 26), dtype=int64, numpy=
array([[ 101, 1996, 2436, ..., 3829, 1029, 102],
[ 101, 1996, 3829, ..., 1997, 1029, 102],
[ 101, 1996, 3871, ..., 3871, 1029, 102],
...,
[ 101, 1996, 6797, ..., 5010, 1029, 102],
[ 101, 1996, 6797, ..., 5723, 1029, 102],
[ 101, 1996, 3829, ..., 1997, 1029, 102]])>
So, what we need to do is a zero-padding so that the length for each row is equal to “tokenizer.model_max_length
”.
#train_features = {x: train_ds[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'attention_mask']}
padding = tf.constant([[0,0],[0,tokenizer.model_max_length-train_ds['input_ids'].shape[1]]])
train_features = {x: tf.pad(train_ds[x], padding) for x in ['input_ids', 'attention_mask']}
Then, we can create the same “train_features
” as we create on Coursera platform.
Finally, I could get the same result like this.
Then, the next question is, do we really need a padding ?
Usually, Transformer does inside it. So, let me try to simply remove “to_tensor” and zero padding like,
train_features = {x: train_ds[x] for x in ['input_ids', 'attention_mask']}
Then, I actually got the very similar result.
Of course, a quick testing to get “Garden” worked.
Anyway, I think the first implementation to add zero padding may be the first try. 2nd is an optional.
Hope this helps.