How to tokenize data for NER

someone555777 · September 17, 2023, 6:20pm

As I understand sometimes the word in input sentence can be represented after tokenization as a few digits. But I have fixed quantity of labels, which was depended from spaces between words. So, we can have different shapes of tokenized sentences and labels before padding. But as I understand correct there can be not good uncertainties after padding. Like padding label ([PAD]) that will match to the word, that labels was defined by human.

So, should I think about this at all?

Also, I would like to know, how to prevent shapes mismatches. For example I use func
model.fit(train_sentences_tags_zipped.padded_batch(64), validation_data=val_sentences_tags_zipped.padded_batch(64), epochs=3)

And I have an error after 2000 training steps like:

InvalidArgumentError: Graph execution error:
...
Node: 'Equal'
required broadcastable shapes
	 [[{{node Equal}}]] [Op:__inference_train_function_900677]

Should it be something like one of anomalies detection mechanisms?

someone555777 · September 23, 2023, 5:52pm

So, the error was exactly about what I scared. I had input like
Trüb GmbH & Co. OHG

Labler did something like
N N N N N

But TextVectorization tokenizer did sentence like
[ 1, 1569, 1072, 3119],

And tags like
[3, 3, 3, 3, 3],

Also, I found very strange behavior of tf.data.Dataset.zip(train_sentences_vec, train_tags_vec).padded_batch(64)

Looks like it doesn’t find maximum length of X and Y between each other. So, it pads just X raw by another X raws in the batch and Y by Y, but not X by Y.

By this reason I had batches like

(<tf.Tensor: shape=(64, 4), dtype=int64, numpy=
  array([[    1,     0,     0,     0],
         [    1,     0,     0,     0],
         [ 1953,     0,     0,     0],
...
 [    1,     0,     0,     0]])>,
  <tf.Tensor: shape=(64, 5), dtype=int64, numpy=
  array([[1, 0, 0, 0, 0],
         [1, 0, 0, 0, 0],
         [1, 0, 0, 0, 0],
         [5, 5, 0, 0, 0],
...])

So, the shapes of inputs and tags are not of the same shape after a .padded_batch because & symbol just was omitted by tokenizer.

So, my main question is still leave. Are any ways to pad sequences without the defining of exact shapes via padded_shapes parameter and with the saving of different batches shapes inputs concept?

Topic		Replies	Views
Tokenize_labels function Natural Language Processing in TensorFlow week-2 , week-3 , week-4	7	617	September 14, 2022
Get_padded_sequences Natural Language Processing in TensorFlow week-1	6	537	December 23, 2022
I am getting error in tokenizer Natural Language Processing in TensorFlow	12	315	January 17, 2023
TF1,C3,WK 2 Assignent re tokenize_labels Natural Language Processing in TensorFlow week-2 , week-3 , week-4	6	566	January 8, 2023
Natural Language Processing in TensorFlow: Week 3: Exploring Overfitting in NLP Natural Language Processing in TensorFlow	9	610	September 1, 2022

How to tokenize data for NER

Related topics