About packaging data: NEXT SENTECE

ggustavo · July 19, 2024, 11:19am

Something about the packaging data seems off to me. I wouldn’t expect an eos_token_id to be followed directly by a bos_token_id and a new sentece. I would expect an eos_token_id to be followed by another eos_token_id ulntil the sequence length is compleated. Otherwise, it might train the model to assume that any sentence can follow any other sentence.

Have you had the chance to try this in production to see the impact of explicitly inserting like a kind of full stop, essentially filling out the sequence with eos_token_id up to the end? I’m curious because this approach seems more logical to me, but I’m unsure if it would enhance the results or not.

By the way, I really enjoyed the course—great job!

Topic		Replies	Views
Packing the data with max sequence length Pretraining LLMs ai-discussions	3	37	December 19, 2024
C4W1: EOS token has very low probability NLP with Attention Models week-module-1	11	166	July 18, 2024
Data sizes for programing assignment of week 1 NLP with Attention Models	3	65	June 27, 2024
W2 Emojify_V2 Padding Sentences Sequence Models coursera-platform	2	499	May 6, 2023
Emoji_v3a steps missing Sequence Models coursera-platform	2	557	May 18, 2021

About packaging data: NEXT SENTECE

Related topics