I would like to know if it is a good idea to pre-process the raw data, before using BERT to transform it into embeddings. For example, I have raw data like this:
“My name is Paul and this summer, on Friday, I went to the lake and I loved it.”
I think the relevant part of this sentence should be “I went to the lake and I loved it”. But I do not know if BERT has already handled this or not. If not, how could I extract the most relevant part of the text?
First, relevant in what sense? If I asked you your name, I don’t think I would care about the lake.
Second, BERT embedding are, well, trash. If you want to cast a sentence into a vector of fixed size you are looking for sentencetransformers.
Third, implementing pre-processing of the raw data for transformers is hard to define and will always depend on the task at hand + eventual cleaning for the data itself - as long as it’s actually “dirty” (containing html code, weird formatting etc). Otherwise you just lose the information.