Hi @Deepti_Prasad
I am getting a clarity on where you are taking this. Your approach is not alien and various versions of this has been tried and tested before attention model became a thing. If you look back to the NLP progress done between late 90s and early 2010s, you will start to see work done in this direction more. So I am amazed that you were able to figure out some parts of it without being aware of the history of how today’s attention model came into existence.
This technique is based on word statistics in a document and eventually in corpus. The last successful work that I know of was GloVe. I would recommend you to understand its summary and look at its references. It evolved from using Term-Frequence, Inverse Document Frequency, Word2Vec (this was a parallel effort) and other concepts. You might learn how statistics based on “line_presence” was deployed.
What if I tell you that this approach is still in production in many of the systems world-wide? In-fact this was the most common way to do. Although using all the words from dictionary is not a bad idea, the downside is that it creates a sparse matrix (or call it table) that are actually being used in the documents used for training (remember those days we could not have huge datasets for training like we do now, so word matrix gets sparse). Such a sparse matrix is expensive while computing gradients (earlier Inverse of Hessian was done, you can look that up). Regardless, using POS and other grammar concepts as features with “attention model” is quite well known. Earlier we used CNNs as attention mechanisms (although we don’t called them attentions) coupled such features.
After tokenizing (either splitting text on spaces, newlines or going crazy like BPE), you may vectorize them based on the various grammatical features (like you mentioned) of its lemma or base word. Many tokens can end up with same features if you are not careful in adding features that make them unique. An example can be :
“The quick brown fox jumps over the lazy dog.” , the token “brown” can have embedding based on that it is ADJ (adjective), 3rd token in its sentence, 5 char length, no prefix, no suffix, no root (its is own), shape is ‘ccccc’ (c means a character) etc. Now use a hashmap for each feature to assign a number to each feature. Say N (noun) is 0, V (verb) is 1… ADJ (adjective) is 5. similarly assuming ‘no’ is given 0 and ‘ccccc’ is given 6234 then the concatenated vector for “brown” in the above sentence will become: 5350006234… You can also append the vector of its previous 2 tokens (The,quick) and successive 2 tokens (fox,jumps)after this vector. Now your complete vector for “brown” is ready! NOTE: choosing 2 tokens before and after to append in the original vector will be an param you will have to play with.
But you can quickly see how fixed (not very dynamic in logic) this embedding space has become. Remember that languages also evolve over time (unlike images). Meaning and grammar of the same word from 2020s text changes compared to 90s. Ofcourse work has been done to make time-aware embeddings (link).
It can be and it was, as I just mentioned above. We have moved on to make grammar-agnostic models, which can do a variety of different tasks, like NER, POS tagging, classification (BERTs, encoder heavy). Luckily, same architectures are now used as text generators (GPTs, decoder heavy) and translators.
When you say that the model “breaks the question…”, you are inherently doing tokenization
So we are not going independent of tokens anytime soon (or perhaps never).
To understand NLP, in general, I would suggest starting with 1 task, lets say NER and reading its history and current progress. NLP field evolved from being very task specific architectures to task agnostic architectures(transformers). Reading survey is a good source of knowledge:
https://arxiv.org/abs/2101.11420
https://arxiv.org/abs/1901.09069
Regards,
Jay