Can an alphabetical vectorisation be created as base model to create word prediction like a dictionary of any word?

From what I have understood till now, words are vectorised, then tokenised then passed through models and using attention models or transformers to create language model.

I was wondering or got a doubt, why can’t and alphabetical a===z + A======Z be vectorised and then fed with an examples of any language say English or French, then create a transformer model to create any kinds of words or sentences using encoder and decoder model, where the attention model would include this language specific model, giving decoder to output the required language model!!!

I am asking this question NLP mentors but would welcome anyone. I know the variation of probablity to create tokenisation might go infinity and model can collapse, but can inclusion tokenisation with next word prediction scale be created.

I don’t know if this thought is unrealistic, but came in mind, so wanted other view on the same.

Regards
DP

Hi, Deepti.

I’m not an NLP guru, so not sure if I really understand your question, but are you suggesting a replacement for the word embedding models here? My understanding of how all the NLP models involving RNNs and Attention in DLS C5 work is that they start from a learned “word embedding” as the foundation. Then all the learning of the Transformer model itself just takes those word embeddings as input values. The training of word embedding models is a completely separate step and Prof Ng devotes week 2 of DLS C5 to covering various ways to train word embeddings and also how to use them. I haven’t taken the NLP Specialization beyond C1, so I’m not sure what they say on that topic here, but they must cover it at some level.

Of course, as I said at the beginning, I’m probably completely missing your real point here. :grinning:

Regards,
Paul

Hi Deepti! This is a good question and has always been an active area of research.
If I understood you correctly, (and if i try to rephrase your question), What will happen if we have a tokenizer with character level tokens (a-z,A-Z, and other non-eng chars) but trained on full sentence examples. Further, this tokenizer is used to train LMs?

My Ans:
The research effort has always been to have as few tokens to encode as much text as possible (BPE is widely used now). But why is that? why not assign each character of all the languages their separate token? The answer lies in that the attention models (and hence transformers) need to attend these tokens to understand the context. If each word was split in as many tokens as it has characters, this will bloat up the attention models context window and it will be attending less words at a time. Having less tokens for each word also makes sense from how different languages are constructed and how our brain attends sentences on words or sub sentence level instead of character level.
For example (https://platform.openai.com/tokenizer):


For the same sentence in English and Chinese, the tokenizer broke tokens in different fashion. Chinese was given tokens at character level rather than sub-word level like English. Perhaps this is the result of using character level vectorization and training on full sentence examples. Tokenizers such as these makes the LM perform poorly as the same architecture (size-wise) is attending lesser for Chinese than English.

Hello @paulinpaloalto and @jyadav202

Thank you for your response.

I think I should have rephrased the question more correctly.

What I meant the alphabets a===z and capital A===Z being calculated as multidimensional vector.

then feeding the word embedding with an Oxford dictionary that has all the words from a===z, then creating an attention model which has features of parts of speech and/or grammatical vocabulary, for the decoder to calculate any kind of sentence based on the query asked.

The basic issue I think would token vocabulary of infinite number which will cause model to collapse or as said by the instruction for long sequence, such model fail.

@jyadav202 this one has a translation model. I am trying to state a model which creates a dictionary of its own in English and then that base model could be used universally in any langugalge models.

Regards
DP

I am not sure I follow your question completely. Can you break it down for me in terms of say a task like seq-to-seq translation task or any other of your liking? I might be able to relate it to existing models and give my opinions.
Questions that popped up in my head were:

  1. How do you construct the tokenizer?
  2. What is the input at various stages: pre/post-processed to encoder/decoder?
  3. In “…attention model which has features of parts of speech…”, how are these features either learnt or fed in? Statistical models like HMMs or CRFs were fed in hand-engineered features like these to understand the syntactical nuances of a language. Now we have semantical models like attention models that (thanks to scaling laws) can learn the syntax of a language inherently in its hidden layers.
    I am not convinced that the basic issue is the huge token vocab first. The basic issue for me is the overall architectural design because each component has an important role to play.

Your question is a good thought experiment and I would like to indulge more if you would as well.

1 Like

Hey arvy @arvyzukai

As you are back and I can annoy with my doubts, what you have to say on this topic, looking forward to your views too.

Regards
DP

Hello @jyadav202

This is the part even I have query and that is why I am asking, I had two approach of thought on this, either tokenize the words from Oxford dictionary based on the their line_presence or in case the attention model of part of speech and other grammatical significance is used, this tokenisation be created using the words from the Oxford dictionary and the attention model as the dictionary will already have mention of word being what kind of part of speech or for example if the word is noun, pronoun or adjective or any other grammatical significance. But I still don’t know how to put forth this idea in tokenise form, so wanted to ask in the forum here.

If you ask me actually it would be kind of two models working together or parallel-wise to encode or decode the sequence of words.

Like you mentions semantic model, sentence structure could be used for attention model where the significance of sentence structure is compatible with grammatical significance as well as seeing if model is able to detect when asked why was Taj Mahal built? so the model breaks the question into Taj Mahal being noun, built being understand as past tense with the significance of use of was, and how it would respond. This is just an example, I am pretty sure I am still missing some important part to crack for the model to be independent of tokenize words and also be in similarity to sentence asked and answered.

I asked this thought here, so others could also put forth their thoughts.

Thank you jayant, but I am pretty sure arvy will have a quirky response to this :rofl: which I am looking forward :sweat_smile:

Regards
DP

Hi @Deepti_Prasad

I’m not exactly sure what you are asking about. I would highly recommend to watch the Let’s build the GPT Tokenizer by Andrej Karpathy.

I’m not sure if this is a type, but it’s the other way around - text is converted to tokens, tokens have their own representations in the embedding table.

Cheers

Hi @Deepti_Prasad
I am getting a clarity on where you are taking this. Your approach is not alien and various versions of this has been tried and tested before attention model became a thing. If you look back to the NLP progress done between late 90s and early 2010s, you will start to see work done in this direction more. So I am amazed that you were able to figure out some parts of it without being aware of the history of how today’s attention model came into existence.

This technique is based on word statistics in a document and eventually in corpus. The last successful work that I know of was GloVe. I would recommend you to understand its summary and look at its references. It evolved from using Term-Frequence, Inverse Document Frequency, Word2Vec (this was a parallel effort) and other concepts. You might learn how statistics based on “line_presence” was deployed.

What if I tell you that this approach is still in production in many of the systems world-wide? In-fact this was the most common way to do. Although using all the words from dictionary is not a bad idea, the downside is that it creates a sparse matrix (or call it table) that are actually being used in the documents used for training (remember those days we could not have huge datasets for training like we do now, so word matrix gets sparse). Such a sparse matrix is expensive while computing gradients (earlier Inverse of Hessian was done, you can look that up). Regardless, using POS and other grammar concepts as features with “attention model” is quite well known. Earlier we used CNNs as attention mechanisms (although we don’t called them attentions) coupled such features.

After tokenizing (either splitting text on spaces, newlines or going crazy like BPE), you may vectorize them based on the various grammatical features (like you mentioned) of its lemma or base word. Many tokens can end up with same features if you are not careful in adding features that make them unique. An example can be :
“The quick brown fox jumps over the lazy dog.” , the token “brown” can have embedding based on that it is ADJ (adjective), 3rd token in its sentence, 5 char length, no prefix, no suffix, no root (its is own), shape is ‘ccccc’ (c means a character) etc. Now use a hashmap for each feature to assign a number to each feature. Say N (noun) is 0, V (verb) is 1… ADJ (adjective) is 5. similarly assuming ‘no’ is given 0 and ‘ccccc’ is given 6234 then the concatenated vector for “brown” in the above sentence will become: 5350006234… You can also append the vector of its previous 2 tokens (The,quick) and successive 2 tokens (fox,jumps)after this vector. Now your complete vector for “brown” is ready! NOTE: choosing 2 tokens before and after to append in the original vector will be an param you will have to play with.
But you can quickly see how fixed (not very dynamic in logic) this embedding space has become. Remember that languages also evolve over time (unlike images). Meaning and grammar of the same word from 2020s text changes compared to 90s. Ofcourse work has been done to make time-aware embeddings (link).

It can be and it was, as I just mentioned above. We have moved on to make grammar-agnostic models, which can do a variety of different tasks, like NER, POS tagging, classification (BERTs, encoder heavy). Luckily, same architectures are now used as text generators (GPTs, decoder heavy) and translators.

When you say that the model “breaks the question…”, you are inherently doing tokenization :slight_smile: So we are not going independent of tokens anytime soon (or perhaps never).

To understand NLP, in general, I would suggest starting with 1 task, lets say NER and reading its history and current progress. NLP field evolved from being very task specific architectures to task agnostic architectures(transformers). Reading survey is a good source of knowledge:

https://arxiv.org/abs/2101.11420
https://arxiv.org/abs/1901.09069

Regards,
Jay

1 Like

Hello jayant @jyadav202

That was an interesting read.

This thought came to me in the last week of last course of NLP specialisation while listening to instructor kept mentioning issues with tokenisation vocab size issue. GloVe I did hear about it earlier but I don’t remember why, but the links you have sent seems great.

Ok I somehow already felt that this should have been already thought before as when language model would have been considered in earlier days, the basis of any language is grammar and its characterstic in framing a sentence.

You seem to be have good knowledge about this on a scholar level, hope to get some good read, books and articles about the same.

Thank you again for those links you shared.

Regards
DP

I am glad I could help. It was good to have a discussion on this though. :slight_smile: