Use case: An AI model that can translate English text to Japanese text.
I’ve thought 2 ways to train an LLM model for the above use case. I’m curious about the approaches that I’ve thought & are they correct or are there any issues with them.
Approach 1:
-
Train an AI model from scratch on a huge corpus of English language & Japanese language - so that the model can learn the tokens, basics of language - in simple terms, a model can develop a vocabulary for the required languages.
-
Then the base model can be fine tuned for a specific task to translate a given english text to japanese text. Since Instruction fine-tuning requires a very less amount of data If model already knows the languages.
Approach 2
- A model can be directly trained with a data set of English text & their corresponding Japanese text. There’s a draw (which I think) - is to feed a huge amount of data & getting a data might be challenging.
While in approach 1 - getting a corpus of data for both the languages is simple than getting a data being mentioned in approach 2.
I’m a naive with Gen AI & had this thought. Would love to hear the thoughts around my thinking process.
Cheers!