Curious about a thought process around developing an AI model (Translation)

Use case: An AI model that can translate English text to Japanese text.

I’ve thought 2 ways to train an LLM model for the above use case. I’m curious about the approaches that I’ve thought & are they correct or are there any issues with them.

Approach 1:

  • Train an AI model from scratch on a huge corpus of English language & Japanese language - so that the model can learn the tokens, basics of language - in simple terms, a model can develop a vocabulary for the required languages.

  • Then the base model can be fine tuned for a specific task to translate a given english text to japanese text. Since Instruction fine-tuning requires a very less amount of data If model already knows the languages.

Approach 2

  • A model can be directly trained with a data set of English text & their corresponding Japanese text. There’s a draw (which I think) - is to feed a huge amount of data & getting a data might be challenging.

While in approach 1 - getting a corpus of data for both the languages is simple than getting a data being mentioned in approach 2.

I’m a naive with Gen AI & had this thought. Would love to hear the thoughts around my thinking process.

Cheers!

1 Like

This is what done with translation of languages, so the model learns corresponding patterns!