I’m trying to understand better the results of Chinchilla paper. As stated it was compute-optimally trained on 1.4T tokens. How does it compare to the other LLM? For example It was covered in the same week material that GPT and T5 models have different ways of tokenizing the text and have different special tokens, are there still going to be compared the same way just in number of tokens? Also is it an optimal estimate for English language or a conceptual understanding of the language, if it makes sense? For example if I want to train a model that processes texts in say polish language is the estimate for compute-optimal number of tokens still 1.4T? Thanks!