How to discover the architecture of GPT-3

I want to take some time and appreciate how elegantly we can find the best architecture for a model. However, when it comes to models that take more time to train themselves, it would prove non-efficient.

Recently, I got to know about GPT-3. A model which consists of 175 billion parameters. Astonishing!
I’ve read that a single training takes as little as 34 days and costs $5M.

Having that in mind, it makes me think that Google has not used the approach learned this week to get the most optimal architecture. Instead, they made an assumption. Otherwise, it would cost them too much money and time.

Hello @popaqy

The final model would not have been based on a direct assumption. Rather, they would have tried many options/combinations before concluding that the final model was the best of the lot.

Google has published the cost of training this final model. It would have been good to know the amount of time and money that was spent to arrive at this final model.