Transformer Summarizer (C4-W2) - How long was model pretrained to achieve this performance?

In the course 4, week 2 - Transformer Summarization, a pre-trained model was used for demonstration - the lab used only 10 steps of training owing to time and resource and a pretrained model with weights was given for demo.

I would like to know how long was the training done for the pretrained model of the transformer summarizer given/used in this lab - size of dataset / epochs / steps / loss achieved / GPU / RAM resources used / overall time taken for full training. I would like to explore / play this model by training externally in cloud to understand consciousness- can I get this information? (Kindly note that I have completed this course a long time back on financial aid).

1 Like

I think you would have to research the origin of the pre-trained model, and perhaps find some information about it online.

Iā€™m pretty sure that no one at DLAI created that model specifically for this course.

1 Like

Hi, Thanks a lot for an immediate reply.

I hope we are referring to the same model. - just to be sure I am giving the relevant excerpt from the assignment notebook which I have fortunately kept downloaded:

4.1 - Loading in a Trained Model
In this part you will evaluate by loading in an almost exact version of the model you coded, but we trained it for you to save you time. Please run the cell below to load in the model.
As you may have already noticed the model that you trained and the pretrained model share the same overall architecture but they have different values for some of the parameters:
Original (pretrained) model:
TransformerLM(vocab_size=33300, d_model=512, d_ff=2048, n_layers=6, n_heads=8,
dropout=0.1, max_len=4096, ff_activation=tl.Relu)``` Your model: TransformerLM(d_model=4, d_ff=16, n_layers=1, n_heads=2)`
Only the parameters shown for your model were changed. The others stayed the same.

Have you checked it with course authors that this toy model - transformer summarizer was not created and trained for the purpose of this DL lab by DLAI?
Assuming that it is not created / trained at DLAI, since the assignment notebook only says

'# Load the pre-trained weights

model.init_from_file(ā€˜model.pkl.gzā€™, weights_only=True)

and not the model name, if we can get to know the details about the model, I might search for the model or similar one in HF / internet for those details.

Thanking you in advance,

1 Like

Sorry, I do not know the details. Perhaps one of the course staff will stop by with more information.

Thatā€™s ok, thanks anyway.

My query arise from this context: When I was doing these courses last year, my understanding was that all the capabilities of LLMs were based on the next token predictions based on attention span. At that time, I did not really believe in the concept of ā€˜emergent behaviorā€™ which was said to occur when the dataset size and scale of training are increased to several bn parameters. Later, when ChatGPT came and other advancements kicked in, I realized that it is far more than just next token prediction and the importance of ā€˜emergent behaviorā€™ that gives consciousness kind of ability. Of late, I am wondering how much is the size and scale is really related to such behavior. I wonder that even this toy transformer summarizer has got much more than the summarizing ability that arises based just on next token predictionā€¦ So, I thought of getting more details on the model.

I hope the course staff will enlighten on thisā€¦

Thanks