In the section of week 1 videos mentioned in the title, when explaining the Zero Redundancy optimizer stages activation and intermediate states occupied memory is not taken into account as it occupies a significant amount of memory around(8 bytes per parameter at-most ).
Can i please know the reason why it is omitted? May be my understanding is small, please help me with the explanation.
1 Like
Excellent question! The size of forward activations depends on many factors, with the key ones being sequence length, hidden size, and batch size. There are the inputs and outputs that are passed and returned by the forward and backward functions, as well as the forward activations saved for gradient computation. In the paper discussed in this module, the activations, temporary buffers, and fragmented memory are called the residual states. The video focuses on Zero-DP, which has three main optimization stages: the partitioning of optimizer states, gradients, and parameters. Another method covered in the paper is called ZeRO-R, which aims to optimize residual memory consumption by activation partitioning and offloading activations to the CPU. The paper also discusses the combination of these methods. For details, I recommend checking the paper.
1 Like