Computational challenges of training LLMs

In the section of week 1 videos mentioned in the title, when explaining the Zero Redundancy optimizer stages activation and intermediate states occupied memory is not taken into account as it occupies a significant amount of memory around(8 bytes per parameter at-most ).


Can i please know the reason why it is omitted? May be my understanding is small, please help me with the explanation.