Hello everyone!
I’m a learner and want to try to train a storytelling AI (Transformer, decoder only). The goal is to balance size, efficiency, and performance while keeping it trainable on Colab Free (T4, 16GB VRAM) within a month.
- Current Architecture:
- Layers: 10
- Embedding Size: 512
- Heads: 8 (64 dim/head)
- FFN Size: 2,048
- Vocab Size: 16,000
- Context Window: 1,024
- Attention: MHLA (multi head latent attention)
- Params: ~100-125M
- Memory: ~1-1.2 GB (FP16) with checkpointing
- Training:
- Dataset: Not decided yet, around 20GB…
- Batch Size: 16
- Time: unknow.
- Epochs: 7-8
- Schedule: around a month
- Overall:
- Size ~ 1GB
If you have any suggestions on the Architecture, Training, Platform, or any aspect, please share with me, I am open to any idea or tweaks.
Thanks for your time! Very appreciate!