Seeking Suggestions: Storytelling Transformer Architecture Optimization

Hello everyone!

I’m a learner and want to try to train a storytelling AI (Transformer, decoder only). The goal is to balance size, efficiency, and performance while keeping it trainable on Colab Free (T4, 16GB VRAM) within a month.

  • Current Architecture:
    • Layers: 10
    • Embedding Size: 512
    • Heads: 8 (64 dim/head)
    • FFN Size: 2,048
    • Vocab Size: 16,000
    • Context Window: 1,024
    • Attention: MHLA (multi head latent attention)
    • Params: ~100-125M
    • Memory: ~1-1.2 GB (FP16) with checkpointing
  • Training:
    • Dataset: Not decided yet, around 20GB…
    • Batch Size: 16
    • Time: unknow.
    • Epochs: 7-8
    • Schedule: around a month
  • Overall:
  • Size ~ 1GB
  • :slight_smile:

If you have any suggestions on the Architecture, Training, Platform, or any aspect, please share with me, I am open to any idea or tweaks.

Thanks for your time! Very appreciate!

1 Like