How do you generate images one token at a time?
I recently built an autoregressive image generation pipeline from scratch on CIFAR 10 to explore this idea.
Instead of generating the full image at once like diffusion models, this approach treats images more like language. The model generates one token at a time.
First, a VQ-VAE compresses a 32 by 32 image into an 8 by 8 grid of discrete tokens. Then a GPT style transformer learns to predict the next token given the previous ones. During inference, tokens are sampled sequentially and decoded back into an image.
A few things I learned:
- Codebook collapse is real. Most of my codebook entries were unused at first because of poor initialization and scale mismatch.
- Backprop through discrete tokens is tricky. The straight through estimator is what makes training work.
- Training is fast but generation is slow. During training everything is parallel, but at inference each token depends on the previous one.
This project gave me a much better intuition for how autoregressive models can be used beyond text.
Full implementation and results in the repo below. Would love to hear your thoughts.
GitHub repo: GitHub - jasmeetsingh-028/vqvae-ar-implementation · GitHub