Generating Images One Token at a Time

Jasmeet_Singh2 · June 13, 2026, 3:54pm

How do you generate images one token at a time?

I recently built an autoregressive image generation pipeline from scratch on CIFAR 10 to explore this idea.

Instead of generating the full image at once like diffusion models, this approach treats images more like language. The model generates one token at a time.

First, a VQ-VAE compresses a 32 by 32 image into an 8 by 8 grid of discrete tokens. Then a GPT style transformer learns to predict the next token given the previous ones. During inference, tokens are sampled sequentially and decoded back into an image.

A few things I learned:

Codebook collapse is real. Most of my codebook entries were unused at first because of poor initialization and scale mismatch.
Backprop through discrete tokens is tricky. The straight through estimator is what makes training work.
Training is fast but generation is slow. During training everything is parallel, but at inference each token depends on the previous one.

This project gave me a much better intuition for how autoregressive models can be used beyond text.

Full implementation and results in the repo below. Would love to hear your thoughts.

GitHub repo: GitHub - jasmeetsingh-028/vqvae-ar-implementation · GitHub

Topic		Replies	Views
Need help on GenAI on Images AI Discussions ai-discussions	3	132	July 16, 2023
Like Diffusion but Faster: The Paella model for fast image generation, explained AI Discussions the-batch , ai-discussions	0	124	June 16, 2023
VAE for image generation Build Better Generative Adversarial Networks week-module-2	4	617	September 28, 2022
Precision-Guided Image Generation: Better text-to-image results with latent diffusion AI Discussions the-batch , ai-discussions	0	135	January 6, 2023
KV Caching for Instruction Tuned models Efficiently Serving LLMs	1	218	March 21, 2024

Generating Images One Token at a Time

Related topics