Qwen3-Next employs hybrid attention for long context inputs

Subscribe for free access to :arrow_forward: Data Points!

Alibaba introduced Qwen3-Next-80B-A3B, which activates only 3 billion of its 80 billion parameters during inference using a sparse mixture-of-experts design. The model combines Gated DeltaNet linear attention with standard attention in a 3:1 ratio, achieving performance comparable to dense 32 billion-parameter models while using less than 10 percent of the training compute. For contexts over 32,000 tokens, Qwen3-Next delivers more than 10 times faster inference and supports up to 256,000 tokens. This shows that sparse architectures with hybrid attention can match larger models’ performance while drastically cutting computational costs. The models are available on Hugging Face and ModelScope, with API access through Alibaba Cloud and NVIDIA. (Qwen)

1 Like