NVIDIA Nemotron-Labs-TwoTower: Split a 30B Model in Two for 2.42ร Faster Diffusion Generation
NVIDIA Research adapts Nemotron-3-Nano-30B into Nemotron-Labs-TwoTower โ a frozen context tower plus trainable diffusion denoiser that writes token blocks in parallel. 98.7% benchmark quality retained at 2.42ร throughput. Hugging Face weights, vLLM, and SGLang support.
Autoregressive LLMs generate one token at a time, left to right. Every agent loop, every chat turn, every code completion pays that sequential tax โ and when you stack 20 inference calls, latency compounds.
On July 2, 2026, NVIDIA AI posted a different bet: take a 30B Nemotron Nano, split it in two, and let one half hold context while the other writes tokens in parallel through block-wise diffusion.
Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 is on Hugging Face now. NVIDIA's headline numbers: 98.7% of the autoregressive baseline's aggregate benchmark quality at 2.42ร wall-clock generation throughput.
NVIDIA Open Model License (commercial use allowed)
The Tweet in Plain English
NVIDIA's pitch in one sentence:
One half holds the context, the other writes the tokens โ both reuse the pretrained model instead of training a new one from scratch.
That last clause matters. Diffusion language models (dLLMs) have been interesting for years, but most paths require training a dedicated diffusion model from zero. Nemotron-Labs-TwoTower is a retrofit: clone your existing Nemotron Nano weights twice, freeze one tower for causal context encoding, and fine-tune only the denoiser on ~2.1T tokens (the backbone itself saw ~25T in pretraining).
Runs causally over the clean prompt and every already committed token block.
Exports per-layer KV cache (attention) and Mamba-2 boundary states.
Stays frozen during TwoTower training โ no catastrophic forgetting of the pretrained backbone.
Denoiser tower (diffusion)
Fills one block of block_size positions at a time (default S=16).
Starts with all [MASK] tokens; runs steps_per_block denoising iterations.
Uses bidirectional in-block self-attention, layer-aligned cross-attention to the context tower, and context-seeded Mamba-2 states.
adaLN-single time conditioning maps diffusion timestep t to per-layer scale/shift/gate (PixArt-ฮฑ style; ~1.5M added params).
Only the denoiser tower learns the mask-diffusion objective. The context tower is a read-only memory of what has been said so far.
How Mask Diffusion Generation Works
Generation is block-wise autoregressive at the macro level, parallel inside each block:
Encode prompt once through the context tower.
For each new block:
Initialize 16 positions as [MASK] (mask_token_id=3).
For up to 16 denoising steps:
Compute timestep from the fraction still masked.
Run denoiser over the whole block (parallel).
Commit positions above confidence threshold ฮณ=0.8; re-mask the rest.
Advance context tower over the committed block (update KV + Mamba caches).
Repeat until max_new_tokens or EOS.
Multiple tokens can commit per step โ that is where 2.42ร comes from versus strict one-token AR decoding.
Tradeoff: Lower ฮณ โ more tokens committed per step โ higher throughput, lower quality. NVIDIA documents this as a tunable operating point, not a fixed constant.
Three Generation Modes
Mode
API
Tokens per step
Use case
Mask diffusion
generate_mask_diffusion()
up to block_size
Default โ parallel block decode
Mock-AR
generate_mock_ar()
1
Two-tower but autoregressive stepping
AR-only
generate_ar()
1
Single GPU; context tower only
Mock-AR and AR modes help debug quality or run on one GPU. Mask diffusion is the speed story โ and it requires two GPUs in the reference implementation.
Benchmarks: What 98.7% Means
Default eval: ฮณ=0.8, S=16, BF16, 2ร H100. Per-task snapshots from the model card:
Task
Nemotron-3-Nano (AR)
TwoTower (diffusion)
MMLU (5-shot)
78.56
78.24
MMLU-Pro (CoT)
62.59
60.93
ARC-Challenge
91.72
92.66
HumanEval
79.27
75.58
GSM8K
92.49
90.14
MATH-500
84.40
80.60
Aggregate quality
100%
98.7%
Throughput
1.0ร
2.42ร
Reading the table: Knowledge and reading tasks hold almost perfectly. Code and math take the biggest hits โ HumanEval drops ~3.7 points, MATH-500 ~3.8. If your workload is agentic coding loops, benchmark on your repo before swapping decoders.
Compare to other parallel-generation research: Mercury 2 chases 1,000+ tok/sec as a commercial API; DiffusionGemma targets 4ร on small open models. TwoTower's angle is reuse Nemotron Nano quality at ~2.4ร without abandoning the pretrained stack.
Run It Locally
Transformers (two-GPU diffusion)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()
prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate_mask_diffusion(
inputs["input_ids"],
max_new_tokens=128,
block_size=16,
steps_per_block=16,
mask_token_id=3,
temperature=0.1,
confidence_threshold=0.8,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requires trust_remote_code=True โ custom generation paths live in the model repo.
The Hugging Face model card includes SGLang launch commands and docker model run hf.co/nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16. Quantized variants for llama.cpp, Ollama, and LM Studio are listed under model quantizations.
Context window: up to 128K tokens in/out โ same class as the Nano backbone, relevant for long agent harness runs.
Why NVIDIA Cares Now
Nemotron is NVIDIA's open-weights agent stack โ Nano for edge/single-GPU, Super/Ultra for datacenter agents. Computex 2026 put Nemotron 3 Ultra at the center of the agentic story.
TwoTower attacks the other bottleneck: once you have a capable 30B MoE, decoding speed still limits interactive agents and high-QPS APIs. Splitting the model lets NVIDIA experiment with diffusion decoding without throwing away 25T tokens of pretraining.
For teams already on Nemotron Nano or NIM containers, TwoTower is the research preview of "same brain, faster mouth."
Limitations
Two GPUs for default mode โ not a laptop model; AR-only fallback is sequential.
~60B total weights โ you pay storage and memory for both towers even though only ~3B activate per token per tower.
Early release โ Hugging Face notes a config parsing warning on config.json; expect rough edges in vLLM/SGLang integrations.
Not a chat-tuned instruct model โ this is Base-BF16; wrap with your own SFT/RL pipeline for assistants.
Diffusion hype cycle โ parallel decode helps generation-bound workloads; single-turn chat often waits on the human reader, not the model (same caveat as Mercury 2).
Where Parallel Decode Actually Wins
TwoTower is most compelling when tokens out ร calls per task dominates your bill: