What is Nemotron-Labs-TwoTower?

Nemotron-Labs-TwoTower is a block-wise diffusion language model from NVIDIA Research built on the Nemotron-3-Nano-30B-A3B backbone. It uses two copies of the same 30B hybrid Mamba-2/attention/MoE stack — a frozen autoregressive context tower and a trainable diffusion denoiser tower — to generate text by iteratively unmasking blocks of tokens in parallel instead of one token at a time.

How fast is Nemotron-Labs-TwoTower compared to autoregressive decoding?

At the default operating point (confidence threshold 0.8, block size 16, BF16 on 2× H100 GPUs), NVIDIA reports 2.42× the wall-clock generation throughput of the autoregressive Nemotron-3-Nano baseline while retaining 98.7% of aggregate benchmark quality.

Do you need to train a new model from scratch?

No. Both towers initialize from the same pretrained Nemotron-3-Nano-30B-A3B checkpoint. The AR/context tower stays frozen; only the diffusion/denoiser tower is trained (~2.1T tokens) on a masked-diffusion objective conditioned on the context tower's per-layer KV cache and Mamba states.

What hardware does Nemotron-Labs-TwoTower require?

Full two-tower mask-diffusion inference uses 2 GPUs (AR tower on cuda:0, denoiser on cuda:1) — roughly 59GB per GPU in BF16. AR-only mode runs on a single 80GB GPU (H100 or A100). NVIDIA targets H100/A100 for production inference via Transformers, vLLM, or SGLang.

Where can I download Nemotron-Labs-TwoTower?

Weights are on Hugging Face at nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 under the NVIDIA Open Model License. The paper is arXiv:2606.26493 (Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context).

How is this different from Mercury 2 or DiffusionGemma?

Mercury 2 (Inception) is a commercial diffusion LLM API optimized for extreme tok/sec on Blackwell. DiffusionGemma is Google's small open diffusion model for research. Nemotron-Labs-TwoTower is NVIDIA's research path for adapting an existing large open Nemotron backbone into parallel block diffusion without discarding pretrained weights — closer to a retrofit than a from-scratch dLLM.

NVIDIA Nemotron-Labs-TwoTower: 2.42× Diffusion LLM Guide (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

NVIDIA Nemotron-Labs-TwoTower: 2.42× Diffusion LLM Guide (2026) | explainx.ai Blog | explainx.ai

Autoregressive LLMs generate one token at a time, left to right. Every agent loop, every chat turn, every code completion pays that sequential tax — and when you stack 20 inference calls, latency compounds.

On July 2, 2026, NVIDIA AI posted a different bet: take a 30B Nemotron Nano, split it in two, and let one half hold context while the other writes tokens in parallel through block-wise diffusion.

Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 is on Hugging Face now. NVIDIA's headline numbers: 98.7% of the autoregressive baseline's aggregate benchmark quality at 2.42× wall-clock generation throughput.

TL;DR


Model	`nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16`
Backbone	Nemotron-3-Nano-30B-A3B
Architecture	Two towers — frozen AR/context + trainable diffusion/denoiser
Total params	~60B (30B per tower); ~3B active per token per tower (MoE)
Quality vs AR	98.7% aggregate benchmark retention (default settings)
Speed vs AR	2.42× generation throughput
Default decode	Block size 16, confidence γ=0.8, 16 denoising steps per block
Hardware	2× 80GB GPU for full diffusion; 1× GPU for AR-only mode
Paper	arXiv:2606.26493
License	NVIDIA Open Model License (commercial use allowed)

The Tweet in Plain English

NVIDIA's pitch in one sentence:

One half holds the context, the other writes the tokens — both reuse the pretrained model instead of training a new one from scratch.

That last clause matters. Diffusion language models (dLLMs) have been interesting for years, but most paths require training a dedicated diffusion model from zero. Nemotron-Labs-TwoTower is a retrofit: clone your existing Nemotron Nano weights twice, freeze one tower for causal context encoding, and fine-tune only the denoiser on ~2.1T tokens (the backbone itself saw ~25T in pretraining).

The June 2026 paper Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context formalizes the design. The July 2 Hugging Face release makes it runnable.

Two Towers, One Backbone

Both towers are copies of the same 52-layer hybrid stack — 23 Mamba-2, 6 self-attention, 23 MoE layers — derived from Nemotron 3 Nano:

  AR / Context Tower              Diffusion / Denoiser Tower
  (frozen, causal)                (trainable, block diffusion)

  clean prompt + committed        noisy [MASK] block
  tokens  ──────────────────▶     bidirectional in-block attn
  KV cache + Mamba states  ─────▶ cross-attn + Mamba seed
                                  confidence unmasking → commit block

Context tower (AR)

Runs causally over the clean prompt and every already committed token block.
Exports per-layer KV cache (attention) and Mamba-2 boundary states.
Stays frozen during TwoTower training — no catastrophic forgetting of the pretrained backbone.

Denoiser tower (diffusion)

Fills one block of block_size positions at a time (default S=16).
Starts with all [MASK] tokens; runs steps_per_block denoising iterations.
Uses bidirectional in-block self-attention, layer-aligned cross-attention to the context tower, and context-seeded Mamba-2 states.
adaLN-single time conditioning maps diffusion timestep t to per-layer scale/shift/gate (PixArt-α style; ~1.5M added params).

Only the denoiser tower learns the mask-diffusion objective. The context tower is a read-only memory of what has been said so far.

How Mask Diffusion Generation Works

Generation is block-wise autoregressive at the macro level, parallel inside each block:

Encode prompt once through the context tower.
For each new block:
- Initialize 16 positions as [MASK] (mask_token_id=3).
- For up to 16 denoising steps:
  - Compute timestep from the fraction still masked.
  - Run denoiser over the whole block (parallel).
  - Commit positions above confidence threshold γ=0.8; re-mask the rest.
- Advance context tower over the committed block (update KV + Mamba caches).
Repeat until max_new_tokens or EOS.

Multiple tokens can commit per step — that is where 2.42× comes from versus strict one-token AR decoding.

Tradeoff: Lower γ → more tokens committed per step → higher throughput, lower quality. NVIDIA documents this as a tunable operating point, not a fixed constant.

Three Generation Modes

Mode	API	Tokens per step	Use case
Mask diffusion	`generate_mask_diffusion()`	up to `block_size`	Default — parallel block decode
Mock-AR	`generate_mock_ar()`	1	Two-tower but autoregressive stepping
AR-only	`generate_ar()`	1	Single GPU; context tower only

Mock-AR and AR modes help debug quality or run on one GPU. Mask diffusion is the speed story — and it requires two GPUs in the reference implementation.

Benchmarks: What 98.7% Means

Default eval: γ=0.8, S=16, BF16, 2× H100. Per-task snapshots from the model card:

Task	Nemotron-3-Nano (AR)	TwoTower (diffusion)
MMLU (5-shot)	78.56	78.24
MMLU-Pro (CoT)	62.59	60.93
ARC-Challenge	91.72	92.66
HumanEval	79.27	75.58
GSM8K	92.49	90.14
MATH-500	84.40	80.60
Aggregate quality	100%	98.7%
Throughput	1.0×	2.42×

Reading the table: Knowledge and reading tasks hold almost perfectly. Code and math take the biggest hits — HumanEval drops ~3.7 points, MATH-500 ~3.8. If your workload is agentic coding loops, benchmark on your repo before swapping decoders.

Compare to other parallel-generation research: Mercury 2 chases 1,000+ tok/sec as a commercial API; DiffusionGemma targets 4× on small open models. TwoTower's angle is reuse Nemotron Nano quality at ~2.4× without abandoning the pretrained stack.

Run It Locally

Transformers (two-GPU diffusion)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()

prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

outputs = model.generate_mask_diffusion(
    inputs["input_ids"],
    max_new_tokens=128,
    block_size=16,
    steps_per_block=16,
    mask_token_id=3,
    temperature=0.1,
    confidence_threshold=0.8,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires trust_remote_code=True — custom generation paths live in the model repo.

vLLM

pip install vllm
vllm serve "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"

OpenAI-compatible /v1/completions on port 8000.

SGLang and Docker

The Hugging Face model card includes SGLang launch commands and docker model run hf.co/nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16. Quantized variants for llama.cpp, Ollama, and LM Studio are listed under model quantizations.

Context window: up to 128K tokens in/out — same class as the Nano backbone, relevant for long agent harness runs.

Why NVIDIA Cares Now

Nemotron is NVIDIA's open-weights agent stack — Nano for edge/single-GPU, Super/Ultra for datacenter agents. Computex 2026 put Nemotron 3 Ultra at the center of the agentic story.

TwoTower attacks the other bottleneck: once you have a capable 30B MoE, decoding speed still limits interactive agents and high-QPS APIs. Splitting the model lets NVIDIA experiment with diffusion decoding without throwing away 25T tokens of pretraining.

For teams already on Nemotron Nano or NIM containers, TwoTower is the research preview of "same brain, faster mouth."

Limitations

Two GPUs for default mode — not a laptop model; AR-only fallback is sequential.
~60B total weights — you pay storage and memory for both towers even though only ~3B activate per token per tower.
Code/math regression — aggregate 98.7% hides task-level drops; coding agents should validate.
Early release — Hugging Face notes a config parsing warning on config.json; expect rough edges in vLLM/SGLang integrations.
Not a chat-tuned instruct model — this is Base-BF16; wrap with your own SFT/RL pipeline for assistants.
Diffusion hype cycle — parallel decode helps generation-bound workloads; single-turn chat often waits on the human reader, not the model (same caveat as Mercury 2).

Where Parallel Decode Actually Wins

TwoTower is most compelling when tokens out × calls per task dominates your bill:

Loop engineering harnesses with dozens of short model turns
Tool-heavy agents that emit long structured JSON each step
Batch summarization and log compression at scale
Serving Nemotron Nano-class quality when AR decoding is the QPS cap

It is less compelling for single long CoT answers where you read slower than either decoder generates.

NVIDIA Nemotron-Labs-TwoTower: Split a 30B Model in Two for 2.42× Faster Diffusion Generation

Related posts

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Claude Sonnet 5 Is Live: Pricing, Benchmarks & API [June 2026]

MacBook vs dedicated GPU for local LLMs: how much RAM you really get, and when each wins in 2026

TL;DR

The Tweet in Plain English

Two Towers, One Backbone

Context tower (AR)

Denoiser tower (diffusion)

How Mask Diffusion Generation Works

Three Generation Modes

Benchmarks: What 98.7% Means

Run It Locally

Transformers (two-GPU diffusion)

vLLM

SGLang and Docker

Why NVIDIA Cares Now

Limitations

Where Parallel Decode Actually Wins

Related Reading