explainx.ainewsletter3.4k
trending🔥loopsskills
pricing
workshops ↗
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses — plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join · $29/moUpcoming workshop

learn

platform · $29/moupcoming workshopworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter · weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

© 2026 AISOLO Technologies Pvt Ltd

← Back to blog

explainx / blog

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Inception's Mercury 2 is a diffusion-based language model hitting 1,009 tokens per second on Blackwell GPUs — over 5x faster than autoregressive models at competitive quality. It is not fast inference via quantization or speculative decoding. It is a fundamentally different generation algorithm. Here is what that means for the AI applications where latency compounds.

Jun 23, 2026·7 min read·Yash Thakker
AI ModelsLLMAI AgentsGenerative AIInference Speed
1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Every agentic workflow is a latency multiplication problem.

If your agent makes 20 LLM calls per task, and each call takes 3 seconds for generation, that's 60 seconds of model waiting per task — before any actual work happens. Cut that to 0.6 seconds per call and you cut 60 seconds to 12. The task goes from slow to fast enough to rethink what's worth automating.

Mercury 2 — 1,009 tokens per second on NVIDIA Blackwell GPUs — is built for exactly that arithmetic.

But the speed is not from the usual tricks. It's from a fundamentally different generation algorithm.

newsletter3.4k

Curated AI updates on agents, skills, and MCP — delivered to your inbox. Unsubscribe anytime.


What Diffusion LLMs Actually Do

Standard language models are autoregressive: they generate tokens one at a time, left to right. Token N depends on tokens 1 through N-1. The bottleneck is sequential — you cannot generate token 5 until you have tokens 1 through 4. More tokens = more time, linearly.

Diffusion models work differently. They were originally developed for image generation (Stable Diffusion, DALL-E 3), where the process starts with random noise and iteratively denoises it toward a coherent image. Each denoising step refines the entire image simultaneously.

Mercury 2 applies this approach to text:

  1. Start with a noisy token sequence of the target length
  2. Run a refinement pass over all positions simultaneously
  3. Repeat for a small number of steps until the sequence converges
  4. Output the result

The key distinction: token positions are refined in parallel, not generated sequentially. The speed advantage is architectural. It's not approximating a slower process — it's a different process with a different scaling behavior.

Inception describes it as "less typewriter, more editor revising a full draft at once." The analogy is apt. An autoregressive model writes word by word. Mercury 2 starts with a draft and edits the whole thing simultaneously until it's coherent.


The Numbers

MetricValue
Generation speed1,009 tokens/sec (NVIDIA Blackwell)
Speed advantage vs autoregressive>5x
Input pricing$0.25/1M tokens
Output pricing$0.75/1M tokens
Context window128K tokens
Tool useNative
JSON outputSchema-aligned
ReasoningTunable

The pricing is competitive with speed-tier models. At $0.75/1M output tokens, Mercury 2 costs the same as many quantized fast models while generating 5x faster.


Why Speed Compounding Changes the Calculus

A single LLM call at 1009 tok/sec vs 200 tok/sec: you notice the difference, but it's not transformative.

A 20-step agent loop changes the math:

Scenario200 tok/sec1009 tok/sec
Single 200-token response1.0 sec0.2 sec
10-step agent loop (200 tokens/step)10 sec2 sec
50-step agent loop (200 tokens/step)50 sec10 sec
Real-time voice transcript (continuous)Falls behindKeeps up

The speed advantage doesn't save time uniformly. It saves time proportional to how many inference calls you stack. This makes Mercury 2 specifically valuable for:

Coding tools: Autocomplete and next-edit suggestions need to land before the developer moves on. If the suggestion arrives after 2 seconds, it lands after the developer has already typed ahead. At 1009 tok/sec, short completions arrive in tens of milliseconds.

Agent loops: Agentic workflows that chain dozens of inference calls per task benefit more from Mercury 2 than any other use case. Not just because it's faster, but because faster loops enable more steps within the same latency budget — better quality through more iteration.

Voice interfaces: Voice pipelines have the tightest latency budget in AI — natural speech cadence allows about 200ms between turns before the pause becomes noticeable. Mercury 2's speed makes reasoning-quality responses viable within that window.

RAG pipelines: Multi-hop retrieval, reranking, and summarization latencies stack. Adding reasoning to the search loop — without blowing the latency budget — becomes possible at 1009 tok/sec.


What the Quality Tier Actually Is

Inception positions Mercury 2 as competitive with "leading speed-optimized models." That's the honest bracket: not frontier reasoning (Claude Opus 4.8, GPT-5.5) but competitive with fast models like Gemini Flash, GPT-4o-mini, or Llama-3.1 8B serving.

What this means practically:

Use CaseMercury 2 fit
Code autocompleteStrong — speed is the primary value
Agent loop reasoning (non-critical)Strong
Voice response generationStrong
RAG summarizationStrong
Frontier reasoning (complex math, code)Not the right tool
Long-horizon planningNot the right tool
Deep analysis requiring extended context comprehensionDepends — test it

The tunable reasoning feature (the reasoning_effort parameter in OpenAI-compatible API) lets you trade some speed for more reasoning quality within Mercury 2 itself, which expands the applicable use case range.


Real-World Validation

The most meaningful signal is who is using it:

Zed editor (Max Brunsfeld, Co-Founder): "Suggestions land fast enough to feel like part of your own thinking, not something you have to wait for." — The autocomplete use case where speed determines whether the tool is useful at all.

Skyvern (Suchintan Singh, CTO): "Mercury 2 is at least twice as fast as GPT-5.2, which is a game changer for us." — Agent automation where generation speed compounds across task steps.

Wispr Flow (Sahaj Garg, CTO): "No other model has come close to the speed Mercury can provide!" — Real-time transcript cleanup that must run at speech rate.

OpenCall (Oliver Silverstein, CEO): "Mercury 2 quality is excellent, and the model's low latency enables more responsive voice agents." — Voice agents where response delay destroys the conversational feel.

The pattern: every validated use case involves either real-time interaction (voice, autocomplete) or agentic loops where generation calls compound. These are the cases where the speed advantage is load-bearing, not marginal.


The OpenAI-Compatible API

Mercury 2 exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    api_key="your_inception_api_key",
    base_url="https://api.inceptionlabs.ai/v1"
)

response = client.chat.completions.create(
    model="mercury-coder-small",
    messages=[{"role": "user", "content": "Write a Python function to parse JSON safely"}]
)
print(response.choices[0].message.content)

Drop-in replacement for existing OpenAI API integrations. No rewrites required.

Available models:

  • mercury-coder-small — fastest, best for autocomplete and short tasks
  • Standard model — balanced quality/speed for most agent use cases

When to Use Mercury 2 vs Frontier Models

The decision is not "is Mercury 2 good?" It is "does this use case need Mercury 2's specific advantage?"

Use Mercury 2 when:

  • Your pipeline has 10+ chained inference calls
  • Real-time responsiveness is required (voice, autocomplete)
  • You're optimizing for throughput at scale with speed-tier quality requirements
  • Latency is a hard constraint, not a preference

Use frontier models (Claude, GPT-5.5) when:

  • Reasoning depth matters more than speed
  • The task is a single, complex prompt — not a loop
  • You need the best quality output, not the fastest adequate output
  • Code generation quality needs to be correct, not just fast

Many production systems will end up using both: Mercury 2 for the high-frequency loop steps that don't need frontier quality, and frontier models for the final synthesis or critical reasoning steps.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now→

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.


Getting Started

Try Mercury 2 at chat.inceptionlabs.ai or via API at api.inceptionlabs.ai. The API is OpenAI-compatible — replace api.openai.com with the Inception endpoint and update the API key.


Related

  • AI models directory — full landscape of language models including speed comparisons
  • AI agent tools — autonomous agent tools that benefit from fast inference
  • AI skills registry — reusable skills for agent pipelines

Related posts

Jun 23, 2026

94.3 on AIME 2026: VibeThinker-3B and the Case for Small Models With Frontier Reasoning

A 3B parameter model just beat DeepSeek V3.2 and Gemini 3 Pro on AIME 2026 verifiable reasoning. VibeThinker-3B's result isn't a fluke — it points to a structural insight about AI capability: reasoning compresses into compact models, knowledge doesn't. The implications for how we build and deploy AI are significant.

Jun 22, 2026

Top AI Prompts for AI Agents: 20 Structured Templates That Actually Work

Shortlist of 20 explainx.ai prompt generators for AI Agents, spanning audio, text modalities and 9 high-level categories.

Jun 16, 2026

What Are AI Agents? The Complete Explainer for 2026

AI agents are systems that perceive their environment, reason about what to do next, take action using tools, observe the results, and repeat — until a goal is achieved. This is the definitive explainer on how they work, why they matter, and what you need to know to build with them in 2026.