What is xAI Voice Agent Builder?

Voice Agent Builder is xAI's no-code beta platform for configuring production voice agents on Grok Voice. It bundles telephony, knowledge retrieval, tools, guardrails, MCP connectors, and call observability in one console — so operators do not stitch together separate speech-to-text, LLM, and text-to-speech APIs.

How much does Grok Voice Agent Builder cost?

Agents bill at the Grok Voice API rate of $0.05 per minute of audio, with voices included and no separate platform fee. Telephony on a free provisioned number adds $0.01/min. xAI positions this as fewer meters than stacks that bill recognition, reasoning, synthesis, and platform separately.

Does Voice Agent Builder include a phone number?

Yes. Every account includes a free phone number for testing or production traffic. You can also bring an existing number over SIP from major telephony providers, or connect your own client over WebSocket without using telephony at all.

How does Grok Voice score on voice-agent benchmarks?

On xAI's τ-voice Bench (hard telephony audio, accents, interruptions, multi-tool workflows), Grok Voice Think Fast 1.0 scored 67.3% overall versus Gemini 3.1 Flash Live at 43.8% and GPT Realtime 1.5 at 35.3%. xAI publishes category splits for retail, airline, and telecom scenarios.

Can I connect MCP servers to a Grok voice agent?

Yes. Voice Agent Builder supports MCP connectors alongside REST API tools, calendar and email integrations, web search, X search, Linear, Notion, and cloud file sources — the same MCP ecosystem xAI documented for Grok and X API clients in June 2026.

How does this compare to OpenAI GPT-Realtime-2?

OpenAI's [GPT-Realtime-2](/blog/openai-gpt-realtime-2-voice-models-api-2026) is a developer API for speech-to-speech models with reasoning tiers and per-token pricing. xAI's Voice Agent Builder is an operator-facing platform on top of Grok Voice with telephony, no-code setup, and flat per-minute billing — closer to a turnkey contact-center stack than a raw model endpoint.

Grok Voice Agent Builder: $0.05/min No-Code Agents | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Grok Voice Agent Builder: $0.05/min No-Code Agents | explainx.ai Blog | explainx.ai

On July 1, 2026, xAI announced Voice Agent Builder in beta: a no-code platform to configure production voice agents on Grok Voice — telephony, knowledge bases, tools, MCP connectors, guardrails, and call observability in one place. Pricing starts at $0.05/min of audio with voices included; a free phone number ships with every account.

The pitch is familiar to anyone who has shipped phone agents: most stacks stitch three APIs — speech-to-text, a language model, and text-to-speech — often from different vendors. Each hop adds latency, cost, and failure modes. xAI's answer is a single speech-to-speech path built for Grok Voice, exposed through an operator console rather than a bag of SDKs.

TL;DR — what people are asking

Question	Answer
When did it launch?	July 1, 2026 — beta at console.x.ai/voice/agents
What does it cost?	$0.05/min audio (API rate, voices included) + $0.01/min on free provisioned telephony
Do I get a phone number?	Yes — free number per account; SIP import or WebSocket client also supported
Is it really no-code?	Plain-language agent prompt + document uploads + tool/MCP wiring; xAI claims ~2 minutes to first agent
MCP support?	Yes — alongside API tools, calendars, X search, Linear, Notion, Drive, OneDrive
Benchmark claim?	τ-voice Bench: Grok Voice Think Fast 67.3% vs Gemini 3.1 Flash Live 43.8%, GPT Realtime 1.5 35.3%
vs GPT-Realtime-2?	OpenAI = model API; xAI = platform + telephony + ops on Grok Voice — see comparison below

Why xAI built a platform, not just a model

Voice agents that handle real phone calls face conditions demo videos skip:

Low-quality PSTN audio, background noise, strong accents
Interruptions and mid-sentence topic changes
Ambiguous workflows spanning dozens of tools
25+ languages on the same line

xAI says Grok Voice was trained on that traffic. They publish τ-voice Bench scores under those conditions:

Model	Overall (τ-voice Bench)
Grok Voice Think Fast 1.0	67.3%
Gemini 3.1 Flash Live	43.8%
GPT Realtime 1.5	35.3%

Category breakdowns (retail, airline, telecom) appear on x.ai/voice. Treat vendor benchmarks as directional — run your hardest call flows before production — but the gap xAI quotes is large enough to force a bake-off if you are re-platforming support lines.

Two minutes to an agent — what the console actually includes

xAI's setup flow:

Write a plain-language prompt — how calls should flow; Grok reasons in real time over long instructions.
Attach a knowledge base — upload docs (Markdown, Word, PowerPoint, Excel, HTML, JSON, plain text, and others). Collections can be shared across agents so policies and runbooks live in one place.
Wire tools — calendars (Google, Outlook), email, custom API requests, web search, X search, tickets (Linear, Notion), files (Drive, OneDrive).
Add MCP connectors — reuse servers you already run; same week xAI pushed hosted MCP for the X API (our setup guide).
Pick a voice — 80+ built-in voices or a ~2-minute clone from brand audio.
Assign a number — free provisioned line, SIP for an existing carrier number, or browser test without dialing.

Actions during calls: lookup records, change CRM rows, transfer to human, end call, and real-time notifications so a team can see tool use and intervene.

After calls: recordings, transcripts, and a trace of which tools fired. Guardrails block off-script topics or sensitive reads (e.g. reading back full card numbers).

Pricing — one meter vs three

Line item	Rate (xAI, July 2026)
Grok Voice audio (agent runtime)	$0.05 / min
Voices	Included (no separate TTS meter)
Platform fee	None (per xAI announcement)
Telephony (free provisioned number)	+$0.01 / min

xAI contrasts this with stacks that bill STT + LLM + TTS + platform separately. Your finance team should still model tool-call costs, MCP/backend API usage, and human transfer time — the $0.05/min line is the voice path, not the whole COGS story.

For comparison, OpenAI GPT-Realtime-2 uses per-token audio pricing ($32/M input audio tokens, $64/M output, as of its May 2026 launch) plus whatever telephony vendor you add yourself.

vs OpenAI GPT-Realtime-2 and stitched stacks

Dimension	xAI Voice Agent Builder	OpenAI Realtime API (GPT-Realtime-2)	DIY STT + LLM + TTS
Primary buyer	Operators, support leads, light devs	Developers building custom clients	Platform teams with integration budget
Telephony	Built-in (+ free number)	Bring your own	Bring your own
Setup	No-code console	Code + WebSocket	Three integrations minimum
MCP	First-class in console	Via your orchestration layer	Via your orchestration layer
Pricing shape	$/min flat on audio	Per-token audio + your infra	Three vendor meters
Model path	Speech-to-speech Grok Voice	Speech-to-speech GPT-Realtime-2	Often three-hop latency

Neither replaces the other automatically: teams already deep on OpenAI Realtime keep their client investments; teams drowning in Twilio + Whisper + GPT + ElevenLabs glue may prefer xAI's bundle.

What X users asked in the first hours (and honest answers)

From the launch thread and replies — the questions that actually showed up:

"How do I get the free phone number?"
Sign up at console.x.ai/voice/agents, create an agent, and provision a number in-console. xAI states every account includes one; exact UI steps may shift during beta.

"Can it navigate phone trees (IVR)?"
Not highlighted in the launch post. Assume outbound/inbound agent flows first; IVR traversal is a verify-in-beta item if your use case is "press 2 for billing."

"Can it call my Hermes agent / sub tools via API?"
Custom API tools and MCP are supported — if Hermes exposes HTTP or MCP, wiring is plausible. There is no Hermes-specific connector in the announcement; plan an integration test.

"Personal assistant for unknown callers → email me details"
Early user reports (e.g. capture name, phone, purpose → email) match the tool + notification pattern Voice Agent Builder describes. Good first-hour beta project; add guardrails before exposing a personal line publicly.

"Is this just a wrapper?"
One reply on X: "functionality wrapper with smart defaults." Fair framing — the differentiation is Grok Voice quality + integrated telephony + MCP at $/min, not a novel agent paradigm. Compare to building on raw GPT-Realtime-2 where you own the wrapper.

Where this fits in the 2026 voice-agent stack

Voice is splitting into three lanes on explainx.ai's read:

Model APIs — OpenAI Realtime, open TTS like Miso-TTS, local studios like Voicebox
Agent harnesses — loop engineering, MCP directories, agent skills
Turnkey operator platforms — Voice Agent Builder lands here

If your team lives in Claude Code + MCP for coding, a Grok voice line for phone support is complementary — not a migration. Browse live MCP servers on explainx.ai/mcp-servers and skills on /skills for the coding side; use Voice Agent Builder when the channel is PSTN or browser voice, not IDE chat.

Security and production checklist

Before pointing production traffic at a beta console:

Guardrails — script boundaries, PCI-style blocks, escalation paths
Tool scopes — least-privilege API keys; MCP servers are supply-chain surface
Recording retention — transcripts may contain PII; align with your data policy
Human transfer — test failure modes when the agent loops or mis-hears numbers
Cost caps — $0.05/min × long calls × tool latency adds up; monitor per-agent dashboards xAI ships

For workshop-style "use Claude as a work partner" without telephony, explainx.ai runs Claude for Work live sessions — different channel, same "AI as colleague" goal.

Try it

Console: console.x.ai/voice/agents
Product page: x.ai/voice
Announcement: x.ai/news/grok-voice-agent-builder

Build one agent on your hardest workflow — refund line, booking change, tier-1 triage — and call it. Voice quality does not survive benchmark tables alone.

Related on explainx.ai

OpenAI GPT-Realtime-2: Voice Models API — speech-to-speech model tier and token pricing
X Hosted MCP: Cursor, Claude, Grok → X API — MCP in the xAI/X stack the same week
What Is MCP? Complete Guide — architecture behind Voice Agent Builder connectors
Loop Engineering: Coding Agent Loops — when voice agents need backend agent loops, not just prompts
Voicebox: Open-Source Voice Studio + MCP — self-hosted alternative lane
Claude for Work Workshop — live training for knowledge workers (no telephony required)
MCP server directory · Agent skills registry

Pricing, benchmark figures, and beta feature availability are accurate as of July 2, 2026 per xAI's announcement. Verify current rates in the xAI console before budgeting production call volume.

xAI Grok Voice Agent Builder: No-Code Voice Agents at $0.05/min (2026)

Related posts

Saperly: phone numbers, voice, and SMS for AI agents (plus MCP)

Why Did Tesla Cap AI Spending at $200 Per Week? Cost Math and the End of Unchecked Token Burn

X Launches Hosted MCP Servers: Connect Cursor, Claude, and Grok to the X API