xAI launched Voice Agent Builder in beta β a no-code platform for production Grok Voice agents with telephony, MCP, guardrails, and observability at $0.05/min. Benchmarks, pricing, and how it compares to stitched STT+LLM+TTS stacks.
On July 1, 2026, xAI announced Voice Agent Builder in beta: a no-code platform to configure production voice agents on Grok Voice β telephony, knowledge bases, tools, MCP connectors, guardrails, and call observability in one place. Pricing starts at $0.05/min of audio with voices included; a free phone number ships with every account.
The pitch is familiar to anyone who has shipped phone agents: most stacks stitch three APIs β speech-to-text, a language model, and text-to-speech β often from different vendors. Each hop adds latency, cost, and failure modes. xAI's answer is a single speech-to-speech path built for Grok Voice, exposed through an operator console rather than a bag of SDKs.
xAI says Grok Voice was trained on that traffic. They publish Ο-voice Bench scores under those conditions:
Model
Overall (Ο-voice Bench)
Grok Voice Think Fast 1.0
67.3%
Gemini 3.1 Flash Live
43.8%
GPT Realtime 1.5
35.3%
Category breakdowns (retail, airline, telecom) appear on x.ai/voice. Treat vendor benchmarks as directional β run your hardest call flows before production β but the gap xAI quotes is large enough to force a bake-off if you are re-platforming support lines.
Two minutes to an agent β what the console actually includes
xAI's setup flow:
Write a plain-language prompt β how calls should flow; Grok reasons in real time over long instructions.
Attach a knowledge base β upload docs (Markdown, Word, PowerPoint, Excel, HTML, JSON, plain text, and others). Collections can be shared across agents so policies and runbooks live in one place.
Wire tools β calendars (Google, Outlook), email, custom API requests, web search, X search, tickets (Linear, Notion), files (Drive, OneDrive).
Add MCP connectors β reuse servers you already run; same week xAI pushed hosted MCP for the X API (our setup guide).
Pick a voice β 80+ built-in voices or a ~2-minute clone from brand audio.
Assign a number β free provisioned line, SIP for an existing carrier number, or browser test without dialing.
Actions during calls: lookup records, change CRM rows, transfer to human, end call, and real-time notifications so a team can see tool use and intervene.
After calls: recordings, transcripts, and a trace of which tools fired. Guardrails block off-script topics or sensitive reads (e.g. reading back full card numbers).
Pricing β one meter vs three
Line item
Rate (xAI, July 2026)
Grok Voice audio (agent runtime)
$0.05 / min
Voices
Included (no separate TTS meter)
Platform fee
None (per xAI announcement)
Telephony (free provisioned number)
+$0.01 / min
xAI contrasts this with stacks that bill STT + LLM + TTS + platform separately. Your finance team should still model tool-call costs, MCP/backend API usage, and human transfer time β the $0.05/min line is the voice path, not the whole COGS story.
For comparison, OpenAI GPT-Realtime-2 uses per-token audio pricing ($32/M input audio tokens, $64/M output, as of its May 2026 launch) plus whatever telephony vendor you add yourself.
vs OpenAI GPT-Realtime-2 and stitched stacks
Dimension
xAI Voice Agent Builder
OpenAI Realtime API (GPT-Realtime-2)
DIY STT + LLM + TTS
Primary buyer
Operators, support leads, light devs
Developers building custom clients
Platform teams with integration budget
Telephony
Built-in (+ free number)
Bring your own
Bring your own
Setup
No-code console
Code + WebSocket
Three integrations minimum
MCP
First-class in console
Via your orchestration layer
Via your orchestration layer
Pricing shape
$/min flat on audio
Per-token audio + your infra
Three vendor meters
Model path
Speech-to-speech Grok Voice
Speech-to-speech GPT-Realtime-2
Often three-hop latency
Neither replaces the other automatically: teams already deep on OpenAI Realtime keep their client investments; teams drowning in Twilio + Whisper + GPT + ElevenLabs glue may prefer xAI's bundle.
What X users asked in the first hours (and honest answers)
From the launch thread and replies β the questions that actually showed up:
"How do I get the free phone number?"
Sign up at console.x.ai/voice/agents, create an agent, and provision a number in-console. xAI states every account includes one; exact UI steps may shift during beta.
"Can it navigate phone trees (IVR)?"
Not highlighted in the launch post. Assume outbound/inbound agent flows first; IVR traversal is a verify-in-beta item if your use case is "press 2 for billing."
"Can it call my Hermes agent / sub tools via API?" Custom API tools and MCP are supported β if Hermes exposes HTTP or MCP, wiring is plausible. There is no Hermes-specific connector in the announcement; plan an integration test.
"Personal assistant for unknown callers β email me details"
Early user reports (e.g. capture name, phone, purpose β email) match the tool + notification pattern Voice Agent Builder describes. Good first-hour beta project; add guardrails before exposing a personal line publicly.
"Is this just a wrapper?"
One reply on X: "functionality wrapper with smart defaults." Fair framing β the differentiation is Grok Voice quality + integrated telephony + MCP at $/min, not a novel agent paradigm. Compare to building on raw GPT-Realtime-2 where you own the wrapper.
Where this fits in the 2026 voice-agent stack
Voice is splitting into three lanes on explainx.ai's read:
Model APIs β OpenAI Realtime, open TTS like Miso-TTS, local studios like Voicebox
Turnkey operator platforms β Voice Agent Builder lands here
If your team lives in Claude Code + MCP for coding, a Grok voice line for phone support is complementary β not a migration. Browse live MCP servers on explainx.ai/mcp-servers and skills on /skills for the coding side; use Voice Agent Builder when the channel is PSTN or browser voice, not IDE chat.
Security and production checklist
Before pointing production traffic at a beta console:
For workshop-style "use Claude as a work partner" without telephony, explainx.ai runs Claude for Work live sessions β different channel, same "AI as colleague" goal.
Build one agent on your hardest workflow β refund line, booking change, tier-1 triage β and call it. Voice quality does not survive benchmark tables alone.
Pricing, benchmark figures, and beta feature availability are accurate as of July 2, 2026 per xAI's announcement. Verify current rates in the xAI console before budgeting production call volume.