xAI Grok Audio APIs Complete Guide — TTS ($4.20/M chars) + STT ($0.10/hour) Undercutting Competitors by 60% [2026]
Complete guide to xAI's Grok TTS and STT APIs, officially bundled on April 17, 2026. TTS at $4.20/1M characters and STT at $0.10/hour (batch) undercut competitors by 60%. Grok STT achieves a 5.0% entity recognition error rate — the best in the industry. Covers API usage, benchmarks, and real-world use cases.
What Are xAI Grok Audio APIs? — TTS + STT Bundled at 60% Below Competitors
On April 17, 2026, xAI officially launched Grok TTS (Text-to-Speech) and Grok STT (Speech-to-Text) as a bundled audio API suite. At $4.20 per 1M characters for TTS and $0.10/hour for STT (batch), xAI undercuts ElevenLabs, Deepgram, and AssemblyAI by approximately 60%. Grok STT achieves a 5.0% entity recognition error rate in phone call benchmarks — three to four times better than competitors. The same technology already powers Grok Voice, Tesla in-vehicle systems, and Starlink customer support.
Both APIs at a Glance
| API | Function | Price | Highlights |
|---|---|---|---|
| Grok TTS | Text → Speech | $4.20 / 1M chars | 5 voices, 20+ languages, inline tags, MP3/WAV/PCM/G.711 |
| Grok STT | Speech → Text | $0.10/hour (batch), $0.20/hour (streaming) | 25+ languages, word timestamps, speaker diarization |
Grok Audio API Ecosystem Overview
Grok TTS: 5 Voice Profiles
Grok TTS offers five fixed voice profiles:
| Voice ID | Character | Recommended Use |
|---|---|---|
| ara | Clear and professional | Customer support, business announcements |
| eve | Warm and approachable | e-Learning, virtual assistants |
| leo | Calm and authoritative male | Narration, podcasts |
| rex | Energetic male | Game NPCs, entertainment content |
| sal | Neutral and versatile | General TTS, IVR |
Grok TTS: Inline Voice Tags for Fine-Grained Control
Grok TTS supports inline tags embedded directly in the input text to control pauses, laughter, whispers, and emphasis — a feature unavailable in most competing services.
Hello! [pause:500ms] Today we have a special offer. [whisper]This is just for you.[/whisper] [laugh] Amazing, right? [emphasis]Don't miss out![/emphasis]This allows highly natural-sounding output for customer support scripts, podcast intros, and interactive voice applications.
Grok TTS: 20+ Languages and BCP-47 Code Support
Grok TTS supports over 20 languages with automatic language detection (`auto`) or explicit BCP-47 code specification (e.g., `ja` for Japanese, `en` for English). For multilingual content pipelines, explicitly specifying the language code is recommended to ensure consistent output quality.
Grok TTS: Output Formats
| Format | Best For |
|---|---|
| MP3 | General web and mobile apps |
| WAV | High-quality recording and editing |
| PCM (Linear16) | Real-time audio streaming |
| G.711 μ-law | North American telephony (VoIP, IVR) |
| G.711 A-law | European and Asian telephony systems |
G.711 support enables direct integration with existing telephone infrastructure without additional audio conversion.
Grok STT: 25+ Languages with Seamless Language Switching
Grok STT covers 25+ languages and automatically handles mid-conversation language switches. Mixed-language audio — such as conversations alternating between Japanese, English, and Mandarin — is processed accurately. This makes it ideal for global enterprise call centers and multilingual media workflows.
Grok STT: Word-Level Timestamps and Speaker Diarization
Grok STT provides word-level timestamps and multi-channel speaker diarization — automatically identifying who spoke when. This unlocks automated meeting minute generation, post-call quality analysis, and podcast editing assistance, dramatically reducing the manual review effort required.
Grok STT: Industry-Leading Benchmark Accuracy
In a phone call entity recognition benchmark (names, account numbers, dates), Grok STT achieved a 5.0% error rate — significantly outperforming all major competitors:
| Service | Error Rate | Difference vs. Grok |
|---|---|---|
| Grok STT | 5.0% | — |
| ElevenLabs | 12.0% | +7.0pp |
| Deepgram | 13.5% | +8.5pp |
| AssemblyAI | 21.3% | +16.3pp |
The accuracy advantage is especially impactful in finance, insurance, and healthcare where precise entity extraction is critical.
Grok STT Error Rate Comparison
Grok STT: Batch vs. Streaming — Which to Choose
| Mode | Price | Latency | Primary Use Cases |
|---|---|---|---|
| Batch | $0.10/hour | Higher (async) | Recorded file processing, meeting transcripts, post-call analysis |
| Streaming | $0.20/hour | Low (real-time) | Live captions, real-time meeting transcription, voice UI |
Choose batch for cost-sensitive workloads and streaming when immediacy is required.
TTS Competitive Comparison
| Service | Price | Voices | Languages | Inline Tags | G.711 |
|---|---|---|---|---|---|
| Grok TTS | $4.20/M chars | 5 | 20+ | Yes | Yes |
| OpenAI TTS | $15/M chars | 6 | 57 | No | No |
| ElevenLabs | ~$11/M chars | Many (cloning) | 32 | Partial | No |
| Google WaveNet | $16/M chars | Many | 40+ | SSML | No |
| Azure TTS | $16/M chars | Many | 140+ | SSML | No |
Grok TTS offers the most competitive pricing. Voice cloning is not supported, but the five fixed voices deliver business-grade quality.
STT Competitive Comparison
| Service | Batch Price | Streaming Price | Error Rate (Phone) | Word Timestamps | Speaker ID |
|---|---|---|---|---|---|
| Grok STT | $0.10/hr | $0.20/hr | 5.0% | Yes | Yes |
| OpenAI Whisper API | $0.006/min | N/A | N/A (unpublished) | Segment | No |
| Deepgram | ~$0.25/hr | ~$0.35/hr | 13.5% | Yes | Yes |
| AssemblyAI | ~$0.37/hr | ~$0.45/hr | 21.3% | Yes | Yes |
| ElevenLabs STT | ~$0.25/hr | N/A | 12.0% | Yes | Yes |
The 60% Undercut Pricing Strategy
xAI's pricing is a deliberate strategy to enter the market at 60% below established players. An enterprise processing 1,000 hours of audio per month can dramatically reduce costs by migrating from Deepgram or AssemblyAI. The barrier to adoption is low — the API is REST-based and compatible with existing tooling, making it straightforward to integrate into current stacks.
Proven at Commercial Scale
The technology behind Grok Audio APIs is already deployed in production: Grok Voice (xAI's AI assistant), Tesla's in-vehicle voice interface, and Starlink customer support. This is not a research prototype — it is a battle-tested system operating at commercial scale, which is a key trust signal for enterprise procurement.
How to Use the API: TTS
Grok TTS uses an OpenAI-compatible REST API. You can test it immediately with the following curl command:
curl https://api.x.ai/v1/audio/speech \
-H "Authorization: Bearer $XAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "grok-tts",
"voice": "ara",
"input": "Hello from Grok TTS!",
"language": "en",
"format": "mp3"
}' --output speech.mp3For telephony integration, set `"format": "ulaw"` for G.711 μ-law. Maximum input is 15,000 characters per request, with 100 concurrent requests per team.
How to Use the API: STT
Grok STT is also accessed via REST API. Here is a Python example with word timestamps and speaker diarization enabled:
import requests
url = "https://api.x.ai/v1/audio/transcriptions"
headers = {"Authorization": f"Bearer {XAI_API_KEY}"}
with open("meeting.mp3", "rb") as f:
response = requests.post(
url,
headers=headers,
files={"file": f},
data={
"model": "grok-stt",
"language": "en",
"timestamps": "word",
"diarization": "true"
}
)
print(response.json())Streaming mode is available via a WebSocket endpoint for real-time transcription.
Related: Grok Voice Agent API ($0.05/min)
Beyond the individual TTS and STT APIs, xAI also offers the Grok Voice Agent API at $0.05 per minute — a fully integrated real-time conversational voice agent that combines both TTS and STT. It is suited for use cases requiring turn-based dialogue: automated customer inquiry handling, voice-driven booking systems, and interactive voice guides.
5 Real-World Use Cases
1. Call Center Automation Transcribe calls with STT → extract entities (names, account numbers, dates) at 5.0% error rate → respond via TTS in G.711 format. Fully automated pipeline that dramatically reduces operator workload. 2. Multilingual Podcast Production STT transcribes the original recording → translate via translation API → TTS generates dubbed audio in 20+ languages. Reduces production cost while expanding global reach. 3. Live Captioning Streaming STT at $0.20/hour delivers real-time captions for meetings, webinars, and live events. Word-level timestamps enable precise caption timing. 4. IVR / Auto-Attendant Systems G.711 format support allows direct connection to existing PBX infrastructure without audio conversion. STT recognizes caller intent, TTS delivers responses. 5. Educational e-Learning TTS auto-generates multilingual narration. STT evaluates and scores learner speech. Enables scalable voice-interactive learning content at a fraction of traditional production cost.
Japanese Language Support Quality
Japanese (BCP-47: `ja`) is included in Grok TTS's 20+ language support, with reliable auto-detection. Grok STT includes Japanese in its 25+ language coverage, delivering business-grade accuracy for professional conversations. Bilingual Japanese-English audio is handled through automatic language switching.
Beta Limitations to Be Aware Of
- Maximum TTS input: 15,000 characters per request - Concurrent request limit: 100 per team - Voice cloning is not supported — five fixed voices only - Pricing and specifications may change during the Beta period - Always verify the latest details at the official page: https://x.ai/news/grok-stt-and-tts-apis
Setup: From Zero to First Request
1. Create an account at x.ai 2. Generate an API key from the dashboard 3. Configure billing (pay-as-you-go, no minimum commitment) 4. Send requests with the `Authorization: Bearer <YOUR_KEY>` header 5. TTS endpoint: `https://api.x.ai/v1/audio/speech` | STT endpoint: `https://api.x.ai/v1/audio/transcriptions`
Frequently Asked Questions
Q1. Can I use TTS and STT independently? Yes. Each is a separate API — you can subscribe to and use only what you need. Q2. How does Grok STT compare to OpenAI Whisper in accuracy? In the phone call entity recognition benchmark, Grok STT achieves a 5.0% error rate. OpenAI has not published equivalent benchmarks for Whisper under the same conditions. Q3. Is the 60% cost reduction claim verified? According to xAI's official announcement, pricing is set to undercut ElevenLabs, Deepgram, and AssemblyAI by approximately 60%. Q4. Is Japanese STT accurate enough for business use? Japanese is among the 25+ supported languages, and the STT is designed for business-grade professional conversation transcription including proper nouns and numbers. Q5. How many speakers can diarization identify? Multi-channel support is confirmed, but the specific speaker count limit is not yet publicly documented. Check the official docs for the latest specification. Q6. Is it true the same technology runs in Tesla vehicles? Yes. The Grok audio technology stack is deployed in Tesla's in-vehicle voice assistant, confirming production-grade reliability at scale. Q7. Can I clone a custom voice? Not in the current Beta. Only the five fixed voices (Ara, Eve, Leo, Rex, Sal) are available. If voice cloning is essential, ElevenLabs remains an alternative. Q8. What is the difference between Voice Agent API and individual TTS/STT APIs? The Voice Agent API ($0.05/min) integrates TTS and STT into a real-time conversational interface. The individual APIs are better suited for batch processing, custom pipelines, and flexible architectural combinations.
Oflight Audio AI Integration Services
Oflight provides end-to-end consulting for integrating Grok TTS and STT APIs into your business workflows — from call center automation and multilingual content pipelines to voice interface development. For technical consultation and implementation support, visit our AI Consulting Services.
Feel free to contact us
Contact Us