株式会社オブライト
AI2026-04-17

xAI Grok Audio APIs Complete Guide — TTS ($4.20/M chars) + STT ($0.10/hour) Undercutting Competitors by 60% [2026]

Complete guide to xAI's Grok TTS and STT APIs, officially bundled on April 17, 2026. TTS at $4.20/1M characters and STT at $0.10/hour (batch) undercut competitors by 60%. Grok STT achieves a 5.0% entity recognition error rate — the best in the industry. Covers API usage, benchmarks, and real-world use cases.


What Are xAI Grok Audio APIs? — TTS + STT Bundled at 60% Below Competitors

On April 17, 2026, xAI officially launched Grok TTS (Text-to-Speech) and Grok STT (Speech-to-Text) as a bundled audio API suite. At $4.20 per 1M characters for TTS and $0.10/hour for STT (batch), xAI undercuts ElevenLabs, Deepgram, and AssemblyAI by approximately 60%. Grok STT achieves a 5.0% entity recognition error rate in phone call benchmarks — three to four times better than competitors. The same technology already powers Grok Voice, Tesla in-vehicle systems, and Starlink customer support.

Both APIs at a Glance

APIFunctionPriceHighlights
Grok TTSText → Speech$4.20 / 1M chars5 voices, 20+ languages, inline tags, MP3/WAV/PCM/G.711
Grok STTSpeech → Text$0.10/hour (batch), $0.20/hour (streaming)25+ languages, word timestamps, speaker diarization

Grok Audio API Ecosystem Overview

Loading diagram...

Grok TTS: 5 Voice Profiles

Grok TTS offers five fixed voice profiles:

Voice IDCharacterRecommended Use
araClear and professionalCustomer support, business announcements
eveWarm and approachablee-Learning, virtual assistants
leoCalm and authoritative maleNarration, podcasts
rexEnergetic maleGame NPCs, entertainment content
salNeutral and versatileGeneral TTS, IVR

Grok TTS: Inline Voice Tags for Fine-Grained Control

Grok TTS supports inline tags embedded directly in the input text to control pauses, laughter, whispers, and emphasis — a feature unavailable in most competing services.

Hello! [pause:500ms] Today we have a special offer. [whisper]This is just for you.[/whisper] [laugh] Amazing, right? [emphasis]Don't miss out![/emphasis]

This allows highly natural-sounding output for customer support scripts, podcast intros, and interactive voice applications.

Grok TTS: 20+ Languages and BCP-47 Code Support

Grok TTS supports over 20 languages with automatic language detection (`auto`) or explicit BCP-47 code specification (e.g., `ja` for Japanese, `en` for English). For multilingual content pipelines, explicitly specifying the language code is recommended to ensure consistent output quality.

Grok TTS: Output Formats

FormatBest For
MP3General web and mobile apps
WAVHigh-quality recording and editing
PCM (Linear16)Real-time audio streaming
G.711 μ-lawNorth American telephony (VoIP, IVR)
G.711 A-lawEuropean and Asian telephony systems

G.711 support enables direct integration with existing telephone infrastructure without additional audio conversion.

Grok STT: 25+ Languages with Seamless Language Switching

Grok STT covers 25+ languages and automatically handles mid-conversation language switches. Mixed-language audio — such as conversations alternating between Japanese, English, and Mandarin — is processed accurately. This makes it ideal for global enterprise call centers and multilingual media workflows.

Grok STT: Word-Level Timestamps and Speaker Diarization

Grok STT provides word-level timestamps and multi-channel speaker diarization — automatically identifying who spoke when. This unlocks automated meeting minute generation, post-call quality analysis, and podcast editing assistance, dramatically reducing the manual review effort required.

Grok STT: Industry-Leading Benchmark Accuracy

In a phone call entity recognition benchmark (names, account numbers, dates), Grok STT achieved a 5.0% error rate — significantly outperforming all major competitors:

ServiceError RateDifference vs. Grok
Grok STT5.0%
ElevenLabs12.0%+7.0pp
Deepgram13.5%+8.5pp
AssemblyAI21.3%+16.3pp

The accuracy advantage is especially impactful in finance, insurance, and healthcare where precise entity extraction is critical.

Grok STT Error Rate Comparison

Loading diagram...

Grok STT: Batch vs. Streaming — Which to Choose

ModePriceLatencyPrimary Use Cases
Batch$0.10/hourHigher (async)Recorded file processing, meeting transcripts, post-call analysis
Streaming$0.20/hourLow (real-time)Live captions, real-time meeting transcription, voice UI

Choose batch for cost-sensitive workloads and streaming when immediacy is required.

TTS Competitive Comparison

ServicePriceVoicesLanguagesInline TagsG.711
Grok TTS$4.20/M chars520+YesYes
OpenAI TTS$15/M chars657NoNo
ElevenLabs~$11/M charsMany (cloning)32PartialNo
Google WaveNet$16/M charsMany40+SSMLNo
Azure TTS$16/M charsMany140+SSMLNo

Grok TTS offers the most competitive pricing. Voice cloning is not supported, but the five fixed voices deliver business-grade quality.

STT Competitive Comparison

ServiceBatch PriceStreaming PriceError Rate (Phone)Word TimestampsSpeaker ID
Grok STT$0.10/hr$0.20/hr5.0%YesYes
OpenAI Whisper API$0.006/minN/AN/A (unpublished)SegmentNo
Deepgram~$0.25/hr~$0.35/hr13.5%YesYes
AssemblyAI~$0.37/hr~$0.45/hr21.3%YesYes
ElevenLabs STT~$0.25/hrN/A12.0%YesYes

The 60% Undercut Pricing Strategy

xAI's pricing is a deliberate strategy to enter the market at 60% below established players. An enterprise processing 1,000 hours of audio per month can dramatically reduce costs by migrating from Deepgram or AssemblyAI. The barrier to adoption is low — the API is REST-based and compatible with existing tooling, making it straightforward to integrate into current stacks.

Proven at Commercial Scale

The technology behind Grok Audio APIs is already deployed in production: Grok Voice (xAI's AI assistant), Tesla's in-vehicle voice interface, and Starlink customer support. This is not a research prototype — it is a battle-tested system operating at commercial scale, which is a key trust signal for enterprise procurement.

How to Use the API: TTS

Grok TTS uses an OpenAI-compatible REST API. You can test it immediately with the following curl command:

bash
curl https://api.x.ai/v1/audio/speech \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-tts",
    "voice": "ara",
    "input": "Hello from Grok TTS!",
    "language": "en",
    "format": "mp3"
  }' --output speech.mp3

For telephony integration, set `"format": "ulaw"` for G.711 μ-law. Maximum input is 15,000 characters per request, with 100 concurrent requests per team.

How to Use the API: STT

Grok STT is also accessed via REST API. Here is a Python example with word timestamps and speaker diarization enabled:

python
import requests

url = "https://api.x.ai/v1/audio/transcriptions"
headers = {"Authorization": f"Bearer {XAI_API_KEY}"}

with open("meeting.mp3", "rb") as f:
    response = requests.post(
        url,
        headers=headers,
        files={"file": f},
        data={
            "model": "grok-stt",
            "language": "en",
            "timestamps": "word",
            "diarization": "true"
        }
    )

print(response.json())

Streaming mode is available via a WebSocket endpoint for real-time transcription.

Related: Grok Voice Agent API ($0.05/min)

Beyond the individual TTS and STT APIs, xAI also offers the Grok Voice Agent API at $0.05 per minute — a fully integrated real-time conversational voice agent that combines both TTS and STT. It is suited for use cases requiring turn-based dialogue: automated customer inquiry handling, voice-driven booking systems, and interactive voice guides.

5 Real-World Use Cases

1. Call Center Automation Transcribe calls with STT → extract entities (names, account numbers, dates) at 5.0% error rate → respond via TTS in G.711 format. Fully automated pipeline that dramatically reduces operator workload. 2. Multilingual Podcast Production STT transcribes the original recording → translate via translation API → TTS generates dubbed audio in 20+ languages. Reduces production cost while expanding global reach. 3. Live Captioning Streaming STT at $0.20/hour delivers real-time captions for meetings, webinars, and live events. Word-level timestamps enable precise caption timing. 4. IVR / Auto-Attendant Systems G.711 format support allows direct connection to existing PBX infrastructure without audio conversion. STT recognizes caller intent, TTS delivers responses. 5. Educational e-Learning TTS auto-generates multilingual narration. STT evaluates and scores learner speech. Enables scalable voice-interactive learning content at a fraction of traditional production cost.

Japanese Language Support Quality

Japanese (BCP-47: `ja`) is included in Grok TTS's 20+ language support, with reliable auto-detection. Grok STT includes Japanese in its 25+ language coverage, delivering business-grade accuracy for professional conversations. Bilingual Japanese-English audio is handled through automatic language switching.

Beta Limitations to Be Aware Of

- Maximum TTS input: 15,000 characters per request - Concurrent request limit: 100 per team - Voice cloning is not supported — five fixed voices only - Pricing and specifications may change during the Beta period - Always verify the latest details at the official page: https://x.ai/news/grok-stt-and-tts-apis

Setup: From Zero to First Request

1. Create an account at x.ai 2. Generate an API key from the dashboard 3. Configure billing (pay-as-you-go, no minimum commitment) 4. Send requests with the `Authorization: Bearer <YOUR_KEY>` header 5. TTS endpoint: `https://api.x.ai/v1/audio/speech` | STT endpoint: `https://api.x.ai/v1/audio/transcriptions`

Frequently Asked Questions

Q1. Can I use TTS and STT independently? Yes. Each is a separate API — you can subscribe to and use only what you need. Q2. How does Grok STT compare to OpenAI Whisper in accuracy? In the phone call entity recognition benchmark, Grok STT achieves a 5.0% error rate. OpenAI has not published equivalent benchmarks for Whisper under the same conditions. Q3. Is the 60% cost reduction claim verified? According to xAI's official announcement, pricing is set to undercut ElevenLabs, Deepgram, and AssemblyAI by approximately 60%. Q4. Is Japanese STT accurate enough for business use? Japanese is among the 25+ supported languages, and the STT is designed for business-grade professional conversation transcription including proper nouns and numbers. Q5. How many speakers can diarization identify? Multi-channel support is confirmed, but the specific speaker count limit is not yet publicly documented. Check the official docs for the latest specification. Q6. Is it true the same technology runs in Tesla vehicles? Yes. The Grok audio technology stack is deployed in Tesla's in-vehicle voice assistant, confirming production-grade reliability at scale. Q7. Can I clone a custom voice? Not in the current Beta. Only the five fixed voices (Ara, Eve, Leo, Rex, Sal) are available. If voice cloning is essential, ElevenLabs remains an alternative. Q8. What is the difference between Voice Agent API and individual TTS/STT APIs? The Voice Agent API ($0.05/min) integrates TTS and STT into a real-time conversational interface. The individual APIs are better suited for batch processing, custom pipelines, and flexible architectural combinations.

Oflight Audio AI Integration Services

Oflight provides end-to-end consulting for integrating Grok TTS and STT APIs into your business workflows — from call center automation and multilingual content pipelines to voice interface development. For technical consultation and implementation support, visit our AI Consulting Services.

Feel free to contact us

Contact Us