Chinese
TTS Voices

Chinese text-to-speech voices with accurate tonal precision

TelnyxInWorldMiniMaxRimeAzureAWS
Top 7 TTS for Chinese
NameProvider
Tao - Lecturertelnyx
Radio Hostminimax
Mei - Expressive Assistanttelnyx
Xiaoyininworld
Yunyi Multilingualazure
Xinyiinworld
Jinginworld
[ VOICE AI PLATFORM ]

From text to talk.
Pick your path.

Call our TTS & STT endpoints directly, wire voice into LiveKit rooms with one plug-in, or spin up an AI assistant on a real phone number.

TTS & STT Endpoints

Production-grade streaming and batch TTS/STT. Low latency, 50+ languages, customizable voices, and SDKs for Node/Python/Browser.

  • Streaming for live apps
  • Multi-speaker diarization & punctuation
  • SDKs, code samples, and latency benchmarks
TTS — CURL
$ curl -X POST \
".../v1/tts" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"voice": "alloy_female_v1",
"language": "en-US",
"format": "mp3",
"text": "Hello, welcome..."
} ' --output speech.mp3

Sends text to the TTS endpoint and saves the synthesized audio as an MP3 file.

View TTS docs →

LiveKit Plug-in

Plug our real-time speech pipeline into LiveKit rooms — transcribe live sessions, synthesize responses and stream audio back into the room.

  • One-line install, example room demo
  • WebRTC + server bridge patterns
  • Works in browser & mobile
LIVEKIT — NODE.JS
import { Room } from "livekit-client";
import { TelnyxSpeechPlugin }
from "@telnyx/livekit-plugin";
const room = new Room();
await room.connect(URL, token);
const plugin = new TelnyxSpeechPlugin({
apiKey: process.env.TELNYX_API_KEY,
voice: "alloy_female_v1",
});
plugin.attach(room);

Connects to a LiveKit room and attaches real-time TTS/STT — transcribes audio in, synthesizes audio out.

Try LiveKit demo →

AI-Assistants (Phone)

Deploy a phone-number based AI assistant in minutes — inbound/outbound calls, IVR, call recording, and DTMF support.

  • Purchase & map a phone number
  • Templates: Support Bot, Sales Assistant, Reminder Bot
  • PSTN reliability & compliance tools
AI-ASSISTANT — CURL
$ curl -X POST \
".../v1/assistants" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"name": "Support Bot",
"phone_number": "+18005551234",
"voice": "alloy_female_v1",
"system_prompt": "You are a
helpful support agent.",
"capabilities": ["inbound",
"recording", "dtmf"]
} '

Creates an AI assistant bound to a phone number with inbound call handling, recording, and DTMF support.

Create your assistant →

Spanish voices

294TTS voices

Español

Browse →

French voices

98TTS voices

Français

Browse →

German voices

82TTS voices

Deutsch

Browse →

Indonesian voices

31TTS voices

Bahasa Indonesia

Browse →

Italian voices

51TTS voices

Italiano

Browse →

Japanese voices

85TTS voices

日本語

Browse →

Korean voices

171TTS voices

한국어

Browse →

Portuguese voices

277TTS voices

Português

Browse →

Russian voices

34TTS voices

Русский

Browse →

Chinese voices

189TTS voices

中文

Browse →

Chinese phonology and prosody

Pitch that carries the dictionary

Mandarin is a tone language[1]: nearly every syllable carries one of four fixed pitch contours: high, rising, dipping, or falling: and changing the contour changes the word. The syllable "ma" means mother (mā), hemp (má), horse (mǎ), or scold (mà). English uses pitch to signal stress and focus[2]; Mandarin pitch is built into the word itself[3]. A TTS system that misshapes a single tone doesn't sound unnatural: it says the wrong word. Producing accurate Mandarin requires inference that resolves tone at the syllable level, with no inter-provider routing degrading the pitch signal.

Evenly chopped, not bouncy

English is stress-timed[1]: stressed syllables land at regular intervals while unstressed syllables compress between them, creating a strong-weak-weak bounce. Mandarin is syllable-timed[2]: syllables stay closer to equal in duration with far less reduction, producing what sounds like a row of similarly sized beats[3] carrying different pitch shapes. A voice engine trained on English stress-timing will squeeze and stretch syllables that should stay even. Getting this right requires models built for Mandarin rhythm, running co-located with the audio pipeline.

Sentence melody on a tonal tightrope

English intonation is relatively free: pitch accents move around a sentence to mark focus or signal questions[1] without changing word identity. Mandarin intonation must ride on top of lexical tones[2], using post-focus pitch compression to convey emphasis while keeping each syllable's tone intact. The same discourse function: focus, question, statement: is realized through different prosodic strategies[3] than English uses. Imposing English-style rising question contours onto Mandarin warps lexical tones into wrong words. This two-layer pitch system demands synthesis infrastructure where tone and intonation are resolved together, not split across providers.