Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

Contents

What Changed from v2 to v3 Architecture and Runtime Reading Accuracy Text Normalization Getting Started Marktechpost’s Visual Explainer Supertonic 3: On-Device TTS,Now in 31 Languages Four Core Improvements Over Supertonic 2 Get Running in Under a Minute Basic Python Usage 31 Supported Languages + na Fallback Handles Complex Inputs Without Pre-Processing Runs Everywhere — 11 Platforms, No GPU Required Key Takeaways

Supertone released Supertonic 3, the third generation of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language support, improved reading accuracy, fewer repeat and skip failures, and v2-compatible public ONNX assets. It is Lightning Fast, On-Device, Multilingual and Accurate TTS.

What Changed from v2 to v3

Compared with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity across the shared-language set, and expands language coverage from 5 to 31 languages. Version 2 supported English, Korean, Spanish, Portuguese, and French. Version 3 adds Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 total ISO language codes. There is also a special na fallback for text whose language is unknown or outside the supported set.

The model grows modestly to accommodate the added languages. At about 99M parameters across the public ONNX assets, Supertonic 3 is much smaller than 0.7B to 2B class open TTS systems. The smaller model size is a practical advantage for download size, startup time, and on-device inference. The update also brings the total disk footprint of the public ONNX assets to 404 MB. Additionally, Supertone recently launched the Voice Builder, allowing developers to create custom, edge-native TTS models from their own voice recordings.

One new capability in v3 that wasn’t present in v2 is expressive tag support. Supertonic 3 supports simple expression tags such as <laugh>, <breath>, and <sigh>. These let you embed prosodic cues directly into input text without a separate preprocessing step or a separate model for expressiveness. For engineers building voice interfaces or accessibility tools, this means you can specify breathing pauses or laughter inline in your text payload.

Architecture and Runtime

The underlying architecture carries over from prior versions: a speech autoencoder that encodes waveforms into continuous latent representations, a flow-matching based text-to-latent module that maps text to audio features, and a duration predictor that controls natural timing. Flow matching is a generative modeling technique that learns a vector field to transform a simple distribution into a target distribution — it samples faster than diffusion models at low step counts, which is why Supertonic can produce usable output in just 2 inference steps. To further refine output, v3 integrates Length-Aware Rotary Position Embedding (LARoPE) for superior text-speech alignment and utilizes a Self-Purifying Flow Matching technique during training to remain robust against noisy data labels.

On runtime efficiency, Supertonic 3 runs fast on CPU, even compared with larger baselines measured on A100 GPU, and uses substantially less memory. It does not require a GPU, which makes local, browser, and edge deployment much easier.

Reading Accuracy

Across measured languages, Supertonic 3 stays within a competitive WER/CER range against much larger open TTS models such as VoxCPM2, while preserving a lightweight on-device deployment path. WER (Word Error Rate) and CER (Character Error Rate) are standard TTS readability metrics: you synthesize a passage, run ASR over the output, and compare the transcription to the original text. CER is used for languages without clear word boundaries; the others use WER. The system’s efficiency is best demonstrated on extreme edge hardware; it achieves an average RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Furthermore, the ecosystem has expanded to include Flutter (with macOS support), .NET 9, and Go, while the web implementation leverages onnxruntime-web for pure client-side execution.

Text Normalization

A differentiating property carried forward from v2 is built-in text normalization. Supertonic handles complex surface forms — financial expressions like $5.2M, phone numbers with area codes and extensions like (212) 555-0142 ext. 402, time and date formats like 4:45 PM on Wed, Apr 3, 2024, and technical units like 2.3h and 30kph — without any preprocessing pipeline or phonetic annotations. The financial expression “$5.2M” must read as “five point two million dollars,” and “$450K” as “four hundred fifty thousand dollars.” All four competing systems failed this. The technical unit “2.3h” must read as “two point three hours” and “30kph” as “thirty kilometers per hour.” All four competitors also failed this category. The competing systems evaluated include ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

https://github.com/supertone-inc/supertonic

Getting Started

The Python SDK install is pip install supertonic. On first run, the SDK downloads the model assets from Hugging Face automatically. A minimal example:

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Marktechpost’s Visual Explainer

Overview

Supertonic 3: On-Device TTS,
Now in 31 Languages

Supertonic 3 is a lightweight, open-weight text-to-speech system by Supertone Inc. It runs entirely via ONNX Runtime on your device — no cloud, no API call, no data leaving your machine. v3 expands from 5 to 31 languages, adds expressive tags, reduces reading failures, and stays compatible with the v2 ONNX interface.

31
Languages

~99M
Parameters

404 MB
ONNX Assets

MIT
Code License

What’s New in v3

Four Core Improvements Over Supertonic 2

Version 3 is a focused upgrade — same inference contract, meaningfully better output.

🌐
31 languages — Expanded from the 5-language v2 release (en, ko, es, pt, fr). Now includes Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and 20 more ISO codes, plus a special na fallback for unknown languages.
✅
More stable reading — Fewer repeat and skip failures, especially on short and long utterances. This was a known limitation in v2 that v3 directly addresses.
🎭
Expression tags — Supports <laugh>, <breath>, and <sigh> inline in text, without any separate preprocessing or external model.
🔊
Higher speaker similarity — Improved similarity across the shared-language set compared with Supertonic 2. Voices are more consistent across languages.

Installation

Get Running in Under a Minute

Install the Python SDK via pip. On first run, model assets are downloaded automatically from Hugging Face — no manual setup required.

pip install supertonic

Quick Start

Basic Python Usage

The SDK auto-downloads model assets on first run. Specify a voice, pass your text with a language code, and save the WAV output.

from supertonic import TTS

# Auto-downloads ONNX assets on first run
tts = TTS(auto_download=True)

# Select a preset voice (M1—M5 male, F1—F5 female)
style = tts.get_voice_style(voice_name="M1")

text = "A gentle breeze moved through the open window."

# synthesize() returns (wav_array, duration_in_seconds)
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

text = "I can't believe it <laugh> that actually worked!"
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

Languages

31 Supported Languages + `na` Fallback

All 31 languages share the same model architecture and ONNX inference pipeline. Use the na code for text whose language is unknown or outside the supported set.

en English

ko Korean

ja Japanese

ar Arabic

bg Bulgarian

cs Czech

da Danish

de German

el Greek

es Spanish

et Estonian

fi Finnish

fr French

hi Hindi

hr Croatian

hu Hungarian

id Indonesian

it Italian

lt Lithuanian

lv Latvian

nl Dutch

pl Polish

pt Portuguese

ro Romanian

ru Russian

sk Slovak

sl Slovenian

sv Swedish

tr Turkish

uk Ukrainian

vi Vietnamese

Text Normalization

Handles Complex Inputs Without Pre-Processing

Supertonic 3 reads financial expressions, dates, phone numbers, and technical units correctly out of the box — no G2P module or phonetic annotations required. Below: Supertonic vs. four major commercial/open-source systems.

Category	Input Example	Supertonic 3	ElevenLabs / OpenAI / Gemini / Microsoft
Financial Expression	$5.2M / $450K	✓	✗ All four failed
Time & Date	4:45 PM, Wed Apr 3	✓	✗ All four failed
Phone Number	(212) 555-0142 ext. 402	✓	✗ All four failed
Technical Unit	2.3h at 30kph	✓	✗ All four failed

Deployment & Resources

Runs Everywhere — 11 Platforms, No GPU Required

The public ONNX assets run on CPU in fixed-voice mode with no GPU dependency. Browser support is via WebGPU and WASM through onnxruntime-web. Audio output is 16-bit WAV; batch inference is supported.

🐍PythonONNX Runtime

🟨Node.jsServer-side JS

🌐BrowserWebGPU / WASM

☕JavaJVM

⚙️C++High-perf

🔷C#.NET

🔵GoGo runtime

🍎Swift / iOSNative

🦀RustSystems

💙FlutterCross-platform

📄Code: MITLicense

🤖Model: OpenRAIL-MLicense

Key Takeaways

Supertonic 3 expands language support from 5 (v2) to 31 languages, growing from 66M to ~99M parameters with a total ONNX asset size of 404 MB
New in v3: expressive tags (<laugh>, <breath>, <sigh>), more stable reading on short and long utterances, and improved speaker similarity vs. v2
v2-compatible public ONNX interface — existing integrations upgrade without changing inference code
Reading accuracy benchmarked against VoxCPM2; v3 stays within a competitive WER/CER range while being substantially smaller
v3-specific RTF/throughput numbers have not been published; the 167× faster-than-real-time figure is a v2 benchmark and should not be assumed identical for v3
Native output of 16-bit WAV files ensuring high-fidelity audio for engineering applications

Check out the GitHub Repo and Hugging Face Space. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us