ARIA — interpretable, controllable neural speech synthesis

Synthesis quality

Near-transparent resynthesis

Before any manipulation, how close is ARIA's analysis–resynthesis to the natural recording? Three sustained Mandarin vowels are shown below, natural on the left and ARIA on the right. The harmonic structure and formant pattern are reproduced.

Neural-MOS snapshot

Predicted naturalness (UTMOS22), mean ± std over N = 50 utterances, for the analysis–resynthesis against the natural recording (the reference). The interactive control studio uses the 5 h (v4) models (the higher-UTMOS variant); the 1 h (v2) rows are kept for the low-resource comparison; the v4-vs-v2 gain partly reflects the ≈ 5× training data. Read the resynth-vs-natural gap within a voice, not absolutes across voices: UTMOS is a neural proxy (not a human MOS test) and is wideband-trained, so the 16 kHz F024 numbers sit low for both synth and reference (their near-equal means indicate near-transparent resynthesis), and 5 h markedly narrows the resynth–natural gap vs 1 h on CSMSC/LJ.

UTMOS22 predicted naturalness (mean ± std, N = 50); the reference column is the natural recording.
Voice	Language	Sample rate	Model	Train data	UTMOS · resynth	UTMOS · natural (ref)
F024	Mandarin	16 kHz	v4	1 h	2.53 ± 0.63	2.70 ± 0.68
CSMSC	Mandarin	24 kHz	v2	1 h	2.99 ± 0.39	3.90 ± 0.37
CSMSC	Mandarin	24 kHz	v4	5 h	3.35 ± 0.44	3.90 ± 0.37
LJSpeech	English	22 kHz	v2	1 h	3.43 ± 0.31	4.37 ± 0.09
LJSpeech	English	22 kHz	v4	5 h	4.00 ± 0.23	4.37 ± 0.09

Interactive

Control studio

Pick a phonetic control and drag the slider along its seven-step continuum. Each control is shown live on several Mandarin syllables at once (varied onset, rime and tone, including compound finals), so you can see it generalise. The demonstration is simple: the targeted control sweeps while everything else holds.

Control

Manipulation step 4 / 7

leftright

Drag to sweep this control through all seven steps; one parameter moves while the rest stay fixed.

← → step · ↑ ↓ control · R source

A1–P0 nasalisation spectrum — Measured trajectory of the three syllables; the current step is ringed. On a vowel control the formants trace a path while F0 stays fixed; on the pitch control the reverse holds.

Evidence

Near-exact, by construction

Because each control is a physical DSP parameter (a pole frequency, a glottal shape, a gain), not a latent fed to a black box, the synthesiser realises the requested value closely and predictably. Over the full F024 test set (1517 utterances × six target scales), the median residual |measured − target| is small: F0 ≈ 0.4 Hz, F1 ≈ 31 Hz, F2 ≈ 40 Hz (vs the ≈ 150 Hz F2 manipulation error reported for HiFi-Glot). F1 still under-travels at the extreme scales (slope < 1, R² ≈ 0.89).

manipulation fidelity — |measured − target| per control across target scales — Manipulation fidelity over the full F024 test set (1517 utterances): the per-utterance |measured − target| for each control across target scales 0.7–1.3.

Generalisation

The same instrument, other voices

The control surface is not tuned to one speaker. The identical pipeline drives two more voices, CSMSC (Mandarin) and LJSpeech (English), 5 h models, here on full sentences, with the three simplest plain-scale controls: pitch, the F1+F2 vowel space, and modal ↔ breathy phonation. The same controls, a different speaker and language. The three examples per control are the highest-UTMOS of a 15-utterance pool.

Ablations

Robustness to data, training & model size

F024, 16 kHz, single speaker. Control fidelity = median |measured − target| (Hz) over voiced frames, N ≈ 15; quality = MCD (dB, vs. natural reference) and UTMOS, N = 50 (reference UTMOS = 3.14). Control fidelity is realised by the analytic DSP, not learned, so it is flat across all three budgets; quality (MCD) degrades gracefully.

Training data (model fixed: v4, 5.6 M)

Control fidelity (median |measured − target|, Hz) and quality (MCD, UTMOS) vs. training-set size.
Data	F0 (Hz)	F1 (Hz)	F2 (Hz)	MCD ↓ (dB)	UTMOS
5 min	0.3	30.0	29.4	4.16	2.94
10 min	0.5	33.2	22.9	3.95	2.94
20 min	0.3	42.8	31.2	3.84	—
1 h	0.5	28.9	25.7	3.78	2.94

Encoder size (data fixed: 1 h)

The same metrics vs. encoder capacity (training data fixed at 1 h).
Encoder	Params	F0 (Hz)	F1 (Hz)	F2 (Hz)	MCD ↓ (dB)	UTMOS
small	1.0 M	0.4	33.7	24.0	3.88	2.92
base	5.6 M	0.5	28.9	25.7	3.78	2.94
large	14.8 M	0.4	39.9	26.5	3.75	2.92

Training steps (control fidelity flat from 5 k: F0 < 0.5 Hz, F1/F2 ≈ 25–35 Hz throughout)

Quality (MCD, dB) over training steps; control fidelity is flat from 5 k (see heading).
Steps	5 k	10 k	20 k	40 k
MCD ↓ (dB)	4.24	4.06	3.99	3.82

F0/F1/F2 in Hz; F1/F2 vary within the N ≈ 15 measurement noise (no systematic trend; the signal is the absence of degradation). MCD spread is near the ≈ 0.3 dB just-noticeable difference (JND). UTMOS is wideband-trained and weakly discriminative at 16 kHz (all configs ≈ 2.9, reference 3.14). Inference ≈ 28× real time on CPU, 16× on an L4 GPU.

Method

How it works

The phonetic control surface

Every control is a parameter phoneticians already reason with, estimated by the network, then realised directly by deterministic DSP.

The six phonetic dimensions, the DSP parameter each maps to, and its perceptual meaning.
Phonetic dimension	Control	Meaning
Pitch	F0	intonation & register — scales the natural F0 (real tone shape preserved)
Vowel quality	F1 F2	tongue height & backness — the vowel space
Phonation / voice quality	R_d + aperiodicity	breathy ↔ modal ↔ pressed (glottal shape and HNR/noise)
Spectral tilt / effort	α	vocal effort / source spectral balance
Loudness · prominence	energy (+F0+α)	stress as a joint cue (timing fixed, so duration excluded)
Nasality	anti-formant	oral ↔ nasal(ised) — a learned pole–zero (ARMA) zero, controllable

Architecture

Analytic source–filter core

A differentiable LF glottal-flow source excites an analytic vocal-tract cascade: spectral tilt + range-constrained F1/F2 resonator biquads + learned higher poles. The formant biquads are the control surface: physical, not latent.

Decoupled by construction

Because F1/F2 are dedicated, range-bounded poles, scaling one does not leak into the other: F0/F1/F2 are cleanly independent. Energy, tilt and glottal R_d show modest, partly measurement-induced cross-talk; the studio above demonstrates the clean F0/F1/F2 cases.

Interpretable aperiodicity

A band-aperiodicity head, supervised by WORLD D4C, controls the harmonic-vs-noise balance per frequency band: an explicit, observable breathiness / HNR dimension.

Low-resource

As little as one hour of single-speaker audio per voice (the cross-language models use 5 h). No text and no phonetic labels at synthesis time; analysis–resynthesis with true F0.

Scope & limitations

The oral tract is all-pole (clean, decoupled formants); a learned pole–zero (ARMA) section adds the anti-formant nasals need, exposed as a controllable nasalisation control.

The learned nasal zero mostly activates under manipulation; aggregate reconstruction gain is modest.
A dedicated low nasal pole (~250 Hz) and anti-formant supervision are future work.
Single speaker, ~1 h; timing is fixed (prominence excludes duration); rounding via F2 only.

SLT 2026 Paper (coming soon) Code (coming soon)