IEEE SLT 2026 · interactive demo

ARIA

An analytic source–filter neural vocoder with an interpretable, decoupled control surface, built for phonetic research.

ARIA exposes the parameters phoneticians already reason with (F0, F1, F2, phonation and prominence, plus an exploratory nasalisation control) as controls over a neural vocoder. The network estimates them; deterministic DSP places each pole/source parameter directly. Move the pitch or vowel control and the others hold: F0, F1 and F2 are cleanly independent (the source and prominence controls overlap by design).

6
independent phonetic dimensions (7 studio controls)
≈ 40 Hz
median F2 manipulation error
from 1 h
single-speaker audio per voice

Synthesis quality

Near-transparent resynthesis

Before any manipulation, how close is ARIA's analysis–resynthesis to the natural recording? Three sustained Mandarin vowels are shown below, natural on the left and ARIA on the right. The harmonic structure and formant pattern are reproduced.

Neural-MOS snapshot

Predicted naturalness (UTMOS22), mean ± std over N = 50 utterances, for the analysis–resynthesis against the natural recording (the reference). The interactive control studio uses the 5 h (v4) models (the higher-UTMOS variant); the 1 h (v2) rows are kept for the low-resource comparison; the v4-vs-v2 gain partly reflects the ≈ 5× training data. Read the resynth-vs-natural gap within a voice, not absolutes across voices: UTMOS is a neural proxy (not a human MOS test) and is wideband-trained, so the 16 kHz F024 numbers sit low for both synth and reference (their near-equal means indicate near-transparent resynthesis), and 5 h markedly narrows the resynth–natural gap vs 1 h on CSMSC/LJ.

UTMOS22 predicted naturalness (mean ± std, N = 50); the reference column is the natural recording.
VoiceLanguageSample rateModelTrain data UTMOS · resynthUTMOS · natural (ref)
F024Mandarin16 kHzv41 h 2.53 ± 0.632.70 ± 0.68
CSMSCMandarin24 kHzv21 h 2.99 ± 0.393.90 ± 0.37
CSMSCMandarin24 kHzv45 h 3.35 ± 0.443.90 ± 0.37
LJSpeechEnglish22 kHzv21 h 3.43 ± 0.314.37 ± 0.09
LJSpeechEnglish22 kHzv45 h 4.00 ± 0.234.37 ± 0.09

Interactive

Control studio

Pick a phonetic control and drag the slider along its seven-step continuum. Each control is shown live on several Mandarin syllables at once (varied onset, rime and tone, including compound finals), so you can see it generalise. The demonstration is simple: the targeted control sweeps while everything else holds.

Control

Manipulation step 4 / 7
leftright

Drag to sweep this control through all seven steps; one parameter moves while the rest stay fixed.

← → step · ↑ ↓ control · R source
Measured trajectory of the three syllables; the current step is ringed. On a vowel control the formants trace a path while F0 stays fixed; on the pitch control the reverse holds.

Evidence

Near-exact, by construction

Because each control is a physical DSP parameter (a pole frequency, a glottal shape, a gain), not a latent fed to a black box, the synthesiser realises the requested value closely and predictably. Over the full F024 test set (1517 utterances × six target scales), the median residual |measured − target| is small: F0 ≈ 0.4 Hz, F1 ≈ 31 Hz, F2 ≈ 40 Hz (vs the ≈ 150 Hz F2 manipulation error reported for HiFi-Glot). F1 still under-travels at the extreme scales (slope < 1, R² ≈ 0.89).

manipulation fidelity — |measured − target| per control across target scales
Manipulation fidelity over the full F024 test set (1517 utterances): the per-utterance |measured − target| for each control across target scales 0.7–1.3.

Generalisation

The same instrument, other voices

The control surface is not tuned to one speaker. The identical pipeline drives two more voices, CSMSC (Mandarin) and LJSpeech (English), 5 h models, here on full sentences, with the three simplest plain-scale controls: pitch, the F1+F2 vowel space, and modal ↔ breathy phonation. The same controls, a different speaker and language. The three examples per control are the highest-UTMOS of a 15-utterance pool.

Ablations

Robustness to data, training & model size

F024, 16 kHz, single speaker. Control fidelity = median |measured − target| (Hz) over voiced frames, N ≈ 15; quality = MCD (dB, vs. natural reference) and UTMOS, N = 50 (reference UTMOS = 3.14). Control fidelity is realised by the analytic DSP, not learned, so it is flat across all three budgets; quality (MCD) degrades gracefully.

Training data  (model fixed: v4, 5.6 M)

Control fidelity (median |measured − target|, Hz) and quality (MCD, UTMOS) vs. training-set size.
DataF0 (Hz)F1 (Hz)F2 (Hz)MCD ↓ (dB)UTMOS
5 min0.330.029.44.162.94
10 min0.533.222.93.952.94
20 min0.342.831.23.84
1 h0.528.925.73.782.94

Encoder size  (data fixed: 1 h)

The same metrics vs. encoder capacity (training data fixed at 1 h).
EncoderParamsF0 (Hz)F1 (Hz)F2 (Hz)MCD ↓ (dB)UTMOS
small1.0 M0.433.724.03.882.92
base5.6 M0.528.925.73.782.94
large14.8 M0.439.926.53.752.92

Training steps  (control fidelity flat from 5 k: F0 < 0.5 Hz, F1/F2 ≈ 25–35 Hz throughout)

Quality (MCD, dB) over training steps; control fidelity is flat from 5 k (see heading).
Steps5 k10 k20 k40 k
MCD ↓ (dB)4.244.063.993.82

F0/F1/F2 in Hz; F1/F2 vary within the N ≈ 15 measurement noise (no systematic trend; the signal is the absence of degradation). MCD spread is near the ≈ 0.3 dB just-noticeable difference (JND). UTMOS is wideband-trained and weakly discriminative at 16 kHz (all configs ≈ 2.9, reference 3.14). Inference ≈ 28× real time on CPU, 16× on an L4 GPU.

Method

How it works

The phonetic control surface

Every control is a parameter phoneticians already reason with, estimated by the network, then realised directly by deterministic DSP.

The six phonetic dimensions, the DSP parameter each maps to, and its perceptual meaning.
Phonetic dimensionControlMeaning
PitchF0intonation & register — scales the natural F0 (real tone shape preserved)
Vowel qualityF1 F2tongue height & backness — the vowel space
Phonation / voice qualityRd + aperiodicitybreathy ↔ modal ↔ pressed (glottal shape and HNR/noise)
Spectral tilt / effortαvocal effort / source spectral balance
Loudness · prominenceenergy (+F0+α)stress as a joint cue (timing fixed, so duration excluded)
Nasalityanti-formantoral ↔ nasal(ised) — a learned pole–zero (ARMA) zero, controllable

Architecture

Analytic source–filter core

A differentiable LF glottal-flow source excites an analytic vocal-tract cascade: spectral tilt + range-constrained F1/F2 resonator biquads + learned higher poles. The formant biquads are the control surface: physical, not latent.

Decoupled by construction

Because F1/F2 are dedicated, range-bounded poles, scaling one does not leak into the other: F0/F1/F2 are cleanly independent. Energy, tilt and glottal Rd show modest, partly measurement-induced cross-talk; the studio above demonstrates the clean F0/F1/F2 cases.

Interpretable aperiodicity

A band-aperiodicity head, supervised by WORLD D4C, controls the harmonic-vs-noise balance per frequency band: an explicit, observable breathiness / HNR dimension.

Low-resource

As little as one hour of single-speaker audio per voice (the cross-language models use 5 h). No text and no phonetic labels at synthesis time; analysis–resynthesis with true F0.

Scope & limitations

The oral tract is all-pole (clean, decoupled formants); a learned pole–zero (ARMA) section adds the anti-formant nasals need, exposed as a controllable nasalisation control.

  • The learned nasal zero mostly activates under manipulation; aggregate reconstruction gain is modest.
  • A dedicated low nasal pole (~250 Hz) and anti-formant supervision are future work.
  • Single speaker, ~1 h; timing is fixed (prominence excludes duration); rounding via F2 only.