Direct Articulatory-to-Acoustic Mapping from Ultrasound Tongue Imaging

Supplementary material — Frigyes Viktor Arthur and Dávid Sztahó

Manuscript submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Listen below. Speech is synthesised from a midsagittal ultrasound video of the tongue.

How the system works

A short window of ~40 raw ultrasound frames (~500 ms) is processed by a convolutional encoder, aggregated, and decoded into an 80-channel mel-spectrogram frame; a pre-trained WaveGlow vocoder then converts the predicted mel-spectrogram into an audible waveform. The same vocoder is applied to the natural-speech reference, so listening tests compare two conditions that differ only in the AAM stage (matched-vocoder protocol).

System architecture: ultrasound tongue images go through a CNN encoder, an aggregator, and a mel decoder; WaveGlow then renders the predicted mel-spectrogram to audio.
System architecture. A temporal window of ~40 UTI frames is processed by a 4-layer CNN encoder, aggregated via flatten + linear projection, and decoded to an 80-dimensional mel-spectrogram frame. WaveGlow converts the predicted spectrogram to a 22 kHz audio waveform. The reference path (bottom) extracts mel-spectrograms from recorded audio for evaluation.
Two example utterances: rain (sub-07 ses-01) on the left, sky (sub-03 ses-01) on the right. Each shows four UTI frames (top), the original mel-spectrogram (middle), and the predicted mel-spectrogram (bottom).
Example utterances. Two test items are shown side by side: rain (sub-07, session 1, MCD = 4.1 dB) on the left and sky (sub-03, session 1, MCD = 3.5 dB) on the right. The top rows are four UTI frames sampled at different time points within the utterance; below are the original and predicted mel-spectrograms.

Abstract

Articulatory-to-acoustic mapping (AAM) — reconstructing speech from articulator measurements — is a core building block for silent speech interfaces (SSIs). Most ultrasound tongue imaging (UTI)-based AAM systems predict classical vocoder parameters (MGC, LSP, F0) that require brittle voicing/pitch decisions and bypass modern neural waveform generators. We propose a temporal convolutional network that maps short windows of UTI frames directly to 80-channel mel-spectrograms, inverted to a waveform by a pre-trained WaveGlow vocoder. The system is trained and evaluated on a custom multimodal corpus of four speakers across three sessions each (1,200 utterances of isolated English words; Hungarian L1 / English L2). The same vocoder is applied to natural and predicted mel-spectrograms, so vocoder artifacts are shared across conditions and the evaluation isolates the AAM stage. With this matched-vocoder protocol the system reaches a mean Mel-Cepstral Distortion of 4.46 dB, with no significant inter-speaker differences (p = 0.838). A blinded 8-alternative forced-choice listening test with 31 listeners (1,240 trials, IEEE Std 1329 conformant) reaches 84.0% word recognition accuracy for synthesized speech against 99.7% for natural recordings (Cohen’s h = 0.71), a Word Error Rate of 16.0%, macro-AUC of 0.895, and substantial inter-rater agreement (Fleiss’ κ = 0.75). Recognition errors are phonetically structured and concentrate on minimal pairs such as doorfour, consistent with the limited visibility of the tongue tip and lips in midsagittal UTI. The matched-vocoder protocol and the listener-response data are released, providing a reproducible reference point for UTI-based AAM and a foundation for cross-speaker pooling, continuous-speech extension, and target-population studies.

Few audio examples

PREDICTED — synthesized from ultrasound ORIGINAL — natural recording

bring

sub-06 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: bring sub-06 ses-01
ORIGINAL natural recording
Original mel: bring sub-06 ses-01

Back to word index

dance

sub-02 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: dance sub-02 ses-01
ORIGINAL natural recording
Original mel: dance sub-02 ses-01

Back to word index

four

sub-03 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: four sub-03 ses-01
ORIGINAL natural recording
Original mel: four sub-03 ses-01

Back to word index

laugh

sub-06 / ses-02

PREDICTED synthesized from ultrasound
Predicted mel: laugh sub-06 ses-02
ORIGINAL natural recording
Original mel: laugh sub-06 ses-02

Back to word index

rain

sub-03 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: rain sub-03 ses-01
ORIGINAL natural recording
Original mel: rain sub-03 ses-01

Back to word index

sad — 3 examples

sub-03 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: sad sub-03 ses-01
ORIGINAL natural recording
Original mel: sad sub-03 ses-01

sub-06 / ses-02

PREDICTED synthesized from ultrasound
Predicted mel: sad sub-06 ses-02
ORIGINAL natural recording
Original mel: sad sub-06 ses-02

sub-06 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: sad sub-06 ses-01
ORIGINAL natural recording
Original mel: sad sub-06 ses-01

Back to word index

sky

sub-02 / ses-01

PREDICTED synthesized from ultrasound
Predicted mel: sky sub-02 ses-01
ORIGINAL natural recording
Original mel: sky sub-02 ses-01

Back to word index

How to cite

Until the article is published, please cite this repository directly via the GitHub “Cite this repository” button (powered by the CITATION.cff file) or use the placeholder BibTeX entry below:

@unpublished{arthur2026directaam,
 author = {Arthur, Frigyes Viktor and Sztah{\'{o}}, D{\'{a}}vid},
 title = {Direct Articulatory-to-Acoustic Mapping from Ultrasound Tongue Imaging},
 year = {2026},
 note = {Manuscript submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing.}
}

License

Unless otherwise stated, the contents of this site and the linked repository are released under Creative Commons Attribution 4.0 International (CC BY 4.0).