Can we use Common Voice to train a Multi-Speaker TTS system? - Grid'5000 Access content directly
Conference Papers Year :

Can we use Common Voice to train a Multi-Speaker TTS system?

Abstract

Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.
Fichier principal
Vignette du fichier
Can_we_use_Mozilla_Common_Voice_for_TTS_CC (1).pdf (213.02 Ko) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03812715 , version 1 (13-10-2022)

Identifiers

  • HAL Id : hal-03812715 , version 1

Cite

Sewade Ogun, Vincent Colotte, Emmanuel Vincent. Can we use Common Voice to train a Multi-Speaker TTS system?. The 2022 IEEE Spoken Language Technology Workshop (SLT 2022), Jan 2023, Doha, Qatar. ⟨hal-03812715⟩
86 View
213 Download

Share

Gmail Facebook Twitter LinkedIn More