DistillW2N: A Lightweight One-Shot Whisper to Normal Voice Conversion Model Using Distillation of Self-Supervised Features

Tianyi Tan, Haoxin Ruan, Xinan Chen, Kai Chen, Zhibin Lin and Jing Lu
NCA Lab

0. Contents

  1. Abstract
  2. Demos -- Baselines
  3. Demos -- DistillW2N
  4. Demos -- Sample Whisper
  5. Reference


1. Abstract

Whisper to Normal voice conversion (W2N) holds great promise for assistive communication and healthcare, making it an exciting area of research and development. Recent advancements in W2N are predominantly driven by self-supervised speech representation learning (SSL) techniques. While effective, SSL requires extensive parameters and high computational costs, making it impractical to be deployed in real-world applications. We propose DistillW2N, a lightweight one-shot W2N model. DistillW2N consists of a speech-to-unit (S2U) encoder, which seeks to close the gap between whispered and normal content units by distilling HuBERT-Soft representations from both normal speech and pseudo-whisper, and a unit-to-speech (U2S) decoder, which incorporates content units with timbre units by Style-Adaptive Layer Normalization (SALN) and leverages SoundStream decoder for lightweight high-quality speech synthesis. Moreover, we discover that the S2U encoder is able to learn from a VAD model to trim noise. Experiments show that DistillW2N significantly improves the intelligibility for whisper while preserving speaker similarity compared to prevailing SSL approaches. The resulting model demands only 10.91 M parameters and 1.63 GMACs per second and runs approximately 5 times faster than QuickVC. All samples and code are available at https://github.com/tan90xx/distillw2n.

2. Demos -- Baselines

Target Speaker Source speech Method
Pseudo-whisper Soft-VC WESPER FreeVC QuickVC
s000u003n s000u003w
s001u003n s001u003w
s002u003n s002u003w
s003u003n s003u003w

3. Demos -- DistillW2N

Target Speaker Source speech Ours
S2U-FS2-HiFiGAN S2U-MS-iSTFT-VITS S2U-U2S
s000u003n s000u003w
s001u003n s001u003w
s002u003n s002u003w
s003u003n s003u003w
Our proposed models are in bold

4. Demos -- Sample Whisper

Source Speech Comparision with with WESPER and DistillW2N(S2U-U2S)
S2U-U2S-s000 S2U-U2S-s001 S2U-U2S-s002 S2U-U2S-s003
WESPER QuickVC-s000 QuickVC-s001 QuickVC-s002 QuickVC-s003
Audio data are obtained from WESPER demo page.

5. Reference

Pseudo-whisper: Z. Lin, T. Patel, and O. Scharenborg, “Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation,” in ASRU 2023.
Soft-VC: B. van Niekerk, M.-A. Carbonneau, J. Za¨ıdi, M. Baas, H. Seut´e, and H. Kamper, “A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion,” in ICASSP 2022.
WESPER: J. Rekimoto, “WESPER: Zero-shot and realtime whisper to normal voice conversion for whisper-based speech interactions,” in ACM CHI 2023.
FreeVC: J. Li, W. Tu, and L. Xiao, “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in ICASSP 2023.
QuickVC: H. Guo, C. Liu, C. T. Ishi, and H. Ishiguro, “QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion,” in ASRU 2023.