DistillW2N: A Lightweight One-Shot Whisper to Normal Voice Conversion Model Using Distillation of Self-Supervised Features
0. Contents
1. Abstract
Whisper to Normal voice conversion (W2N) holds great promise for assistive communication and healthcare, making it an exciting area of research and development. Recent advancements in W2N are predominantly driven by self-supervised speech representation learning (SSL) techniques. While effective, SSL requires extensive parameters and high computational costs, making it impractical to be deployed in real-world applications. We propose DistillW2N, a lightweight one-shot W2N model. DistillW2N consists of a speech-to-unit (S2U) encoder, which seeks to close the gap between whispered and normal content units by distilling HuBERT-Soft representations from both normal speech and pseudo-whisper, and a unit-to-speech (U2S) decoder, which incorporates content units with timbre units by Style-Adaptive Layer Normalization (SALN) and leverages SoundStream decoder for lightweight high-quality speech synthesis. Moreover, we discover that the S2U encoder is able to learn from a VAD model to trim noise. Experiments show that DistillW2N significantly improves the intelligibility for whisper while preserving speaker similarity compared to prevailing SSL approaches. The resulting model demands only 10.91 M parameters and 1.63 GMACs per second and runs approximately 5 times faster than QuickVC. All samples and code are available at https://github.com/tan90xx/distillw2n.
2. Demos -- Baselines
Target Speaker | Source speech | Method | ||||
Pseudo-whisper | Soft-VC | WESPER | FreeVC | QuickVC | ||
s000u003n | s000u003w | |||||
s001u003n | s001u003w | |||||
s002u003n | s002u003w | |||||
s003u003n | s003u003w |
3. Demos -- DistillW2N
Target Speaker | Source speech | Ours | ||
S2U-FS2-HiFiGAN | S2U-MS-iSTFT-VITS | S2U-U2S | ||
s000u003n | s000u003w | |||
s001u003n | s001u003w | |||
s002u003n | s002u003w | |||
s003u003n | s003u003w |
4. Demos -- Sample Whisper
Source Speech | Comparision with with WESPER and DistillW2N(S2U-U2S) | |||
S2U-U2S-s000 | S2U-U2S-s001 | S2U-U2S-s002 | S2U-U2S-s003 | |
WESPER | QuickVC-s000 | QuickVC-s001 | QuickVC-s002 | QuickVC-s003 |
5. Reference
Pseudo-whisper: Z. Lin, T. Patel, and O. Scharenborg, “Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation,” in ASRU 2023.Soft-VC: B. van Niekerk, M.-A. Carbonneau, J. Za¨ıdi, M. Baas, H. Seut´e, and H. Kamper, “A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion,” in ICASSP 2022.
WESPER: J. Rekimoto, “WESPER: Zero-shot and realtime whisper to normal voice conversion for whisper-based speech interactions,” in ACM CHI 2023.
FreeVC: J. Li, W. Tu, and L. Xiao, “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in ICASSP 2023.
QuickVC: H. Guo, C. Liu, C. T. Ishi, and H. Ishiguro, “QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion,” in ASRU 2023.