DistillW2N: A Lightweight One-Shot Whisper to Normal Voice Conversion Model Using Distillation of Self-Supervised Features

Tianyi Tan, Haoxin Ruan, Xinan Chen, Kai Chen, Zhibin Lin and Jing Lu NCA Lab

0. Contents

Abstract
Demos -- Baselines
Demos -- DistillW2N
Demos -- Sample Whisper
Reference

1. Abstract

Whisper to Normal voice conversion (W2N) holds great promise for assistive communication and healthcare, making it an exciting area of research and development. Recent advancements in W2N are predominantly driven by self-supervised speech representation learning (SSL) techniques. While effective, SSL requires extensive parameters and high computational costs, making it impractical to be deployed in real-world applications. We propose DistillW2N, a lightweight one-shot W2N model. DistillW2N consists of a speech-to-unit (S2U) encoder, which seeks to close the gap between whispered and normal content units by distilling HuBERT-Soft representations from both normal speech and pseudo-whisper, and a unit-to-speech (U2S) decoder, which incorporates content units with timbre units by Style-Adaptive Layer Normalization (SALN) and leverages SoundStream decoder for lightweight high-quality speech synthesis. Moreover, we discover that the S2U encoder is able to learn from a VAD model to trim noise. Experiments show that DistillW2N significantly improves the intelligibility for whisper while preserving speaker similarity compared to prevailing SSL approaches. The resulting model demands only 10.91 M parameters and 1.63 GMACs per second and runs approximately 5 times faster than QuickVC. All samples and code are available at https://github.com/tan90xx/distillw2n.

2. Demos -- Baselines

Target Speaker	Source speech	Method
Target Speaker	Source speech	Pseudo-whisper	Soft-VC	WESPER	FreeVC	QuickVC
s000u003n	s000u003w
s001u003n	s001u003w
s002u003n	s002u003w
s003u003n	s003u003w

3. Demos -- DistillW2N

Target Speaker	Source speech	Ours
Target Speaker	Source speech	S2U-FS2-HiFiGAN	S2U-MS-iSTFT-VITS	S2U-U2S
s000u003n	s000u003w
s001u003n	s001u003w
s002u003n	s002u003w
s003u003n	s003u003w

Our proposed models are in bold

4. Demos -- Sample Whisper

Source Speech	Comparision with with WESPER and DistillW2N(S2U-U2S)
Source Speech	S2U-U2S-s000	S2U-U2S-s001	S2U-U2S-s002	S2U-U2S-s003

WESPER	QuickVC-s000	QuickVC-s001	QuickVC-s002	QuickVC-s003

Audio data are obtained from WESPER demo page.

5. Reference

Pseudo-whisper: Z. Lin, T. Patel, and O. Scharenborg, “Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation,” in ASRU 2023.
Soft-VC: B. van Niekerk, M.-A. Carbonneau, J. Za¨ıdi, M. Baas, H. Seut´e, and H. Kamper, “A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion,” in ICASSP 2022.
WESPER: J. Rekimoto, “WESPER: Zero-shot and realtime whisper to normal voice conversion for whisper-based speech interactions,” in ACM CHI 2023.
FreeVC: J. Li, W. Tu, and L. Xiao, “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in ICASSP 2023.
QuickVC: H. Guo, C. Liu, C. T. Ishi, and H. Ishiguro, “QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion,” in ASRU 2023.