Publications
2022
- Single-Channel Target Speaker Separation Using Joint Training with Target Speaker’s Pitch InformationJincheng He, Yuanyuan Bao , Na Xu , Hongfeng Li , Shicong Li , Linzhang Wang , Fei Xiang , and Ming LiIn Proc. The Speaker and Language Recognition Workshop (Odyssey 2022) , 2022
Despite the great progress achieved in the target speaker separation (TSS) task, we are still trying to find other robust ways for performance improvement which are independent of the model architecture and the training loss. Pitch extraction plays an important role in many applications such as speech enhancement and speech separation. It is also a challenging task when there are multiple speakers in the same utterance. In this paper, we explore if the target speaker pitch extraction is possible and how the extracted target pitch could help to improve the TSS performance. A target pitch extraction model is built and incorporated into different TSS models using two different strategies, namely concatenation and joint training. The experimental results on the LibriSpeech dataset show that both training strategies could bring significant improvements to the TSS task, even the precision of the target pitch extraction module is not high enough.
@inproceedings{he22_odyssey, author = {He, Jincheng and Bao, Yuanyuan and Xu, Na and Li, Hongfeng and Li, Shicong and Wang, Linzhang and Xiang, Fei and Li, Ming}, title = {{Single-Channel Target Speaker Separation Using Joint Training with Target Speaker's Pitch Information}}, year = {2022}, booktitle = {Proc. The Speaker and Language Recognition Workshop (Odyssey 2022)}, pages = {301--305}, doi = {10.21437/Odyssey.2022-42}, }
2021
- Lightweight Dual-channel Target Speaker Separation for Mobile Voice CommunicationYuanyuan Bao , Yanze Xu , Na Xu , Wenjing Yang , Hongfeng Li , Shicong Li , Yongtao Jia , Fei Xiang , Jincheng He, and Ming Li2021
Nowadays, there is a strong need to deploy the target speaker separation (TSS) model on mobile devices with a limitation of the model size and computational complexity. To better perform TSS for mobile voice communication, we first make a dual-channel dataset based on a specific scenario, LibriPhone. Specifically, to better mimic the real-case scenario, instead of simulating from the single-channel dataset, LibriPhone is made by simultaneously replaying pairs of utterances from LibriSpeech by two professional artificial heads and recording by two built-in microphones of the mobile. Then, we propose a lightweight time-frequency domain separation model, LSTM-Former, which is based on the LSTM framework with source-to-noise ratio (SI-SNR) loss. For the experiments on Libri-Phone, we explore the dual-channel LSTMFormer model and a single-channel version by a random single channel of Libri-Phone. Experimental result shows that the dual-channel LSTM-Former outperforms the single-channel LSTMFormer with relative 25% improvement. This work provides a feasible solution for the TSS task on mobile devices, playing back and recording multiple data sources in real application scenarios for getting dual-channel real data can assist the lightweight model to achieve higher performance.
@misc{bao2021lightweight, title = {Lightweight Dual-channel Target Speaker Separation for Mobile Voice Communication}, author = {Bao, Yuanyuan and Xu, Yanze and Xu, Na and Yang, Wenjing and Li, Hongfeng and Li, Shicong and Jia, Yongtao and Xiang, Fei and He, Jincheng and Li, Ming}, year = {2021}, eprint = {2106.02934}, archiveprefix = {arXiv}, primaryclass = {cs.SD}, }