This repo is the official PyTorch implementation of 'VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification', which has been submitted to IEEE/ACM Trans. on TASLP.
This work proposes a variational Bayesian inference framework with neural speech prior for joint ASR-effective speech dereverberation and blind RIR identification. By combining the prior distribution of anechoic speech predicted by an arbitrary discriminative dereverberation DNN with the reverberant recording, VINP employs VBI to solve the proposed CTF-based probabilistic graphical model and further estimate the anechoic speech and RIR. The usage of VBI avoids the direct utilization of DNN output but still utilizes its powerful nonlinear modeling capability, and proves to be effective for ASR without any joint training with the ASR system. Experimental results demonstrate that the proposed method achieves superior or competitive performance against the SOTA approaches in both tasks.
Speech Dereverberation Results
Blind RIR Identification Results
Computational Cost
- Please see
Prepare Training Set and Validation Set
Prepare clean source speech and noise recordings in .wav or .flac format.
Prepare reverberant and direct-path RIRs
python dataset/ -c [config/config_gen_rir.json]
- Save the list of filepath for the source speech, simulated RIR (.npz), and noise to .txt file
python datset/ -i [dirpath] -o [.txt filepath] -e [filename extension]
Prepare Test Set for Dereverberation
- Prepare the official single-channel test sets of REVERB Challenge Dataset.
Prepare Test Set for Blind RIR Identification
Prepare the RIRs of the 'Single' subfolder in ACE Challenge.
Downsample the RIRs to 16kHz
python datset/ -i [ACE 'Single' dirpath] -o [saved dirpath]
- Save the list of filepath for the source speech, ACE RIR, and noise to .txt file
python datset/ -i [dirpath] -o [.txt filepath] -e [filename extension]
- Generate the test set (consists of reverberant speech and labels)
python dataset/ --[keyword] [arg]
Edit the config file (for example:
). -
# train from scratch
torchrun --standalone --nnodes=1 --nproc_per_node=[number of GPUs] -c [config filepath] -p [saved dirpath]
# resume training
torchrun --standalone --nnodes=1 --nproc_per_node=[number of GPUs] -c [config filepath] -p [saved dirpath] -r
# use pretrained checkpoints
torchrun --standalone --nnodes=1 --nproc_per_node=[number of GPUs] -c [config filepath] -p [saved dirpath] --start_ckpt [pretrained model filepath]
- Run
python -c [config filepath] --ckpt [list of checkpoints] -i [reverberant speech dirpath] -o [output dirpath] -d [GPU id]
Speech Quality
- For SimData, run
bash eval/ -i [speech dirpath] -r [reference dirpath]
- For RealData, the reference is not available. Run
bash eval/ -i [speech dirpath]
- For SimData, run
python eval/ -i [speech dirpath] -m [whisper model name (tiny small medium)]
- For RealData, run
python eval/ -i [speech dirpath] -m [whisper model name (tiny small medium)]
RT60 and DRR
- Estimate RT60 and DRR using
python -i [estimated RIR dirpath]
- Run
python eval/ -o [estimated RT60 or DRR .json] -r [reference RT60 or DRR .json]
Evaluation results are saved to the output folder.
If you find our work helpful, please cite
title={VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification},
author={Pengyu Wang and Ying Fang and Xiaofei Li},