SPEAKER SELECTIVE BEAMFORMER WITH KEYWORD MASK ESTIMATION
Yusuke Kida, Dung Tran, Motoi Omachi, Toru Taniguchi, and Yuya Fujita
2018 IEEE Workshop on Spoken Language Technology (SLT), 2018/12
音声処理 (Speech Processing) 機械学習 (Machine Learning)
- This paper addresses the problem of automatic speech recog- nition (ASR) of a target speaker in background speech. The novelty of our approach is that we focus on a wakeup key- word, which is usually used for activating ASR systems like smart speakers. The proposed method firstly utilizes a DNN- based mask estimator to separate the mixture signal into the keyword signal uttered by the target speaker and the remain- ing background speech. Then the separated signals are used for calculating a beamforming filter to enhance the subse- quent utterances from the target speaker. Experimental evalu- ations show that the trained DNN-based mask can selectively separate the keyword and background speech from the mix- ture signal. The effectiveness of the proposed method is also verified with Japanese ASR experiments, and we confirm that the character error rates are significantly improved by the pro- posed method for both simulated and real recorded test sets.
SPEAKER SELECTIVE BEAMFORMER WITH KEYWORD MASK ESTIMATION（外部サイト／External Site Link）