End-to-end models have significant potential in most languages and recently proved their robustness in ASR tasks. Many robust architectures are proposed, and among many techniques, Recurrent Neural Network – Transducer (RNN-T) shows remarkable success. However, with background noise or reverb in spontaneous speech, this architecture generally suffers from high deletion error problems. |