- Abstract: Speech enhancement driven robust Automatic Speech Recognition (ASR) systems typically require parallel corpus with noisy and clean speech utterances for training. Moreover, many studies have reported that such front-ends, even though improve speech quality, do not always improve the recognition performance. On the other hand, the multi-condition training of ASR systems have little visualization or interpretability capabilities of how these systems achieve robustness. In this paper, we propose a novel neural architecture with unified enhancement and sequence classification block, that is trained in an end-to-end manner only using noisy speech without having information of clean speech. The enhancement block is a fully convolutional network that is designed to perform Time Frequency (T-F) masking like operation, followed by an LSTM sequence classification block. The T-F masking formulation enables visualization of learned mask and helps us to visualize the T-F points important for classification of a speech command. Experiments performed on Google Speech Command dataset show that our proposed network achieves better results than the baseline model without an enhancement front-end.