Sparsity in Recurrent Neural Networks

Sharan Narang, Greg Diamos, Shubho Sengupta & Erich Elsen @ Baidu Research:

Recent advances in multiple elds such as speech recognition (Graves & Jaitly, 2014; Amodei et al., 2015), language modeling (Jozefowicz et al., 2016) and machine translation (Wu et al., 2016) can be at least partially attributed to larger training datasets, larger models and more compute that allows larger models to be trained on larger datasets.

For example, the deep neural network used for acoustic modeling in Hannun et al. (2014) had 11 million parameters which grew to approximately 67 million for bidirectional RNNs and further to 116 million for the latest forward only GRU models in Amodei et al. (2015). And in language modeling the size of the non-embedding parameters (mostly in the recurrent layers) have exploded even as various ways of hand engineering sparsity into the embeddings have been explored in Jozefowicz et al. (2016) and Chen et al. (2015a).

These large models face two signi cant challenges in deployment. Mobile phones and embedded devices have limited memory and storage and in some cases network bandwidth is also a concern. In addition, the evaluation of these models requires a signi cant amount of computation. Even if those cases when the networks can be evaluated fast enough, it will have a signicant impact on battery life in mobile phones (Han et al., 2015).