RNNs have shown remarkable effectiveness in several tasks such as music generation, speech recognition and machine translation. RNN computations involve both intra-timestep and inter-timestep dependencies. Due to these features, hardware acceleration of RNNs is more challenging than that of CNNs.
We survey 100+ papers on GPU/FPGA/ASIC-based accelerators and optimization techniques for RNNs. We review techniques for simplifying RNN architectures (e.g., pruning, low-precision, leveraging similarity), optimizing them (e.g., pipelining, parallelism, batching, scheduling), and various RNN accelerator architectures (e.g., those for n-dimensional LSTMs, multi-FPGA based and so on). This survey seeks to synergize the efforts of researchers in the area of deep learning, computer architecture, and chip-design.
PDF is here, paper accepted in JSA 2020
We survey 100+ papers on GPU/FPGA/ASIC-based accelerators and optimization techniques for RNNs. We review techniques for simplifying RNN architectures (e.g., pruning, low-precision, leveraging similarity), optimizing them (e.g., pipelining, parallelism, batching, scheduling), and various RNN accelerator architectures (e.g., those for n-dimensional LSTMs, multi-FPGA based and so on). This survey seeks to synergize the efforts of researchers in the area of deep learning, computer architecture, and chip-design.
PDF is here, paper accepted in JSA 2020