Attention is an integral part of the state of the art architectures for NLP. At a high-level, attention mechanism enables a neural network to “focus” on relevant parts of its input more than the irrelevant parts. If an attention mechanism is designed in a way that it is able to focus on different input aspects simultaneously, it is called “multi-head attention”. Multi-head attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. Multi-head attention was shown to make more efficient use of the model’s capacity, but its importance for translation and roles of individual “heads” are not clear.
In this talk, I will briefly describe standard attention in sequence to sequence models, as well as the Transformer architecture with multi-head self-attention. Then, we will evaluate the contribution made by individual attention heads to the overall performance of the Transformer and analyze the roles played by them. I will show that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance.
The talk is based on my recent work which is going to appear at ACL 2019 (https://arxiv.org/abs/1905.09418).