transformer Read
Questions
- What’s the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions ?
Extended Learning
Attention mechanisms with a recurrent network.
Bahdanau, Dzmitry. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
Use convolutional neural networks to compute hidden representations in parallel.
Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” International conference on machine learning. PMLR, 2017.
Why is difficult to learn dependencies between distant positions.
Hochreiter, Sepp, et al. “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.” (2001).
Self-attention.
Cheng, Jianpeng. “Long short-term memory-networks for machine reading.” arXiv preprint arXiv:1601.06733 (2016).