all AI news
[R] How does GPT2/GPT3 differ from transformer in [Vaswani et al]?
I'm quite new to NLP/transformers, and work mostly in computer vision. Here's a basic question.
How does the transformer architecture in GPT2/GPT3 differ from the original transformer from "attention is all you need" (Vaswani et al, NeurIPS'17)? What additional techniques are incorporated there?submitted by /u/white0clouds