June 8, 2024, 5:24 p.m. | /u/ReinforcedKnowledge

Machine Learning www.reddit.com


I was reading the paper [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) and I found many weird elements and I'd like to discuss them you. I have read the paper again and again and I'm confused if my thoughts are true or false and I'd be grateful to be corrected.

Just a global idea for those who aren't familiar with the sparse transformer. Instead of using the multi-head approach that uses h heads in parallel, it uses p (just to …

