June 8, 2024, 5:24 p.m. | /u/ReinforcedKnowledge

Machine Learning www.reddit.com


I was reading the paper [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) and I found many weird elements and I'd like to discuss them you. I have read the paper again and again and I'm confused if my thoughts are true or false and I'd be grateful to be corrected.

Just a global idea for those who aren't familiar with the sparse transformer. Instead of using the multi-head approach that uses h heads in parallel, it uses p (just to …

authors consistent elements global head machinelearning multi-head notation paper thoughts transformer transformers

Senior Data Engineer

@ Displate | Warsaw

Solution Architect

@ Philips | Bothell - B2 - Bothell 22050

Senior Product Development Engineer - Datacenter Products

@ NVIDIA | US, CA, Santa Clara

Systems Engineer - 2nd Shift (Onsite)

@ RTX | PW715: Asheville Site W Asheville Greenfield Site TBD , Asheville, NC, 28803 USA

System Test Engineers (HW & SW)

@ Novanta | Barcelona, Spain

Senior Solutions Architect, Energy

@ NVIDIA | US, TX, Remote