Nov. 15, 2023, 4:39 p.m. | /u/TheRealBracketMaster

Machine Learning www.reddit.com

The Mistral 7B paper claims a theoretical attention span of 131K tokens(via propagating information up through layers with GQA) for Mistral 7B. I'm trying to figure out how this is achieved in practice. The trick seems to be on [line 128](https://github.com/mistralai/mistral-src/blob/main/one_file_ref.py#L129), with the branch \`if positions.shape\[0\] > 1:\`, which would typically be taken when the model is first called. From my understanding, taking this branch would compute k/v values for all provided tokens, which could then propagate information for an …

attention cache call compute machinelearning mistral paper token transformer

Data Engineer

@ Cepal Hellas Financial Services S.A. | Athens, Sterea Ellada, Greece

Senior Manager Data Engineering

@ Publicis Groupe | Bengaluru, India

Senior Data Modeler

@ Sanofi | Hyderabad

VP, Product Management - Data, AI & ML

@ Datasite | USA - MN - Minneapolis

Supervisão de Business Intelligence (BI)

@ Publicis Groupe | São Paulo, Brazil

Data Manager Advertising (f|m|d) (80-100%) - Zurich - Hybrid Work

@ SMG Swiss Marketplace Group | Zürich, Switzerland