May 8, 2024, 11:45 a.m. | /u/kiockete

Machine Learning www.reddit.com

According to [this post on X](https://x.com/PMinervini/status/1781080046972604739), LLaMa 3 used [Intra-Document causal masking](https://arxiv.org/abs/2402.13991) during pretraining to avoid [cross-contamination when using sequence packing](https://arxiv.org/abs/2107.02027):

[Causal Masking vs. Intra-Document Causal Masking](https://preview.redd.it/14e5ml4bs6zc1.jpg?width=680&format=pjpg&auto=webp&s=58e7e1764aeafe412237fb731488873aeb911e88)

This seems to improve final performance of the model across wide range of tasks in contrast to simply putting a separator token - e.g. end of text token - between documents and hoping that the model will learn to not attend to unrelated documents during pre-training as stated in [GPT-3 paper](https://arxiv.org/abs/2005.14165):

>During training …

computational context context window documents efficiency machinelearning multiple text token train training

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Principal Research Engineer - Materials

@ GKN Aerospace | Westlake, TX, US

Internship in Data and Projects

@ Bosch Group | Mechelen, Belgium

Research Scientist- Applied Mechanics

@ Corning | Pune, MH, IN, 410501

Product Data Analyst

@ AUTODOC | Lisbon-remote