May 8, 2024, 11:45 a.m. | /u/kiockete

Machine Learning www.reddit.com

According to [this post on X](https://x.com/PMinervini/status/1781080046972604739), LLaMa 3 used [Intra-Document causal masking](https://arxiv.org/abs/2402.13991) during pretraining to avoid [cross-contamination when using sequence packing](https://arxiv.org/abs/2107.02027):

[Causal Masking vs. Intra-Document Causal Masking](https://preview.redd.it/14e5ml4bs6zc1.jpg?width=680&format=pjpg&auto=webp&s=58e7e1764aeafe412237fb731488873aeb911e88)

This seems to improve final performance of the model across wide range of tasks in contrast to simply putting a separator token - e.g. end of text token - between documents and hoping that the model will learn to not attend to unrelated documents during pre-training as stated in [GPT-3 paper](https://arxiv.org/abs/2005.14165):

>During training …

computational context context window documents efficiency machinelearning multiple text token train training

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US