May 8, 2024, 11:45 a.m. | /u/kiockete

Machine Learning

According to [this post on X](, LLaMa 3 used [Intra-Document causal masking]( during pretraining to avoid [cross-contamination when using sequence packing](

[Causal Masking vs. Intra-Document Causal Masking](

This seems to improve final performance of the model across wide range of tasks in contrast to simply putting a separator token - e.g. end of text token - between documents and hoping that the model will learn to not attend to unrelated documents during pre-training as stated in [GPT-3 paper](

>During training …

computational context context window documents efficiency machinelearning multiple text token train training

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Principal Research Engineer - Materials

@ GKN Aerospace | Westlake, TX, US

Internship in Data and Projects

@ Bosch Group | Mechelen, Belgium

Research Scientist- Applied Mechanics

@ Corning | Pune, MH, IN, 410501

Product Data Analyst

@ AUTODOC | Lisbon-remote