Feb. 13, 2024, 5:44 a.m. | Jon Saad-Falcon Daniel Y. Fu Simran Arora Neel Guha Christopher R\'e

cs.LG updates on arXiv.org arxiv.org

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to fine-tune this …

benchmarking bert building challenges context cs.ir cs.lg document documents domains information integral learning systems machine machine learning pipelines raises retrieval systems text tokens

