Feb. 13, 2024, 5:44 a.m. | Jon Saad-Falcon Daniel Y. Fu Simran Arora Neel Guha Christopher R\'e

cs.LG updates on arXiv.org arxiv.org

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to fine-tune this …

benchmarking bert building challenges context cs.ir cs.lg document documents domains information integral learning systems machine machine learning pipelines raises retrieval systems text tokens

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Research Scholar (Technical Research)

@ Centre for the Governance of AI | Hybrid; Oxford, UK

HPC Engineer (x/f/m) - DACH

@ Meshcapade GmbH | Remote, Germany

ETL Developer

@ Gainwell Technologies | Bengaluru, KA, IN, 560100

Medical Radiation Technologist, Breast Imaging

@ University Health Network | Toronto, ON, Canada

Data Scientist

@ PayPal | USA - Texas - Austin - Corp - Alterra Pkwy