March 5, 2024, 2:51 p.m. | Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.00791v1 Announce Type: new
Abstract: Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail …

abstract acl arxiv building cs.ai cs.cl databases dataset datasets discovery language molecules q-bio.bm q-bio.qm small training type understanding

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Codec Avatars Research Engineer

@ Meta | Pittsburgh, PA