Aug. 1, 2022, 1:11 a.m. | Tejas Srinivasan, Xiang Ren, Jesse Thomason

cs.CL updates on arXiv.org arxiv.org

Aligning image and text encoders from scratch using contrastive learning
requires large amounts of paired image-text data. We alleviate this need by
aligning individually pre-trained language and vision representation models
using a much smaller amount of paired data, augmented with a curriculum
learning algorithm to learn fine-grained vision-language alignments. TOnICS
(Training with Ontology-Informed Contrastive Sampling) initially samples
minibatches whose image-text pairs contain a wide variety of objects to learn
object-level alignment, and progressively samples minibatches where all
image-text pairs contain …

alignment arxiv curriculum curriculum learning cv data language learning vision

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Business Intelligence Architect - Specialist

@ Eastman | Hyderabad, IN, 500 008