Web: http://arxiv.org/abs/2201.12086

Jan. 31, 2022, 2:10 a.m. | Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

cs.CV updates on arXiv.org arxiv.org

Vision-Language Pre-training (VLP) has advanced the performance for many
vision-language tasks. However, most existing pre-trained models only excel in
either understanding-based tasks or generation-based tasks. Furthermore,
performance improvement has been largely achieved by scaling up the dataset
with noisy image-text pairs collected from the web, which is a suboptimal
source of supervision. In this paper, we propose BLIP, a new VLP framework
which transfers flexibly to both vision-language understanding and generation
tasks. BLIP effectively utilizes the noisy web data by …

arxiv bootstrapping cv language training vision

