Aug. 30, 2022, 1:14 a.m. | Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu

cs.CV updates on arXiv.org arxiv.org

Image captioning models are usually trained according to human annotated
ground-truth captions, which could generate accurate but generic captions. In
this paper, we focus on generating distinctive captions that can distinguish
the target image from other similar images. To evaluate the distinctiveness of
captions, we introduce a series of metrics that use large-scale vision-language
pre-training model CLIP to quantify the distinctiveness. To further improve the
distinctiveness of captioning models, we propose a simple and effective
training strategy that trains the …

arxiv captioning clip image optimization

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Technology Consultant Master Data Management (w/m/d)

@ SAP | Walldorf, DE, 69190

Research Engineer, Computer Vision, Google Research

@ Google | Nairobi, Kenya