June 2, 2022, 1:12 a.m. | Ning Han, Jingjing Chen, Chuhao Shi, Yawen Zeng, Guangyi Xiao, Hao Chen

cs.CV updates on arXiv.org arxiv.org

The task of text-video retrieval aims to understand the correspondence
between language and vision, has gained increasing attention in recent years.
Previous studies either adopt off-the-shelf 2D/3D-CNN and then use average/max
pooling to directly capture spatial features with aggregated temporal
information as global video embeddings, or introduce graph-based models and
expert knowledge to learn local spatial-temporal relations. However, the
existing methods have two limitations: 1) The global video representations
learn video temporal information in a simple average/max pooling manner and …

arxiv bic cv learning retrieval temporal text video

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Director, Clinical Data Science

@ Aura | Remote USA

Research Scientist, AI (PhD)

@ Meta | Menlo Park, CA | New York City