Web: http://arxiv.org/abs/2205.03039

May 9, 2022, 1:10 a.m. | Yiqi Gao, Xinglin Hou, Wei Suo, Mengyang Sun, Tiezheng Ge, Yuning Jiang, Peng Wang

cs.CV updates on arXiv.org arxiv.org

Video captioning aims to understand the spatio-temporal semantic concept of
the video and generate descriptive sentences. The de-facto approach to this
task dictates a text generator to learn from \textit{offline-extracted} motion
or appearance features from \textit{pre-trained} vision models. However, these
methods may suffer from the so-called \textbf{\textit{"couple"}} drawbacks on
both \textit{video spatio-temporal representation} and \textit{sentence
generation}. For the former, \textbf{\textit{"couple"}} means learning
spatio-temporal representation in a single model(3DCNN), resulting the problems
named \emph{disconnection in task/pre-train domain} and \emph{hard for
end-to-end …

arxiv captioning cv transformer video

More from arxiv.org / cs.CV updates on arXiv.org

Director, Applied Mathematics & Computational Research Division

@ Lawrence Berkeley National Lab | Berkeley, Ca

Business Data Analyst

@ MainStreet Family Care | Birmingham, AL

Assistant/Associate Professor of the Practice in Business Analytics

@ Georgetown University McDonough School of Business | Washington DC

Senior Data Science Writer

@ NannyML | Remote

Director of AI/ML Engineering

@ Armis Industries | Remote (US only), St. Louis, California

Digital Analytics Manager

@ Patagonia | Ventura, California