Jan. 5, 2022, 2:10 a.m. | Tonmoay Deb, Akib Sadmanee, Kishor Kumar Bhaumik, Amin Ahsan Ali, M Ashraful Amin, A K M Mahbubur Rahman

cs.CL updates on arXiv.org arxiv.org

While describing Spatio-temporal events in natural language, video captioning
models mostly rely on the encoder's latent visual representation. Recent
progress on the encoder-decoder model attends encoder features mainly in linear
interaction with the decoder. However, growing model complexity for visual data
encourages more explicit feature interaction for fine-grained information,
which is currently absent in the video captioning domain. Moreover, feature
aggregations methods have been used to unveil richer visual representation,
either by the concatenation or using a linear layer. Though …

arxiv attention captioning cv local attention networks video

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Program Control Data Analyst

@ Ford Motor Company | Mexico

Vice President, Business Intelligence / Data & Analytics

@ AlphaSense | Remote - United States