all AI news
Variational Stacked Local Attention Networks for Diverse Video Captioning. (arXiv:2201.00985v1 [cs.CV])
cs.CL updates on arXiv.org arxiv.org
While describing Spatio-temporal events in natural language, video captioning
models mostly rely on the encoder's latent visual representation. Recent
progress on the encoder-decoder model attends encoder features mainly in linear
interaction with the decoder. However, growing model complexity for visual data
encourages more explicit feature interaction for fine-grained information,
which is currently absent in the video captioning domain. Moreover, feature
aggregations methods have been used to unveil richer visual representation,
either by the concatenation or using a linear layer. Though …
arxiv attention captioning cv local attention networks video