Multimodal video captioning systems use video frames and speech to generate natural language descriptions of videos. Such systems are stepping stones toward the long-term objective of developing multimodal conversational systems that effortlessly communicate with users while perceiving their environments via multimodal input streams. In contrast to video understanding tasks, where the primary challenge lies in […]

