June 7, 2022, 5:24 p.m. | Google AI (noreply@blogger.com)

Google AI Blog ai.googleblog.com

Posted by Paul Hongsuck Seo and Arsha Nagrani, Research Scientists, Google Research, Perception Team

Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments through multimodal input streams.

Unlike video understanding tasks (e.g., video classification and retrieval) where the key challenge lies in processing and understanding multimodal input …

captioning cvpr multimodal multimodal learning pre-training training video

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

Enterprise Data Architect

@ Pathward | Remote

Diagnostic Imaging Information Systems (DIIS) Technologist

@ Nova Scotia Health Authority | Halifax, NS, CA, B3K 6R8

Intern Data Scientist - Residual Value Risk Management (f/m/d)

@ BMW Group | Munich, DE

Analytics Engineering Manager

@ PlayStation Global | United Kingdom, London

Junior Insight Analyst (PR&Comms)

@ Signal AI | Lisbon, Lisbon, Portugal