Aug. 1, 2023, 1:02 a.m. | /u/Yip37

Machine Learning www.reddit.com

I was wondering if there's already something like CLIP (the model that looks at an image and describes it), but for videos. So you show a video of, say, a dog jumping and grabbing a tennis ball and it outputs "dog grabbing a tennis ball", something like that.

My first thought was object detection, and input that interaction of the objects (tennis ball, dog) to the model with the target being "dog grabbing tennis ball". My ultimate goal being real-time …

clip image machinelearning show something subtitles tennis text thought video videos

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne