March 29, 2024, 4:46 a.m. | Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

cs.CV updates on arXiv.org arxiv.org

arXiv:2312.02051v2 Announce Type: replace
Abstract: This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to …

abstract arxiv cs.ai cs.cl cs.cv encoder key language language model large language large language model long video understanding multimodal multimodal large language model type understanding video video understanding visual work

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US