March 14, 2024, 4:46 a.m. | Lijun Yu, Jos\'e Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexand

cs.CV updates on arXiv.org arxiv.org

arXiv:2310.05737v2 Announce Type: replace
Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using …

abstract arxiv cs.ai cs.cv cs.mm diffusion diffusion models generative image key language language model language models large language large language models llms maps tasks type video video generation visual

Senior Data Engineer

@ Displate | Warsaw

Automation and AI Strategist (Remote - US)

@ MSD | USA - New Jersey - Rahway

Assistant Manager - Prognostics Development

@ Bosch Group | Bengaluru, India

Analytics Engineer - Data Solutions

@ MSD | IND - Maharashtra - Pune (Wework)

Jr. Data Engineer (temporary)

@ MSD | COL - Cundinamarca - Bogotá (Colpatria)

Senior Data Engineer

@ KION Group | Atlanta, GA, United States