all AI news
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training
April 24, 2024, 4:47 a.m. | Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang
cs.CL updates on arXiv.org arxiv.org
Abstract: Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline …
abstract arxiv cs.cl excel fine-tuning gemini gpt gpt-4v improving inputs intermediate language language models large language large language models mathematical reasoning mllms multimodal proprietary proprietary models reason reasoning skills struggle tasks textual through training type visual
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne