all AI news
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training
April 24, 2024, 4:47 a.m. | Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang
cs.CL updates on arXiv.org arxiv.org
Abstract: Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline …
abstract arxiv cs.cl excel fine-tuning gemini gpt gpt-4v improving inputs intermediate language language models large language large language models mathematical reasoning mllms multimodal proprietary proprietary models reason reasoning skills struggle tasks textual through training type visual
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York