April 2, 2024, 7:47 p.m. | Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He

cs.CV updates on arXiv.org arxiv.org

arXiv:2404.00906v1 Announce Type: new
Abstract: Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then …

abstract arxiv challenge concepts cs.cv generate graph graph representation graphs intermediate language language models novel pixels reasoning representation struggle tasks type vision vision-language models visual

