all AI news
Transforming Visual Scene Graphs to Image Captions. (arXiv:2305.02177v3 [cs.CV] UPDATED)
cs.CV updates on arXiv.org arxiv.org
We propose to Transform Scene Graphs (TSG) into more descriptive captions. In
TSG, we apply multi-head attention (MHA) to design the Graph Neural Network
(GNN) for embedding scene graphs. After embedding, different graph embeddings
contain diverse specific knowledge for generating the words with different
part-of-speech, e.g., object/attribute embedding is good for generating
nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based
decoder, where each expert is built on MHA, for discriminating the graph
embeddings to generate different kinds of words. …
apply arxiv attention design diverse embedding embeddings good graph graph neural network graphs head image knowledge multi-head multi-head attention network neural network part part-of-speech speech words