all AI news
STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced Audio-Visual Diarization. (arXiv:2306.10608v2 [cs.CV] UPDATED)
cs.CV updates on arXiv.org arxiv.org
This report introduces our novel method named STHG for the Audio-Visual
Diarization task of the Ego4D Challenge 2023. Our key innovation is that we
model all the speakers in a video using a single, unified heterogeneous graph
learning framework. Unlike previous approaches that require a separate
component solely for the camera wearer, STHG can jointly detect the speech
activities of all people including the camera wearer. Our final method obtains
61.1% DER on the test set of Ego4D, which significantly …
advanced arxiv audio challenge diarization framework graph graph learning innovation novel report speakers temporal video