April 16, 2024, 4:43 a.m. | Otto Brookes, Majid Mirmehdi, Hjalmar Kuhl, Tilo Burghardt

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.08937v1 Announce Type: cross
Abstract: We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-language model which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and output class predictions. Query tokens are initialised using a standardised ethogram of chimpanzee behaviour, rather than using random or name-based …

abstract access architectures arxiv cs.ai cs.cv cs.lg decoding embedding features language language model modal multi-modal process recognition show species text type understanding videos vision visual

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US