April 16, 2024, 4:43 a.m. | Otto Brookes, Majid Mirmehdi, Hjalmar Kuhl, Tilo Burghardt

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.08937v1 Announce Type: cross
Abstract: We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-language model which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and output class predictions. Query tokens are initialised using a standardised ethogram of chimpanzee behaviour, rather than using random or name-based …

abstract access architectures arxiv cs.ai cs.cv cs.lg decoding embedding features language language model modal multi-modal process recognition show species text type understanding videos vision visual

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Engineer

@ Quantexa | Sydney, New South Wales, Australia

Staff Analytics Engineer

@ Warner Bros. Discovery | NY New York 230 Park Avenue South