Feb. 26, 2024, 9:44 a.m. | /u/Basic_AI

Computer Vision www.reddit.com

Models like CLIP wowed us by interacting seamlessly with text prompts without any training samples. But its lack of spatial skills made dense prediction tasks like image segmentation tough without extensive fine-tuning which can dampen that zero-shot flair. Self-supervised models like DINO, however, showcased some robust spatial representations without label-reliance.

Bringing these strengths together, the new CLIP-DINOiser framework fuses DINO’s self-supervised image features with CLIP’s zero-shot classifier to pull off zero-shot segmentation that can hold its own against fully-supervised approaches. …

clip computervision fine-tuning flair image making multimodal prediction prompts samples segmentation self-supervised learning semantic skills spatial supervised learning tasks text training zero-shot

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analyst (Digital Business Analyst)

@ Activate Interactive Pte Ltd | Singapore, Central Singapore, Singapore