April 6, 2022, 6:55 a.m. | /u/mavericknathan1

Computer Vision www.reddit.com

I am working on this model for object detection using ViT where I want to add a detection head to my decoder so that I can get bounding boxes. The output feature tensor of my decoder is of dim (12,96,224,224) where batch size = 12, no. of features per pixel = 96 and the image is of size (224,224).

How do I take this feature map and annotations and pass them through a CNN to obtain multiple bounding boxes?

Edit: …

computervision detection head transformer vision

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Lead Software Engineer - Artificial Intelligence, LLM

@ OpenText | Hyderabad, TG, IN

Lead Software Engineer- Python Data Engineer

@ JPMorgan Chase & Co. | GLASGOW, LANARKSHIRE, United Kingdom

Data Analyst (m/w/d)

@ Collaboration Betters The World | Berlin, Germany

Data Engineer, Quality Assurance

@ Informa Group Plc. | Boulder, CO, United States

Director, Data Science - Marketing

@ Dropbox | Remote - Canada