Aug. 19, 2023, 10:48 a.m. | /u/vov_or

Data Science www.reddit.com

Vision-Language understanding Transformer, which has 40% fewer parameters than vanilla CLIP while performing much better on text-to-image retrieval, where it's also beneficial that output embeddings have 2x fewer dimensions (256 vs 512).

Moreover, it supports 21 languages, including popular English, Hindi, Chinese, Arabic, and lower-resource languages like Ukrainian, Hebrew, and Armenian.

Demo: [http://usearch-images.com/](http://usearch-images.com/)

Github: [https://github.com/unum-cloud/uform](https://github.com/unum-cloud/uform)

https://i.redd.it/3zhc95uvs1jb1.gif

api chinese clip datascience embeddings english graphcore hindi image language languages language understanding popular retrieval text text-to-image transformer understanding vision

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Data Engineer

@ Kaseya | Bengaluru, Karnataka, India