April 26, 2024, 11:09 a.m. | /u/themathstudent

Machine Learning www.reddit.com

I am attempting to train CLIP from scratch. However, there is a lack of available datasets. The one dataset that seemed quite diverse and clean seems to be taken down (`laion-400m`). Looking at HF datasets, these are the two datasets that are promising, but wondering if there has been anything better/ cleaner.
- conceptual captions: uses alt-text.
- red_caps: reddit threads, but these are mostly the first comment on the image than an actual caption.

TIA

captions clip dataset datasets diverse however laion machinelearning scratch train

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote