July 9, 2022, 10:35 p.m. | /u/impurekitkat

Artificial Intelligence www.reddit.com

Both models are using the vqgan\_imagenet\_f16\_16384 model. I'm not sure what Dalle-mini does differently, but the results it produces are so much better. VQGAN+CLIP produces results that don't have anything in focus, even if the prompt is just a single object. I'm not sure if this is because of the augmentation randomization (affine, sharpness, color jitter) or not.

For example, here are the results of both models' results on the prompt "an art deco car driving down the street":

​ …

artificial clip dalle dalle-mini

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne