April 7, 2024, 7:14 p.m. | /u/Jordanoer

Machine Learning www.reddit.com

At the moment, I have a decent understanding I feel as to how cross attention is actually employed in a Unet at each upsampling and downsampling block. Eventually, the cross attention seems to produce these attention-like heat maps which basically indicate the relevancy of each pixel in the image to the words in the prompt.



My confusion lies in how this attention map is used to produce the final image. I.e how is the cross attention integrated with the …

attention block clip downsampling embeddings eventually heat image machinelearning maps moment text understanding unet

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US