April 7, 2024, 7:14 p.m. | /u/Jordanoer

Machine Learning www.reddit.com

At the moment, I have a decent understanding I feel as to how cross attention is actually employed in a Unet at each upsampling and downsampling block. Eventually, the cross attention seems to produce these attention-like heat maps which basically indicate the relevancy of each pixel in the image to the words in the prompt.



My confusion lies in how this attention map is used to produce the final image. I.e how is the cross attention integrated with the …

attention block clip downsampling embeddings eventually heat image machinelearning maps moment text understanding unet

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne