If you’ve been reading some of my recent blogs, you know I’m a big user of the new AI tools that generate images from prompts, part of the recent AI spring (I pay a monthly fee to Midjourney for the privilege).
A perception I’ve had about AI research is that it’s a race to hacking some complicated models until they make some benchmarks move without spending enough time understanding why the models do what they do. In this sense, research work in these models sometimes feels more like an art than a science. This is the reason I’ve had a minor revulsion towards digging too deep into them. This is exacerbated by the speed at which development seems to be happening. What if I spend a lot of effort understanding some new-fangled model that becomes obsolete tomorrow?
But the recent advances in image generating models where one can just enter some description of an image and get a really high-quality piece in response (Midjourney, Dall-E and the open-source Stable Diffusion being some of the players) has forced me to come out of my cave and pay attention. And attention was all I needed.
In doing this, I was pleasantly surprised to note that the theory behind these diffusion models is pretty deep, motivated by a branch of physics called statistical thermodynamics, and it involves time travel. This doesn’t mean there wasn’t some of the “hacking for results art” going on as I was referring to earlier.
In this article, I’ll summarize what I’ve gleaned so far from reading the papers (also linking the most important ones) frantically for a few weeks and paint a high level, end to end picture of what’s going on.
All images unless otherwise stated are by me, the author.
Say we were starting from scratch. We’d like to develop a model that takes a text prompt and spits out an image. Let’s develop a model that achieves this without worrying about the quality or performance. Just something that will work mechanically.
As described in section III here, we have this tool called neural networks that can map vectors from one space to those from another space, and able to learn all…