


so that DALL♾ 2 can generate a novel image that retains these characteristics while varying the non-essential features.įigure 1: CLIP (top). The mental representation encodes the main features that are semantically meaningful: People, animals, objects, style, colors, background, etc. It’s interesting to note that the decoder is called unCLIP because it does the inverse process of the original CLIP model - instead of creating a “mental” representation (embedding) from an image, it creates an original image from a generic mental representation. We input a sentence into the “black box” and it outputs a well-defined image. By concatenating both models we can go from a sentence to an image. DALL♾ 2: Combination of prior + diffusion decoder (unCLIP) models.ĭALL♾ 2 is a particular instance of a two-part model (figure 1, bottom) made of a prior and a decoder.Decoder Diffusion model (unCLIP): Takes a CLIP image embedding and generates images.Prior model: Takes a caption/CLIP text embedding and generates CLIP image embeddings.CLIP: Model that takes image-caption pairs and creates “mental” representations in the form of vectors, called text/image embeddings (figure 1, top).These are the four key high-level concepts you have to remember: I’ll explain DALL♾ 2 more intuitively soon, but I want you to form now a general idea of how it works without resorting to too much simplification. This section is subdivided into social and technical aspects. DALL♾ 2 limitations and risks: I’ll talk about DALL♾ 2’s shortcomings, which harms it can cause, and what conclusions we can draw.My favorite DALL♾ 2 creations: I’ll show you my personal favorites that many of you might not have seen.These techniques generate the most stunning images, videos, and murals. DALL♾ 2 variations, inpainting, and text diffs: What are the possibilities beyond text-to-image generation.I’ll add at the end an “explain like I’m five” practical analogy that anyone can follow and understand. How DALL♾ 2 works: What the model does and how it does it.This article is divided into four sections. Thanks for your support! (End of advertisement) How it influences our lives and how we can learn to navigate the complex world we’re building. I write exclusive content about the AI that you use and the AI that’s used on you. Also, I’ll go light on this one (although it’s quite long), so don’t expect a highly technical article - DALL♾ 2’s beauty lies in its intersection with the real world, not in its weights and parameters.Īnd it’s at the intersection of AI with the real world where I focus my Substack newsletter, The Algorithmic Bridge.

I’ll give you new insights to ponder and will add depth to ideas others have touched on only superficially. This is surely not the first DALL♾ 2 article you see, but I promise to not bore you. (CLIP is also the basis of the apps and notebooks people who can’t access DALL♾ 2 are using.) Still, OpenAI’s CEO, Sam Altman, said they’ll eventually release DALL♾ models through their API - for now, only a few selected have access to it (they’re opening the model to 1000 people each week). However, they open-sourced CLIP which, although only indirectly related to DALL♾, forms the basis of DALL♾ 2. Despite its size, DALL♾ 2 generates 4x better resolution images than DALL♾ and it’s preferred by human judges +70% of the time both in caption matching and photorealism.Īs they did with DALL♾, OpenAI didn’t release DALL♾ 2 (you can always join the never-ending waitlist). At 3.5B parameters, DALL♾ 2 is a large model but not nearly as large as GPT-3 and, interestingly, smaller than its predecessor (12B). That’s what this article is for.ĭALL♾ 2 is the new version of DALL♾, a generative language model that takes sentences and creates corresponding original images. The post is fine if you want to get a glimpse at the results and the paper is great for understanding the technical details, but neither explains DALL♾ 2’s amazingness - and the not-so-amazing - in depth. OpenAI published a blog post and a paper entitled “ Hierarchical Text-Conditional Image Generation with CLIP Latents” on DALL♾ 2. If you’ve seen some of its creations and think they’re amazing, keep reading to understand why you’re totally right - but also wrong.

“Vibrant portrait painting of Salvador Dalí with a robotic half face.” Credit: OpenAIĭALL♾ 2 is the newest AI model by OpenAI.
