How to see the future using DALL-E 3

To understand how quickly AI is improving, forget ChatGPT — use an image generator

Oct 24, 2023

"A blonde male writer expressing delight as his keyboard transforms into a paintbrush and paints a beautiful, mystical landscape on his laptop screen / DALL-E"

/ DALL-E — "A blonde male writer expressing delight as his keyboard transforms into a paintbrush and paints a beautiful, mystical landscape on his laptop screen / DALL-E" / DALL-E

Earlier this month, I wrote about how you can fast-forward through time by using the paid versions of frontier AI products. Today, let’s talk about another of those products, the text-to-image generator DALL-E. The latest version tells us a lot about how OpenAI is thinking about product, policy, and safety as its core technology continues to improve.

These days, ChatGPT gets all the attention. But six months before OpenAI’s chatbot arrived on the scene and shifted the tech world’s collective focus to generative artificial intelligence, the company released another tool that spoke to similar possibilities. DALL-E 2, which I got access to in June of last year, captivated me from its first release.

As a kid, I was an enthusiastic artist, delighting in drawing my own comic books on weekends and after school. But I quickly found the limits of my talent, and despite trying my best to follow the instructions in books designed to make me a better illustrator, I never really got very far. Fast forward a few decades, and suddenly, I could conjure whole worlds just by typing words into a box. It felt like magic, in a way that the tech industry often promises but rarely lives up to.

In the 18 months since DALL-E 2 emerged, that picture has become complicated by questions about copyright, permissions, and what — if anything — the makers of text-to-image generators like DALL-E owe the artists whose work their models are trained on. Stock photo company Getty Images sued Stable Diffusion earlier this year, saying the company’s model had improperly been trained on its photos. Similar lawsuits seem likely to follow. Meanwhile, Adobe demonstrated an alternative path forward by creating its own Firefly image generator using only licensed imagery and says it will compensate creators whose work was used in the training process.

Despite the legal and ethical certainties around text-to-image generators, though, the field has continued to develop rapidly. Midjourney, which launched shortly after DALL-E 2, attracted 15 million users and is generating hundreds of millions of dollars in annual revenue with its own image generator despite currently being available to use exclusively on Discord. Stable Diffusion reached 10 million users last October. As they grew, the quality of their images improved exponentially, as DALL-E’s — while still impressive by pre-2022 standards — remained stagnant.

Then, on Thursday, DALL-E 3 arrived. After a short time in public beta, the next generation of OpenAI’s image generator is now available to enterprise customers and to subscribers to ChatGPT Plus. (You can also use a free version through the Bing Image Creator.)

I’ve been using DALL-E 3, both through Bing and ChatGPT, for the past few weeks. On Wednesday, I also was briefed on the new version by Gabriel Goh, a research scientist at OpenAI who helped to build the new model, and Sandhini Agarwal, who works on AI policy at the company.

Here are five surprising things I’ve learned about DALL-E 3 in my first weeks of using it.

When used inside ChatGPT, DALL-E 3 aggressively rewrites shorter prompts.

One of the core product challenges in AI so far has been its missing user interface. Whether you’re using a text-to-image generator or a chatbot, you typically have nothing more to work with than a blank box to type in. Getting quality images out of tools like Midjourney can require significant prompt engineering, with power users memorizing dozens of modifying tags (“photorealistic,” “depth of field,” “sharp focus”) to get what they want.

DALL-E 3 confronts this problem by simply rewriting prompts for you. When I asked it to show me “a teddy bear detective in a noir film, meeting a new client in his office,” for example, it quickly spun up much more detailed queries. Here was one:

Illustration of a 1940s-style detective office. The room is adorned with black and white checkered flooring, dark wooden walls, and a ceiling fan slowly rotating. A teddy bear detective, donning a sharp suit and fedora, sits behind an old-fashioned desk. He listens intently to a male teddy bear client who stands nervously before him. The client wears a bow tie and suspenders and clutches a briefcase. A lit desk lamp casts long shadows, and outside, the silhouette of rain-soaked buildings can be seen.

And here was another:

Photo of an atmospheric detective office with greenish-blue lighting, reminiscent of a noir film. A teddy bear detective, wearing glasses and a tie, sits at his desk, which is cluttered with papers and a typewriter. He's looking up to meet the gaze of a teddy bear client, a woman in a glamorous hat and fur shawl. The room is filled with smoke, possibly from the detective's cigarette. The ambiance is tense, and the city's skyline can be glimpsed through a half-drawn blind.

You can avoid this kind of rewriting by entering longer prompts. But it’s been fun to let ChatGPT run wild and suggest a more diverse array of artistic styles than I would have imagined myself.

In my testing, I found ChatGPT also rewrites prompts with an eye toward diversity. If you describe a nondescript group — ”doctors,” say, or “students” — ChatGPT will rewrite it to ensure the resulting image contains racial and gender diversity.

DALL-E 3 can be surprisingly restrictive.

Typically DALL-E 3 returns four responses to your query. But for the teddy bear noir example above, it returned just three. “One of the descriptions didn't align with our content policy, so I was unable to generate an image for that,” the bot told me. Just so we’re clear here: ChatGPT rewrote my prompt about a teddy bear detective to something that broke its own content policy, and so it refused to generate an image. If you’re curious what exactly ChatGPT wanted to show the teddy bear doing, so am I.

Perhaps that one was just a random bug. But I’ve found ChatGPT restrictive in other ways, too. It’s highly attuned to requests for images that might be construed to sexualize the subject matter — if you’re looking for a photorealistic image of a man with his shirt off, look elsewhere. And it’s more sensitive about copyrighted images than its predecessor as well. When I asked it to depict corporate logos — something I did often with DALL-E 2 — it simply refused.

All of this is probably good from a legal liability perspective. But I wonder if adults won’t find it off-putting, at least compared to more permissive rivals. At the very least, OpenAI could do more to educate users how their prompt or resulting image broke its content policies, and suggest permitted alternatives.

DALL-E now bans requests for art in the style of living artists.

Text-to-image generators have drawn criticism for the way that they can be used to create infinite knock-off versions of individual artist’s work. Greg Rutkowski, whose name was used in more Stable Diffusion prompts than any other artist, is among many creatives who object to the way that these tools abstract away their entire styles and turn them into software tools with no compensation to the artist.

DALL-E’s solution to this is to simply deny requests for art in the style of a living artist. Ask for a dragon in the style of Greg Rutkowski, and it responds “I'm sorry, I cannot directly reference the style of Greg Rutkowski as it's too recent. However, I can generate an image of a dragon inspired by general contemporary digital art styles. Would you like me to proceed with that?”

Living artists can also now opt out of having their work used to train future models, though there’s no way to opt out of the existing training sets.

OpenAI is working on a novel way to identify AI-generated images.

A leading concern about AI safety is that photorealistic fakes will be widely circulated in an effort to sow doubt or cause chaos, and we won’t be able to tell what is real and what is fake. Researchers are actively working on various forms of digital watermarking, but those watermarks often cannot survive being screenshotted. (Doing so removes the metadata in which watermarks are often stored.)

OpenAI is trying a different approach: training a machine-learning classifier that attempts to learn what DALL-E images look like. That’s counterintuitive, at least to me, since DALL-E can create so many different styles. But show a model enough images and eventually it learns which came from DALL-E with 99 percent accuracy, the company says. If you crop it, resize it, or compress it into a JPEG, OpenAI’s model is still 95 percent accurate, it says.

For now, the company is not making its tool available for external use. If it works, though, it would suggest a useful new approach for identifying deepfakes that come from tools like DALL-E.

Text-to-image generators turn out to be a better way to understand how quickly AI is developing than chatbots.

How much better is ChatGPT-4 than 3.5? Ask anyone who has used both for an extended period of time and they’ll tell you that the latter isn’t even close to as good as the former. And yet when you ask why, the answer will probably be vibes-based: it just feels smarter and less likely to get stuck or hallucinate, as its predecessor does.

Trying to understand the difference between DALL-E 2 and 3, on the other hand, is as simple as using them both once each.

When I got access to DALL-E 2 last year, I asked it to show me “bulldog mad scientist mixing sparkling chemicals, digital art.” Here are the images it gave me:

I entered the same prompt into DALL-E 3 last week. This was my favorite result:

"Digital depiction of a bulldog in a chaotic lab, wearing goggles and gloves, enthusiastically combining sparkling red and green chemicals, resulting in an unexpected explosion of colors"

The time between these two images was 16 months. In that time, OpenAI has added automatic prompt engineering and radically improved the resulting image.

I understand that many people have feelings of visceral disdain for synthetic images like this. They rightly worry about what the arrival of this technology means for people who make their living as artists and about the cultural and political impact of the world being flooded with artificial media.

I worry about those things, too. And part of being an effective worrier is to observe just how quickly AI is getting better. As good as I think the bulldog above is, it would not be surprising if another 16 months from now, text-to-image generators had made another exponential leap forward.

Talk about this edition with us in Discord: This link will get you in for the next week.

Talk to us

Send us tips, comments, questions, and DALL-E prompts: casey@platformer.news and zoe@platformer.news.

Sponsor Platformer.

5 Comments

Chyld

With the advent of DALL-E 3, MidJourney, Stable Diffusion and its variants, alongside Adobe Firefly, the discourse on generative art is over. The tech is so groundbreaking, there's no turning back now. Pandora's Box literally cannot be closed. It would be like saying "Yeah I know electricity is nice and all, but we're going to turn it off".

Expand full comment

Eric Fish, DVM, PhD

All Science Great & Small

Nov 3, 2023

Thank you for this deep-dive on DALL-E 3! I have been playing around with it a bit and generally enjoying it, although I have similarly run into hard road blocks with no explanation. I wrote this prompt "For light-hearted Halloween fun, generate an image of the scariest thing a pathologist can imagine," and when the first responses were mediocre, I clarified "Try again, this time creating something clever that is a visual pun or play on concepts in pathology" and it slammed right into warnings about aligning with their content policy. Overall, I think the new work-around of DALL-E taking initial text and making their own prompt makes sense, but it can lead to baffling results like this where it over-extrapolates a benign request into something verboten on its own...

I have also noticed a few times where it just refuses to perform as specified. A few times I asked for 3 or 4 examples and it would either return 0-1 or just say flatly "I can generate two." Uhh, ok??

Finally, I have noticed that regardless of the style image generated, compared to Stable Diffusion or Midjourney, DALL-E 3 images tend to have a subtle soft-light "glow" effect, almost like old school soap operas. Human figures also have a distinctive, almost-realistic-but-video-game-y look different than competitor algorithms. These features seem like they would be easy enough to detect at mass scale

3 more comments...

Platformer

How to see the future using DALL-E 3

To understand how quickly AI is improving, forget ChatGPT — use an image generator

Governing

Industry

Those good posts

Talk to us

Sponsor Platformer.