How to see the future using DALL-E 3
To understand how quickly AI is improving, forget ChatGPT — use an image generator
Earlier this month, I wrote about how you can fast-forward through time by using the paid versions of frontier AI products. Today, let’s talk about another of those products, the text-to-image generator DALL-E. The latest version tells us a lot about how OpenAI is thinking about product, policy, and safety as its core technology continues to improve.
These days, ChatGPT gets all the attention. But six months before OpenAI’s chatbot arrived on the scene and shifted the tech world’s collective focus to generative artificial intelligence, the company released another tool that spoke to similar possibilities. DALL-E 2, which I got access to in June of last year, captivated me from its first release.
As a kid, I was an enthusiastic artist, delighting in drawing my own comic books on weekends and after school. But I quickly found the limits of my talent, and despite trying my best to follow the instructions in books designed to make me a better illustrator, I never really got very far. Fast forward a few decades, and suddenly, I could conjure whole worlds just by typing words into a box. It felt like magic, in a way that the tech industry often promises but rarely lives up to.
In the 18 months since DALL-E 2 emerged, that picture has become complicated by questions about copyright, permissions, and what — if anything — the makers of text-to-image generators like DALL-E owe the artists whose work their models are trained on. Stock photo company Getty Images sued Stable Diffusion earlier this year, saying the company’s model had improperly been trained on its photos. Similar lawsuits seem likely to follow. Meanwhile, Adobe demonstrated an alternative path forward by creating its own Firefly image generator using only licensed imagery and says it will compensate creators whose work was used in the training process.
Despite the legal and ethical certainties around text-to-image generators, though, the field has continued to develop rapidly. Midjourney, which launched shortly after DALL-E 2, attracted 15 million users and is generating hundreds of millions of dollars in annual revenue with its own image generator despite currently being available to use exclusively on Discord. Stable Diffusion reached 10 million users last October. As they grew, the quality of their images improved exponentially, as DALL-E’s — while still impressive by pre-2022 standards — remained stagnant.
Then, on Thursday, DALL-E 3 arrived. After a short time in public beta, the next generation of OpenAI’s image generator is now available to enterprise customers and to subscribers to ChatGPT Plus. (You can also use a free version through the Bing Image Creator.)
I’ve been using DALL-E 3, both through Bing and ChatGPT, for the past few weeks. On Wednesday, I also was briefed on the new version by Gabriel Goh, a research scientist at OpenAI who helped to build the new model, and Sandhini Agarwal, who works on AI policy at the company.
Here are five surprising things I’ve learned about DALL-E 3 in my first weeks of using it.
When used inside ChatGPT, DALL-E 3 aggressively rewrites shorter prompts.
One of the core product challenges in AI so far has been its missing user interface. Whether you’re using a text-to-image generator or a chatbot, you typically have nothing more to work with than a blank box to type in. Getting quality images out of tools like Midjourney can require significant prompt engineering, with power users memorizing dozens of modifying tags (“photorealistic,” “depth of field,” “sharp focus”) to get what they want.
DALL-E 3 confronts this problem by simply rewriting prompts for you. When I asked it to show me “a teddy bear detective in a noir film, meeting a new client in his office,” for example, it quickly spun up much more detailed queries. Here was one:
Illustration of a 1940s-style detective office. The room is adorned with black and white checkered flooring, dark wooden walls, and a ceiling fan slowly rotating. A teddy bear detective, donning a sharp suit and fedora, sits behind an old-fashioned desk. He listens intently to a male teddy bear client who stands nervously before him. The client wears a bow tie and suspenders and clutches a briefcase. A lit desk lamp casts long shadows, and outside, the silhouette of rain-soaked buildings can be seen.
And here was another:
Photo of an atmospheric detective office with greenish-blue lighting, reminiscent of a noir film. A teddy bear detective, wearing glasses and a tie, sits at his desk, which is cluttered with papers and a typewriter. He's looking up to meet the gaze of a teddy bear client, a woman in a glamorous hat and fur shawl. The room is filled with smoke, possibly from the detective's cigarette. The ambiance is tense, and the city's skyline can be glimpsed through a half-drawn blind.
You can avoid this kind of rewriting by entering longer prompts. But it’s been fun to let ChatGPT run wild and suggest a more diverse array of artistic styles than I would have imagined myself.
In my testing, I found ChatGPT also rewrites prompts with an eye toward diversity. If you describe a nondescript group — ”doctors,” say, or “students” — ChatGPT will rewrite it to ensure the resulting image contains racial and gender diversity.
DALL-E 3 can be surprisingly restrictive.
Typically DALL-E 3 returns four responses to your query. But for the teddy bear noir example above, it returned just three. “One of the descriptions didn't align with our content policy, so I was unable to generate an image for that,” the bot told me. Just so we’re clear here: ChatGPT rewrote my prompt about a teddy bear detective to something that broke its own content policy, and so it refused to generate an image. If you’re curious what exactly ChatGPT wanted to show the teddy bear doing, so am I.
Perhaps that one was just a random bug. But I’ve found ChatGPT restrictive in other ways, too. It’s highly attuned to requests for images that might be construed to sexualize the subject matter — if you’re looking for a photorealistic image of a man with his shirt off, look elsewhere. And it’s more sensitive about copyrighted images than its predecessor as well. When I asked it to depict corporate logos — something I did often with DALL-E 2 — it simply refused.
All of this is probably good from a legal liability perspective. But I wonder if adults won’t find it off-putting, at least compared to more permissive rivals. At the very least, OpenAI could do more to educate users how their prompt or resulting image broke its content policies, and suggest permitted alternatives.
DALL-E now bans requests for art in the style of living artists.
Text-to-image generators have drawn criticism for the way that they can be used to create infinite knock-off versions of individual artist’s work. Greg Rutkowski, whose name was used in more Stable Diffusion prompts than any other artist, is among many creatives who object to the way that these tools abstract away their entire styles and turn them into software tools with no compensation to the artist.
DALL-E’s solution to this is to simply deny requests for art in the style of a living artist. Ask for a dragon in the style of Greg Rutkowski, and it responds “I'm sorry, I cannot directly reference the style of Greg Rutkowski as it's too recent. However, I can generate an image of a dragon inspired by general contemporary digital art styles. Would you like me to proceed with that?”
Living artists can also now opt out of having their work used to train future models, though there’s no way to opt out of the existing training sets.
OpenAI is working on a novel way to identify AI-generated images.
A leading concern about AI safety is that photorealistic fakes will be widely circulated in an effort to sow doubt or cause chaos, and we won’t be able to tell what is real and what is fake. Researchers are actively working on various forms of digital watermarking, but those watermarks often cannot survive being screenshotted. (Doing so removes the metadata in which watermarks are often stored.)
OpenAI is trying a different approach: training a machine-learning classifier that attempts to learn what DALL-E images look like. That’s counterintuitive, at least to me, since DALL-E can create so many different styles. But show a model enough images and eventually it learns which came from DALL-E with 99 percent accuracy, the company says. If you crop it, resize it, or compress it into a JPEG, OpenAI’s model is still 95 percent accurate, it says.
For now, the company is not making its tool available for external use. If it works, though, it would suggest a useful new approach for identifying deepfakes that come from tools like DALL-E.
Text-to-image generators turn out to be a better way to understand how quickly AI is developing than chatbots.
How much better is ChatGPT-4 than 3.5? Ask anyone who has used both for an extended period of time and they’ll tell you that the latter isn’t even close to as good as the former. And yet when you ask why, the answer will probably be vibes-based: it just feels smarter and less likely to get stuck or hallucinate, as its predecessor does.
Trying to understand the difference between DALL-E 2 and 3, on the other hand, is as simple as using them both once each.
When I got access to DALL-E 2 last year, I asked it to show me “bulldog mad scientist mixing sparkling chemicals, digital art.” Here are the images it gave me:
I entered the same prompt into DALL-E 3 last week. This was my favorite result:
The time between these two images was 16 months. In that time, OpenAI has added automatic prompt engineering and radically improved the resulting image.
I understand that many people have feelings of visceral disdain for synthetic images like this. They rightly worry about what the arrival of this technology means for people who make their living as artists and about the cultural and political impact of the world being flooded with artificial media.
I worry about those things, too. And part of being an effective worrier is to observe just how quickly AI is getting better. As good as I think the bulldog above is, it would not be surprising if another 16 months from now, text-to-image generators had made another exponential leap forward.
Talk about this edition with us in Discord: This link will get you in for the next week.
Governing
The US Supreme Court has allowed Biden administration officials to contact social media companies about misinformation, temporarily lifting limits previously imposed by an appeals court. (Adam Liptak / The New York Times)
Google must pay female executive Ulku Rowe $1 million after a jury found it discriminated against her based on her gender. (Beth Wang / Bloomberg Law)
A new report found that verified accounts on X are among the biggest spreaders of misinformation about the Israel-Hamas conflict — with X profiting from the engagement on those posts. (Donie O'Sullivan / CNN)
Meta is divided internally over how to regulate content from the Palestinian territories, with some of its Tel Aviv and Dubai teams disagreeing about content in the region. (Sam Schechner, Jeff Horwitz and Newley Purnell / The Wall Street Journal)
Israeli company Percepto International, a disinformation-for-hire business, was hired in Burkina Faso to keep the government in power and fight Islamic extremist rhetoric. But the operatives quickly discovered they weren’t the only players in the disinformation business in the region. (Elizabeth Dwoskin / Washington Post)
Tim Cook’s trip to China included a meeting with commerce minister Wang Wentao, underscoring the increasingly complex relationship between Apple and the Chinese economy. (Yang Jie and Liza Lin / The Wall Street Journal)
Microsoft said it will spend about $3.2 billion in Australia to expand its AI and cloud computing abilities. (Byron Kaye / Reuters)
Japan’s Fair Trade Commission is investigating Google and looking into whether the company abuses its market position to block rivals, adding to the tech giant’s antitrust troubles. (Ryotaro Nakamaru, Eddy Duan, and Grace Huang / Bloomberg)
Industry
TikTok is testing 15-minute uploads with a limited group of users, which would raise its video upload limit from 10. (Aisha Malik / TechCrunch)
Threads is attempting to boost its growth by showing Threads posts on Facebook, after making a similar move on Instagram in August. (Ivan Mehta / TechCrunch)
Instagram head Adam Mosseri said the app is testing a toggle for a verified-only feed, which could incentivize more users to pay for the Meta Verified subscription. (Jay Peters / The Verge)
Meta researchers say they’re releasing a more advanced version of Habitat, an AI simulation environment, along with a 3D dataset that can train AI navigation agents including robots. (Mike Wheatley / Silicon Angle)
Meta chief product officer Chris Cox discusses the company’s plans to incorporate AI into its products, including the rollout of chatbots and its partnerships with celebrities. (The Wall Street Journal)
OpenAI is talking to major publishers about paying to use news stories to train ChatGPT, and possibly for ChatGPT to link to individual stories. (Nitasha Tiku / Washington Post)
Internal prompts between the DALL-E 3 AI image generator and ChatGPT included commands written in all caps for emphasis — a possible preview of the future of programming. (Benj Edwards / Ars Technica)
A new study by Stanford researchers outlines concerns that AI chatbots perpetuate racist medical ideas, which could worsen racial disparities in healthcare. (Garance Burke and Matt O’Brien / Associated Press)
Microsoft CEO Satya Nadella talked about how excited he is about the Activision Blizzard deal, the biggest opportunity in AI, and where to pick his zero-sum battles in this interview. (Mathias Döpfner / Insider)
Ali Farhadi, CEO of the Allen Institute for AI, is leading an effort to build a freely available AI alternative to those built by big tech giants, sparking a debate over innovation and public data access. (Steve Lohr / The New York Times)
X is partnering with Google Ad Manager to allow companies to advertise on X through Google Ads, following a steep decline in ad revenue. (Nicola Agius / Search Engine Land)
Elon Musk said he is planning two new tiers of X Premium: one that costs less than the current $8/month plan but doesn’t reduce ad load, and one that costs more with no ads. (Ivan Mehta / TechCrunch)
Amazon is launching passkey support for its online site and mobile shopping apps. (Umar Shakir / The Verge)
Apple is trying to catch up with the AI craze — revamping Siri, adding AI to the next iOS version, and exploring ways to add AI to as many apps as possible. (Mark Gurman / Bloomberg)
Twitch is now allowing creators to stream simultaneously across any other live streaming platforms. It also updated its off-platform behavior policies.(Jay Peters / The Verge)
Google is preparing to test “IP Protection”, a new feature for Chrome that aims to hide users’ locations using proxy servers. (Mayank Parmar / Bleeping Computer)
Cameo’s meteoric rise during the pandemic has given way cost-cutting measures, but its co-founder says he is still optimistic. (Erin Griffith and Sapna Maheshwari / The New York Times)
Tinder Matchmaker, a new feature on the dating app, will allow users to invite their friends and families to view and recommend potential matches. Let me know if you do this! (Jess Weatherbed / The Verge)
Those good posts
For more good posts every day, follow Casey’s Instagram stories.
(Link)
(Link)
(Link)
Talk to us
Send us tips, comments, questions, and DALL-E prompts: casey@platformer.news and zoe@platformer.news.
With the advent of DALL-E 3, MidJourney, Stable Diffusion and its variants, alongside Adobe Firefly, the discourse on generative art is over. The tech is so groundbreaking, there's no turning back now. Pandora's Box literally cannot be closed. It would be like saying "Yeah I know electricity is nice and all, but we're going to turn it off".
Thank you for this deep-dive on DALL-E 3! I have been playing around with it a bit and generally enjoying it, although I have similarly run into hard road blocks with no explanation. I wrote this prompt "For light-hearted Halloween fun, generate an image of the scariest thing a pathologist can imagine," and when the first responses were mediocre, I clarified "Try again, this time creating something clever that is a visual pun or play on concepts in pathology" and it slammed right into warnings about aligning with their content policy. Overall, I think the new work-around of DALL-E taking initial text and making their own prompt makes sense, but it can lead to baffling results like this where it over-extrapolates a benign request into something verboten on its own...
I have also noticed a few times where it just refuses to perform as specified. A few times I asked for 3 or 4 examples and it would either return 0-1 or just say flatly "I can generate two." Uhh, ok??
Finally, I have noticed that regardless of the style image generated, compared to Stable Diffusion or Midjourney, DALL-E 3 images tend to have a subtle soft-light "glow" effect, almost like old school soap operas. Human figures also have a distinctive, almost-realistic-but-video-game-y look different than competitor algorithms. These features seem like they would be easy enough to detect at mass scale