Crafting Prompts for Text-to-Image Models

DALL·E can generate whatever you want, as long as you know the right incantation

Julia Turc
Towards Data Science

--

Illustration produced via Midjourney (generative AI). Text prompt: “a boy desperately typing on a laptop keyboard, frustrated face, visibly angry, cartoon style”.

The advent of text-conditioned image generation models like DALL·E is undoubtedly going to change the traditional creative process. However, art will not necessarily come for free: the burden will simply shift away from drawing or using complex graphic design software to crafting effective text prompts to control the whims of text-to-image models. This article discusses potential ways in which users and companies will address the challenge of prompt engineering or prompt design.

Prompting is the newest and most extreme form of transfer learning. Each request for an image can be seen as a new task to be accomplished by a model that was pre-trained on a vast amount of data. In a way, prompting has democratized transfer learning, but has not yet made it effortless. Writing effective prompts can require as much work as picking up a new hobby.

The Ancestry of Prompting

One can argue that prompting is the newest and most extreme form of transfer learning: a mechanism that allows previously-trained model weights to be reused in a novel context. Throughout the years, we figured out ways of reusing more and more pre-trained weights when building task-specific models. In 2013, word2vec [1] bundled general-purpose word embeddings into a static library; people used them as off-the-shelf inputs to their NLP models. In the late 2010s, models like ELMo [2] and BERT [3] introduced fine-tuning: they allowed for the entire architecture of the pre-trained model to be reused and concatenated with a minimal number of additional weights for each task. Finally, GPT-3 [4] closed the transfer learning chapter in 2020 via prompting: a single pre-trained model could now perform virtually any specific task, without additional parameters or re-training; it just had to be guided in the right direction through its text input. Text-to-image models like DALL·E are at the same end of the transfer learning spectrum: each request for an image can be seen as a new task to be accomplished by the model.

The Current State of Affairs

In a way, prompting has democratized transfer learning: users no longer need ML engineering skills or expensive fine-tuning datasets in order to leverage the power of large models. However, making use of generative AI is not yet effortless. Today, 1.5 years after the first DALL·E paper was published and 3 months after DALL·E 2 was made accessible to a select few, writing effective prompts can require as much effort as picking up a new hobby. There is a learning curve at play: people tinker with the models and, through iterative experimentation, they discover correlations between inputs and model behavior. They also immerse themselves into text-to-image communities (e.g. Reddit, Twitter, etc.) to learn the tricks of the trade and share their own discoveries. Among others, they argue over whether DALL·E 2 does or does not have a secret language.

As a data-driven person, I wanted to find out whether data can offer a shortcut to acquiring the elusive skill of prompt engineering. Together with a friend, we scraped Midjourney’s public Discord server, where users interact with a bot to issue prompts and get AI-generated images in return. We collected 4 months’ worth of requests and responses on 10 channels, and made the dataset available on Kaggle: Midjourney User Prompts & Generated Images (250k).

The most commonly used phrases in text prompts issued by Midjourney users. See the full dataset on Kaggle: Midjourney User Prompts & Generated Images (250k). Illustration made by the author.

The word cloud above illustrates the most commonly used phrases in the text prompts issued by Midjourney users. Some of these are unexpected, at least for a non-connoisseur. Instead of animals, robots, or whatever other entities we humans find endearing (i.e. content), the terms that make it to the top are modifiers (i.e., describing the style or quality of the desired output). They include application names like Octane Render or Unreal Engine and artist names like Craig Mullins. You can find a more detailed prompt analysis in this notebook. A disclaimer: it is unclear how generalizable these findings are. They might simply reflect the taste of a potentially biased user base, or might only elicit a strong visual response from the Midjourney model in particular. If you have access to DALL·E 2, let me know if they have any effect on it!

Top artists mentioned by Midjourney users in their text prompts (y axis is counts per a random subsample of 10k prompts). You can find more statistics in this notebook.

Admittedly overwhelmed by the complexity of the prompts we observed, we decided to fine-tune GPT-2, a large language model, on these user-generated text prompts. Instead of learning the tricks of the trade by ourselves, we can now rely on it to auto-complete our meager prompts into creative and sophisticated inputs. Our model is freely available on HuggingFace at succinctly/text2image-prompt-generator. Feel free to interact with the demo!

Sample usage of our prompt autocompletion model, available at succinctly/text2image-prompt-generator. The three images were generated via Midjourney. The illustration itself was made by the author.

Prompting Meets Capitalism

In business, time is money, and so is text prompting.

As DALL·E 2 and competing services like Midjourney are becoming more widely available (the former is currently rolling out to its first million users, while the latter is in open beta), professionals are starting to evaluate the potential of incorporating generative AI into their workflows. Here is, for instance, a Twitter thread from a graphic designer probing DALL·E 2’s ability of creating unique mockups:

As text-to-image models enter capitalism (professional design, content marketing, ad creatives), prompting becomes less of an entertaining hobby and more of a job to be completed effectively. In business, time is money, and so is text prompting. Some people are predicting that, similar to other types of manual work, prompt engineering will be offloaded to lower-income countries: workers would be paid ~$10/hour to issue as many queries as possible and select the best visual outputs. However, with OpenAI controversially announcing a credits-based pricing model (which essentially charges per usage instead of offering subscriptions), users are incentivized to issue as few prompts as possible. So, instead of the brute-force approach above, we might see a new profession emerging: prompt engineer — a person well-versed in the capabilities and whims of generative AI who can produce the illustration you need in 3 attempts or less.

How Research Will Step In

Prompting doesn’t necessarily need to remain human labor forever. In fact, when the prompting practice first emerged in the text generation field, researchers studied it extensively. This collection, which might not even be complete, mentions 86 papers as of the end of July 2022. Many of the articles propose automations which automatically rephrase the input in a way that is more model-friendly, embrace redundancy and generate additional tokens to make the task more explicit for the model, produce soft prompts (i.e., modify the internal representation of the original input prompts), or design a framework for interactive sessions where the model remembers user preferences and feedback over a longer sequence of requests. It is likely that the same amount of research will go into taming text-to-image models.

Acknowledgements

The Midjourney scraping project was completed in collaboration with Gaurav Nemade.

Resources / Links

References

[1] Mikolov et al. (2013), Efficient Estimation of Word Representations in Vector Space

[2] Peters et al. (2018), Deep contextualized word representations

[3] Devlin et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[4] Brown et al (2020), Language Models are Few-Shot Learners

--

--