How to use VQGAN+CLIP to generate images from a text prompt. A complete, non-technical guide to go from beginner to expert.

Published in

NightCafe

11 min readAug 15, 2021

Update August 2022 — the hottest new algorithm on the AI Art scene is Stable Diffusion. Click here to read my Stable Diffusion tutorial.

This image was generated from a text prompt. “Geometric glass city from the future at dusk”.

In this tutorial I’ll show you how to use the state-of-the-art in AI image generation technology — VQGAN and CLIP — to create unique, interesting and in many cases mind-blowing artworks. No technical knowledge required.

I’ll show you two ways to use the technology. The first is to use Google Colab — an online programming environment (it’s not as scary as it sounds — you don’t need to know code), and the second is to use an app called NightCafe Creator (disclaimer, I built the app), which is faster and easier than Google Colab, but does eventually require payment for extended use. Don’t worry, you won’t need to pay anything to complete the tutorial.

Text prompt: “It’s like that drug trip I saw in that movie while I was on a drug trip. Trending on Artstation”. VQGAN+CLIP art on NightCafe Creator.

First, an intro to VQGAN and CLIP

Feel free to jump straight to method 1 or 2 if you’re just here for the tutorial.

VQGAN and CLIP are actually two separate machine learning algorithms that can be used together to generate images based on a text prompt. VQGAN is a generative adversarial neural network that is good at generating images that look similar to others (but not from a prompt), and CLIP is another neural network that is able to determine how well a caption (or prompt) matches an image.

The two algorithms were combined in various forms by AI-generated art enthusiasts like Ryan Murdock and Katherine Crowson. The implementations of VQGAN+CLIP were made public on Google Colab, meaning anyone could run their code to generate their own art. This soon resulted in a viral explosion of people using this technique to create incredible artworks and sharing them on platforms like Twitter and Reddit.

Read on to find out how to do it yourself… Remember, no coding required!

Text prompt: “A colourful cubist painting of a parrot in a cage”. VQGAN+CLIP made with NightCafe Creator.

Method 1. VQGAN+CLIP in Google Colab

Note: Google Colab is designed primarily to be accessed from a computer. If you’re on your phone, you should probably skip to Method 2. NightCafe Creator.

If at any time you feel that Colab is too complicated, jump straight to Method 2. NightCafe Creator.

Google Colaboratory (usually referred to as Colab) is a cloud-based programming environment that allows you to run Python code on servers that have access to GPUs (fast processors originally created for graphics). That last part is important because VQGAN+CLIP (and machine learning in general) takes a lot of processing power. So much so that it’s impractical to run it on a CPU.

When Katherine Crowson first combined VQGAN and CLIP, she made it public in a Google Colab notebook (a notebook is the name for a program written in Colab) so that anyone could use it. Her original implementation has been copied and tweaked many times since then, so there are lots of different versions that you can use. Here’s a list compiled by Reddit user u/Wiskkey. For this tutorial, we’ll be using this version (go ahead open it in a new tab).

It will be helpful for you to understand a bit about how Google Colab works in general. Remember, Colab is a general purpose online programming environment, it’s not made specifically for making AI art, so there are some things that might seem unnecessary, and the interface is a bit confusing for newcomers.

Colab notebooks are made up of “cells”. Each cell runs a block of code, and can have a text description. Once the programmer has written the code, they can hide it and just show the text description of what the cell does. You can execute the code in a cell by clicking on the “Play” icon.

So the way that you run a Colab notebook is by running each cell (i.e. clicking play) one after the other. The notebook that we’re using has 9 cells. Follow along with the instructions below to complete your first run. Each instruction is for a single cell, so there are 9 instructions.

The license. This cell doesn’t do anything, and you don’t actually have to run it.
A code cell with a single command — !nvidia-smi. Running this cell just gives you information about the GPU that Colab has assigned to you. You can skip this if you like.
A code cell that starts with !git clone https://github.com/openai/CLIP . This cell downloads and installs some external code packages (like CLIP and the VQGAN code) that the rest of the cells depend on. You have to run this one, but only once per session. This cell will take a while to execute because it’s downloading a lot of code.
A text cell with information about the “models” (different versions of the AI trained on different datasets) that you can download. You can’t run this cell, but you should read it.
Selection of models to download — This cell allows you to choose which models to download by selecting the checkboxes and then clicking the Play button. You have to run this cell, but only once per session unless you want to try a different model. I recommend just checking the “imagenet_16384” box and then clicking Play. This one will take a while, because it’s downloading quite a big file.
Load libraries and variables — This cell just executes some code in the background. Run it and continue.
Settings for this run — This is an important one. This is where you can specify your text prompt and some other variables before doing the actual run. For your first go, I recommend just setting a text prompt, setting the width and height to 400, and the max_iterations to 300 — these settings will give you a pretty good result in a relatively short time. Also make sure you choose vqgan_imagenet_f16_16384 model, which is the one you downloaded in step 5. If you want to try a different model later, you’ll have to choose it in step 5 and run that cell again first to actually download it. Leave the rest of the options on their default for now.
Actually do the run — This is the cell that runs VQGAN+CLIP with your chosen parameters. It will print some information as it goes. It will run the algorithm for the number of max_iterations that you specified in step 7, and will display a “progress image” every 50 iterations (or whatever you specify in images_interval in step 7). It will take a while for this to run, because it takes a lot of computation power. When it’s done, it will simply stop, and the last image displayed is your generated image. Note you can scroll up and down within this cell to see all the images.
Generate a video with the result — This is an optional step you can run after your image has generated. It will create a video out of all the progress images generated in the process of generating your final image.

A few more things to know

After you’ve run all the cells once, to create something new you will only need to run cells 7 and 8 again (and 9 if you want a video). However, if you want to try selecting a different model in cell 7, you’ll need to first check the corresponding box in cell 5 and then run that cell again.

The notebook allows you to (optionally) use “start” and “target” images. A start image will initialise the algorithm with your image (rather than random pixels) and a target image will act as another prompt in the form of an image, steering the algorithm towards an output that looks like the target. To use the start and target images in cell 7, you first need to click the “files” tab (folder icon) in the left sidebar, and then the “upload to session storage” icon. You can upload an image here, and then enter its filename into the “start image” or “target images” parameters in cell 7.

Method 2. VQGAN+CLIP on NightCafe Creator

NightCafe Creator is a web-based app for creating AI-generated art. It started as a neural style transfer app (another form of AI art where one image is reimagined in the style of another — think “a photo of the Eiffel tower in the style of Van Gogh’s Starry Night”), and has recently also added text-to-image art generation which uses VQGAN+CLIP under the hood.

Using NightCafe Creator is much easier than Google Colab, and is also a lot faster. Plus, unlike Colab, it works just as well from your phone.

To start, head to VQGAN+CLIP on NightCafe Creator and click on the main “Start Creating” button. If you’re faced with a choice between “Style transfer” and “Text to image”, click on “Text to image”. Otherwise if you’ve arrived at a form, you’re in the right place already!

The NightCafe Creator interface for creating text-to-image artworks.

From here, the rest is fairly self-explanatory. Start by typing a text prompt into the first text box. Feel free to be creative, or just use something simple like “A dog”.

Next, expand the list of modifiers by clicking “Show more” under the “Add some modifiers” section. Choose one or two and click them. They’ll be added to your prompt in the text box. This part is optional, but you really shouldn’t skip it — keyword modifiers are very powerful and will really improve the final result.

I suggest that on your first go, you leave the rest of the settings unchanged. Resolution and runtime are both set to the minimum by default, and increasing either setting will cost you more credits.

Scroll to the bottom of the form and click the black “Create” button. Your creation will be queued for a short time, and then will start running. It will take about a minute to generate (once it starts running) and you’ll see the progress every 15 seconds or so. You don’t have to wait for it, you can go ahead and start another creation if you like.

A word about the runtime option. “Runtime” refers to the number of times that the algorithm does a loop, or “iterations”. It usually takes about 200 iterations to get an idea of whether your creation will turn out well or not, and 400–1000 iterations for it to get about as good as it’s going to get. For this reason I recommend pretty much always starting with the shortest runtime, and using the “Evolve” button to run it longer if you think it’s going to be a good one. More on the evolve button later.

Using a Start Image

In NightCafe Creator, if you turn on the “Show advanced options” switch at the top of the page, you’ll see an option for “Start image”. Using this allows you to choose an image to kick-start the algorithm (without this, the algorithm starts from randomly generated pixels). There’s a bit of an art to using a start image. The algorithm will quickly deviate from the start image towards something totally different. I’ve found that you’ll get the best results by trying to fully describe the content of the image and then add some modifiers. For example, in this artwork I started with the image on the left, and used the prompt “Old man with a moustache wearing a hat and smoking a cigarette art deco trending on Artstation”. I.e. a full description of the initial image, plus the modifiers “art deco” and “trending on Artstation”.

Left: start image. Right: result for “Old man with a moustache wearing a hat and smoking a cigarette art deco trending on Artstation”

Another tip for using start images is to only run them for a small number of iterations, otherwise they tend to deviate further than you’d like from the start image.

How/why do VQGAN+CLIP keyword modifiers work?

Modifiers are just keywords that have been found to have a strong influence on how the AI interprets your prompt. In most cases, using one or more modifiers in your prompt will dramatically improve the resulting image. Here’s an example using the text prompt “A dog on the beach”. It’s obvious that the image on the left (without any modifiers) is noticeably worse than the others.

Left to right: “A dog on the beach”, “A dog on the beach **Thomas Kinkade**”, “A dog on the beach **detailed painting**”, “A dog on the beach **Unreal Engine**”.

So why do modifiers have such a dramatic effect? It’s to do with the data that the CLIP network was trained on —millions of image and caption pairs from the internet. CLIP has seen a huge number of images on the internet, and the ones that include the words “Thomas Kinkade” in the caption tend to be nicely textured paintings like those shown in the centre-left image. Likewise the images that were paired with a caption containing the words “Unreal Engine” tend to look like scenes from a video game (because Unreal Engine is a video game rendering engine).

Thus, when you include modifiers like “Thomas Kinkade” or “Unreal Engine”, CLIP knows that the image should look a certain way. Note that in the examples above, it’s not so much the shapes that are better with modifiers, it’s the finer textures that make it look better.

To see how 100+ different modifiers affected the same base image, check out our VQGAN+CLIP keyword modifier comparison, which was itself inspired by this list of 200+ modifiers and how they affected 4 different base text prompts, by Reddit user u/kingdomakrillic.

Earning Credits on NightCafe Creator

Because running the VQGAN+CLIP algorithm requires so much processing power, NightCafe Creator runs on a credit system. When you sign up, you’re given a small number of free credits, and your credit balance is topped up to 2 credits every day (if you log in that day).

If you run out of credits and want to keep creating, you do have a few options. If you’d like a lot more credits, you can buy them. If you just want a few more credits to keep creating casually, you don’t have to buy them, you can earn them!

If you click/tap on your profile image in the top right, then click on your credit balance, you’ll be taken to the pricing page, which lists how you can get free credits by earning “badges”. Some examples include:

Completing your profile
Sharing one of your creations on Twitter
“Liking” a certain number of other users’ creations
Publishing a certain number of your own creations
Getting a certain number of likes on your creations

Most badges can only be earned once, but until you’re a big power user, there’s usually a bigger badge to earn! The badge for sharing a creation on Twitter can be earned multiple times.

“Evolving” Creations

Once an image has finished generating on NightCafe Creator, you can use the “Evolve” button to do things like:

Run it for longer
Increase the resolution
Tweak the prompt (E.g. add or remove modifiers)

“Evolve” just duplicates the creation, but importantly it also uses the output of the original creation as the start image for the next. It’s a very useful tool in your arsenal if you really want to explore how the AI works and refine your creations.

Conclusion

Whether you choose to create text-to-image art by using Google Colab or NightCafe Creator, you’re now armed with the knowledge of not only how to use these tools, but a bit more of an understanding of how they work, and how to get the best out of them.

Feel free to leave any questions, corrections or comments below. They’re encouraged and cherished!