Blog / ML News

DALL E 2: AI That Can Render Masterpieces from Text!

Apr 8, 2022

5 min read

Human-level intelligence is coming, and this time it's not designed by Google but it's an API from OpenAI. Yes, the researchers have once again improved their previous record of creating an AI agent capable of creating realistic images using sophisticated neural network techniques through their new AI System DALL E 2.

Courtesy of Sam Altman's Twitter

Building upon DALL E from 2021 the newer architecture is not only producing more accurate, original, and artistic pieces but doing so in 4 times the original resolution!

Let us dive deeper into what and how OpenAI has achieved this amazing architecture and how it follows the legacy set by the GPT Series, DALL E, and CLIP, through some questions that you and I both thought of after seeing the buzz around the network.

What is OpenAI?

OpenAI is an AI Research Lab, which operates on a non-profit basis to promote and provide friendly AI that benefits humanity as a whole.

View this post on Instagram

A post shared by DALL·E by OpenAI (@openaidalle)

Much like Google's DeepMind, OpenAI got its reputation from its amazing team and remarkable research & engineering. GPT models and in particular the GPT-3, are considered to be one of the most revolutionary NLP Modelstill now. The third-generation language prediction model in the GPT-n series model has a capacity of 175 billion machine learning parameters, which makes it capable of receiving the best of laureates with its language generation.

With its "deep"-est and thorough roots in Natural Language Processing, the company has been lately trying to tie this success into other fields of learning, wherein they merge various other fields of automation and intelligence. Some such examples are MuseNet with sound, CLIP with comparing images, CODEX which powers things like Github Copilot, etc.

What is DALL E and did we need a new network?

The first iteration of DALL E, is a 12 Billion parameter version of the GPT-3 Architecture, tweaked to generate images from text. Deriving its roots from Image GPT. Much like a language transformer model, DALL E works by treating images as a set of tokens that help generate and derive more images with the help of higher lingual tokens like sentences and such.

DALL·E 2 is here! It can generate images from text, like "teddy bears working on new AI research on the moon in the 1980s".

It's so fun, and sometimes beautiful.https://t.co/XZmh6WkMAS pic.twitter.com/3zOu30IqCZ
— Sam Altman (@sama) April 6, 2022

DALLE has two parts, a VQ-VAE model that encodes & decodes the image signal to and from 1024 tokens resp. The second is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens---256 for the text and 1024 for the image---and models all of them autoregressively (using the previous generation as input for the next). The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens.

DALL E's launch itself mentions gates the architecture opened up for the revolution of art, photorealism, and generative models. Which only reinstated itself with the newer DALL E 2 which is capable of higher resolution, greater comprehension, and new features like Image Inpainting.

Why has DALL E 2 taken your YouTube feed by storm?

One very basic thing DALL E 2 presented on top of its predecessor was the ability to use an image as an input for the model, which not only gave you the ability to perform the previous functionalities more smoothly and efficiently, but also performs tasks like Image Inpainting, Advanced Image Augmentation, and Scene Touchups and Generations.

pic.twitter.com/qmeI3w77Gc
— Sam Altman (@sama) April 6, 2022

The photorealistic abilities that the network gets with the even greater number of parameters it can utilize from the GPT-3 Architecture, include but are not limited to adding or removing elements while taking shadows, reflections, and textures into account.

DALL E was created by training a neural network on images and their text descriptions. Through Deep Learning architecture is not only able to understand and generate individual objects but also relate them with each other by understanding the verbs that connect them.

DALL E exists with 2 major motives, let us take a look at them:

It can help people express themselves creatively in a way they may not have been able to before.
The accuracy at which the architecture generates images gives us an insight into how machines are capable of interpreting our world and not just rebuking what is taught to them.

How does DALL E 2 Achieve this?

The architecture, as mentioned before, derives a lot of its functionality from the CLIP architecture from OpenAI. CLIP essentially works towards efficiently captioning an Image to perform the best results. DALL E broadly works towards reversing this very engineering to get pictures from the captions and is hence also called the unCLIP Architecture.

This unCLIP architecture works towards encoding the information received from both the image and text, into a condensed representation called latent code. This latent code goes on to act as the input for the decoder layer and is used to generate new images. This latent code heavily influences how the decoder and can be used to manipulate the image results with small randomization into the encoding.

The key factor that set-asides the architecture is the decoder which compiles the information into images. The system goes for a diffusion-decoder, this model starts with random noise and learns to iteratively change this noise to get an image. It learns to do so by doing the opposite, i.e. applying random gaussian noise to an input image until we can't see anything then works its way up until it generates a comprehensible image.

Ethical boundaries for such a powerful network

OpenAI themselves have taken the liberty to discuss the ethical values linked to such a powerful tool, let us go through them:

Preventing Harmful Generations: The team has engineered the training dataset from the get-go to prevent the generation of any violent, hateful, or adult images. They have also gone the extra mile to use techniques to avoid the photorealistic generation of real individuals' faces, including public figures.
Curbing Misuse: On top of the safeguards for the generations of unwanted harmful images, the organization has implemented automated and human monitoring against misuse of the model.
Phased Deployment based on Learning: The deployment of the network is being taken in a very controlled manner, much like that of GPT-3, and is being trusted by people who pass their checks.

However, the existence of such a powerful system begs the question of 3 major things: privacy and surveillance, bias and discrimination, and perhaps the deepest, most difficult philosophical question of the era, the role of human judgment.

Are you looking to work on your own revolutionary AI system? How about checking out NimbleBox for your Deployment needs! Sign up for early access now by following this link: Click Here

Written By

Aryan KargwalData Evangelist