In the last few months, with OpenAI’s DALL E 2 and DeepMind’s Gato, the world has tasted the sheer generality and potential of artificial neural networks in the creative fields. Next to this long line of complex code and exclusive models comes Standard Diffusion, one of the most talented open sources of AI artists out there.
With the open-sourced weights and standard hardware friendly compute requirements, it may just be the eye-catcher for your next personal project! So let us dive deeper into the workings and potential of Stable Diffusion and how you can use it to pump up the creative cells of your pipeline or startup.
Ever wondered what “Mario eating a croissant in front of the Taj Mahal” may look like? Me neither 🙄…
Stable Diffusion is a machine learning-based Text-to-Image model capable of generating graphics based on text. Till now, such models (at least to this rate of success) have been controlled by big organizations like OpenAI and Google (with their model Imagen). This is Primarily to avoid unethical use of the model, it kind of sucks due to limited access to genuinely curious people.
However, this status quo has been challenged by Stability.ai, which publicly released their model complete with weights and API compatibility in Collaboration with Hugging Face on the 22nd of August. Since then, we have seen an outburst of AI-generated art on big platforms like Instagram and Twitter.
The origins of the network and the API applications can be traced back to the initial research paper written by StabilityAI and Runway ML, High-Resolution Image Synthesis with Latent Diffusion Models. The network is an example of a latent text-to-image diffusion model.
The model has been trained by taking image-text pairs from the LAION dataset. More specifically, the model has been very expensively trained on 512x512 images taken from the LAION-5B subset from the larger dataset.
The model architecture derives its roots from the initial diffusion models from 2015 and introduces variance in the form of Latent Diffusion Models. Rather than denoising the image in question to gain context from the picture, the model works towards breaking down the image into a lower-dimensional latent space. Once the latent vision has been achieved, the primitive method of noising and denoising is applied to gain the final contextual decoding into the pixel space.
The final decoding is used to map image context to the image and eventually generate the artistic marvels we have witnessed over the past few weeks.
Before discussing the difference between the two, let us take a quick look at DALL-E 2. DALL-E 2 is the second generation of the text-to-image generative models by OpenAI, which is smaller yet arguably better than its predecessor. How about checking out our article for the same to know more about the network?
Cat and Mouse in front of Pyramids (DALL E 2 vs Stable Diffusion)
Now let us look at the difference between the two:
|Stable Diffusion||DALL-E 2|
|Completely Open-Sourced code and data.||The idea has been revealed; however, no available access to the code and data and minimal access to the model itself.|
|They are trained on the LAION-5B dataset, which has over 5.85 billion Images.||Trained on millions of undisclosed stock images and the existing GPT-3 parameters.|
|The data is primarily uncurated, enabling the model to generate inappropriate and ethically unaccepted images.||Data and the model are heavily curated and strictly try to avoid ethical issues, which is such a robust model.|
|Since the model follows a diffusion technique on existing data, it lacks generality in creating things it hasn’t seen before, like logos and fonts.||With GPT-3 as its backbone, the model serves as a milestone in conception, which sets a precedent for upcoming general models.|
The core concept behind making the model and network open-sourced was to enable users to use the model on their system, however, there is a hardware requirement for the same which may go up to GPUs with at least 6GB VRAM to generate heavier and intricate images.
The model is very modular in the sense that it can be easily run on online model training services like Google Colab.
Well, the model before stable diffusion to set precedent was DALL E 2, but it came with over 3.5 Billion trainable parameters, however, Stable Diffusion stands at a measly 890 Million Parameters.
Stability AI has worked towards creating a platform where you can easily try out the model without touching the actual code to tweak the results. You can check out Dream Studio for the same.
Again like at the end of the DALL E 2 hype, we are at a crossroads to ponder the same question as to the ethical implications of enabling a machine to be creative and where it drives the future of art and artists.
But, the question aside, the model is letting even non-artist discover their artistic side and maybe become an inspiration for more.