A Guide to Open-Source Image Generation Models

March 27, 2024 • Written By Sherlock Xu

In my previous article, I talked about the world of Large Language Models (LLMs), introducing some of the most advanced open-source text generation models over the past year. However, LLMs are only one of the important players in today’s rapidly evolving AI world. Equally transformative and innovative are the models designed for visual creation, like text-to-image, image-to-image, and image-to-video models. They have opened up new opportunities for creative expression and visual communication, enabling us to generate beautiful visuals, change backgrounds, inpaint missing parts, replicate compositions, and even turn simple scribbles into professional images.

One of the most mentioned names in this field is Stable Diffusion, which comes with a series of open-source visual generation models, like Stable Diffusion 1.4, 2.0 and XL, mostly developed by Stability AI. However, in the expansive universe of AI-driven image generation, they represent merely a part of it and things can get really complicated as you begin to choose the right model for serving and deployment. A quick search on Hugging Face gives over 18,000 text-to-image models alone.

In this blog post, we will provide a featured list of open-source models that stand out for their ability in generating creative visuals. Just like the previous blog post, we will also answer frequently asked questions to help you navigate this exciting yet complex domain, providing insights into using these models in production.

Stable Diffusion

Stable Diffusion (SD) has quickly become a household name in generative AI since its launch in 2022. It is capable of generating photorealistic images from both text and image prompts. You might often hear people use the term “diffusion models” together with Stable Diffusion, which is the base AI technology that powers Stable Diffusion. Simply put, diffusion models generate images by starting with a pattern of random noise and gradually shaping it into a coherent image through a process that reversibly adds and removes noise. This process is computationally intensive but has been optimized in Stable Diffusion with latent space technology.

Latent space is like a compact, simplified map of all the possible images that the model can create. Instead of dealing with every tiny detail of an image (which takes a lot of computing power), the model uses this map to find and create new images more efficiently. It's a bit like sketching out the main ideas of a picture before filling in all the details.

In addition to static images, Stable Diffusion can also produce videos and animations, making it a comprehensive tool for a variety of creative tasks.

Why should you use Stable Diffusion:

  • Multiple variants: Stable Diffusion comes with a variety of popular base models, such as Stable Diffusion 1.4, 1.5, 2.0, and 2.1, Stable Diffusion XL, Stable Diffusion XL Turbo, and Stable Video Diffusion. According to this evaluation graph, the SDXL base model performs significantly better than the previous variants. Nevertheless, I think it is not 100% easy to say which model generates better images than others, as the results can impacted by various factors, like prompt, inference steps and LoRA weights. Some models even have more LoRAs available, which is an important factor when choosing the right model. For beginners, I recommend you start with SD 1.5 or SDXL 1.0. They're user-friendly and rich in features, perfect for exploring without getting into the technical details.
  • Customization and fine-tuning: Stable Diffusion base models can be fine-tuned with as little as five images for generating visuals in specific styles or of particular subjects, enhancing the relevance and uniqueness of generated images. One of my favorites is SDXL-Lightning, built upon Stable Diffusion XL; it is known for its lightning-fast capability to generate high-quality images in just a few steps (1, 2, 4, and 8 steps).
  • Controllable: Stable Diffusion provides you with extensive control over the image generation process. For example, you can adjust the number of steps the model takes during the diffusion process, set the image size, specify the seed for reproducibility, and tweak the guidance scale to influence the adherence to the input prompt.
  • Future potential: There's vast potential for integration with animation and video AI systems, promising even more expansive creative possibilities.

Points to be cautious about:

  • Distortion: Stable Diffusion can sometimes inaccurately render complex details, particularly faces, hands, and legs. These mistakes might not be immediately noticeable. To improve the generated images, you can try to add a negative prompt or use specific fine-tuned versions.
  • Text generation: Stable Diffusion has difficulties in understanding and creating text within images, which is not uncommon for image generation models.
  • Legal concerns: Using AI-generated art could pose long-term legal challenges, especially if the training data wasn't thoroughly vetted for copyright issues. This isn’t specific to Stable Diffusion and I will talk more about it in an FAQ later.
  • Similarity risks: Given the data Stable Diffusion was trained on, there's a possibility of generating similar or duplicate results when artists and creators use similar keywords or prompts.

Note: Stable Diffusion 3 was just released last month but it is only for early preview.

DeepFloyd IF

DeepFloyd IF is a text-to-image generation model developed by Stability AI and the DeepFloyd research lab. It stands out for its ability to produce images with remarkable photorealism and nuanced language understanding.

DeepFloyd IF's architecture is particularly noteworthy for its approach to diffusion in pixel space. Specifically, it contains a text encoder and three cascaded pixel diffusion modules. Each module plays a unique role in the process: Stage 1 is responsible for the creation of a base 64x64 px image, which is then progressively upscaled to 1024x1024 px across Stage 2 and Stage 3. This distinguishes itself from latent diffusion models like Stable Diffusion. This pixel-level processing allows DeepFloyd IF to directly manipulate images for generating or enhancing visuals without the need for translating into and from a compressed latent representation.

Why should you use DeepFloyd IF:

  • Text understanding: DeepFloyd IF integrates a large language model T5-XXL-1.1 for deep text prompt understanding, enabling it to create images that closely match input descriptions.
  • Text rendering: DeepFloyd IF showcases tangible progress in rendering text with better coherence than previous models in the Stable Diffusion series and other text-to-image models. While it has its flaws, DeepFloyd IF marks a significant step forward in the evolution of image generation models in text rendering.
  • High photorealism: DeepFloyd IF achieves impressive zero-shot FID scores (6.66), which means it is able to create high-quality, photorealistic images. The FID score is used to evaluate the quality of images generated by text-to-image models, and lower scores typically mean better quality.

Points to be cautious about:

  • Content sensitivity: DeepFloyd IF was trained on a subset of the LAION-5B dataset, known for its wide-ranging content, including adult, violent, and sexual themes. Efforts have been made to mitigate the model's exposure to such content, but you should remain cautious and review output if necessary.
  • Bias and cultural representation: The model's training on LAION-2B(en), a dataset with English-centric images and text, introduces a bias towards white and Western cultures, often treating them as defaults. This bias affects the diversity and cultural representation in the model's output.
  • Hardware requirements: You need a GPU with at least 24GB vRAM for running all its variants, making it resource-intensive.

ControlNet

ControlNet can be used to enhance the capabilities of diffusion models like Stable Diffusion, allowing for more precise control over image generation. It operates by dividing neural network blocks into "locked" and "trainable" copies, where the trainable copy learns specific conditions you set, and the locked one preserves the integrity of the original model. This structure allows you to train the model with small datasets without compromising its performance, making it ideal for personal or small-scale device use.

Why should you use ControlNet:

  • Enhanced control over image generation: ControlNet introduces a higher degree of control by allowing additional conditions, such as edge detection or depth maps, to steer the final image output. This makes ControlNet a good choice when you want to clone image compositions, dictate specific human poses, or produce similar images.
  • Efficient and flexible: The model architecture ensures minimal additional GPU memory requirements, making it suitable even for devices with limited resources.

Points to be cautious about:

  • Dependency on Stable Diffusion: ControlNet relies on Stable Diffusion to function. This dependency could affect its usage in environments where Stable Diffusion might not be the preferred choice for image generation. In addition, the limitations of Stable Diffusion mentioned above could also impact the generated images, like distortion and legal concerns.

Animagine XL

Text-to-image AI models hold significant potential for the animation industry. Artists can quickly generate concept art by providing simple descriptions, allowing for rapid exploration of visual styles and themes. In this area, Animagine XL is one of the important players leading the innovation. It represents a series of open-source anime text-to-image generation models. Built upon Stable Diffusion XL, its latest release Animagine XL 3.1 adopts tag ordering for prompts, which means the sequence of prompts will significantly impact the output. To ensure the generated results are aligned with your intention, you may need to follow certain template as the model was trained this way.

Why should you use Animagine XL:

  • Tailored anime generation: Designed specifically for anime-style image creation, it offers superior quality in this genre. If you are looking for a model to create this type of images, Animagine XL can be the go-to choice.
  • Expanded knowledge base: Animagine XL integrates a large number of anime characters, enhancing the model's familiarity across a broader range of anime styles and themes.

Points to be cautious about:

  • Niche focus: Animagine XL is primarily designed for anime-style images, which might limit its application for broader image generation needs.
  • Learning curve: Mastering tag ordering and prompt interpretation for optimal results may require familiarity with anime genres and styles.

Stable Video Diffusion

Stable Video Diffusion (SVD) is a video generation model from Stability AI, aiming to provide high-quality videos from still images. As mentioned above, this model is a part of Stability AI’s suite of AI tools and represents their first foray into open video model development.

Stable Video Diffusion is capable of generating 14 and 25 frames at customizable frame rates between 3 and 30 frames per second. According to this evaluation graph, SVD gained more human voters in terms of video quality over GEN-2 and PikaLabs.

In fact, Stability AI is still working on it to improve both its safety and quality. Stability AI emphasized that “this model is not intended for real-world or commercial applications at this stage and it is exclusively for research”. That said, it is one of the few open-source video generation models available in this industry. If you just want to play around with it, pay attention to the following:

  • Short video length: The model can only generate short video sequences, with a maximum length of around 4 seconds, limiting the scope for longer narrative or detailed exploration.
  • Motion limitations: Some generated videos may lack dynamic motion, resulting in static scenes or very slow camera movements that might not meet the expectations in certain use cases.
  • Distortion: Stable Video Diffusion may not accurately generate faces and people, often resulting in less detailed or incorrect representations, posing challenges for content focused on human subjects.

Now let’s answer some of the frequently asked questions for open-source image generation models. Questions like “Why should I choose open-source models over commercial ones?” and “What should I consider when deploying models in production?” are already covered in my previous blog post, so I do not list them here.

What is LoRA? What can I do with it and Stable Diffusion?

LoRA, or Low-Rank Adaptation, is an advanced technique designed for fine-tuning machine learning models, including generative models like Stable Diffusion. It works by using a small number of trainable parameters to fine-tune these models on specific tasks or to adapt them to new data. As it significantly reduces the number of parameters that need to be trained, it does not require extensive computational resources.

With LoRA, you can enhance Stable Diffusion models by customizing generated content with specific themes and styles. If you don’t want to create LoRA weights yourself, check out the LoRA resources on Civitai.

How can I create high-quality images?

Creating high-quality images with image generation models involves a blend of creativity, precision, and technical understanding. Some key strategies to improve your outcomes:

  • Be detailed and specific: Use detailed and specific descriptions in your prompt. The more specific you are about the scene, subject, mood, lighting, and style, the more accurately the model can generate your intended image. For example, instead of saying "a cat", input something like "a fluffy calico cat lounging in the afternoon sun by a window with sheer curtains”.
  • Layered prompts: Break down complex scenes into layered prompts. First, describe the setting, then the main subjects, followed by details like emotions or specific actions. This will help you guide the model understand your prompt.
  • Reference artists or works: Including the names of artists or specific art pieces can help steer the style of the generated image. However, be mindful of copyright considerations and use this approach for inspiration rather than replication.

Should I worry about copyright issues when using image generation models?

The short answer is YES.

Copyright concerns are a significant aspect to consider when using image generation models, including not just open-source models but commercial ones. There have been lawsuits against companies behind popular image generation models like this one.

Many models are trained on vast datasets that include copyrighted images. This raises questions about the legality of using these images as part of the training process.

Another thing is that determining the copyright ownership of AI-generated images can be complex. If you're planning to use these images commercially, it's important to consider who holds the copyright — the user who inputs the prompt, the creators of the AI model, or neither.

So, what can you do?

At this stage, the best suggestion I can give to someone using these models and the images they create is to stay informed. The legal landscape around AI-generated images is still evolving. Keep abreast of ongoing legal discussions and rulings related to AI and copyright law. Understanding your rights and the legal status of AI-generated images is crucial for using these tools ethically and legally.

What is the difference between deploying LLMs and image generation models in production?

Deploying LLMs and image generation models in production requires similar considerations on factors like scalability and observability, but they also have their unique challenges and requirements.

  • Resource requirements: Image generation models, especially high-resolution video or image models, typically demand more computational power and memory than LLMs due to the need to process and generate complex visual data. LLMs, while also resource-intensive, often have more predictable computational and memory usage patterns.
  • Latency and throughput: Image generation tasks can have higher latency due to the processing involved in creating detailed visuals. Optimizing latency and throughput might require different strategies for image models compared to LLMs, such as adjusting model size or using specialized hardware accelerators (GPUs).
  • Data sensitivity and privacy: Deploying both types of models in production needs wise data handling and privacy measures. However, image generation models may require additional considerations due to the potential for generating images that include copyrighted elements.
  • User experience: For image generation models, I will recommend you provide users with guidance on creating effective prompts, which can enhance the quality of generated images. You may need to design the user interface by considering the model's response time and output characteristics.

Final thoughts

Just like LLMs, choosing the right model for image generation requires us to understand their strengths and limitations. Each model brings its unique capabilities to the table, supporting different real-world use cases. Currently, I believe the biggest challenge for image generation models is ethical and copyright concerns. As we embrace the potential of them to augment our creative process, it's equally important to use these tools responsibly and respect copyright laws, privacy rights, and ethical guidelines.

More on image generation models

  • If you are looking for a way to deploy diffusion models in production, feel free to try these tutorials.
  • Try BentoCloud and get $30 in free credits on signup! Experience a serverless platform tailored to simplify the building and management of your AI applications, ensuring both ease of use and scalability.
  • Join our Slack community to get help and the latest information on BentoML!