An Icebreaker for Precisely Controlled AI Image Generation, ControlNet and T2I-Adapter

  • What is the breakthrough of ControlNet and T2I-Adapter? What’s the difference?

  • Other related studies imposing conditional guidance on T2I diffusion models

  • What is the actual application effect of ControlNet and T2I-Adapter?

  • In terms of user experience, what is the difference from img2img natively supported by SD?

  • The Potential of ControlNet for Illustration Creation

  • Combination of multiple conditional guides

  • The Potential of ControlNet for 3D and Animation Creation

  • Where can I play for free without installation?

  • Papers and Models

The excitement of the AIGC community has reached its peak in several months these days, catching up with the excitement of the first release of Stable Diffusion last year. The main character is ControlNet, a light pre-training model based on Stable Diffusion 1.5, which can use the edge features, depth features or skeleton features (posture skeleton) of the input image (input image), and cooperate with the text prompt to accurately guide the image in SD Generated results in 1.5.

The picture below is from the demo of the ControlNet paper, using Canny Canny edge detection to extract the outline features of the deer in the input picture, and using Prompt “a high-quality, detailed, and professional image” to generate 4 result pictures in SD 1.5.

The preprint of ControlNet was released on February 10, and the weights of the pre-training model and all input condition detectors in the paper were open sourced. The community quickly deployed a trial demo on Huggingface and packaged it as a plug-in that can be used in the Stable Diffusion WebUI.

Six days later, Tencent ARC released a similar solution, T2I-Adapter.

What is the breakthrough of ControlNet and T2I-Adapter? What’s the difference?

Leaving aside the technical details of how to incorporate additional modal inputs in the diffusion model (because I can’t read

), generally speaking, the two ideas are very similar. The breakthrough point is how to add trainable parameters on the basis of the existing model, control the pre-trained large-scale diffusion model , support additional input conditions (input conditions), and achieve effect transfer on new tasks. Robust learning is achieved even when the training dataset is small .

By establishing a framework that retains the advantages and capabilities of large models from billions of images , while having a fast training method, within acceptable time and computing resource conditions, using pre-trained weights, and fine-tuning strategies or Transfer learning, where large models are optimized for specific tasks. Taking into account the ability to deal with general problems and the flexibility to meet the user’s demand for generation control in specific tasks, the generation ability of the original model is retained to the greatest extent.

The frameworks of ControlNet and T2I-Adapter both have flexible and compact features, fast training, low cost, and few parameters, and can be easily inserted into existing text-image diffusion models without affecting the original network topology of existing large models and generating capacity. At the same time, both of them are compatible with other image generation models based on Stable Diffsuion’s fine-tune without retraining, such as Anything v 4.0 (two-dimensional style SD 1.5 fine-tune model).

Training a new input condition detector model, such as a model that supports a new edge or depth detection algorithm, can be as fast as a common fine-tune under this framework.

ControlNet mentioned in the paper that the training of the Canny Edge detector model used 3 million edge-image-labeled corpus, and 600 GPU hours of A100 80G. The Human Pose (Human Pose Skeleton) model uses corpus of 80,000 pose-image-label pairs, and A100 80G with 400 GPU hours.

The training of T2I-Adapter took only 2 days on 4 Tesla 32G-V100s, including 3 kinds of guidance conditions: sketch (150,000 pictures corpus), Semantic segmentation map (160,000 pictures) and Keypose (150,000 pictures) million).

The difference between the two: the pre-training model currently provided by ControlNet has a higher degree of availability and supports more kinds of conditional guidance (9 categories).

And T2I-Adapter “is more concise and flexible in engineering design and implementation, and easier to integrate and expand” (by virushuo who has read its code). In addition, T2I-Adapter supports more than one boot condition , such as sketch can be used at the same time and segmentation map as input conditions, or use sketch guidance in a masked area (that is, inpaint)

It is also worth mentioning that the first work of these two papers is a young Chinese AI researcher, Lvmin Zhang , the first work of ControlNet, who graduated in 21 years and is now a PhD at Stanford. In his sophomore year in 2018, he published ACM Graphics’ highly cited papers are regarded as “geniuses” in the field of AI who have independent scientific research capabilities at the undergraduate level. His most famous previous project is Style2paints , which uses Enhanced Residual U-net and Auxiliary Classifier GAN to color grayscale animation line drafts. As the founder of this small research organization, he has been focusing on AI model training, corpus organization and tool development in the direction of two-dimensional style image generation.

Tencent ARC, which released T2I-Adapter, is Tencent’s business group focusing on smart media-related technologies, with vision, audio and natural language processing as its main directions.

Other related studies applying Input condition guidance to the T2I diffusion model

Of course, no ML solution has been born these days. In December last year, Google released the paper Sketch-Guided Text-to-Image Diffusion Model , which used the idea of ​​classifier guidance and designed a latent edge. The framework of the predictor can predict whether the generation of each step matches the sketch edge detected in the input image on the noisy latent vector of Stable Diffusion. The predictions are then used to guide the generation of the diffusion model.

But the biggest problem with this framework is that the edge generation (gradient guidance) does not consider text information and there is no interaction. The result of independent guidance can make the edge of the image in the generated result match the guidance input, but it does not fit well with the corresponding semantic information.

https://ift.tt/Q42uPNG

Another paper GLIGEN: Open-Set Grounded Text-to-Image Generation released in January this year. “Fine-tune the Stable-Diffusion model with a parameter efficient idea similar to the transformer-adapter in the NLP field (that is, fix the parameters of the existing model, and only train the additional components added to the model), and successfully make the SD model refer to the bounding box location information to generate different entities”.

https://ift.tt/wkZTAbc

In this paper, a runnable demo is released, and the effect has been confirmed. Nakamori, an NLP algorithm engineer on Zhihu, believes that this paper demonstrates “the high scalability of the existing pre-trained Vincent graph model, and the high feasibility of adding various modal control information to the open source model for continued training. “.

demo: https://ift.tt/useA2nf

For the comparison of the results in these three papers, please move to his column: https://ift.tt/U9WNq07

What is the actual application effect of ControlNet and T2I-Adapter?

For Stable Diffusion, in terms of guiding effect, a hundred text prompts may not be as accurate and efficient as an input image. To observe the actual combat effect, a thousand lines of text introduction is not as clear as a few sets of result pictures.

(Except for the pictures marked with the reference source, the others are raw outcomes generated by the author, which are basically unselected single generation results)

Portrait category:

Input image

ControlNet test: convert the original image into HDE map (Holistically-nested edge detection, a DL model of overall nested edge detection, the accuracy is much higher than Canny Edge), and capture its edge features for guidance.

Prompt: portrait, half body, wearing a delicate shirt, highly detailed face, beautiful detail, sharp focus, by HR Giger

Prompt: portrait, half body, wearing a delicate shirt, highly detailed face, beautiful detail, sharp focus, by don’t remember who

Prompt: portrait, half body, wearing a delicate shirt, highly detailed face, beautiful detail, sharp focus, by Alphonso Mucha

T2I-Adapter test: Use Sketch-guided Synthesis to convert the original image into a draft, and capture edge features for guidance. (The edge detection algorithm selected by the Adapter is a lightweight pixel difference network PiDiNet based on the CNN model (https://ift.tt/QnpiSKl)

Prompt: portrait, half body, wearing a delicate shirt, highly detailed face, beautiful detail, sharp focus (all the following 3 photos)

Construction category:

Enter image (Corbusier’s Villa Savoye)

ControlNet test: Convert the original image to Hough Line. (Hough transform is a patented algorithm invented in 1962. It was originally invented to identify complex lines in photos. It is good at detecting straight lines and geometric shapes, and is suitable for capturing the edges of architectural images)

Prompt: building, super detail, by Giorgio de Chirico

Prompt: building, super detail, by Charles Addams

Prompt: building, super detail, by Alena Aenami

T2I-Adapter test: Use Sketch-guided Synthesis to convert the original image into a draft (I got the aspect ratio wrong

)

Prompt: building, super detail, by Giorgio de Chirico (the same below)

Landscape category:

Input image (generated by SD2.0)

ControlNet test: convert the original image into a semantic segmentation map (semantic segmentation map), and capture the shape blocks in it for guidance.

Prompt: artwork by Eyvind Earle, stunning city landscape, street view, detailed

Prompt: artwork by John Berkey, stunning city landscape, street view, detailed

Prompt: artwork by Alphonso Mucha, stunning city landscape, street view, detailed

T2I-Adapter test: Use Sketch-guided Synthesis to convert the original image into a draft, and capture edge features for guidance.

The generation results in SD 1.4 (T2I-Adapter’s pre-training model only supports PLMS sampling, which may affect its generation effect):

Prompt: artwork by Eyvind Earle, stunning city landscape, street view, detailed

Prompt: artwork by Eyvind Earle, stunning city landscape, street view, detailed

Human post skeleton:

Input image

ControlNet test: convert the original image into a human pose, capture the pose bones in it for guidance

Build results in SD 1.5:

Generated results in Anything 4.0:

T2I-Adapter test:

The generated results in SD 1.4 using the same skeleton guide map

Use a hand-drawn sketch as a guide:

The last group is a test that uses User Scribble (Sketch) as a guide for generation. I drew a sketch of a hand-eating octopus cat

ControlNet build results in SD 1.5:

Prompt: Octocat, cat head, cat face, Octopus tentacles, by forget who

Prompt: Octocat, cat head, cat face, Octopus tentacles, by HR Giger

The generation result of T2I-Adapter in SD 1.4:

Prompt: Octocat, cat head, cat face, Octopus tentacles, oil painting

T2I-Adapter supports a parameter of matching strength . The upper picture uses 50% strength, and the lower picture uses 40% strength, and the prompt is the same.

The image above more closely matches the outline of the octopus in the sketch image, while the image below generates tentacles with more offset.

Depth-based guidance

In addition to the three basic input conditions of edge detection, draft and post bones, ControlNet also supports another very useful depth guide.

Input image:

In ControlNet, the original image is converted into a normal map Normal Map (a technique for simulating the lighting effect of the bump, which is an implementation of the bump map. Compared with the depth Depth Map model, the normal map model seems to be better at preserving details. a little better)

Prompt: by HR Giger, portrait of Snake hair Medusa, snake hair, realistic wild eyes, evil, angry, black and white, detailed, high contrast, sharp edge

by Alberto Seveso, portrait of Snake hair Medusa, snake hair, photography realistic, beautiful eyes and face, evil, black and white, detailed, high contrast, sharp edge, studio light

by Alphonso Mucha, portrait of Snake hair Medusa, snake hair, beautiful eyes and face of a young girl, peaceful, calm face, black and white, detailed, high contrast, sharp edge

In terms of user experience, what is the difference between the above boot controls and the img2img natively supported by SD?

The picture below is a draft that I painted quickly in 5 minutes, as an input image input, using Canny edge edge detection in ControlNet as an input condition, and generated 3 results.

Prompt: a deer standing on the end of a road, super details, by Alice Nee

a deer standing on the end of a road, super details, by C215

a deer standing on the end of a road, super details, by Canaletto

In the draft, the edges of the deer’s hind legs were not well recognized, but with the text prompt, all the resulting pictures restored the well-structured deer.

The following pictures are the results generated by the img2img guide. By comparing the input image and the generated results, it is easy to find that the guidance provided by the input image of img2img is mainly the distribution of noise, which affects the composition and color, but the shape (edge) of the generated object does not fit well with the input image ( Antlers are particularly pronounced).

Prompt: a vibrant digital illustration of a deer standing on the end of a road, top of mountain, Twilight, Huge antlers like tree branches, giant moon, art by James jean, exquisite details, low poly, isometric art, 3D art, high detail, concept art, sharp focus, ethereal lighting

Build results in SD 1.5

Build results in SD 2.0

The Noise Strength parameter (0.0 – 1.0) of img2img will determine the approximation of the input image and the generated result. The larger the parameter, the higher the approximation. If you want to get a shape that fits more closely with the input image, you have to sacrifice the “generation ability” of the diffusion model. However, the color and composition in the guide picture can be preserved continuously.

Input Image

Output: Noise Strength Parameter: 0.8

Output: Noise Strength Parameter: 0.5

The Potential of ControlNet Conditional Guidance for Image Creation

Below is a series of experiments and explorations from the community using ControlNet to guide AI for creative generation.

Use the Post reference tool to generate guide images and precisely control the perspective and movement of generated characters. This is almost impossible to do with just a text prompt.

https://ift.tt/3U8eyHa

Another effect of using the SD fine-tune model Realistic Vision to generate the guide map after using the Post reference tool (MagicPoser App).

https://ift.tt/6L1zEPp

https://ift.tt/R9F0cBV

Use the depth map guide in Control Net to precisely control perspective and scenes.

ControlNetのDepthめっちゃ面白い! ! #Aiart pic.twitter.com/dVlyAYkXI2

— jurai (@cambri_ai) February 16, 2023

pic.twitter.com/fA4jk4QfVF

— toyxyz (@toyxyz3) February 14, 2023

Use the human post guide to control the generation of multiplayer characters

This evening I’ve installed #ControlNet as an extension for #automatic1111
These images use OpenPose to duplicate the pose. Still working on using with 2.1 but I’m enjoying seeing what it can do.
Details here: https://t.co/Vh5J0u8vPG #stablediffusion #aiart pic.twitter.com/wlrRYCUCLQ

— TomLikesRobots (@TomLikesRobots) February 16, 2023

Japanese Twitter friend @toyxyz3 has done a series of post skeleton-guided experiments, which are very valuable.

After removing part of the limbs on the post skeleton, ControlNet will guide the generation to treat the missing limbs as occluded, and the head as a side angle (prompt auxiliary guidance may be required).

Change the scale of the limbs in the post skeleton, which ControlNet will handle as perspective angles during guide generation.

Change the head case in the post skeleton, and ControlNet will treat the character objects as different ages (or Q versions) when generating the bootstrap.

Change the number of limbs in the post skeleton. . . ControlNet will handle it as, uh~ Orc during boot generation

.

@toyxyz3 also tested whether a larger number of characters can be reasonably accommodated in the frame.

Combination of multiple conditional guides

Although Control Net can’t natively support multiple input conditions, but with manual post-processing, we can see its application potential.

Use two guiding conditions to generate characters and scenes separately

Characters are guided by post skeleton, and scenes are guided by depth map. Separately generated and then synthesized. Separate guidance works better and allows for more flexibility in creative design. (The figure needs to be matted before the composition. Also don’t forget to add a projection to the figure

)

Simultaneously use different guide maps to cover and meet two control needs

Reddit user Ne_Nel uses two guide maps at the same time (SD generation tool that can support two input images is required), one for ControlNet guide, and one for img2img guide after coloring, so that the object outline and the generated result can be controlled at the same time Color/Shade .

https://ift.tt/xAktXPf

This is also a guiding method that I very much hope to have, which can read the two guiding conditions of edge and color from the input image at the same time. Based on the framework of ControlNet and T2I-Adapter, maybe we will soon see such a new guidance model being trained.

In the following experiment, @toyxyz3 also tried to experiment whether it is possible for ControlNet to bring depth or color information when reading the segments of the Semantic Segmentation map (no)

But the next day, the community discovered a feature of Semantic Segmentation. Semantic Segmentation Semantic segmentation is a deep learning algorithm, and the word ” semantic ” in the name is meaningful. This type of algorithm associates a label or category with each pixel in an image. are used to identify sets of pixels that form different classes. For example, common applications include autonomous driving, medical imaging, and industrial inspection. For example, helping self-driving cars recognize the characteristics of different objects such as vehicles, pedestrians, traffic signs, and road surfaces. Each label will have a corresponding marker color .

It can be known from ControlNet’s paper that the segmentation map model it uses uses the ADE20K protocol. ADE20K exposes the color codes it uses to label different semantic segments.

https://ift.tt/VGLdCT8

This means that when designing the Segmentation map guide map, the creator can use it in reverse. For example, change the color of a certain segment to make it consistent with the semantics when the ADE20K algorithm is used for labeling. For example, ADE20K is used to label “clocks” with grass green , and the shape block in the background is painted in grass green. When generated, the shape block More likely, it will be directed to generate a clock , in fact, the shape block does not match the common circular shape of clocks.

I have to say that the Hacking ability of Stable Diffusion players is really powerful.

The Google Doc link and the picture below are the color codes used by ADE20K for labeling.

https://ift.tt/IbU913q

The Potential of ControlNet Conditional Guidance for 3D and Animation Creation

Combining Blender with ContrelNet to create 3D

Create a 3D model in Blender, export a static image as an input image, use controlnet’s depth detection to produce an image, and paste it back to the original model in Blender as a texture, bingo! Although it will be rough for complex surfaces such as the human body, it should be very practical for simple geometry such as packaging boxes or buildings.

Controlnet’s Normal mode を ってて, 3D model を イラスト, アニメニに変改した后, Yuan の 3D modelににテクスチャとして贴りけます

Normal mode は detail の structure が よ く reflect されるので, かなりcorrect に texcharing あ come out ます し adjustment すればばェイプキーすら flow with できそうなレベル#aiart #b3d pic.twitter.com/7cF7Qqp8PX

— TDS (@TDS_95514874) February 16, 2023

Combining Blender with ContrelNet to create animation

After generating the 3D model in Blender, mark each part with different colors, and then export the animation sequence and input it as the Segmentation map condition in ControlNet. The generated animation has better stability and consistency in the structure of each part, which is especially applicable Actions with occlusion between body parts.

Part ごとにマテリアルを色分けしてレンダリング→controlnetのSegmentationでi2iをすると、かなりprecisionが上がります

Special に parts が intersection す る part や face の subtle な angle で effect を 智流し て い る よ う 思います#aiart #b3d pic.twitter.com/boT7Ht4QG3

— TDS (@TDS_95514874) February 18, 2023

Create an animation using a combination of two input guides

Character actions are guided by post skeleton, and scenes are guided by depth map. Separately generated and then synthesized. Although it is not a real text to animation generation, this method can already achieve better results than before, with less glitch interference (sense of skipping frames), more fluid character movements, and more stable backgrounds.

ControlNet Pose&depth2image animation test #stablediffusion #AIイラスト#depth2image #pose2image pic.twitter.com/YWoUi5IVWv

— toyxyz (@toyxyz3) February 19, 2023

Where to play ControlNet and T2I-Adapter without installation and for free

https://ift.tt/SpabBc3

  • ControlNet + Anything v4.0

https://ift.tt/luMHWY5

https://ift.tt/pd395Dv

  • Integrated into Stable Diffusion WebUI

1. Update WebUI to the latest version, download or install it from https://ift.tt/aDEqGC4, and put it in the extensions folder of WebUI

2. Download the file from https://ift.tt/tiC8TSW and put it in the ckpts directory under the annotator under the plug-in directory

3. Download the model from https://ift.tt/wrmeo16 (700M)

Or https://ift.tt/PIk9qyz (5.7G) into the models directory under the plug-in directory

It is foreseeable that there will be an explosion of plug-ins, APIs, and segmentation tools that integrate similar guidance controls, such as

https://ift.tt/ZbtfjlX

https://ift.tt/hSZsGKN

https://ift.tt/N7yIvja

Papers and Models

  • Adding Conditional Control to Text-to-Image Diffusion Models

https://ift.tt/8dLySeA

https://ift.tt/7PLhNid

  • T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

https://ift.tt/CB3g9xl

https://ift.tt/s7gquxH

have fun everyone

Greetings from the accidentally acquired octopus cat.

I just released the AIGC artist style library lib.KALOS.art. A small team of 4 people has been busy for 4 weeks.

– Currently the largest in the world, with more than 30,000 4v1 style pictures from 1300+ artists,

– Covers three mainstream image generation models

8~11 common themes are generated for each artist, such as portraits, landscapes, science fiction, street scenes, animals, flowers, etc.

The combination of artists and various themes will bring many unexpected results

Postmodern stage designers to draw wasteland science fiction scenes? or Cubist sculptor to paint a cat?

According to human habitual thinking, using portrait painters to generate portraits and landscape painters to generate landscapes actually limits the creativity and possibility of AI models. I hope lib.kalos.art can help you discover the potential of AIGC and get more creative inspiration

Click to read the original text and visit the latest and most complete AIGC art style database

The text and pictures in this article are from mysterious programmers

This article is transferred from https://www.techug.com/post/the-ice-breaking-scheme-for-precise-control-of-ai-image-generation-controlnet-and-t2i-adap8bc362e825d640c9ecd0/
This site is only for collection, and the copyright belongs to the original author.