最近硅星人多次報道過 AI 圖片生成技術(shù),提到過 DALL·E、Midjourney、DALL·E mini(現(xiàn)用名 Craiyon)、Imagen、TikTok AI綠幕等知名產(chǎn)品。
實際上,Stable Diffusion 有著強(qiáng)大的生成能力和廣泛的使用可能性,模型可以直接在消費級顯卡上運行,生成速度也相當(dāng)之快。而其免費開放的本質(zhì),更是能夠讓 AI 圖片生成模型不再作為少數(shù)業(yè)內(nèi)人士的玩物。
在強(qiáng)者如云、巨頭紛紛入局的 AI 圖片生成領(lǐng)域,Stable Diffusion 背后的“神秘”機(jī)構(gòu) Stability AI,也像是“世外高僧”一般的存在。它的創(chuàng)始人沒有那么出名,創(chuàng)辦故事和融資細(xì)節(jié)也不是公開信息。再加上免費開源 Stable Diffusion 的慈善行為,更讓人增加了對這家神秘 AI 科研機(jī)構(gòu)的興趣。
項目開發(fā)領(lǐng)導(dǎo)者有兩位,分別是 AI 視頻剪輯技術(shù)創(chuàng)業(yè)公司 Runway 的 Patrick Esser,和慕尼黑大學(xué)機(jī)器視覺學(xué)習(xí)組的 Robin Romabach。這個項目的技術(shù)基礎(chǔ)主要來自于這兩位開發(fā)者之前在計算機(jī)視覺大會 CVPR22 上合作發(fā)表的潛伏擴(kuò)散模型 (Latent Diffusion Model) 研究。
在訓(xùn)練方面,模型采用了4000臺 A100 顯卡集群,用了一個月時間。訓(xùn)練數(shù)據(jù)來自大規(guī)模AI開放網(wǎng)絡(luò)項目旗下的一個注重“美感”的數(shù)據(jù)子集 LAION-Aesthetics,包括近59億條圖片-文字平行數(shù)據(jù)。
雖然訓(xùn)練過程的算力要求特別高,Stable Diffusion使用起來還是相當(dāng)親民的:可以在普通顯卡上運行,即使顯存不到10GB,仍可以在幾秒鐘內(nèi)生成高分辨率的圖像結(jié)果。
訓(xùn)練擴(kuò)散模型,預(yù)測每一步對樣本進(jìn)行輕微去噪的方法,經(jīng)過幾次迭代,得到結(jié)果。擴(kuò)散模型已經(jīng)應(yīng)用于各種生成任務(wù),例如圖像、語音、3D 形狀和圖形合成。
擴(kuò)散模型包括兩個步驟:
這其實是非常繁瑣的,而正是基于此,Stable Diffusion采用了一種更加高效的方式構(gòu)建擴(kuò)散模型,具體如下(來源于該模型paper):
為啥區(qū)別開v1.1與后面的v1.4環(huán)境,是我看到v1.1的倉庫好像只是作為一個測試,里面并沒有v1.4完整的代碼,并且模型權(quán)重以及安裝難度小很多。
- sd-v1-1.ckpt: 237k steps at resolution 256x256 on laion2B-en. 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024).
- sd-v1-2.ckpt: Resumed from sd-v1-1.ckpt. 515k steps at resolution 512x512 on laion-aesthetics v2 5+ (a subset of laion2B-en with estimated aesthetics score > 5.0, and additionally filtered to images with an original size >= 512x512, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using the LAION-Aesthetics Predictor V2).
- sd-v1-3.ckpt: Resumed from sd-v1-2.ckpt. 195k steps at resolution 512x512 on “l(fā)aion-aesthetics v2 5+” and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
- sd-v1-4.ckpt: Resumed from sd-v1-2.ckpt. 225k steps at resolution 512x512 on “l(fā)aion-aesthetics v2 5+” and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
上述來源于Github,簡單解釋就是sd-v1-1.ckpt
大概有1.3G左右,而sd-v1-4.ckpt
是4G,full-v1.4
是7.4G,所以進(jìn)入v1.1環(huán)境安裝過程。
pip install --upgrade diffusers transformers scipy
沒錯,就一句話。v1.1環(huán)境只是v1.4的一個簡略版本,v1.4是完全版。
這個問題就有點多了,因為外網(wǎng)問題,以及有些包確實不好安裝,開梯子可能會快很多,因我是在服務(wù)器上,以下是我踩坑的一些記錄。
https://github.com/CompVis/stable-diffusion.git
conda env create -f environment.yaml
conda activate ldm
上述bug主要在第二步,下載速度很慢,這里提供幾種解決方案。作者在yaml中設(shè)置的channels是依據(jù)pytorch和conda默認(rèn)源,但是很顯然,沒有梯子,不僅會很慢,而且timeout幾率大大增加??紤]改變channel地址,為:
name: ldm
channels:
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
# - defaults
我不知道是不是就我有問題,出現(xiàn)報錯為Solving environment: failed,ResolvePackageNotFound
,具體如下:
CLIP
和taming-transformers
,其他沒在出現(xiàn)問題。
最后兩個包錯誤為 error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
,報錯給出的方案為note: This error originates from a subprocess, and is likely not a problem with pip.
:
pip==20.3
,所以退下pip版本就安裝成功。
首先,如果想下載Stable Diffusion的模型,必須要去huggingface同意下載協(xié)議,具體鏈接為:
stable-diffusion-v1-1:
https://huggingface.co/CompVis/stable-diffusion-v1-1
stable-diffusion-v1-4:
https://huggingface.co/CompVis/stable-diffusion-v1-4
點進(jìn)這兩個里面,首先會彈出相關(guān)協(xié)議,大概是不用于商用,不做違法亂紀(jì),xxxxx等,但怎么說呢,量子位那篇《Stable Diffusion火到被藝術(shù)家集體舉報,網(wǎng)友科普背后機(jī)制被LeCun點贊》一文看完,感覺該商用的公司依然會套層皮商用,因為太火?emmm。。。切回正題,只有點擊同意該協(xié)議后,就可以在服務(wù)器端下載了。
在服務(wù)器端輸入:
huggingface-cli login
就會彈出登錄界面:
User Access Tokens
,復(fù)制token,輸入上圖進(jìn)行登陸,如果沒有User Access Tokens
,請進(jìn)行創(chuàng)建:token登錄后,就能進(jìn)行模型測試了。
import torch from torch import autocast from diffusers import StableDiffusionPipeline model_id = "CompVis/stable-diffusion-v1-1" device = "cuda" pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True) pipe = pipe.to(device) prompt = "a photo of an astronaut riding a horse on mars" with autocast("cuda"): image = pipe(prompt, guidance_scale=7.5)["sample"][0] image.save("astronaut_rides_horse.png")
不出意外,會出現(xiàn)條形滾動模型下載輸出,我就不再演示了,雖然該模型只有1.3G,但是我網(wǎng)速有點差,下了v1.4,已經(jīng)有點耐心受限。。
當(dāng)然,上述只是最原始的模型下載方式,還有其余選項下載不同權(quán)重:
""" 如果您受到 GPU 內(nèi)存的限制并且可用的 GPU RAM 少于 10GB,請確保以 float16 精度加載 StableDiffusionPipeline,而不是如上所述的默認(rèn) float32 精度。 """ import torch pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, revision="fp16", use_auth_token=True) pipe = pipe.to(device) prompt = "a photo of an astronaut riding a horse on mars" with autocast("cuda"): image = pipe(prompt, guidance_scale=7.5)["sample"][0] image.save("astronaut_rides_horse.png") """ 要換出噪聲調(diào)度程序,請將其傳遞給from_pretrained: """ from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler model_id = "CompVis/stable-diffusion-v1-1" # Use the K-LMS scheduler here instead scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000) pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True) pipe = pipe.to("cuda") prompt = "a photo of an astronaut riding a horse on mars" with autocast("cuda"): image = pipe(prompt, guidance_scale=7.5)["sample"][0] image.save("astronaut_rides_horse.png")
最后,如果網(wǎng)速實在太差,可以直接去網(wǎng)頁端下載,鏈接為:
https://huggingface.co/CompVis/stable-diffusion-v-1-1-original
和1.1一樣,首先是模型下載,也是有很多種選擇,我就不一一列出了:
# make sure you're logged in with `huggingface-cli login` from torch import autocast from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", use_auth_token=True ).to("cuda") prompt = "a photo of an astronaut riding a horse on mars" with autocast("cuda"): image = pipe(prompt)["sample"][0] image.save("astronaut_rides_horse.png") # device = "cuda" # model_path = "CompVis/stable-diffusion-v1-4" # # # Using DDIMScheduler as anexample,this also works with PNDMScheduler # # uncomment this line if you want to use it. # # # scheduler = PNDMScheduler.from_config(model_path, subfolder="scheduler", use_auth_token=True) # # scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False) # pipe = StableDiffusionImg2ImgPipeline.from_pretrained( # model_path, # scheduler=scheduler, # revision="fp16", # torch_dtype=torch.float16, # use_auth_token=True # ).to(device)
上述我采用最開始的下載方式,默認(rèn)為32位,其它參數(shù)沒動,就是大概要下載4個多G的模型:
不管哪種方式,只要能用就好,那么緊接著就可以測試文本轉(zhuǎn)圖像文本例程,這里我自己寫了兩條,另外,參考了模型方法–Stable Diffusion 中的prompt和運行命令,因為感覺寫得很全的樣子。實例為:
python txt2img.py --prompt "Asia girl, glossy eyes, face, long hair, fantasy, elegant, highly detailed, digital painting, artstation, concept art, smooth, illustration, renaissance, flowy, melting, round moons, rich clouds, very detailed, volumetric light, mist, fine art, textured oil over canvas, epic fantasy art, very colorful, ornate intricate scales, fractal gems, 8 k, hyper realistic, high contrast"
--plms
--outdir ./output/
--ckpt ./models/sd-v1-4.ckpt
--ddim_steps 100
--H 512
--W 512
--seed 8
這里為了好看,參數(shù)做了換行處理,如果直接運行請去除換行,參數(shù)的解釋可以直接看GitHub,沒有太難的參數(shù)設(shè)置。在終端跑起來后,還需要下載一個HardNet模型:
還有兩組我隨便寫得參數(shù)為:
prompt = "women, pink hair, ArtStation, on the ground, open jacket, video game art, digital painting, digital art, video game girls, sitting, game art, artwork"
prompt = "fantasy art, women, ArtStation, fantasy girl, artwork, closed eyes, long hair. 4K, Alec Tucker, pipes, fantasy city, fantasy art, ArtStation"
好像混進(jìn)去什么奇怪的東西?emmm,我也不知道為什么會出來。。。
這是文字轉(zhuǎn)圖片的用例,還有一種就是 圖像+文字轉(zhuǎn)圖像,那么啟動方式為:
python img2img.py --prompt "magic fashion girl portrait, glossy eyes, face, long hair, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, renaissance, flowy, melting, round moons, rich clouds, very detailed, volumetric light, mist, fine art, textured oil over canvas, epic fantasy art, very colorful, ornate intricate scales, fractal gems, 8 k, hyper realistic, high contrast"
--init-img ./ceshi/33.jpg
--strength 0.8
--outdir ./output/
--ckpt ./models/sd-v1-4.ckpt
--ddim_steps 100
本來我以為,跑demo就此就可以很順利的結(jié)束了,然而很悲催的是,卡資源不夠了。剛好卡空間少了幾G(PS:也就是v1.4需要的顯存,不止15G):
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
RuntimeError: CUDA out of memory. Tried to allocate 2.44 GiB (GPU 0; 14.75 GiB total capacity; 11.46 GiB already allocated; 1.88 GiB free; 11.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
所以,我也不糾結(jié)了,直接轉(zhuǎn)FP16精度,并且參考colab上的實驗,我看有人是用t4成功了,那么話不多說,直接轉(zhuǎn)jupyter notebook
。
先導(dǎo)包:
import inspect import warnings from typing import List, Optional, Union import torch from torch import autocast from tqdm.auto import tqdm from diffusers import ( AutoencoderKL, DDIMScheduler, DiffusionPipeline, PNDMScheduler, UNet2DConditionModel, ) from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
然后加入數(shù)據(jù)管道源碼,下載預(yù)訓(xùn)練權(quán)重模型,指定模型為float16
:
class StableDiffusionImg2ImgPipeline(DiffusionPipeline): def __init__( self, vae: AutoencoderKL, text_encoder: CLIPTextModel, tokenizer: CLIPTokenizer, unet: UNet2DConditionModel, scheduler: Union[DDIMScheduler, PNDMScheduler], safety_checker: StableDiffusionSafetyChecker, feature_extractor: CLIPFeatureExtractor, ): super().__init__() scheduler = scheduler.set_format("pt") self.register_modules( vae=vae, text_encoder=text_encoder, tokenizer=tokenizer, unet=unet, scheduler=scheduler, safety_checker=safety_checker, feature_extractor=feature_extractor, ) @torch.no_grad() def __call__( self, prompt: Union[str, List[str]], init_image: torch.FloatTensor, strength: float = 0.8, num_inference_steps: Optional[int] = 50, guidance_scale: Optional[float] = 7.5, eta: Optional[float] = 0.0, generator: Optional[torch.Generator] = None, output_type: Optional[str] = "pil", ): if isinstance(prompt, str): batch_size = 1 elif isinstance(prompt, list): batch_size = len(prompt) else: raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}") if strength < 0 or strength > 1: raise ValueError(f'The value of strength should in [0.0, 1.0] but is {strength}') # set timesteps accepts_offset = "offset" in set(inspect.signature(self.scheduler.set_timesteps).parameters.keys()) extra_set_kwargs = {} offset = 0 if accepts_offset: offset = 1 extra_set_kwargs["offset"] = 1 self.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs) # encode the init image into latents and scale the latents init_latents = self.vae.encode(init_image.to(self.device)).sample() init_latents = 0.18215 * init_latents # prepare init_latents noise to latents init_latents = torch.cat([init_latents] * batch_size) # get the original timestep using init_timestep init_timestep = int(num_inference_steps * strength) + offset init_timestep = min(init_timestep, num_inference_steps) timesteps = self.scheduler.timesteps[-init_timestep] timesteps = torch.tensor([timesteps] * batch_size, dtype=torch.long, device=self.device) # add noise to latents using the timesteps noise = torch.randn(init_latents.shape, generator=generator, device=self.device) init_latents = self.scheduler.add_noise(init_latents, noise, timesteps) # get prompt text embeddings text_input = self.tokenizer( prompt, padding="max_length", max_length=self.tokenizer.model_max_length, truncation=True, return_tensors="pt", ) text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0] # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` # corresponds to doing no classifier free guidance. do_classifier_free_guidance = guidance_scale > 1.0 # get unconditional embeddings for classifier free guidance if do_classifier_free_guidance: max_length = text_input.input_ids.shape[-1] uncond_input = self.tokenizer( [""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt" ) uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0] # For classifier free guidance, we need to do two forward passes. # Here we concatenate the unconditional and text embeddings into a single batch # to avoid doing two forward passes text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers. # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502 # and should be between [0, 1] accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys()) extra_step_kwargs = {} if accepts_eta: extra_step_kwargs["eta"] = eta latents = init_latents t_start = max(num_inference_steps - init_timestep + offset, 0) for i, t in tqdm(enumerate(self.scheduler.timesteps[t_start:])): # expand the latents if we are doing classifier free guidance latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents # predict the noise residual noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"] # perform guidance if do_classifier_free_guidance: noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) # compute the previous noisy sample x_t -> x_t-1 latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs)["prev_sample"] # scale and decode the image latents with vae latents = 1 / 0.18215 * latents image = self.vae.decode(latents) image = (image / 2 + 0.5).clamp(0, 1) image = image.cpu().permute(0, 2, 3, 1).numpy() # run safety checker safety_cheker_input = self.feature_extractor(self.numpy_to_pil(image), return_tensors="pt").to(self.device) image, has_nsfw_concept = self.safety_checker(images=image, clip_input=safety_cheker_input.pixel_values) if output_type == "pil": image = self.numpy_to_pil(image) return {"sample": image, "nsfw_content_detected": has_nsfw_concept} device = "cuda" model_path = "CompVis/stable-diffusion-v1-4" # Using DDIMScheduler as anexample,this also works with PNDMScheduler # uncomment this line if you want to use it. # scheduler = PNDMScheduler.from_config(model_path, subfolder="scheduler", use_auth_token=True) scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False) pipe = StableDiffusionImg2ImgPipeline.from_pretrained( model_path, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16, use_auth_token=True ).to(device)
這里大概也有接近3G的模型,沒有報錯后,載入圖像并對其進(jìn)行預(yù)處理,以便我們可以將其傳遞給管道??梢韵冗x擇官方圖進(jìn)行測試:
預(yù)處理:
import PIL
from PIL import Image
import numpy as np
def preprocess(image):
w, h = image.size
w, h = map(lambda x: x - x % 32, (w, h)) # resize to integer multiple of 32
image = image.resize((w, h), resample=PIL.Image.LANCZOS)
image = np.array(image).astype(np.float32) / 255.0
image = image[None].transpose(0, 3, 1, 2)
image = torch.from_numpy(image)
return 2.*image - 1.
加載官方圖,可以手動下載傳上去,也能直接走網(wǎng)絡(luò)請求:
import requests
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_img = Image.open(BytesIO(response.content)).convert("RGB")
init_img = init_img.resize((768, 512))
init_img
init_image = preprocess(init_img)
prompt = "A fantasy landscape, trending on artstation"
generator = torch.Generator(device=device).manual_seed(1024)
with autocast("cuda"):
images = pipe(prompt=prompt, init_image=init_image, strength=0.75, guidance_scale=7.5, generator=generator)["sample"]
不過我這里加入的是另一個詞條,為:
prompt = "Anime, Comic, pink hair, ArtStation, on the ground,cartoon, Game "
結(jié)果為:
這樣看上去還行,但我去下了幾張動漫圖,準(zhǔn)備還用上面詞條,主要是pink hair的關(guān)鍵字,腦子一瞬間想到的是栗山未來和圣人惠(檢查的時候發(fā)現(xiàn)問題,然而櫻花+惠的組合讓我印象深刻),結(jié)果上述圖里我的jupyter本來就幾個命令塊代碼,跑了接近80次,有60多次都是我在微調(diào)。。。單詞黔驢技窮了,感覺詞條有問題,但就那樣了,調(diào)的比較好的一次作品為:
|
|
不過看網(wǎng)上別人做的,是真的好看。從結(jié)果來講,第一可能是我模型精度選得小,第二就是我的詞匯量有點匱乏,這個用例是邊寫博客邊調(diào)的,另外有其它事情忙,調(diào)得有點煩,不過還算滿意。(PS:不滿意又能怎么辦?emmm)
上面內(nèi)容都是自己搭建環(huán)境自己調(diào),相當(dāng)于可以自己手動調(diào)教模型參數(shù),朝著自己想要的方向走,而下面將介紹一些我在huggingface以及一個商用的已經(jīng)調(diào)教好的在線平臺。
這里推薦兩個地址,一個是為官方的測試地址:
https://huggingface.co/spaces/stabilityai/stable-diffusion
輸入Anime, Comic, on the ground,cartoon, Game
,感覺上不可名狀,官方在線部署的應(yīng)該是小模型了,并且訓(xùn)練結(jié)果很慢。
https://huggingface.co/spaces/huggingface/diffuse-the-rest
最后,是找到一個非開源的名字叫做stable-diffusion-animation
的項目:
https://replicate.com/andreasjansson/stable-diffusion-animation
聯(lián)系客服