Open-Sora，高效復現類Sora視訊生成方案開源！魔搭社群最佳實踐教程來啦！

2024-03-20碼農

技術解讀

近期，HPC-AI Tech團隊在GitHub上正式公開了Open-Sora計畫（ https://github.com/hpcaitech/Open-Sora），該計畫致力於復現OpenAI的Sora模型核心技術，並已取得實質性進展。作為開源社群內的開創性工作，Open-Sora率先提供了全球第一個類Sora視訊生成方案。魔搭社群也迅速跟進並深入學習了這一研究成果，以期促進技術交流與套用落地。

Open-Sora的工作主要分為如下幾個部份：

1、變分自編碼器（VAE）

為了降低計算成本，Open-Sora使用VAE名將視訊從原始像素空間對映至潛在空間（latent space）。Sora的技術報告中，采用了時空VAE來減少時間維度，Open-Sora計畫組透過研究和實踐，發現目前尚無開源的高品質時空VAE（3D-VAE）模型。Google研究計畫的MAGVIT所使用的4x4x4 VAE並未開放原始碼，而VideoGPT的2x4x4 VAE在實驗中表現出較低的品質。因此，在Open-Sora v1版本中，使用來自Stability-AI的2D VAE（sd-vae-ft-mse）。

2、Diffusion Transformers - STDiT

在處理視訊訓練時，涉及到大量token。對於每分鐘24幀的視訊，共有1440幀。經過VAE 4倍下采樣和patch尺寸2倍下采樣後，大約得到1440x1024≈150萬token。對這150萬個token進行全註意力操作會導致巨大的計算開銷。因此，Open-Sora計畫借鑒Latte計畫，采用時空註意力機制來降低成本。

如圖所示，在STDiT（空間-時間）架構中每個空間註意力模組之後插入一個時間註意力模組。這一設計與Latte論文中的變體3相似，但在參數數量上未做嚴格控制。Open-Sora在16x256x256分辨率視訊上的實驗表明，在相同叠代次數下，效能排序為：DiT（全註意力）> STDiT（順序執行）> STDiT（並列執行）≈ Latte。出於效率考慮，Open-Sora本次選擇了STDiT（順序執行）。

Open-Sora專註視訊生成任務，在PixArt-α一個強大的影像生成模型基礎上訓練模型。這項研究，采用了T5-conditioned DiT結構。Open-Sora以PixArt-α為基礎初始化模型，並將插入的時間註意力層初始化為零值。這樣的初始化方式確保了模型從一開始就能保持影像生成能力，文本編碼器采用的則是T5模型。插入的時間註意力層使得參數量從5.8億增加到了7.24億。

受PixArt-α和穩定視訊擴散技術成功的啟發，Open-Sora采取了逐步訓練策略：首先在36.6萬預訓練數據集上以16x256x256分辨率訓練，然後在2萬數據集上分別以16x256x256、16x512x512以及64x512x512分辨率繼續訓練。借助縮放position embedding，顯著降低了計算成本。

3、Patch Embedding

Open-Sora嘗試在DiT中使用三維patch embedding，但由於在時間維度上進行2倍下采樣，生成的視訊品質較低。因此，在下一版本中，Open-Sora把下采樣的任務留給時空VAE，在V1中按照每3幀采樣（16幀訓練）和每2幀采樣（64幀訓練）的方式進行訓練。

4、Video caption

Open-Sora使用LLaVA-1.6-Yi-34B（一款影像描述生成模型）為視訊進行標註，該標註基於三個連續的幀以及一個精心設計的提示語。借助這個精心設計的提示語，LLaVA能夠生成高品質的視訊描述。

總結來說，Open-Sora計畫V1版本非常完整的復刻了基於Transformers的視訊生成的Pipeline：

比較多種STDiT的方式，采用了STDiT（順序執行），並驗證了結果

借助position embedding，實作了不同分辨率和不同時長的視訊生成。

在試驗的過程中也遇到了一些困難，比如我們註意到，

Open-Sora計畫一開始采用的是VideoGPT的時空VAE，驗證效果不佳後，依然選擇了Stable Diffusion的2D的VAE。

同時，三維patch embedding，由於在時間維度上進行2倍下采樣，生成的視訊品質較低。在V1版本中依然采用了按幀采樣的方式。

Open-Sora也借助了開源計畫和模型的力量，包括但不限於：

LLaVA-1.6-Yi-34B的多模態LLM來實作Video-Caption，生成高品質的視訊文本對。

受PixArt-α和穩定視訊擴散技術成功的啟發，采用了T5 conditioned DiT結構。

Open-Sora計畫透過其強大的工程能力，快速的搭建和驗證了Sora的技術鏈路，推動了開源視訊生成的發展，同時我們也期待V2版本中，對時空VAE等難題的進一步解決。

魔搭最佳實踐

第一步：下載程式碼並安裝：

# install flash attention (optional)pip install packaging ninjapip install flash-attn --no-build-isolation# install apex (optional)pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git# install xformerspip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121git clone https://github.com/hpcaitech/Open-Soracd Open-Sorapip install -v .

第二步：下載模型並放到對應的資料夾

cd Open-Sora/opensora/models# 下載VAE模型git clone https://www.modelscope.cn/AI-ModelScope/sd-vae-ft-ema.git# 下載ST-dit模型git clone https://www.modelscope.cn/AI-ModelScope/Open-Sora.git# 下載text-encoder模型cd text-encodergit clone https://www.modelscope.cn/AI-ModelScope/t5-v1_1-xxl.git

第三步：修改config檔/mnt/workspace/Open-Sora/configs/opensora/inference/16x256x256.py

num_frames = 16fps = 24 // 3image_size = (256, 256)# Define modelmodel = dict( type="STDiT-XL/2", space_scale=0.5, time_scale=1.0, enable_flashattn=True, enable_layernorm_kernel=True, from_pretrained="/mnt/workspace/Open-Sora/opensora/models/stdit/OpenSora-v1-HQ-16x256x256.pth",)vae = dict( type="VideoAutoencoderKL", from_pretrained="/mnt/workspace/Open-Sora/opensora/models/sd-vae-ft-ema",)text_encoder = dict( type="t5", from_pretrained="/mnt/workspace/Open-Sora/opensora/models/text_encoder", model_max_length=120,)scheduler = dict( type="iddpm", num_sampling_steps=100, cfg_scale=7.0,)dtype = "fp16"# Othersbatch_size = 2seed = 42prompt_path = "./assets/texts/t2v_samples.txt"save_dir = "./outputs/samples/"

執行推理程式碼：

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py

視訊記憶體峰值約30G，值得一提的，大部份的視訊記憶體使用是t5-v1_1-xxl，我們切換成t5-v1_1-large，會報conditioned tensor shape不對，後續魔搭社群也會繼續嘗試切換，目標是可以讓開發者在一張消費級顯卡上使用。

使用官方提供的prompt的生成效果：

prompt： A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell, is the main focus of the video, swimming gracefully towards the right side of the frame. The coral reef, teeming with life, is visible in the background, providing a vibrant and colorful backdrop to the turtle's journey. Several small fish, darting around the turtle, add a sense of movement and dynamism to the scene. The video is shot from a slightly elevated angle, providing a comprehensive view of the turtle's surroundings. The overall style of the video is calm and peaceful, capturing the beauty and tranquility of the underwater world.

prompt：The dynamic movement of tall, wispy grasses swaying in the wind. The sky above is filled with clouds, creating a dramatic backdrop. The sunlight pierces through the clouds, casting a warm glow on the scene. The grasses are a mix of green and brown, indicating a change in seasons. The overall style of the video is naturalistic, capturing the beauty of the landscape in a realistic manner. The focus is on the grasses and their movement, with the sky serving as a secondary element. The video does not contain any human or animal elements.

prompt：A serene night scene in a forested area. The first frame shows a tranquil lake reflecting the star-filled sky above. The second frame reveals a beautiful sunset, casting a warm glow over the landscape. The third frame showcases the night sky, filled with stars and a vibrant Milky Way galaxy. The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. The style of the video is naturalistic, emphasizing the beauty of the night sky and the peacefulness of the forest.

prompt：The Grandeur of a Venetian Canal at Twilight.As evening descends upon Venice, the Grand Canal shimmers with the reflections of historic palazzos and gently bobbing gondolas. The fading light turns the water to liquid gold, while the sky transitions from a soft lavender to the deep blue of the coming night. Lanterns begin to glow from wrought iron posts, casting a warm light on the faces of lovers walking hand in hand along the cobblestone paths. The gentle lapping of the water against the stone walls of the canal carries the soft melodies of a distant accordion, evoking a timeless romance only Venice can offer.

prompt：A Rustic Vineyard at Harvest Time.Nestled in the rolling hills of the countryside, the vineyard is a patchwork of vines laden with clusters of ripe, plump grapes. Workers move between the rows, baskets in hand, as they gather the fruits of their labor under the autumn sun. The vines are a cascade of golden leaves, ready to fall at the slightest touch. In the distance, a stone farmhouse rests, its walls aged by time, with smoke rising from the chimney into the clear blue sky. The air is filled with the sweet scent of grapes and the earthy aroma of the soil, inviting a sense of home and the promise of a bountiful harvest.

64幀

指令碼換為/configs/opensora/inference/64x512x512.py

註意修改模型為OpenSora-v1-HQ-16x512x512.pth

推理時視訊記憶體占用約44G