OpenAI Sora視訊生成模型技術報告中英全文+總結+影響分析

2024-02-29碼農

譯者： Simon Me，在讀博士，轉自：DataWhale

▌ 01. OpenAI Sora 視訊生成模型技術報告總結

•不管是在視訊的保真度、長度、穩定性、一致性、分辨率、文字理解等方面，Sora都做到了SOTA（當前最優）。

•技術細節寫得比較泛（防止別人模仿）大概就是用視覺塊編碼（visual patch）的方式，把不同格式的視訊統一編碼成了用transformer架構能夠訓練的embeding，然後引入類似diffusion的unet的方式做在降維和升維的過程中做加噪和去噪，然後把模型做得足夠大，大到能夠出現湧現能力。

•簡單來說，在別家做視訊模型的時候還是基於「小」模型的思路（基於上一幀預測下一幀，並且用文字或者筆刷遮罩做約束）的時候，OpenAI則是用做「大」模型的思路做視訊生成——準備足夠大量的視訊，用多模態模型給視訊做標註，把不同格式的視訊編碼成統一的視覺塊嵌入，然後用足夠大的網路架構+足夠大的訓練批次（batch size）+ 足夠強的算力，讓模型對足夠多的訓練集做全域擬合（理解），在模型更好地還原細節的同時讓模型出現智慧湧現能力——例如在一定程度上理解真實世界的物理影響和因果關系。

•最讓人期待（不安）的是，這個視訊生成模型仿佛只是OpenAI世界模型（理解和模擬真實世界的各種復雜因果關系的通用模型）路上點亮的一個成就，而不是終點。

▌ 02. Sora釋出的潛在影響

▎ C端 / 對於普通人

•這或許是獨立創作者最好的年代，Sora釋出之後，文案、音效、視訊AI生成的可用工具都已齊備，一個人可以無痛carry一個短片，好故事將價值千金，有才華的人更難被埋沒。但是從另一個角度將，創作門檻降低之後故事的競爭將異常激烈。

•以vision pro為代表的XR產業將再次獲得助力——內容匱乏將不再是問題。

•目前當紅的短視訊推薦的形態可能會發生改變——從系統根據使用者喜好推薦短視訊，變成針對性生成短視訊？或者說，同一個短視訊在不同的使用者對可以有不同的（即時）微調版本？

▎ B端 / 對於商業公司

•所有做AI視訊生成的公司將面臨第一波危機，但是危中有機。因為OpenAI證明了用大模型的思路做視訊是可行的，那麽他們需要做的只是證明我也可以用大模型做視訊。參考chatGPT火了之後做大語言模型的公司反而更多了而不是更少。

•AI三維生成的公司將面臨第二波沖擊，由於多目重建技術的存在，視訊生成和3D生成的界限是模糊的。所以3D生成可能要重新考慮當前技術路線的合理性和商業敘事邏輯。

•雖然OpenAI沒有明說，但是Sora需要的算力不會小，所以顯卡公司會迎來新的一波利好，但是不一定利好輝達。因為現在算力越來越呈現基礎設施的特征，而基礎設施是各個國家的命脈，即便不考慮禁運，中國不會是唯一一個要求算力自主可控的國家，甚至每個大廠都開始想自己搞顯卡或者AI專用算力卡（參考google、特斯拉、openAI、阿裏），所以算力領域的競爭者會越來越多。

▌ 03. 技術報告全文中英對照（GPT4轉譯+人工潤色）

Video generation models as world simulators
視訊生成模型作為世界模擬器

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
我們探索了在視訊數據上進行大規模訓練生成模型。具體來說，我們聯合訓練了文本條件擴散模型，處理不同持續時間、分辨率和寬高比的視訊和影像。我們利用了一種在視訊和影像潛碼的時空塊上操作的變壓器架構。我們最大的模型Sora能夠生成一分鐘的高保真視訊。我們的結果表明，擴大視訊生成模型的規模是朝著構建物理世界通用模擬器的有前途的路徑。

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.
本技術報告重點介紹：（1）我們將各類視覺數據轉換為統一表示的方法，該方法能夠實作生成模型的大規模訓練；（2）Sora能力和局限性的定性評估。報告中未包含模型和實作細節。

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,generative adversarial networks, 4,5,6,7 autoregressive transformers, 8,9 and diffusion models. 10,11,12 These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.
以前的許多工作已經研究了使用各種方法對視訊數據進行生成建模，包括迴圈網路、生成對抗網路、自回歸變換器和擴散模型。這些工作通常專註於狹窄類別的視覺數據、較短的視訊或固定大小的視訊。Sora是一種通用的視覺數據模型——它可以生成持續時間、寬高比和分辨率各異的視訊和影像，長達一分鐘的高畫質視訊。

Turning visual data into patches

將視覺數據轉換為影像塊

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data. 13,14 The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches . Patches have previously been shown to be an effective representation for models of visual data. 15,16,17,18 We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
我們從大型語言模型中獲得靈感，這些模型透過在互聯網規模的數據上訓練來獲得通用能力。這種範式的成功在一定程度上得益於使用詞元編碼/令牌（token），它們巧妙地統一了文本的多種形式——程式碼、數學和各種自然語言。在這項工作中，我們考慮如何讓視覺數據的生成模型繼承這些好處。與擁有文本令牌的不同，Sora擁有視覺塊嵌入編碼（visual patches）。視覺塊已被證明是視覺數據模型的一種有效表示。我們發現，修補程式是一種高度可延伸且有效的表示形式，用於在多種型別的視訊和影像上訓練生成模型。

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, 19 and subsequently decomposing the representation into spacetime patches.
在高維上，我們首先將視訊壓縮到一個低維潛在空間，然後將表示分解成時空嵌入，從而將視訊轉換成一系列編碼塊。

Video compression network

視訊壓縮網路

We train a network that reduces the dimensionality of visual data. 20 This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.
我們訓練了一個網路，用於降低視覺數據的維度。這個網路將原始視訊作為輸入，並輸出一個在時間和空間上都被壓縮的潛在表示。Sora在這個壓縮的潛在空間內接受訓練，並隨後生成視訊。我們還訓練了一個相應的解碼器模型，將生成的潛在表示對映回像素空間。

Spacetime Latent Patches

隱空間時空編碼塊

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.
給定一個壓縮的輸入視訊，我們提取一系列時空編碼塊作為transformer令牌（token）。這種方案也適用於影像，因為影像只是幀數為單一的視訊。我們基於修補程式的表示使得Sora能夠訓練不同分辨率、持續時間和寬高比的視訊和影像。在推理時，我們可以透過在適當大小的網格中排列隨機初始化的編碼塊來控制生成視訊的大小。

Scaling transformers for video generation

擴充套件Transformer用於視訊生成

Sora is a diffusion model 21,22,23,24,25 ; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original 「clean」 patches. Importantly, Sora is a diffusion transformer . 26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, 13,14 computer vision, 15,16,17,18 and image generation. 27,28,29
Sora是一個擴散模型；給定輸入的雜訊塊（和像文本提示這樣的條件資訊），它被訓練來預測原始的「幹凈」塊。重要的是，Sora是一個擴散變換器。變換器在包括語言建模、電腦視覺和影像生成等多個領域展現了顯著的擴充套件內容。

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.
在這項工作中，我們發現擴散變換器作為視訊模型也能有效地擴充套件。下面，我們展示了訓練進展過程中，使用固定種子和輸入的視訊樣本比較。隨著訓練計算量的增加，樣本品質顯著提高。

Variable durations, resolutions, aspect ratios

可變持續時間、分辨率、寬高比

Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.
過去在影像和視訊生成中的方法通常會將視訊調整大小、裁剪或剪輯到一個標準尺寸——例如，4秒長的視訊，分辨率為256x256。我們發現，直接在數據的原始尺寸上進行訓練可以帶來幾個好處。

Sampling flexibility

采樣靈活性

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.
Sora可以采樣寬屏1920x1080p視訊、豎屏1080x1920視訊以及介於兩者之間的所有格式。這使得Sora能夠直接按照不同裝置的原生寬高比建立內容。它還允許我們在使用同一模型生成全分辨率內容之前，快速原型化較小尺寸的內容。

Improved framing and composition

改進的構圖和畫面組成

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
我們透過實證發現，在視訊的原始寬高比上進行訓練可以改善構圖和取景。我們將Sora與一個版本的模型進行了比較，該模型將所有訓練視訊裁剪成正方形，這是訓練生成模型時的常見做法。在正方形裁剪上訓練的模型（左側）有時會生成主體只部份出現在視野中的視訊。相比之下，來自Sora的視訊（右側）具有改善的取景。

Language understanding

語言理解

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 30 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.
訓練文本到視訊生成系統需要大量帶有相應文字標題的視訊。我們將在DALL·E 3中引入的重新標註技術套用到視訊上。我們首先訓練一個高度描述性的標註模型，然後使用它為我們訓練集中的所有視訊生成文字標題。我們發現，在高度描述性的視訊標題上進行訓練可以提加文本的準確性以及視訊的整體品質。

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.
類似於DALL·E 3，我們也利用GPT將使用者的簡短提示轉換成更長的詳細說明，然後發送給視訊模型。這使得Sora能夠生成高品質的視訊，準確地遵循使用者的提示。

Prompting with images and videos

使用圖片和視訊進行提示

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.
上述結果以及我們的登入頁面展示了文本到視訊的樣本。但是Sora也可以透過其他輸入進行提示，例如預先存在的圖片或視訊。這項能力使得Sora能夠執行廣泛的影像和視訊編輯任務——建立完美迴圈的視訊，為靜態影像添加動畫，向前或向後延長視訊的時間等。

Animating DALL·E images 制作DALL·E影像動畫

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2 31 and DALL·E 3 30 images.
Sora能夠根據輸入的圖片和提示生成視訊。下面我們展示了基於DALL·E 2 31 和DALL·E 3 30 圖片生成的範例視訊。

Extending generated videos

延長生成的視訊

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.
Sora也能夠將視訊向前或向後延長時間。下面是四個視訊，它們都是從生成的視訊片段開始向後延長的。因此，這四個視訊的開頭各不相同，但最終都會達到相同的結局。

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.
我們可以使用這種方法將視訊向前和向後擴充套件，以制作出無縫的無限迴圈。

Video-to-video editing 視訊到視訊編輯

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit, 32 to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.
擴散模型使得從文本提示編輯影像和視訊的方法層出不窮。下面我們將其中一種方法，SDEdit，套用於Sora。這項技術使得Sora能夠零次學習地轉換輸入視訊的風格和環境。

Connecting videos

連線視訊

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.
我們還可以使用Sora在兩個輸入視訊之間逐漸插值，建立在完全不同主題和場景構成的視訊之間的無縫過渡。在下面的例子中，中間的視訊在左右兩邊對應視訊之間進行插值。

Image generation capabilities

影像生成能力

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.
Sora也能夠生成影像。我們透過在具有一個幀時間範圍的空間網格中排列高斯雜訊塊來實作這一點。該模型可以生成不同大小的影像——分辨率最高可達2048x2048。

Close-up portrait shot of a woman in autumn, extreme detail, shallow depth of field
秋天裏一位女性的特寫肖像，極致細節，淺景深

Vibrant coral reef teeming with colorful fish and sea creatures
充滿活力的珊瑚礁，擠滿了五彩繽紛的魚類和海洋生物

Digital art of a young tiger under an apple tree in a matte painting style with gorgeous details
數位藝術：一只幼年老虎在蘋果樹下，采用啞光繪畫風格，細節華麗

A snowy mountain village with cozy cabins and a northern lights display, high detail and photorealistic dslr, 50mm f/1.2
一個雪山村莊，有著舒適的小木屋和北極光展示，高解析度和逼真的數位單眼相機，50mm f/1.2鏡頭拍攝。

Emerging simulation capabilities

湧現的模擬能力

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.
我們發現，當在大規模上訓練時，視訊模型展現出許多有趣的新興能力。這些能力使得Sora能夠模擬現實世界中人類、動物和環境的某些方面。這些內容並沒有任何針對3D、物體等的明確歸納偏見——它們純粹是規模效應的現象。

3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.
3D一致性。Sora能夠生成具有動態相機運動的視訊。隨著相機的移動和旋轉，人物和場景元素在三維空間中保持一致地移動。

Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.
長距離一致性和物體恒存性。對於視訊生成系統來說，一個重大挑戰是在采樣長視訊時保持時間上的連貫性。我們發現，盡管不總是如此，Sora通常能夠有效地建模短距離和長距離依賴關系。例如，我們的模型即使在人、動物和物體被遮擋或離開畫面時，也能持續保持它們的存在。同樣，它能在單個樣本中生成同一角色的多個鏡頭，並在整個視訊中保持其外觀。

Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.
與世界互動。Sora有時可以模擬一些簡單的動作來影響世界的狀態。例如，畫家可以在畫布上留下隨時間持續存在的新筆觸，或者一個人可以吃一個漢堡並留下咬痕。

Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning 「Minecraft.」
模擬數位世界。Sora也能夠模擬人工過程——一個例子是視訊遊戲。Sora可以在同時控制【我的世界】中的玩家采用基本策略的同時，還能以高保真度渲染世界及其動態。透過用提到「我的世界」的字幕提示Sora，可以零次嘗試地引發這些能力。

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.
這些能力表明，持續擴充套件視訊模型是朝著開發高度能夠模擬物理和數位世界及其內部的物體、動物和人類的有希望的道路。

Discussion 討論

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page .
Sora作為一個模擬器目前展現出許多限制。例如，它並沒有準確地模擬許多基本互動的物理效應，比如玻璃破碎。其他互動，比如吃食物，不總是產生正確的物體狀態變化。我們在我們的登入頁面列舉了模型的其他常見故障模式——比如在長時間樣本中發展的不連貫性或物體的自發出現。

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.
我們相信，Sora目前的能力表明，持續擴充套件視訊模型是朝著開發能夠模擬物理和數位世界及其內部的物體、動物和人類的有能力的模擬器的有希望的道路。

原文連結：

https://openai.com/research/video-generation-models-as-world-simulators (原文參考文獻見文末)

加入知識星球【我們談論數據科學】

600+小夥伴一起學習！