Flash-VStream：AI竟然能看懂電影了！！開源視訊語言模型，一個新概念誕生了

2024-07-17碼農

小夥伴們有沒有想過未來使用AI來看電影？🎬

這裏所說的並非用像Vision Pro的產品來看，而是真正地讓AI來替我們看。👀

清華聯合字節跳動最新釋出開源視訊語言模型—— Flash-VStream ！

一個新的概念出現了： video-language model

Flash-VStream模擬人類的記憶機制，能夠在即時處理極長視訊流的同時，響應使用者查詢。

通俗易懂就是能夠使用Flash-VStream去理解長視訊裏的內容，使用者可以隨時詢問Flash-VStream長視訊裏的內容。

掃碼加入AI交流群

獲得更多技術支持和交流

計畫簡介

線上視訊流是最常見的媒體形式之一，動態本質為視訊理解帶來很大的挑戰。

傳統的視訊理解模型往往無法有效處理視訊內容的連續性和使用者問題的異步性，尤其是在需要儲存和處理長時間資訊時。

Flash-VStream模型透過模擬人類的記憶機制，顯著降低了推理延遲和視訊記憶體消耗。

Demo

團隊在Hugging Face上釋出了模型Demo體驗效果，為此小編拿了幾個視訊進行了測試。

Q:what he is doing?

A:The man is sitting in front of a fire with two apples in front of him. He is holding one apple in his hand and using it to roast the other apple. (這個男人坐在火前，面前放著兩個蘋果。他手裏拿著一個蘋果，用它來烤另一個蘋果。)

Q:tell me what the two girls is doing and the difference between two girls.

A:The two girls are dancing in front of a window. One of the girls is wearing a white dress, while the other is wearing a blue dress. The girl in the white dress is wearing a bow and is dancing in a more energetic way. The girl in the blue dress is also dancing, but she is not wearing a bow. (兩個女孩正在窗前跳舞。其中一個女孩穿著白色連衣裙，另一個女孩穿著藍色連衣裙。身穿白色連衣裙的少女，系著蝴蝶結，舞姿更加充滿活力。穿藍色裙子的女孩也在跳舞，但她沒有戴蝴蝶結)

Q:Describe the video briefly.

A:The video shows a black and white panda bear eating bamboo in a zoo. The panda is seen with its mouth open, and the bamboo is green. The video captures the panda's natural behavior in a zoo setting. (視訊顯示一只黑白相間的熊貓在動物園裏吃竹子。熊貓張著嘴，竹子翠綠。該視訊捕捉了熊貓在動物園環境中的自然行為)

Q:Why is this video funny?

A:The video is funny because it shows a person wearing a teddy bear costume dancing in a room. The person's movements are exaggerated and silly, which adds to the humor of the video. The video captures the person's playful and lighthearted nature, making it entertaining to watch. (該視訊很有趣，因為它顯示一個穿著泰迪熊服裝的人在房間裏跳舞。人的動作誇張又傻氣，增添了視訊的幽默感。該視訊捕捉到了這個人頑皮而輕松的天性，讓人觀看起來很有趣)

透過模型回答的效果能夠看出部份視訊理解的細節方面仍然存在偏差，但整體的視訊大體主旨以及回答的詳細度效果都比較好。

計畫原理

模型引入了名為STAR的學習式記憶機制，STAR即空間-時間-抽象-檢索，該記憶機制可壓縮必要的視覺資訊並以線上和即時的方式更新記憶。