Goal: Complete a demo where the input is Chinese text and the output is the corresponding video.
Project team members: FanXie, Hongxun, Mingjian
Currently required tasks:
- Have a general understanding of the project. Two seed papers:
- Survey. Find more papers in the field of Text-to-Video (T2V), and find some points that can be used for reference.
- To collect the Text-to-Video (T2V) related datasets.
- Configure the running environment. Then try a model.
Whenever you three discussed, Please upload the NOTES of the group meeting. FanXie, HongxunDing, Mingjian
Assignment of tasks
-
- 讲论文
- CogVideo & Make-A-Video (including transformer, swintransformer,
- CogVideo inference
- 看T2I 工作的现有效果(中文input) RUN看效果(对比)
- 找其他的T2V的方法
- 讲论文
-
- paper reading
- CogVideo & diffusion model
- CogVideo inference
- paper reading
-
- 研究CogVideo input接口
- T2V task常用的Metrics
-
整理 MAKE-A-VIDEO 中所用到的数据集
- LAION-5B(2.3B)、WebVid-10M(10M)、HD-VILA-100M(10M)
Abstract
Version1.0 (Written by Hongxun):
Video generation is a long-standing research topic. Recently, text-to-video generation emerges as a promising direction. Our project is to build an efficient model which generates low FVD (Fréchet Video Distance), high IS (Inception score) video sequences based on the input texts. The task remains challenging due to the scarcity and weak relevance of text-video data as well as the high variation during videos, which destroys the alignment between the text and its temporal counterparts. Our solution attempts to build a latent diffusion model and draw inspirations from the leading methods.
Version2.0(Revised by FanXie):
In recent years, video generation has become a popular research topic, with one approach being the generation of videos from text descriptions. This text-to-video task is challenging and involves several obstacles such as semantic understanding, realism, temporal coherence, data availability, computational complexity, and evaluation. Common metrics used for evaluating the generated videos in this task are FVD (Fréchet Video Distance) and IS (Inception Score). Our project aims to address these challenges by developing an efficient model that generates videos with low FVD and high IS scores based on input texts. While the VQVAE and autoregressive transformers have been widely used in recent text-to-video methods, our project will adopt the updated latent diffusion model to tackle this task, taking inspiration from leading methods in the field. This model has proven to be a powerful tool for understanding the decision-making process under uncertainty and has shown promise in improving the quality and realism of generated videos.
Final Version(Revised by DZeng)
Text-to-video generation aims to generate videos from input text descriptions. It remains challenging due to the scarcity and weak relevance of text-video data as well as the high variation in videos, which can cause misalignment between the text and its temporal counterparts. Existing methods such as VQVAE and autoregressive transformers are widely used for text-to-video generation. In contrast, our project aims to address this task from a fresh perspective. Specifically, we propose to use the latent diffusion model to generate video from the input text as it has shown promising results in enhancing the realism of generated videos. Our goal is to generate videos from input text that have low FVD (Fréchet Video Distance) and high IS (Inception Score).
现在text-to-video的inference进展如何?
inference已经完成
Weekly task:
- Finish inference on CogVideo
- Survey on T2V (focus on abstracts) and diffusion model (focus on implementation)
- Some tests on CogVideo
Updated on 2023/3/8
IS/FVD has been accomplished.
Slides: https://docs.google.com/presentation/d/16Lj0BJqiV796iL9o9iO7jJskWQYcJx2wSMmNUZGWGkY/edit?usp=sharing
Meeting on 3.9
Future tasks:
- 对Make a video和Cogvideo技术做详细解读,关键帧如何生成。对模型的优点和缺点做分析,并对可能优化的方向进行思考。
- 收集足量数据集:1)其它论文的;2)网上爬图并自己生成;
- 生成粗略的基本模型;
- 实验,优化。
2023/3/17 Progress
好的,看起来很很多。ppt命名把答辩日期加进去,同时把report也上传。