Currently required tasks:
- Have a general understanding of the project. Two seed papers:
- Survey. Find more papers in the field of Text-to-Video (T2V), and find some points that can be used for reference.
- To collect the Text-to-Video (T2V) related datasets.
- Configure the running environment. Then try a model.
@FanXie Please assign specific tasks to them (including you)!
Assignment of tasks
- CogVideo & Make-A-Video (including transformer, swintransformer,
- CogVideo inference
- 看T2I 工作的现有效果(中文input） RUN看效果（对比）
- paper reading
- CogVideo & diffusion model
- CogVideo inference
- paper reading
- 研究CogVideo input接口
- T2V task常用的Metrics
整理 MAKE-A-VIDEO 中所用到的数据集
Version1.0 (Written by Hongxun):
Video generation is a long-standing research topic. Recently, text-to-video generation emerges as a promising direction. Our project is to build an efficient model which generates low FVD (Fréchet Video Distance), high IS (Inception score) video sequences based on the input texts. The task remains challenging due to the scarcity and weak relevance of text-video data as well as the high variation during videos, which destroys the alignment between the text and its temporal counterparts. Our solution attempts to build a latent diffusion model and draw inspirations from the leading methods.
Version2.0(Revised by FanXie):
In recent years, video generation has become a popular research topic, with one approach being the generation of videos from text descriptions. This text-to-video task is challenging and involves several obstacles such as semantic understanding, realism, temporal coherence, data availability, computational complexity, and evaluation. Common metrics used for evaluating the generated videos in this task are FVD (Fréchet Video Distance) and IS (Inception Score). Our project aims to address these challenges by developing an efficient model that generates videos with low FVD and high IS scores based on input texts. While the VQVAE and autoregressive transformers have been widely used in recent text-to-video methods, our project will adopt the updated latent diffusion model to tackle this task, taking inspiration from leading methods in the field. This model has proven to be a powerful tool for understanding the decision-making process under uncertainty and has shown promise in improving the quality and realism of generated videos.
Final Version(Revised by DZeng)
Text-to-video generation aims to generate videos from input text descriptions. It remains challenging due to the scarcity and weak relevance of text-video data as well as the high variation in videos, which can cause misalignment between the text and its temporal counterparts. Existing methods such as VQVAE and autoregressive transformers are widely used for text-to-video generation. In contrast, our project aims to address this task from a fresh perspective. Specifically, we propose to use the latent diffusion model to generate video from the input text as it has shown promising results in enhancing the realism of generated videos. Our goal is to generate videos from input text that have low FVD (Fréchet Video Distance) and high IS (Inception Score).
Weekly_report_3.3.pptx (4.8 MB)
- Finish inference on CogVideo
- Survey on T2V (focus on abstracts) and diffusion model (focus on implementation)
- Some tests on CogVideo
ppt: Fan's meeting minutes - #5 by FanXie
Updated on 2023/3/8
IS/FVD has been accomplished.
Meeting on 3.9
- 对Make a video和Cogvideo技术做详细解读，关键帧如何生成。对模型的优点和缺点做分析，并对可能优化的方向进行思考。
创新实践第一次答辩v1.0.pdf (1.9 MB)