Goal: Complete a demo where the input is Chinese text and the output is the corresponding video.
Project team members: FanXie, Hongxun, Mingjian

Currently required tasks:

Whenever you three discussed, Please upload the NOTES of the group meeting. FanXie, HongxunDing, Mingjian

@FanXie Please assign specific tasks to them (including you)!

Assignment of tasks

  • FanXie

    • 讲论文
      • CogVideo & Make-A-Video (including transformer, swintransformer,
      • CogVideo inference
    • 看T2I 工作的现有效果(中文input) RUN看效果(对比)
    • 找其他的T2V的方法
  • Hongxun

    • paper reading
      • CogVideo & diffusion model
    • CogVideo inference
  • Mingjian

    • 研究CogVideo input接口
    • T2V task常用的Metrics
    • 整理 MAKE-A-VIDEO 中所用到的数据集
      • LAION-5B(2.3B)、WebVid-10M(10M)、HD-VILA-100M(10M)


Version1.0 (Written by Hongxun):
Video generation is a long-standing research topic. Recently, text-to-video generation emerges as a promising direction. Our project is to build an efficient model which generates low FVD (Fréchet Video Distance), high IS (Inception score) video sequences based on the input texts. The task remains challenging due to the scarcity and weak relevance of text-video data as well as the high variation during videos, which destroys the alignment between the text and its temporal counterparts. Our solution attempts to build a latent diffusion model and draw inspirations from the leading methods.

Version2.0(Revised by FanXie):
In recent years, video generation has become a popular research topic, with one approach being the generation of videos from text descriptions. This text-to-video task is challenging and involves several obstacles such as semantic understanding, realism, temporal coherence, data availability, computational complexity, and evaluation. Common metrics used for evaluating the generated videos in this task are FVD (Fréchet Video Distance) and IS (Inception Score). Our project aims to address these challenges by developing an efficient model that generates videos with low FVD and high IS scores based on input texts. While the VQVAE and autoregressive transformers have been widely used in recent text-to-video methods, our project will adopt the updated latent diffusion model to tackle this task, taking inspiration from leading methods in the field. This model has proven to be a powerful tool for understanding the decision-making process under uncertainty and has shown promise in improving the quality and realism of generated videos.

Final Version(Revised by DZeng)
Text-to-video generation aims to generate videos from input text descriptions. It remains challenging due to the scarcity and weak relevance of text-video data as well as the high variation in videos, which can cause misalignment between the text and its temporal counterparts. Existing methods such as VQVAE and autoregressive transformers are widely used for text-to-video generation. In contrast, our project aims to address this task from a fresh perspective. Specifically, we propose to use the latent diffusion model to generate video from the input text as it has shown promising results in enhancing the realism of generated videos. Our goal is to generate videos from input text that have low FVD (Fréchet Video Distance) and high IS (Inception Score).



Weekly_report_3.3.pptx (4.8 MB)

Weekly task:

  1. Finish inference on CogVideo
  2. Survey on T2V (focus on abstracts) and diffusion model (focus on implementation)
  3. Some tests on CogVideo

ppt: Fan's meeting minutes - #5 by FanXie

Updated on 2023/3/8
IS/FVD has been accomplished.
Slides: https://docs.google.com/presentation/d/16Lj0BJqiV796iL9o9iO7jJskWQYcJx2wSMmNUZGWGkY/edit?usp=sharing

Meeting on 3.9
Future tasks:

  1. 对Make a video和Cogvideo技术做详细解读,关键帧如何生成。对模型的优点和缺点做分析,并对可能优化的方向进行思考。
  2. 收集足量数据集:1)其它论文的;2)网上爬图并自己生成;
  3. 生成粗略的基本模型;
  4. 实验,优化。

2023/3/17 Progress

创新实践第一次答辩v1.0.pdf (1.9 MB)


创新实践第一次报告.pdf (111.4 KB)
创新实践第一次答辩 3.30.pdf (1.9 MB)