Fan’s Reading List

Continuously updating…

01 TokenPose: Learning Keypoint Tokens for Human Pose Estimation

ICCV2021
Paper link

1. Introduction

1.1 Motivation

Since heatmap representation has become the standard label representation to encode the positions of keypoints, most existing models tend to use fully convolutional layers to maintain the 2D-structure of feature maps until the network output.

Nevertheless, there are usually no concrete variables abstracted by such CNN models to directly represent the keypoint entities, which limits the ability of the model to explicitly capture constraint relationships between parts.

传统的cnn模型通常没有抽象出具体的变量来直接表示某个关键点,这使得其不能很好地显式地捕捉到各关键点之间的约束关系。

1.2 Contribution

  • We propose to use token to represent each keypoint entity. In this way, visual cue and constraint cue learning are explicitly incorporated into a unified framework.

    用token来显式表达关键点。(此为keypoint tokens)ps:还有Visual tokens

2. Method

2.1 Keypoint Representation

2.1.1 Visual tokens

Following a similar approach to ViT, the image is divided into batches of patches. Each patch is then straightened into a vector. The vector is then mapped to a d-dimensional embedding through a linear projection function. Then add position embedding.

In this way, each visual token is yielded to represent a specific area of original image.

[visual] = {v1 + pe1, v2 + pe2, …, vL +peL}

遵循和ViT类似的方法,将图像分为一批补丁。而后每一个补丁都拉直成一个向量。向量再经过一个线性投影函数映射为d维嵌入。之后再加上position embedding即可。

2.1.2 Keypoint tokens

We prepend N learnable d-dimensional embedding vectors to represent N target keypoints.

2.2 Architecture

image

CNN提取特征图,而后将每个特征图都分为patches,patch拉直后映射到特定维,此向量之后加上position embedding。再然后同关键点token一同输入Transformer。最后一层tansformer layer输出的关键点token会被用来生成heatmap。

3. Experiments

image

不同大小的模型设置

3.1 COCO

image

COCO val下的实验对比结果。红色是与SimpleBaseline-Res50对比的结果。

3.2 Ablation Study

3.2.1 Keypoint token fusion

We propose to concatenate keypoint tokens outputted by different layers of the Transformer encoder correspondingly, namely ‘keypoint token fusion’, to help model training.

将Transformer的不同层输出的关键点token连接起来。

Taking TokenPose-L+/D12 with 12 Transformer layers as an example, the keypoint tokens output in the 4th, 8th and12th layers are concatenated correspondingly. The resulting three times longer keypoint tokens are then sent into the MLP head to obtain the final heatmaps.

image

对于TokenPose-L+/D12,使用关键点token融合将结果提高了0.2个AP。然而,对于像TokenPose-S这样的小模型,融合反而会导致性能下降。

3.2.2 Position embedding

image

In particular,2D sine position embedding performs better than learnable position embedding, which is as expected since the 2D spatial information is required for predicting heatmaps.

3.3 Visualization

image

The attention maps between keypoint tokens and visual tokens of different Transformer layers.

i)层数越深越细化。 ii)在推断被遮挡的关键点时,其能借助其他关键点的信息,如推断被遮挡的ankle(l),ankle(r)也会有很高的注意力得分。

image

在前几层中,每个关键点几乎都关注其他所有关键点,以构建全局上下文。随着网络的深入,每个关键点往往主要依赖于几个部分来产生最终的预测。

Rethinking

  • 方法主要是借鉴ViT的想法,之后加上关键点token,其中关键点token是一个亮点,可以显式地表示关键点并学习到关键点之间的约束关系,从而提高精度。本质上就是运用transformer的注意力机制,首先能让关键点token之间建立联系,能让我们更好地预测遮挡点;其次,让关键点token作为“取件码”去进一步提取图像的特征。
  • 发现大家都喜欢探究显式的联系,如果能将原本黑盒的操作变得更能为人所理解并结合一些先进的技巧来提升性能,就是一个很好的工作了。

Great job on the first reading list.

It could be even better with the following suggestions:

  1. Rethinking部分应该指出该文章中存在的不足,以及对于我们课题的启发,即如何应用到我们的solution中;这部分是读paper后应该重点思考的部分,将这部分放在reading list开头;

  2. Highlights出该文章和现有其他工作的根本区分(2-3点):

Highlights:

a) 现有方法多用heatmap表示特征点用于模型学习,本文提出了新的特征点表示方式:visual token + keypoint token;
b) 针对Tokenpose的适用范围进行了探索,其使用于大模型,原因是:xxxx
c) Tokenpose能够借助关键点之间的关联性进行特征点预测,此特性尤其适用于特征点遮挡下的检测;

  1. Provide precise information about the paper, authors, and institution, for example, this is an ICCV’21 Oral or Poster paper. Also, give the GitHub link if there is any. If there is, did you successfully run the code? Are the results as good as the paper says? Can this code be used for further research?

  2. 为什么选择该论文,这并不是最新的论文?22年有哪些论文follow了该工作,这些论文在求解上有什么本质的创新?

为什么learnable position embedding is worse?

对于模型的“Large”和“Small”怎么区分?为什么TokenPose不能在所有模型中发挥效果?

两种token的fusion方式能否有进一步的提高,目前的做法是否太straightfoward?

  1. rethinking和highlight部分,感谢老师的意见,我下回将会改进。
  2. 对于文章信息的细节,下回我也会尽可能地加上。对于文章中没有理解的部分,也应该拉下代码去看看如何跑的,同时也多积累一些代码范式,我下回也会改进的。
  3. 选择该论文是因为之前看的文章中有cite这篇,之后发现没看过,就攒起来了。之后我会选一些比较新的论文。2022年以来有74篇工作follow了该工作。对于他们的求解,我看得不多,今后会慢慢关注。

sine方法和learnable方法。我认为只要训练数据量足够大,learnable方法是会超越sine方法的,但是sine方法的优点还是很多的,效果也很牛,我认为在实践中两种方法都应该去尝试,最后选效果更好的。

image

如上图,large和small的区分应该是backbone的不同。

image

由此图我们可以得出small模型是没有pretrain的CNN的,而large模型有。

image

上图中TokenPose-L+/D12是decoder layer:12, embedding size: 384, 有pretrain CNN的large模型。
造成差异的原因我想就是backbone的问题,好的backbone提更好的特征,fusion能更好利用;而不好的backbone提的特征不那么好,有可能是起到误导的效果,使结果变差。

从论文描述来看,visual token和keypoint token是一起输入到Transformer Layer的,由Transformer Layer图我们可以看到里面主要由Multi-Head Attention和FFN组成。
我觉得若是先将visual token和keypoint token先各自进行self-attention操作,而后再cross-attention可能会更好些。

选文章,从三个角度考虑,第一是新(very latest work),第二个是citation(如果多于10),第三个的提供GitHub代码中的star数量(如果几十)。尤其是Github上如果follower比较多,基本上就说明我们可以对这个代码进行很好的复用。

Sine方法具体的优点有哪些?Learable方法效果不如Sine方法是否和训练时候参数学习带来的burden有关系?

Checked, well explained.

02 Pose for Everything: Towards Category-Agnostic Pose Estimation

ECCV2022 Oral

Github

1. Rethinking

  • 本文提出了一个新的研究工作CategoryAgnostic Pose Estimation (CAPE),并在此工作上提出了自己的解决方法。作者提出的解决方法较baseline有很大的改进。不足之处我认为有以下几点:
    • 在汽车和家具类的表现上还比较差。
    • 5-shot的效果较1-shot提升不大,目前5-shot的处理方式是取平均,用其他方式能否更好地利益5-shot的信息。
  • 对课题的启发:
    • 利用本篇工作的方法,可以将APE估计的范围进一步扩大,从哺乳动物到所有的动物。
    • 不同category有不同的keypoints,这边工作里对keypoints是如何处理并融入训练的,值得学习。

2. Introduction

2.1 Motivation

In this paper, we introduce an important yet challenging task, termed CategoryAgnostic Pose Estimation (CAPE).

CAPE aims at using a single model for detecting poses of any category.

Input: a support image of a novel category, the corresponding keypoint definition and the query image

Output: the class-agnostic pose estimator predicts the pose of query image

传统的方法一般都是一个模型解决一种类别的姿态估计问题,本文的作者提出的方法就是能解决多种类别的姿态估计问题,甚至是不在训练集中存在的类别。

image

2.2 Challenge

  • Most pose estimation approaches treat it as a supervised regression task, requiring thousands of labeled images to learn to map an input image to keypoint locations.

  • Different objects may have different keypoint definition and unknown number of keypoints. It is non-trivial to learn the unique output representations and utilize the structural information.

  • There are few to none large-scale pose estimation datasets with many visual categories for the development of a general pose estimation method. Previous datasets mostly consist of only one category (e.g. human body).

2.3 Contribution

  • We introduce an important yet challenging task termed Category-Agnostic Pose Estimation (CAPE). CAPE requires the model to predict the poses of any objects given a few support images with keypoint definition.

  • We propose the novel CAPE framework, namely POse Matching Network (POMNet), and formulate the keypoint detection task as a matching problem. Keypoint Interaction Module (KIM) is proposed to capture both the keypoint-level relationship and the support-query relationship.

  • We build the first large-scale multi-(super-)category dataset for the task of CAPE, termed Multi-category Pose (MP-100), to boost the related research.

3. Class-Agnostic Pose Estimation (CAPE)

3.1 Problem Definition

In this sense, CAPE task can be viewed as a K-shot pose estimation problem. Especially, when K = 1, it is one-shot pose estimation.

3.2 POse Matching Network (POMNet)

image

3.2.1 Feature Extractor

image

3.2.2 Keypoint Interaction Module (KIM)

As the keypoint numbers of different categories are different, several dummy features with padding mask are added at the end to keep a fixed number L of input features (L = 100 in our implementation), which enables KIM to adapt to various keypoint numbers. KIM has three transformer blocks, each of which consists of two major components, i.e. Self-Attn. and Cross-Attn.

The keypoint features are input as query, and the flattened query image features are input as the key and the value.

3.2.3 Matching Head (MH)

image

image

4. Mulit-category Pose (MP-100) Dataset

Over 18K images and 20K annotations are collected from several popular 2D pose datasets, including COCO, 300W, AFLW, OneHand10K, DeepFasion2, AP-10K, MacaquePose, Vinegar Fly, Desert Locust, CUB-200, CarFusion, AnimalWeb, and Keypoint-5. Keypoint numbers are diverse across different categories, ranging from 8 to 68.

We split the collected 100 categories into train/val/test sets (70 for train, 10 for val, and 20 for test). Following the common settings, we form five splits whose test sets are non-overlapping and evaluate the average model performance on the five splits.

image

image

5. Experiments

5.1 Benchmark Results on MP-100 Dataset

image

5-shot较1-shot提升不是很大。

image

在七个超类上训练,在一个超类上测试。

所有方法都在车辆和家具的超类别上表现不佳。这可能是因为这些类别与训练类别非常不同,提取的特征没有足够的辨别力。车辆有大量的不可见关键点,家具图像之间的类内变化较大,这使得这两个超类具有挑战性。

5.2 Ablation Study

image

对比1和5,KIM 显著提高了 CAPE 模型的性能(提高了 13.2%)。

对比3和4,说明Self-Atten较Cross-Atten更为重要。

对比1和4,说明self-attention 在关键点之间传递消息极大地有利于关键点定位。

对比2和5,说明了 MH 的必要性。

6. Learned from the code

I will update here after I debug the code.

代码层面上是怎么实现的?比如:任意特征点都扩展成100,测试时候是如何确保前#n个特征点的特征能有效表示。

以及该设计怎么能用于我们课题中??

此处你应该清楚标记哪些是animal body相关的数据集,哪些是其他数据集。

03 CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Arxiv

GitHub

Wenyi Hong, Ming Ding

Tsinghua University

1. Rethinking

  • 本篇工作是Text-to-Video中第一个开源的基于预训练transformer的工作,这点上挺好的。同时其还巧妙运用了预训练的Text-to-Image模型,减少了大量的计算开销。之后不足之处,我认为有:
    • 尽管从人类的角度来说本篇工作的效果是优于其他所有参与对比的工作的,但在实验数值上,并没有特别出众。
    • 模型规模很大,意味着输入序列的长度不能太长。
  • 启发

2. Introduction

2.1 Previous & Present Work

Previous:

  • Most previous works focus on the next-frame prediction task — forecasting the future frames based on the first video frame. (CNNs or RNNs) e.g. CDNA(2016) and PredRNN(2017)

    • These deterministic models are unable to capture the stochastic temporal patterns and synthesize coherent complex scenes.
  • GANs, begin to dominate the area as they can perform unconditional or class-conditional video synthesis without the first frames. e.g. VGAN(2016), TGAN(2017), MoCoGAN(2017), DIGAN(2022)

  • The framework of VQVAE and autoregressive transformers quickly becomes the mainstream method.

  • Video Diffusion Models

本篇工作的method是autoregressive transformers。

Present:

We present a large-scale pretrained text-to-video generative model, CogVideo, which is of 9.4 billion parameters and trained on 5.4 million text-video pairs. We build CogVideo based on a pretrained text-to-image model, CogView2, in order to inherit the knowledge learned from the text-image pretraining.

2.2 Contribution

  • We present CogVideo, which is the largest and the first open-source pretrained transformer for text-to-video generation in the general domain.
  • CogVideo elegantly and efficiently finetunes a pretrained text-to-image generative model for text-to-image generation, avoiding the expensive full pretraining from scratch.
  • We propose the multi-frame-rate hierarchical training to better align text-clip pairs, which significantly improves the generation accuracy, in particular for movements of complex semantics.

3. Method

image

Multi-frame-rate hierarchical generation framework in CogVideo. Input sequence includes frame rate, text, frame tokens. [B] (Begin-of-image) is a separator token, inherited from CogView2. In stage 1, Ts frames are generated sequentially on condition of frame rate and text. Then in stage 2, generated frames are re-input as bidirectional attention regions to recursively interpolate frames. Frame rate can be adjusted during both stages. Bidirectional attention regions are highlighted in blue , and unidirectional regions are highlighted in green .

3.1 Multi-frame-rate Hierarchical Training

image

根据帧率和文本顺序生成Ts个关键帧。输入顺序为[{帧率}{文本}[B] {帧 1}…{帧 Ts}]。在实践中,我们总是设置Ts = 5和最小采样帧率为1帧/秒。

而后对帧进行插值。在每一轮插值中,我们将生成的帧分割成多个开始和结束相重叠帧块,长度为2.5向上取整为3,并在每个块中的连续帧之间插值一帧。其中帧2和帧4将会自回归地生成。之后递归地减半帧率,我们可以进行越来越精细的插值,生成多帧的视频。

The effect of CogLM. Tasks such as frame interpolation rely heavily on bidirectional information. However, most previous works use GPT, which is unidirectional. CogLM unites bidirectional context-aware mask prediction and autoregressive generation by dividing tokens into unidirectional and bidirectional attention regions. While bidirectional regions can attend to all bidirectional regions, unidirectional regions can attend to all bidirectional regions and previous unidirectional regions.

Cross-Modal General Language Model(CogLM)源于CogView2(a work on text-to-image)。

帧的插值很依赖双向的信息,大部分之前的工作都用的是单向的信息。为了能用到双向的信息,用到了CogLM。该模型将token划分为单向和双向注意区域,将双向上下文掩码预测和自回归生成结合起来。

双向区域可以处理所有双向区域,单向区域可以处理所有双向区域和之前的单向区域。

stage1的所有帧和stage2的帧2帧4是单向区域。帧率、文本和其他所有帧是双向区域。

3.2 Dual-channel Attention

image

In this paper, we propose to leverage pretrained image generation models instead of image data.

对于训练纯视频生成模型,在大规模预训练场景中添加图像数据将显著增加训练成本。

The proposed technique is dual-channel attention, where we only add a new spatial-temporal attention channel to the pretrained CogView2 at each transformer layer. All the parameters in the CogView2 are frozen in the training, and only the parameters in the newly added attention layer (attention-plus) are trainable.

同时直接finetune CogView2用于text-to-video的生成不能很好地继承知识,因为时间注意力遵循不同的注意力模式,并且会在训练的初始阶段迅速破坏预训练的权重,并且具有很大的梯度。

image

α = sigmoid(a), a is a learnable parameter.

3.3 Shifted Window Attention in Auto-regressive Generation

The original Swin attention is only applied to non-autoregressive scenarios, we extend it to the autoregressive and temporal scenario by applying an auto-regressive attention mask in the shifted windows.

最初的Swin attention仅应用于非自回归场景,这里通过在移位窗口中应用自回归注意掩码将其扩展到自回归和时间场景。

An interesting finding is that, the Swin attention provide a chance for parallel generation in faraway regions of different frames, which further accelerates the auto-regressive generation.

image

一个3D自回归swin attention例子(以窗口大小2 × 2为例),红框中的标记只能(直接或间接地)关注黄色或绿色标记。第i帧中的灰色token和红框中的token可以并行生成。

Suppose X,Y is the height and width of each frame, and Ax,Ay are the height and width of shifted window. For two tokens at (t1, x1, y1) and (t2, x2, y2), t1 < t2, the latter cannot attend to the former either directly or indirectly if

image

不同帧上不同位置的满足一定条件的token可以同时生成,可以极大加快inference的速度。

4. Experiments

4.1 Setting

Model

The backbone of CogVideo in both stages is a Transformer with dual-channel attention. The Transformer has 48 layers, with a hidden size of 3,072 in each attention channel, 48 attention heads and 9.4 billion parameters in total.

Dataset

We pretrain our model on a dataset of 5.4 million captioned videos with a spatial resolution of 160 × 160 (can be upsampled to 480 × 480 by CogView2).

Pretraining

The sequence lengths in both stages are 2,065, consisting of 64 text tokens, 5 (frames)× 400 (per frame) image tokens, and 1 seperator token.

4.2 Evaluation

Two popular benchmarks for video generation: UCF101 and Kinetics-600

UCF-101 is a human action dataset consisting of 13,320 videos annotated with 101 action classes.

Kinetics-600 contains 600 classes of human action videos, with roughly 350,000 train and 50,000 test videos in total.

Metrics

  • Fréchet Video Distance (FVD) ↓

    FVD is calculated based on I3D model trained on Kinetics-400.

  • Inception score (IS) ↑

    IS is based on C3D model which was first trained on the Sports-1M dataset and then finetuned on the UCF101 dataset.

4.3 Results

image

从结果看来,CogVideo效果一般。

image

人类评估结果。从人类的角度来说,大家普遍认为CogVideo生成的结果更加好。

4.4 Ablation Study

Hierarchical multi-frame-rate generation

In comparison with CogVideo, we finetune a 1-stage video generation model on Kinetics-600 from the sequential generation model in CogVideo, which generates long videos by sliding windows. In each window, we generate the rest frames based on N_overlap previous known frames. Larger N_overlap means more previous frames can be utilized during the inference, but will increase time overhead.

3.1提到过不同的帧块会有重叠区域,在重叠区域上插值,扩大重叠区域可以更多利用先前帧,但很显然会增加运算开销。

Dual-channel attention with CogView2’s weights.

To highlight the effectiveness of our finetuning strategy, we additionally finetune

(1) a randomly initialized model

(2) a model incorporating CogView2’s weights but leaving the temporal channel unfixed (equivalent to CogVideo without pretraining on videos) on Kinetics-600.

此处为了说明所采用的微调策略是正确的,多加了两个对比实验。一个是随机初始化,另一个是只固定一部分,时间通道部分不固定。

image

年份,之后给reading list编号,比如:03. CogVideo:xxx

此处Cogview2训练的数据量是如何的?How many text-images pairs are used?

是否pretrained模型主要作用是提供了更多的frame candidate? 【B】和输入的frames之间的差异有多大?