Zelin Reading List

Starting from 2/11/2023, a weekly updated reading list

VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

Publication: ECCV 2022

Institusion: Tencent ARC Lab


Recent works tried to exploit facial reference prior(DFDNet) to cope with the challenging BFR(Blind Face Restoration) task. However, current solutions only focus on several facial components, which motivate the author to exploring involving VQ(Vector Quantized) book into the method design.

The key challenges are:

  1. A proper patch size should be decided to construct the codebook.

  2. The fusion of encoder feature and codebook feature.


  1. Propose and analyze the use of VQ codebook in face restoration.

  2. Architecture design

  3. Performance gain


Brief intro:

VQGAN is actually a CNN-based auto-encoder that learns a code book to quantize the model latent representations, whose input/output are both HR images. The author first trains a VQGAN and fixes the codebook. By adding an extra decoder and feature fusion modules, the author successfully turns VQGAN into a model restoring LR image by tuning the model by LR-HR image pairs.

Parallel decoder:

The encoder features are adjusted by an extra CNN-based decoder features and fitted into the VQGAN decoder by concatenation. The input of the extra decoder is the same as the VQGAN decoder(codebook features). The adjustment is achieved by a texture warping module.

Texture Warping Module:


decoder and encoder features

two steps:

  1. concatenation

  2. deformable convolution

Model Objectives:

A combination of VQGAN loss and the common losses for image restoration.


Patch Size Analysis

Obs1: Degradations in LQ faces can be removed by VQ codebooks trained

only with HQ faces, when we adopt a proper compression patch size f.

VQGAN is directly used to restore LR image. The author found that VQGAN only trained on HR image with suitable patch size can restore LR image!

patch size too small: failed

patch size too big: change identity

Obs2: When training for the restoration task, there is also a trade-off

between improved detailed textures and fidelity changes.

Fix the codebook and tune VQGAN by LR-HR image pairs.

The inflence from patch size still exists. According to the observation of restored image, the author select patch size=32 which they think is most suitable.

Performance Comparisons

Train in whole FFHQ dataset and test in CelebA-Test and three real-world datasets: LFW-Test, CelebChild-Test and WebPhoto-Test. Methods are evaluated by FID, NIQE, LPIPS ,PSNR SSIM, identity(arcface), shape(landmark).

Baselines(Sourse of model weights is unknown): HiFaceGAN, DFDNet, PSFRGAN, mGANprior, PULSE, GFPGAN

Ablation Stuies

Three design are involved in the ablation study:

  1. Fusion of encoder feature and decoder feature

  2. Parallel decoder

  3. Texture warping module

Fusion of encoder feature and decoder feature improves the restoration fidelity while sacrifices the its realness. An extra decoder alleviate the loss of realness, which is at the cost of fidelity reduction. The Texture warping module can both increase the realness fidelity.

While simply fusing the encoder feature and decoder feature can achieve best restoration performance, current design is better considering both realness and fidelity.

Critical Thoughts

  1. The selection of patch size is a trival design point. However, the author take a lot of space to introduce it and its analysis based on the observation experiments. It efficiently completes the storyline of the research and make it reasonable.

  2. The parallel decoder and texture warpping module design increase the design complexity and its motivation is logical(increase realness). Although it actually reduce the restoration fidelity, it shows the author’s thinking on architecture design, which make the work more complete.

  3. The rely on codebooks actually can decrease its robustness to data prior, which is a drawback and I think can be explored by us.

Great job on the first reading list.

It could be even better with the following suggestions:

  1. Put the important information at the very beginning, for example, ‘critical thinking’ could go first.

The rely on codebooks actually can decrease its robustness to data prior, which is a drawback and I think can be explored by us. This can be extended!

  1. Use 2-3 sentences to highlight why the proposed method differs from the existing works.


a) Existing FR methods use either shape prior or GAN prior, the proposed method is the first to use the vector-quantized codebook (a new prior) to enhance FR. (Idea Novel)

b) Incorporating the vector-quantized codebook for FR is not straightforward. Discuss when VQFR would work and how to make it work. (More like rethinking paper, which will lower expectations for good performance and arouse the interest of the reviewers)

c) Focus on both image quality (metrics: LPIPS, FID, NIQE, PSNR, SSIM) and fidelity (metrics: Degree, Landmark Distance) for performance evaluation.

  1. List the weaknesses, this could be used for the related work of our future paper.

  2. Attach the necessary figure as a good reminder

  3. Provide precise information about the paper, for example, this is an ECCV’22 Oral paper, and the corresponding author Mingming Cheng is a senior expert in CV. Also, give the GitHub link if there is any. If there is, did you successfully run the code? Are the results as good as the paper says? Can this code be used for further research?

  4. Correct typos, Institusion → Insititution, alleviate → alleviates, etc.

To read:

    挂在了arxiv, 目测第一篇用diffusion model 做人脸图像修复的工作

  2. Improving Person Re-identification by Attribute and Identity Learning.
    CVPR 2019, 希望从中了解person re-id常用属性,以及属性使用,目前看起来,有涉及属性间的置信度机制,可能可以学习。

  3. Does face restoration improve face verification?
    21年的某期刊工作, 题目很吸引人

  4. Jointly De-Biasing Face Recognition and Demographic Attribute Estimation
    ECCV 2020 研究人脸识别与人脸属性识别任务如何解耦,希望了解其中对于身份信息与属性信息是怎么定义的,又是怎么区分的

第1篇可以当成seed paper阅读,reading list可以基于这篇展开,我也会读这篇文章;


Zongsheng Yue(research fellow@mmlab) & Chen Change Loy(director of mmlab@NTU, a senior expert in SR field)

S-Lab, Nanyang Technological University

Github: https://github.com/zsyOAOA/DifFace

Critical Thinking

  1. The two derivations are the same. They both conclude that as the time steps increase, the error between predicted $x_N$ and real $x_N$ will reduce. However, it is easy to come to this conclusion intuitively. Because the diffusion process is a process of losing information in essence, it is unsurprising that two images will become similar as they are degraded in the same manner.

A more useful feature should reduce the difference between predicted $x_0$ and real $x_0$.

  1. The essence of this work is image post-processing to remove the over-smooth problem, and I am surprised with its performance and plan to reproduce this work.

  2. The superiority of the work is that it can restore the severely degraded image. This capability may come from two aspects:

    1). The data pre-processing

    2). The use of the diffusion model

    I wonder about its base SR model performance on severely degraded images(32x)(an over-smooth face? fails?). If it is an over-smooth face, can it work in more difficult situations (e.g., 64x)? If it fails, will the restoration result make sense?

  3. The effect of the time step on PSNR and SSIM is ignored. Why?

  4. The performance comparison experiment has a similar problem to our work. Is it fair?


Existing Blind Face Restoration(BFR) methods suffer from two limitations:

1). Can’t restore severely degraded images.

2). Need a combination of loss functions which increase the difficulty of tunning models.


The author tends to involve pre trained diffusion model in the BFR task and proposes DIFFACE. It has two advantages and one drawback:

1). The framework is suitable for any SR model to involve a pretrained diffusion model, which only needs L2 loss. (Because the diffusion model solves the over-smooth problem)

2). The balance between realness and fidelity can be adjusted by the number of steps of the Markov chain.


1). The inference speed of DIFFACE is much slower than the end-to-end model, and its inference speed increase at the cost of realness.


Preliminary(diffusion model principle)

The diffusion model consists of two periods, i.e., diffusion and reconstruction.

If we define $x_0$ as the high-quality(HQ) image, the diffusion model will first deconstruct it to a sample that obeys isotropic Gaussian distribution step by step(Markov chain $q(x_{t-1}|x_{t})$).[diffusion]

The time step and the variance schedule are set up before the diffusion model training, and the diffusion model learns the reverse of transition kernel $q(x_t|x_{t-1})$ so that it can produce [reconstruction].

How DIFFACE works?

The essence of DIFFACE is that it first uses a base SR model, e.g., SRCNN, to estimate $x_0$ from a low-quality(LQ) image $y_0$. Then, it uses the estimation $f(y_0;w)$ rather than $x_0$ to “diffuse” to $x_N$($w$ represents the model parameters, $N$ is the time step). The figure is a little misleading, which hides the estimation result from the base SR model:


The interesting point is that the diffusion model is partially used.


Although the method is simple, the author offers two derivations of the framework features which increase the theoretical value of the work:


: The trade-off between fidelity and realness can be adjusted by the time step $N$.

Since if $p(x_N|y_0)\equiv p(x_N|x_0)$, the final restoration result will be optimal, the author uses the KL divergence to evaluate the difference between the two distributions and obtain:


It proves that as $N$ increases, the error will decrease as the time step increase, which can produce a more realistic result.

And the author claims that more noises are added to the process as the time step increases, so the fidelity will be affected. (N↑, realistic↑, fidelity↓)


The error of predicted $x_N$ contracts by a factor less than 1.



The time step is set to 400

  1. Performance comparison

    Synthetic images: Select 700 images from CelebA-Test in each scale(4,8,16,24,32) (5x700). Use the complex degradation model to degrade them(Eq11).


The restoration results own good PSNR and SSIM. Since the comparison is on synthetic results involving degradations(noise, blur…), It is unsurprising that it is better than bicubic result in PSRN and SSIM. Since the author commits that DifFace is slightly inferior to VQFR and GFPGAN under scale factors 4 and 8 (mild degradations), the performance superiority may come from the larger downsample range([1,8]->[0.8,32]) and the adjusted test image distribution. (We may also adjust our test set ?? Is it fair??)

Real images: LFW-Test, WebPhotoTest, Widerface


  1. Effect of time steps


  1. Robust to a scale factor


  1. Varied sample seed


To read:
1. Image Super-Resolution Via Iterative Refinement
2022年的TPAMI, DIFFAce引用了它,希望读一下

补上:2022 Dec Arxiv

Good point. They should include both PSNR and SSIM effects in Figure 6. The main reason for removing PSRN and SSIM may be because of their inferior performance. You can refer to Table 4 in the supplementary to see that both PSNR and SSIM perform worse than the baseline methods.

关注diffusion model应用时,应该侧重关注efficient difussion model,按照steps去依次迭代太耗时。如何在latent space上进行diffusion model的应用是当前比较热门的研究,CVPR22德国做的latent diffusion来进行图像超分就是一个很好的例子。本文也算是在efficient上做了点简介的改进,通过从N step而不是T step来进行图像修复。

这篇文章有个很大的漏洞,方法本身其实是针对general image restoration来设计的,全文却仅聚焦在face这个domain上去讨论。对比时候,也派出了很多优秀的natural image修复的工作。尤其怀疑其在general image上的有效性!!!



上图的正、逆扩散是整个SR的upper bound performance。分支的estimate相当于在approximate左上部分。


  1. N=0,该框架退化成base SR model;
  2. N=T,这个base SR model用不上,因为xT是高斯分布,不用去逼近;

Strictly speaking, it is unfair. The question is really about how much the SR images are affected by the training data processing.

Improving Person Re-identification by Attribute and Identity Learning

CVPR 2019

Yutian Lin, Yi Yang(senior expert).

the University of Technology Sydney

Critical Thoughts

  1. Age is divided into (child, teenager, adult, and old), which should be a good coarse-grained attribute input, and I tend to test it.
  2. The author mentioned that the attribute is local, while identity is global. And the identity recognition task also increases the performance of the attribute classification task. So shall I de-freeze the classifier weights in my works🤔?
  3. The work is complete because the experiments are extensive. I think it is kind of similar to our “shape + identity” work. However, once the training cost is high, trying this kind of experiment seems hard, e.g., the balance between tasks.


Most existing re-ID methods only consider the identity labels of pedestrians.


  1. Manually labeled a set of pedestrian attributes for the Market-1501 dataset and the DukeMTMC-reID dataset.
  2. Involve attribute classification into identity recognition to accelerate identity recognition by reducing gallery.
  3. Attribute reweight module


  1. the method is simple…(yet effective?)



Balance between tasks

A parameter is used to balance the loss for attribute and identity.
The author finds out that 0.9 id + 0.1 attributes is the best setting in the validation set.

Attribute reweight

A self-attention layer: Sigmoid(layer MLP(x))*(x)

Utilize the correlation between attributes. For instance, when the prediction
scores of “pink upper-body clothes” and “long hair” are very
high, the network may tend to up-weight the prediction scores
for the attribute “female.”

Accelerate identity recognition

A possible usage for classifying attributes.
Set the threshold for attribute classification(if attribute confidence> threshold, use the attribute to filter the dataset).



(Baseline1 : Only identity recognition, Baseline2 : Only attribute classification)

Performance Comparison


Ablation Studies

Effect of single attribute

Remove a single attribute from the dataset and observe its effect on recognition.

It is interesting to know that “The most influencing attribute of the two datasets
are bag types and the color of shoes, which lead to a rank-1
decrease of 2.14% and 1.49% on the two datasets.”

Effect of the Attribute Re-weighting Module

w/o ARM, the performance drop in all three datasets

Micro Results

Robustness of the learned representation in the Wild

Report results on the Market-1501+500k
dataset. The 500k distractor dataset is composed of background
images and a large number of irrelevant pedestrians.

Increase more images from the 500k distractor dataset, found:

The method is robust…?(why does it drop faster?)

Improve Attribute Recognition

下次把paper序号标记下,这次是03. improving xxx

You could try. But I think it really depends on what attributes we are using. For example, can we say gender is a local region-related attribute? For identity, if we mention soft biometrics, then it should be also locally region-related. We could recognize a person by the unique mole and not use the global features.

Is any explanation given in the paper?