Mingjian's Reading List

This is a weekly updated reading list starting from March 23, 2023.

01. Difficulty-Net: Learning to Predict Difficulty for Long-Tailed Recognition: WACV 2023

Authors: Saptarshi Sinha, Hiroki Ohashi
Corporation: Hitachi Ltd., Tokyo, Japan
Mail: saptarshi.sinha.hx@hitachi.com, hiroki.ohashi.uo@hitachi.com


Long-tailed recognition (LTR):
Among long-tailed datasets, a few classes (called head classes) have much more training samples than the rest of the classes (called tail classes). This may cause recognition models to get biased towards the head classes.
This is one of the reasons why lots of deep learning (DL) models suffer a performance drop when deployed in the real world, since the public datasets are usually balanced while real world data are generally long-tailed. The LTR research domain particularly aims at addressing this issue.


Recognition models tend to get biased towards the head classes when trained on long-tailed datasets.


Cost-sensitive learning modifies the loss to learn a better model. A common way is to add a weight values to scale each sample’s loss. A common way to distribute the weights is based on class-frequencies.
The authors’ research team recently found that class-difficulty is better than class-frequencies. They proposed some pre-determined quantifications of class-difficulty.
In this work, they furthermore propose a meta-learning method to learn the optimal quantifications in different situations.

This diagram is drawn on CIFAR-100-LT, the horizontal axis is imbalance (most frequent class size / least frequent class size), the vertical axis is classification accuracy, and the line color represent four pre-determined quantifications and Difficulty-Net. In this diagram, different imbalance represent different datasets, and there is not a optimal pre-determined quantification.

Before this work, Meta-Weight-Net (MWN) is proposed, which predicts sample-level, absolute difficulties. Since sample-level weighting method is worse than class-level weighting method(head classes contain more hard sample because they have more samples, so higher weights are given to head classes in total, model tends to overfit to head classes) and absolute difficulties are worse than relatively difficulties (it is reasonable to assign a high difficulty score to a class with high accuracy, i.e. easy class, if the other classes have even higher accuracies.), this work propose Difficulty-Net to predict class-level, relative difficulties.

Contribution (the original text said so)

• We propose Difficulty-Net, which learns to predict class-difficulty in a meta-learning framework.
• We argue that relative difficulty is more important and effective than absolute difficulty, and provide an empirical evidence for the argument.
• We propose a new loss function, called driver loss, that guides the learning process in a reasonable direction.
• We conducted extensive experiments on multiple long-tail benchmark datasets and achieved state-of-the-art results. In addition, we provide in-depth analysis on
the effect and property of the proposed method in comparison to previous works, which revealed the effectiveness of our method.

Proposed method

Difficulty-Net is a MLP with 2 hidden layers, the input are accuracies calculated among a validation set, the output are difficulty scores. The hidden layer dimention H is defined by H = 2^n such that 2^(n−1) ≤ C < 2^n. :slightly_frowning_face: This forum does not support markdown math formula input.
For each epoch, three-way update (similar to Face2exp :slightly_smiling_face:) is deployed to modify classifier’s weights ϕ and Difficulty-Net’s weights θ as below.


w_i is the output of Difficulty-Net on the validation set.


The driver loss’s definition is as below:
while the a_c_hat here is defined as:
The meta loss’s optimazation is as below:
Here the meta set is the reuse of validation set.



Experiment results

This work outperforms SOTA


The effect of driver loss

This diagram is drawn on CIFAR-100-LT of imbalance 100, without driver loss, the difficulty become nearly uniform and lack of meaning.

Difficulty-Net learns reasonable difficulty scores

This diagram is drawn on CIFAR-100-LT of imbalance 100, here E is defined as:
We can see that the minimum of E is 0, which occurs when d is uniform-distributed, and as it become more and more uniform, E become less and less. The authors explain this process as “the model’s class-wise performance gradually gets balanced”.

Compared with CDB-CE

This diagram is also drawn on CIFAR-100-LT, three classes are picked out, and their assigned weights during training process are plotted. Left is CDB-CE and right is Difficulty-Net.
The right curves are more smooth and stable, the authors explain this as “Difficulty-Net can remember which class is difficult whereas CDB-CE weighting tends to be heavily affected by quick accuracy change at each time step”.

Ablation study



  1. Read the codes and write a demo, this will be done within 2~3 weeks.
  2. While understanding the codes, complete the take-a-way and rethinking parts, consider how this work can benefit our topic.


Why? Could explain it?

数量更多的类别相对而言具有更大的类内差异,导致有更多的hard samples。而head class中的hard sample与tail class中的sample都属于较少出现的样本,模型很难去fit这些样本。但,此处申明的模型更容易拟合到head class指sample-leve weighting还是class-level weighting?


Let me know if there are new CVPR23 paper about FER.

首先,在表格4的#8和#9,作者修改了代码以进行对照实验(即,#8和#9的区别在于#9是Ours,而#8是Ours的sample level版本),这里可以看出class level比sample level要好;
然后,关于hard sample这一段的解释是这样的:因为样本量更多,所以head class中hard samples的数量(absolute number)更多,所以sample level weighting总的来说会给head classes更高的权重,因此偏向head classes。这一点由这位作者在2020年ACCV的另一篇工作Class-wise difficulty-balanced loss forr solving class-imbalance指出。

首先,还是Class-wise difficulty-balanced loss forr solving class-imbalance这篇工作,和另一篇工作Class-Difficulty Based Methods for Long-Tailed Visual ecognition: Internation Journal of Computer Vision, 2022(还是这个作者和他的团队),指出来class difficulty是要比class frequences更好地作为weighting的标准的,也提出来几种quantify class-difficulty 的式子,比如1-a_c等与a_c负相关的式子,这里的a_c是类别i的准确率。
然后,可以从classification accuracy-Imbalance的折线图表中看出,在不同的imbalance下,几种提前定义好的d_c式子表现各有优劣,因此提出使用元学习方法,让模型自己去学习d_c与a_c的关系,这里先有的工作是Rethinking class-balanced methods fo long-tailed visual recognition from a domain adaptation perspective: CVPR 2020Learning to reweight examples fo robust deep learning: ICML 2018和西交团队的MWN: NeurIPS: 2019这些。
而本篇是基于class level difficulty要比sample level difficulty好,relative difficulty要比absolute difficulty好的观点提出的,然后为了使训练得更好又提出了driver loss。


也就是sample level weighting是基于假设“head class中hard sample数量多”使head class的weighting高,从而加剧了偏差。

Class-Difficulty Based Methods for Long-Tailed Visual ecognition: Internation Journal of Computer Vision, 2022 IJCV是顶刊,创新性比CVPR通常还会更高。解释得很不错!

Twin Contrastive Learning with Noisy Labels: CVPR 2023

Authors: Zhizhong Huang, Junping Zhang, Hongming Shan
Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 200433, China
Institute of Science and Technology for Brain-inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
Shanghai Center for Brain Science and Brain-inspired Technology, Shanghai 200031, China

Standard Benchmarks Versus Real-world Datasets

TCL (this work); DivideMix; Learning from Noisy Data with Robust Representation Learning: ICCV 2021; Early-Learning Regularization Prevents Memorization of Noisy Labels: NeurIPS 2020; Multi-Objective Interpolation Training for Robustness to Label Noise: CVPR 2021

  • Standard benchmarks: CIFAR-10/100.
  • Real-world datasets: WebVision and Clothing1M (MOIT doesn’t use Clothing1M).

Conclusion: Standard benchmarks (CIFAR) cannot represent all the situations, experiments on real-world datasets (e.g. WebVision, Clothing1M) must be conducted. This is a convension.

TCL Versus DivideMix

  1. The OOD label noise detection method proposed by TCL, which utilizing a two-component GMM to model the samples with clean and wrong labels, is similar to the “divide” part of DivideMix.
  2. Like DivideMix, TCL also conducts a label correstion method, using the weighted sum of model predictions and noisy labels to replace the noisy labels.
  3. TCL also uses MixUp.

TCL’s Contribution (copy the text)

• We present TCL, a novel twin contrastive learning model that explores the label-free GMM for unsupervised representations and label-noisy annotations for learning from noisy labels.
• We propose a novel OOD label noise detection method by modeling the data distribution, which excels at handling extremely noisy scenarios.
• We propose an effective cross-supervision, which can bootstrap the true targets with an entropy loss to regularize the model.
• Experimental results on several benchmark datasets and real-world datasets demonstrate that our method outperforms the existing state-of-the-art methods by a significant margin. In particular, we achieve 7.5% improvements in extremely noisy scenarios.

TCL’s Framework





  1. Learn contrastive learning after midterm exam, otherwise, I can’t understand the meaning of some terms used in this paper, such as “label-free unsupervised representations”, “discriminative image representations” and so on.
  2. Do some review in math, otherwise I can’t understand the meaning of the formula in the method part.
  3. The codes of Difficulty-Net and this paper are both published on Github, maybe reading the codes can help me better understand these two papers.


Standard benchmarks for evaluating methods on synthetic noise with specified ratios? and real-world datasets to verify the noise robustness of the proposed method?

Good comparison, but what’s the real difference between TCL and DivideMix. Did the authors mention the disadvantages of GMMs? Have they tried class-wise GMM (e.g., different facial expression class) like we have done before?

Based on the figure, it is hard to fully understand how the method works. A few quick questions before I check the paper : 1) why we need two MLP? Why not use feature embedding in the classification network (g) for GMM? 2) what is the contribution of mixup view? 3) how do L_{align} and L_{ctr} works? 4) Can we use images from the same class to form mixup view?

You can discuss this paper with me this Friday afternoon! Or let me know if you need more time.

For experiment setting of TCL, these datasets with noise are used:

Simulated datasets

  1. Symmetric noise: CIFAR-10/100(noise rate 20%, 50%, 80%, 90%)

  2. Assymmetric noise: CIFAR-10(noise rate 40%)

Real-world datasets

  1. WebVision: the first 50 classes of the Google image subset are used, termed WebVision (mini); evaluated on both WebVision and ImageNet validation set.
  2. Clothing1M: only the noisy training set is used.