HotPapers_May20

1. What Do Self-Supervised Vision Transformers Learn? ICLR, 2023.

  • Summary [src]
    • The paper compares two self-supervised classes of methods for vision transformers, namely contrastive learning and masked reconstruction.
    • Through several quantitative analyses of pre-trained models, the authors expose a series of findings on the learning properties of these methods.
    • The main findings for contrastive learning are: it captures global low-frequency patterns, its intermediate representations are rather homogeneous, and self-attention is its key component.
    • Conversely, for masked reconstruction, the main findings are: it focuses on local high-frequency patterns, its intermediate representations maintain diversity, and the MLPs are its key component.

image-20230511140659637

 

 

2. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. Google Research, 2023.

  • Motivation
    • 1) Finetuning requires expensive human labels, and 2) distillation requires large amounts of unlabeled data which can be hard to obtain
    • We introduce Distilling step-bystep, a new simple mechanism for training smaller models with less (labled and unlabeled) training data.
  • Method
    • Core to our mechanism is viewing that LLMs can produce natural language rationales.
    • (Details are in the paper) Step1. Extracting rationales from LLMs by utilizing Chain-of-Thought (CoT) prompting (Wei et al., 2022). Step2. Training smaller models with rationales

 

image-20230511174940686

 

 

3. IMAGEBIND: One Embedding Space To Bind Them All. Meta, 2023.

  • Summary
    • Contribution: We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
    • Results: The emergent capabilities improve with the strength of the image encoder. We achieve a new state-of-the-art on emergent zero-shot recognition tasks.

image-20230512203420802

 

 

4. Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime. VGG group and DeepMind, 2023.

  • Introduction
    • While humans are known to efficiently learn new tasks from a few examples, deep learning models struggle with adaptation from a few examples.
    • And we show the important benefits of self-labeling: using the model’s own predictions to self-improve when having access to a larger number of unlabelled images. >> Our work focuses on the data-efficient adaptation.
  • Methods
    • Fine-tunning: We start from Flamingo (which shows the pre-trained model can be adapted to new tasks with in-context learning.).
    • Self-training: (1) train a pseudo-labeller, (2) use it to obtain pseudo-labels and then (3) re-train the original model using the pseudo-labels and the small amount of annotated data.
    • Adapter: We follow the approach in [26] and add MLP adapter layers.
    • Threshold: We choose the top N samples with the highest likelihood.
  • My thought
    • Main difference with DA: (a) conducting experiments on multiple tasks (2) using visual & language encoder (3) New setup: small labled data + large unlabed data
    • Overfitting does not considered. (모든 UDA, Semi-supervised learning 에 적용할 수 있는 Regularization? De-overfitting 기법이 뭐가 있을까? Overfitting 이라는 표현이 맞나? 그냥 Collapes가 아닌가? 원인이 무엇인가?)
    • Note that CLIP model has already the ability for zero shot prediction .

image-20230512204427918

 

 

5. DINOv2: State-of-the-art computer vision models with self-supervised learning. Meta, 2023.

  • Project page
  • [Diff1] Unlike many recent reconstruction-based self-supervised learning methods, our model requires no fine-tuning.
  • [Diff2] CLIP-based models ignores important information that typically isn’t explicitly mentioned in those text descriptions.
  • [Approach1] Making progress from DINO to DINOv2 required overcoming several challenges:
    1. Creating a large and curated training dataset
      • Two key ingredients: 1) discarding irrelevant images, 2) balancing the dataset across concepts
      • 25 public datasets -> a set of seed images -> extending it by retrieving images (crawled web data) sufficiently close to those seed images -> producing 142 million images out of the 1.2 billion source images.
    2. Improving the training algorithm and implementation
      • Challenge1: Increasing the model size makes the training more challenging because of potential instability. -> additional regularization 1, 2
      • Challenge2: Larger models require more efficient implementations. -> Pytorch 2, xFormers
    3. Designing a functional distillation pipeline
      • Model distillation allows us to compress our highest-performance architecture into significantly smaller ones at only a minimal cost in accuracy, for a dramatically decreased inference cost.
  • [Result1] DINOv2 combines with simple linear classifiers to achieve strong results across multiple tasks beyond the segmentation sub-field.

 

 

 

6. Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning. ACL, 2023.

  • In this paper, we systematically study the effect of catastrophic forgetting and mitigation strategies in a massively multilingual setting covering up to 51 languages on three different tasks.
  • We propose LR ADJUST. Our intuition is that models are susceptible to catastrophic forgetting when we provide a higher learning rate, so the learning rate should be toned down.
  • Baselines for continual learning method are (1) Experienced replay, (2) Averaged GEM, and (3) Elastic Weight Consolidation (using fisher information matrix)

image-20230601155300755

 

 

7. A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering. arXiv, 2023.

  • SAM
    • SAM performs promptable segmentation, where the role of prompt behaves like attention.
    • SAM exclusively solves mask prediction (not label prediction), which seem to be trivial task but contributes to the development of vision foundation model.
  • Can SAN really segment anything?
  • Projects (X anything)
    • Grounded SAM
    • SAM + Label
    • Image editing
    • Tracking in Video
    • 3D
    • Medical

image-20230601165131644