[Note] Hot papers in April 2024

240425_Trend

A. Retrieval augmented generation (RAG) Architecture

What is RAG? (LinkedIn)
How RAG works? In terms of LLM, Inference 과정 동안, 유사한 Sample을 question에 같이 넣어준다.
retrieve는 '검색하다. 수습하여 되찾다' 이런 뜻인데, 이 분야에서는 유사한 데이터를 찾다 느낌으로 사용된다.
In terms of VLM, What is RA? Training 과정 동안, 기존 train-set과 유사한 데이터 가져와 사용한다.

◽️ REACT Learning Customized Visual Models with Retrieval-Augmented Knowledge. CVPR, 2023

Model을 완성하는데 3가지 과정을 다음과 같이 정의한다. Pretraining → Customization → Finetuning. 여기서 Customization 과정은 Retrieval-Augmented Knowledge(ex, web-crawled data)를 사용해 pra-trained model을 조금 더 downstream tasks에 특화되게 만드는 과정이다.
위 세팅을 구현하기 위해서 구성한 데이터셋 예시는 다음과 같다. Pretraining: CLIP weight → Customization: LAION (400M large image-text pairs, COCO) → Finetuning (ImageNet, 20 datasets in ELEVATER)
특별한 Method가 있는 것은 아니고, 위와 같은 세팅으로 모델을 학습시켰을 때 (PEFT을 위한 weight는 아래 오른쪽 이미지와 같음, locked-text gated-image tuning이라고 불림), 효과를 많은 실험으로 보여줌. Classification, Retrieval, Detection, Segmentation task의 downsteam dataset에서 더 좋은 zero-shot, few-shot 성능 향상을 얻음.

◽️ Retrieval-Augmented Multimodal Language Modeling. ICML, 2023

Foundation model이 world knowledge를 가지고 있는데, 이것은 (1) learning and serving에 어려움이 있고, (2) 대부분 black box 라는 단점이 있다. World knowledge를 model weights에 담아 놓는 것 말고 World images를 massive key, value로 저장해서 활용하는 방법에 대해 고민한다. 활용 방법은 아래 Overview에 있는거 거의 전부이다.
Long-tailed recognition 에서 tail classes 들을 위한 external knowledge (이 논문에서는 Webli, LAION, YFCC100M, ImageNet 이라고 가정) 활용에 초점을 맞춤다.

B. AI agentic workflows (LLM in the Loop)

What is LLM in the Loop? (Andrew Ng)
- A Current LLM system is akin to asking someone to compose an essay from start to finish, typing straight through with no backspacing allowed, and expecting a high-quality result. But the iterative process (글을 여러번 보고 첨석하는 과정) is critical even for most human writers to write good text.

◽️ LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement. Berkeley, Mar 2024. (arXiv)

A targeted and iterative LLM-based data augmentation technique that efficiently and effectively augments small task-specific datasets. These datasets are used to fine-tune a student LLM.
아래 그림처럼, Teacher는 Student가 못하는 data를 1) 추가로 2) Task-specific 하게 생성하는 역할을 해준다. Teacher가 human annotator가 아니라, LLM이라는 점에서 LLM in the loop라고 표현할 수 있다.

◽️ Improving Text-to-Image Consistency via Automatic Prompt Optimization. Meta, Mar 2024. (arXiv)

Prompt-image consistency: producing images that are consistent with the input prompt.
Method:
- A T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models.
- Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score.
/Users/junha/Library/CloudStorage/OneDrive-개인/Davian/Study/GenerativeStudy/240320_OPT2I_송준하

◽️ Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. arXiv, 2024.

C. VLM pretraining

ViTamin: Designing Scalable Vision Models in the Vision-Language Era (project)
- Developing a ViT replacement model, ViTamin.
- Classification 뿐만 아니라 zero-shot, imagenet-R, retrieval, etc., 에서 잘되는 ViT 개발.

D. Bias and Generalization

Can Biases in ImageNet Models Explain Generalization?, CVPR 2024.
- Distribution shift를 여러가지 경우로 정의한다. (ex, texture, critical, high-frequency) 이 bias를 해결하는 것과 generalization과의 관계를 설명한다.
- 48가지 DG기법으로 학습된, ResNet-50 모델들을 사용해 위 관계를 설명한다.
- 논문의 Contributions 처럼, 그들의 key insight를 정리해놓은 방식은 아주 좋은 것 같다.
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?, arXiv 2024.
- 위 논문 저자가, 1달후에 새롭게 개시한 논문.
- We find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. (text prompt 를 적절히 줌으로써 shape-bias가 해결되어 왔다.)
- One important visual bias is the texture vs. shape bias, or the dominance of local over global information.
- 사용된 비교 모델들은 다음과 같음. (Instruct-VLMs) GPT, LLaVA / CLIP / ImageNet pretrained model / Human

◽️ Model Stock: All we need is just a few fine-tuned models, NAVER AI, 2024 (arxiv)

CLIP ViT-L을 ImageNet으로 Fine-tuning 한다. 여러 Seed를 사용해서 여러 개의 fine-tuned models을 얻었다고 가정하자. 이들을 weight-average함으로써 ID에서도 OOD에서도 좋은 성능을 가지는 모델을 만들 수 있다. (위 fine-tuning과정에서 DG등의 기법은 사용되지 않는다)
아와 같은 접근법과 테스크는 Model soups. ICML, 2022에서 먼저 소개되었다. Model soups에서는 fine-tuned models가 48개 정도 많이 필요했지만, model stock에서는 2~3개의 모델만 있더라도, 충분히 좋은 모델을 찾을 수 있는 기법을 소개한다.
Section에 관계없이, Observation을 나열한 것이 인상적이다.
1. fine-tuned models의 weights 들끼리의 angle, norm 차이는 consistent 하다.
2. weight-average는 ImageNet val-set (ID), IN-R, IN-A (OOD) 모두에서 좋은 성능을 유도한다.
3. pretrained-model (W_0)과 적은 수의 fine-tuned models (W_1, W_2)의 geometric analaysis and hypothesis를 통해서 최적의 averaged model (W_H)를 얻는 방법론을 제시한다.

◽️ Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. Meta. Apr, 2024. (X link)

LLMs가 함유하고 있는 지식의 정도를 측정하는 새로운 Evaluation metric 제시한다. 그들의 측정에 따르면, LLMs (current foundation models)는 int8 형식의 하나의 파라미터가 저장하는 데이터/지식의 량은 2 bits 뿐이란다.
training duration, model architecture 등등 에 따른 '지식 저장량'의 변화에 대해 분석한다. 요약본은 위 트위터 게시물에 잘 정리되어 있음.
특히 마지막 결과가 Finetuning의 필요성을 강조해주는 것 같음. "Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model’s knowledge capacity."
여기서 사용하는 evaluation등을 CV 에서 적용해보는 방법을 고안해보는 것도 좋은 연구 방향 같음.

◽️ DatasetDM:Synthesizing Data with Perception Annotations Using Diffusion Models NeurIPS 2023

project: https://weijiawu.github.io/DatasetDM_page/
SD의 understanding (그림에서 cross attention maps, multi-scale feature maps)을 인풋으로 받는, P-decoder 라는 small module을 few-shot learning 한다.
Semantic segmentation, Instance segmentation, depth estimation 같은 많은 visual perception task에서 좋은 성능을 보여준다.
"SD의 understanding이 어디에 담겨있는가? 어떤 Feature를 사용하면 좋은가?" 관점에서 이 논문과 아래 논문을 참고하면 좋을 것 같다. Unleashing Text-to-Image Diffusion Models for Visual Perception (ICCV 2023)

◽️ GInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

project: https://dginstyle.github.io/
Question: Are diffusion models usable as large-scale data generators, e.g., to improve tasks in the perception stack? Yes!
Motive1: ControlNet으로 생성된 이미지는 GTA 그 자체 같다. Diffusion의 foundation knowledge를 사용하지 못한다.
Motive2: SD은 Small object 잘 생성하지 못한다.
Method: 위 두 문제점을 완화하기 위해, ControlNet finetuning을 좀 더 정교하게 하는 Style Swap 기법 제안.

Junha

[Note] Hot papers in April 2024

A. Retrieval augmented generation (RAG) Architecture

◽️ REACT Learning Customized Visual Models with Retrieval-Augmented Knowledge. CVPR, 2023

◽️ Retrieval-Augmented Multimodal Language Modeling. ICML, 2023

B. AI agentic workflows (LLM in the Loop)

◽️ LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement. Berkeley, Mar 2024. (arXiv)

◽️ Improving Text-to-Image Consistency via Automatic Prompt Optimization. Meta, Mar 2024. (arXiv)

◽️ Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. arXiv, 2024.

C. VLM pretraining

D. Bias and Generalization

◽️ Model Stock: All we need is just a few fine-tuned models, NAVER AI, 2024 (arxiv)

◽️ Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. Meta. Apr, 2024. (X link)

◽️ DatasetDM:Synthesizing Data with Perception Annotations Using Diffusion Models NeurIPS 2023

◽️ GInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Posted by Junha Song

문의하기 양식