2개의 Workshop (CVinW, VFoundationM) 에서 소개되는 내용들을 정리한다. 그래서 Section 2, 4에서 같은 논문이 적혀있을 수 있다.
논문들을 정리한 포스트는 다음에 위치한다.. Computers/projects/Google_blog_posts/AI/1230_CVinW2.md

Computer vision in the wild

0. Overview (Workshop, Awesome)

Limitation: Visual recognition models are typically trained to predict a ﬁxed set of pre-determined object categories, which limits their usability in real-world applications since additional labeled data are needed to generalize to new visual concepts and domains.
Interest in the vision-language model:
- Open-vocabulary recognition models can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. / The pre-trained CLIP model is so semantically rich that it can be easily transferred to downstream image classiﬁcation and text-image retrieval tasks in zero-shot settings.
- As language supervision signiﬁcantly increases the coverage of visual concepts for model training, model generalization is improved.
Dataset: ELEVATER (project, paper)
Challenges: Classification, Detection, Segmentation, Roboflow(Detection+)

1. ELEVATER (model adaptation, paper, github)

Refer to the next post.

2. Open-vocabulary recognition

Classification: CLIP, ALIGN and Florence,
Detection: ViLD, RegionCLIP, GLIP and OWL-ViT
Segmentation: GroupViT, OpenSeg, MaskCLIP, X-Decoder, Segment Anything (SAM) and SEEM
Chatbot: Multimodal GPT-4, LLaVA and MiniGPT4

REACT: Learning Customized Visual Models with Retrieval-Augmented Knowledge. CVPR, 2023.
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection. NeurIPS, 2022.

3. Efficient Adaptation (Awesome)

위에 논문들 재밌다. 유용하다. 의미있다. 하지만 나의 리소스로 연구 가능한지 모르겠다. 논문들 전체를 보고 GPU사용량을 확인해보자.

VL-PET Vision-and-Language Parameter-Efficient Tuning via Granularity Control. ICCV, 2023.
Adapter is All You Need for Tuning Visual Tasks. Alibaba, summited to CVPR, 2024.
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. ICLR, 2023.

Vision Foundation Model

1. [MicroSoft] CVPR23 Tutorial

Talk1: Recent Advances in Vision Foundation Models (Apple, Senior researcher. ZheGAN)
- CLIP 이후 vision-langage 모델(for Image classification)의 능력을 향상시키기 위해, Data-Archtecture-Objective 관점에서 어떤 노력들이 있었는지 요약한다.
- Interesting papers
  - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens. Apple, EMNLP, 2023. (Model design 관점)
  - K-lite: Learning transferable visual models with external knowledge. NeurIPS, 2022. (Data 관점)
  - Lit: Zero-shot transfer with locked-image text tuning. CVPR, 2022. (Model design/Training 관점)
- 느낀점: 많은 회사에서 좋은 Foundation 모델을 만들기 위한 노력을 수행 중. 다만, 학생 레벨로 하기엔 어려움 있음.
Talk2: Towards Unified Vision Understanding Interface
- Computer vision researcher로써, 해당 연구의 큰 흐름을 잡아준 것 같다. 앞으로도 열린 길이 많다고 이야기하는데, 어떤게 있을지 고민이 필요한 것 같다.
- Interesting paper
  - DETR → Mask2Former
  - [Langage-driven] → CLIP, Alevater
  - [Task-specific] → GroupViT / MaskCLIP, ECCV22, MaskClip,CVPR23 / OpenSeg / Open-MaskCLIP
  - [Output Unification] → UniTab, Pix2Seqv2 / Unified-IO. ICLR, 2023.
  - [Functionality Unification] → Uni-Perceiver v2 / GLIP / X-decoder. MS, CVPR, 2023.
  - [Promptable interface] → SAM / SEEM. MS, NeurIPS, 2023. / FIND. MS, 2023. / LISA, submitted to ICLR24.
  - → What is the next?
Talk4: Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4
- Vision (CLIP), Language encoder들을 각각 다른 Task에서 Pre-training해놓고, 서로를 연결하는 방식을 차용한다. 대표적인 논문으로 Flamingo, BLIP-2가 있다.
- GPT4 어떻게 만든지 모른다. 하지만 차근이 이해, 예측해보자. 1) Instruction tuning, Self-instruction, 2) language-only to multi-modal.
- Multi-modal chatbot을 위해서 다음과 같은 Instruction-dataset을 구성할 수 있다. 1) captions (detailed description) 2) box sequence 3) caption-base generated (?) conversation 4) complex reasoning (이미지 안의 숨겨진 의미 = the implications of the image, Science QA데이터셋이 존재한다.)
- Archtecture 및 학습 절차는 아래 사진과 같다.
- Further research ditections: 1) More modalities 2) Multitask instrct with academic tasks 3) MultiModal in-incontext learning, OpenFlamingo 4) Parameter-Efficient Training 5) Object hallucination, OCR, Reading high-resolution image, Specific-domain knowledge (medical, physics)

Talk5: Chaining Multimodal Experts with LLMs

2. [DeepMind, Apple] Vision-Language Models

Ferret: Refer and Ground Anything Anywhere at Any Granularity. Apple, 2023.
VeCLIP: Improving CLIP training via visual-enriched captions. Apple, 2023.
Contrastive Feature Masking Open-Vocabulary Vision Transformer. DeepMind, ICCV, 2023.
RECLIP: Resource-Efficient Clip by Training with Small Images. DeepMind, TMLR, 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers. DeepMind, CVPR, 2023.

6. Questions

그래서 데이터셋은 어떻게 다운 받지?
Pretrained model, Foundation model은 어디서 다운 받지?
Tuning/Adaptation을 위한 데이터셋으로 뭐가 있지?

Junha

[CVinW] Computer Vision in the Wild 1

Computer vision in the wild

0. Overview (Workshop, Awesome)

1. ELEVATER (model adaptation, paper, github)

2. Open-vocabulary recognition

3. Efficient Adaptation (Awesome)

Vision Foundation Model

1. [MicroSoft] CVPR23 Tutorial

2. [DeepMind, Apple] Vision-Language Models

6. Questions

Posted by Junha Song

문의하기 양식