💙 Analysis of VLMs / Understanding VLMs

◽️ What matters when building vision-language models? Hugging face, arXiv, 2024.

Intro: VLMs 활용 예시 (PDF 읽기, 차트 설명, 이미지 안 text 읽기, 웹페이지 코드로 바꾸기, 등) → unimodal 로 학습된 모델을 그냥 붙여서 사용하고 있다. 그때 여러가지 design choise가 나올 수 있는데 이거에 대한 탐색이 부족하다. (예를들어 BLIP2, Flamingo의 다른 디자인 차이) → 아래의 항목들에 대해서 조사한다 (1) fusion 디자인 초이스 (2) 멀티모달 학습 절차 → 디자인 관점으로는 BLIP2가 Flamingo보다 좋다는 등의 Findings을 여러가지 제시한다. → 이런 지식으로 학습한 Idefics2-8B 모델을 발표한다. 이 모델은 기존 4배 큰 모델과 비슷한 성능을 내더라.
Contributions: (1) Diverse ablation studies are conducted, especially regarding fusion design choices. (2) They introduce the Idefics2-8B model.
Evaluation: Average results of 4 downstream benchmarks: VQAv2, TextVQA(OCR), OKVQA(external know.), Captioning.
Observations
1. For a fixed number of parameters, the quality of the language model backbone has a higher impact on the performance of the final VLM than the quality of the vision backbone.
2. When training the unimodal backbones(LoRA), the fully autoregressive architecture outperforms the cross-attention one.
3. Reducing the number of visual tokens with learned pooling significantly improves compute efficiency.
4. Splitting images into sub-images during training allows trading compute efficiency for more performance. The increase in performance is particularly noticeable in tasks involving reading text in an image.

◽️ Image Captioners Are Scalable Vision Learners Too. Google, NeurIPS 2023. No code.

Tags: VL pretraining (like CLIP, but different framework), two decoder types (autoregressive v.s., parrellel=single inference)
Intro에 과거의 정리가 survey만큼 잘되어 있다. 추천.
그림의 Cap은 Zero-shot 측정에는 적절하진 않았지만, 그래도 Few-show learning 하면 CLIP 만큼 좋은 visual encoder가 탄성하더라. 추가로, autoregrassive 하는게 아닌 simple forward만 할 수 있는 CapPa 구조를 사용하니 더 좋은 visual encoder가 탄성하더라. Cap에 bag-of-words 개념을 추가하면 좀 더 좋은 성능을 얻을 수 있더라.
결론: Our results show that pretraining a simple encoder-decoder architecture via image captioning alone can produce vision encoders competitive with CLIP

◽️🔒 Long-CLIP Unlocking the Long-Text Capability of CLIP ECCV24

궁금한 점만 찾아봤다. ShareGPT4 데이터셋을 특별한 변경 없이 그대로 사용했다. 150 words 정도의 text를 거의 그대로 사용했다. Lora도 사용하지 않고, 1M (long caption, image) pairs 만을 랜덤으로 선택해서 딱 1epoch만 학습시킨 것은 굉장히 흥미롭다.

◽️ S2-Wrapper: When Do We Not Need Larger Vision Models?

💙 In-context leraning, In-the-loop, Teacher forcing, Training resolutions

◽️ Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023. 9. code-s22

Tags: image selection and text assignment for in-context learning with VL models,
NLP와는 다른 VL models을 위한 in-context learning 기법 탐구 및 인사이트 제공, 4가지 이미지 텍스트 selection방법 소개 및 분석
Intro: In-context learning in LM → Flamingo for VLM, few-shot prompt → NLP에서 selection or ordering of in-context samples 에대한 연구가 활발, but VL에서는 증명 안됨 → 여러 VLM tasks 중 에서 Image captioning (IC)에 집중하겠다. → 아래의 4x4 전략을 비교해서, 몇 finding을 구체화하고 → 이를 기반으로 적절한 strategy를 소개한다.
image selection: (1) random sampling (2) similarity-based image-image retrieval (3) similarity-based image-caption retrieval (4) diversity-based image-image retrieval
caption assignment: 위에서 선택된 이미지에 대한 (1) ground-truth captions (2) model-generated captions (3) iteratively prompting (4) model-generated captions as anchors
Findings: (1) 비슷한 이미지를 고르는게 항상 좋은 성능을 보장하는건 아니고, the quality of the associated captions가 가장 중요하더라. 이때 캠션 퀄리티는 descriptiveness and language patterns 와 관계있다. (2) 아주 비슷한 이미지를 주면 short-cut inference 를 유도해서 적절하지 않은 캡션이 나오기도 한다.
베이스라인으로는 Open-Flamingo 모델 활용.

💙 Video understanding

Video dataset 생성 및, CLIP으로 video-langage 모델을 만드는것이 대부분 논문의 목적이다. 데이터 생성 부분 보다는 video-langage model 학습을 위해 contrastive learning, MAE 부분이 흥미롭다.
InternVid, InternVideo: 이 논문에서 영상의 길이와, 캡션의 길이가 어느정도 이고, evaluation metrixs는 어떤 것을 사용하는지 확인 필요.

◽️ InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. ICLR 2024 spotlight.

Tag: Video-text dataset / Using ImageCaptioner (BLIP2, Tag2Text) / Video-text contrastive learning (ViCLIP)
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
Their core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset. (BLIP2, Tag2Text)
ViCLIP (Video-text representation learning based on ViT-L): (1) Learned on InternVid via contrastive learning. (2) hope to advance text-to-video generation research.

Misunderstanding 부분 정리
- InternVid (ICLR24), InternVideo, InternVideo2 (24.05 arxived. model code) 모두 Video vision transformer 사용.
- 위에 frame captioning 하는 부분은 video-text data generation을 위해서만 사용. 하지만, 자세히 보면 frame caption에서 긴 캡션을 요구하는 건 아닌 것 같아 보임....
- 도대체 이 논문에서 영상의 길이와, 캡션의 길이가 어느정도 이고, evaluation metrixs는 어떤 것을 사용하는지 확인 필요.
- 아래 그림(왼쪽 오른쪽 각각 VASTA, InterVideo2 논문에서 차용)과 같이, InternVideo2, VASTA 에서도 Video caption 평가를 수행을 위해 CIDEr를 사용함. (CIDER 평가를 위해서 single-sentence로 output이 나오며 정보량도 그리 많지 않음)
- InternVideo2의 Video Multimodal Annotation 단락에 의하면, 아직 데이터 생성을 위해서 위 이미지와 같은 pipeline을 사용 중.

◽️ Simple LLM Framework for Long-Range Video Question-Answering. arXiv, Dec, 2023.

Tag: Video question- answering (LVQA) / Using ImageCaptioner (LaViLa, BLIP-2)
Language-based Long-range Video question- answering (LVQA) framework.
Previous work: FrozenBiLM (It fails to answer a question that requires reasoning about complex human activities in a long video.)
Method: (1) a short-term visual captioner 0.5-8s (2) an LLM aggregates the captions to answer a given question.
Findings: (1) proposing a novel multi-round summarization prompt (2) the choice of the visual captioner and LLM is critical. (LaViLa > BLIP-2 > EgoVLP)

◽️ LaViLa: Learning Video Representations from Large Language Models. Facebook, CVPR, 2023.

Tag: Video-text dataset, CLIP pretraining with video data (ViCLIP)
Automatically generate text pairing for such videos by leveraging Large Language Models (LLMs). → Take full advantage of the massive video data → stronger representations
LaViLa (Language-model augmented Video-Language pre-training): GPT-2를 backbone으로 input: Video, Output: Automatically generated text. LaViLa becomes "visually-conditioned narrators." (여기서 GPT-2 를 어떻게 학습시키는건지, image encoding은 어떻게 되는건지 모르겠다. 나중에 필요하면 찾아보자.)
LaViLa 장점 (1) LaViLa can generate dense descriptions for long videos, and (2) the generated text is well-aligned with the visual input. (3) an egocentric view-video에 대한 assistive and augmented description 생성 가능. (4) 생성된 데이터를 video-text contrastive learning에 사용가능
오른쪽 그림의 Narrator는 action에 대한 설명을 추출하는 역할을 하고 Rephraser는 text를 augment하는 역할을 한다. 이렇게 얻어진 데이터를 사용해서 the dual encoders (backgbones for representation learning) 을 학습한다.
a per-gpu batch size of 32 over 32 GPUs for TimeSformer-B and a per-gpu batch size of 16 over 64 GPUs for TimeSformer-L / COCO captioning evalutation X.

◽️ EgoVLP: Egocentric Video-Language Pretraining. NeurIPS, 2022.

Tag: Egocentric Video/Clip-text dataset. Video-text contrastive learning (EgoNCE)
We exploit the recently released Ego4D dataset to pioneer Egocentric Video - Language Pretraining in three directions:
1. We create EgoClip. 3.8M clip-text pairs.
2. We propose a novel pretraining objective, dubbed EgoNCE.
3. We created a benchmark, Egocentric Multiple-Choices Question (EgoMCQ), which contains 39K questions created from Ego4D and focuses on evaluating video-text alignment.

◽️ Distilling Vision-Language Models on Millions of Videos. CVPR 2024.

Tag: Video-text data generation model (video question-answering/captioning), Video-text CLIP model (video-text retrieval, video recognition)
Enough human-curated video-text data is not available. So, we fine-tune a video-language model from a strong image-language baseline with synthesized instructional data. The finetuned video model is then used to auto-label millions of videos.
Evaluation: MSR-VTT zero-shot text-to-video retrieval, open-ended NExT-QA
Problem of InternVid: The resulting image captions are often biased towards static scenes and lose videos’ rich temporal information.
Method: (1) first fine-tune the visual encoder, (2) fine-tune the language model on a small amount of instruction-following data, and (3) The resulting video-language model sees both dynamic input and motion-focused output.
Dataset: (1) High alignment btw video and text. (2) temporal information in captions (3) textual descriptions with multiple granularities (4) more scalable than human labeling
Model: PaLI-3 (SOTA VLM model) ViT-G/14 (visual encoder) UL-2 (language model)

◽️ VideoCon: Robust Video-Language Alignment via Contrast Captions. Google, arXiv, Nov 2023. 2.

Tags: contrastive dataset (ex, positive and negative pairs) with video-caption
Overviews: Video-caption contrastive learning을 위한 dataset을 구축 → LLM으로 NLE (Natural language explanations) 또한 생성해놓기 → Video-language model 파인 튜닝

◽️ V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video Meta ICLR 2024 submitted

Video-level MAE + SimSiam.

◽️ FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. arXiv, 2024.

Tags: CLIP's poor interpretation of fine-grained attributes of CLIP
Overviews: CLIP encoder는 [fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning] 해석하는 능력이 없다. → 왜냐하면 caption 자체가 모든 details를 담고 있지 않으니 visual encoder 또한 담지 않은 것이다. (just acting as a bag of words) → 이 문제를 해결하기 위해서 a highquality, comprehensive, and relatively small datasets으로 adapting 해주면 된다. → 여기서 VidSitu (a video situation recognition dataset)을 사용하는데, 여기엔 verbs and rich semantic role labels (SRL) 정보가 담겨있다. → hard negatives and hierarchical losses => Fine-Grained CLIP (FiGCLIP)
a single 12GB GPU / one RTX2080 GPU (이게 어떻게 가능하지?)
👍 내가 생각하는 CLIP encoder의 문제점을 잘 거론했다. 하지만 이때 사용한 특별한 비디오 데이터가 나에게 익숙하지 않은 점이 아쉽다. 그럼에도 Implementation details 등은 참고하면 좋을 것 같다.
👍 이번달 말에 코드가 공개되면 LoRA를 어떻게 붙여서 학습됐는지 확인해보고 싶다.

💙 Distillations

◽️ MiniLLM: Knowledge Distillation of Large Language Models. ICLR, 2024. 71

Motive: (1) White-box(=feature-level) KD for LLMs is yet to be explored. (2) The standard KD objectives are sub-optimal for LLMs that perform tasks.
Findings: (1) For open-ended text generation tasks, which is usually the case of LLM applications, the output spaces are much more complex, and p(y|x) can contain many more modes than what qθ(y|x) can express due to the limited model capacity. (LLM을 능력을 다 따라가기엔, 작은 모델의 capacity가 부족하다.) (2) Minimizing typical KLD causes q_θ to assign unreasonably high probabilities to the void regions of p and produces very unlikely samples under p during free-run generation. (toy 실험 이미지 참조)
Methods: (1) They leverage reverse KL divergence. Minimizing reverse KLD has been shown to cause mode-seeking behavior. (2) single-step decomposition, (3) teacher-mixed sampling, and (4) length normalization. (2~4 approach 디테일은 논문 참조)

◽️ Sequence-Level Knowledge Distillation. EMNLP, 2016. 1026

💙 Efficiency

◽️ RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.

Tag: CLIP with ~~small batch size~~ small memory GPUs.
Motive: Many image-text pairs (rich supervision) have a higher level of noise. To address this noise, training CLIP models require ∼3k V100-GPU-days.
Method:
- Inspired by the notion of coarse-to-fine, See Figure 1. (1) Humans can effortlessly match text-image pairs even if the image size is small. (2) pretraining incorporates high-level information from small images, and finetuning enables the model to refocus its attention on the important details.
Results:
- Only 16 tokens for the image encoding are sufficient for the main training phase (even if recent works tend to use long sequence lengths).
- We set the batch size to 16384. Our training is run on TPU-v3 infrastructure (each GPU has 16GB).

◽️ Data-Efficient Multimodal Fusion on a Single GPU. CVPR 24. Highlight

핵심은 DINOv2, BGE 같은 unimodel에서 features를 모두 다 뽑아 놓고, Fusion Adapters만 학습하는 것. 따라서 학습시간이 압도적으로 빠르다. (한 이미지당 저장해야하는 feature가 1x512뿐이고, 이미지를 infererene하는 시간 따위도 필요하지 않는다.)
Findings: Our key insight is that off-the-shelf unimodal encoders that have been pre-trained on large amounts of unimodal data already encode rich semantics.
Approach: FuseMix motivated by Mixup.
Results: 3M, 5M 정도의 이미지의 (low-data regime) features 만을 가지고 학습했음에도 불구하고, 충분히 좋은 성능을 냄. 하나의 GPU로, 엄청 큰 Batch를 사용해도 충분히 학습가능.

Junha

[VLM] Vision Language Models 3

💙 Analysis of VLMs / Understanding VLMs

◽️ What matters when building vision-language models? Hugging face, arXiv, 2024.

◽️ Image Captioners Are Scalable Vision Learners Too. Google, NeurIPS 2023. No code.

◽️🔒 Long-CLIP Unlocking the Long-Text Capability of CLIP ECCV24

◽️ S2-Wrapper: When Do We Not Need Larger Vision Models?

💙 In-context leraning, In-the-loop, Teacher forcing, Training resolutions

◽️ Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023. 9. code-s22

💙 Video understanding

◽️ InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. ICLR 2024 spotlight.

◽️ Simple LLM Framework for Long-Range Video Question-Answering. arXiv, Dec, 2023.

◽️ LaViLa: Learning Video Representations from Large Language Models. Facebook, CVPR, 2023.

◽️ EgoVLP: Egocentric Video-Language Pretraining. NeurIPS, 2022.

◽️ Distilling Vision-Language Models on Millions of Videos. CVPR 2024.

◽️ VideoCon: Robust Video-Language Alignment via Contrast Captions. Google, arXiv, Nov 2023. 2.

◽️ V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video Meta ICLR 2024 submitted

◽️ FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. arXiv, 2024.

💙 Distillations

◽️ MiniLLM: Knowledge Distillation of Large Language Models. ICLR, 2024. 71

◽️ Sequence-Level Knowledge Distillation. EMNLP, 2016. 1026

💙 Efficiency

◽️ RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.

◽️ Data-Efficient Multimodal Fusion on a Single GPU. CVPR 24. Highlight

Posted by Junha Song

문의하기 양식