๐ Analysis of VLMs / Understanding VLMs
◽️ What matters when building vision-language models? Hugging face, arXiv, 2024.
- Intro: VLMs ํ์ฉ ์์ (PDF ์ฝ๊ธฐ, ์ฐจํธ ์ค๋ช , ์ด๋ฏธ์ง ์ text ์ฝ๊ธฐ, ์นํ์ด์ง ์ฝ๋๋ก ๋ฐ๊พธ๊ธฐ, ๋ฑ) → unimodal ๋ก ํ์ต๋ ๋ชจ๋ธ์ ๊ทธ๋ฅ ๋ถ์ฌ์ ์ฌ์ฉํ๊ณ ์๋ค. ๊ทธ๋ ์ฌ๋ฌ๊ฐ์ง design choise๊ฐ ๋์ฌ ์ ์๋๋ฐ ์ด๊ฑฐ์ ๋ํ ํ์์ด ๋ถ์กฑํ๋ค. (์๋ฅผ๋ค์ด BLIP2, Flamingo์ ๋ค๋ฅธ ๋์์ธ ์ฐจ์ด) → ์๋์ ํญ๋ชฉ๋ค์ ๋ํด์ ์กฐ์ฌํ๋ค (1) fusion ๋์์ธ ์ด์ด์ค (2) ๋ฉํฐ๋ชจ๋ฌ ํ์ต ์ ์ฐจ → ๋์์ธ ๊ด์ ์ผ๋ก๋ BLIP2๊ฐ Flamingo๋ณด๋ค ์ข๋ค๋ ๋ฑ์ Findings์ ์ฌ๋ฌ๊ฐ์ง ์ ์ํ๋ค. → ์ด๋ฐ ์ง์์ผ๋ก ํ์ตํ Idefics2-8B ๋ชจ๋ธ์ ๋ฐํํ๋ค. ์ด ๋ชจ๋ธ์ ๊ธฐ์กด 4๋ฐฐ ํฐ ๋ชจ๋ธ๊ณผ ๋น์ทํ ์ฑ๋ฅ์ ๋ด๋๋ผ.
- Contributions: (1) Diverse ablation studies are conducted, especially regarding fusion design choices. (2) They introduce the Idefics2-8B model.
- Evaluation: Average results of 4 downstream benchmarks: VQAv2, TextVQA(OCR), OKVQA(external know.), Captioning.
- Observations
- For a fixed number of parameters, the quality of the language model backbone has a higher impact on the performance of the final VLM than the quality of the vision backbone.
- When training the unimodal backbones(LoRA), the fully autoregressive architecture outperforms the cross-attention one.
- Reducing the number of visual tokens with learned pooling significantly improves compute efficiency.
- Splitting images into sub-images during training allows trading compute efficiency for more performance. The increase in performance is particularly noticeable in tasks involving reading text in an image.
◽️ Image Captioners Are Scalable Vision Learners Too. Google, NeurIPS 2023. No code.
- Tags: VL pretraining (like CLIP, but different framework), two decoder types (autoregressive v.s., parrellel=single inference)
- Intro์ ๊ณผ๊ฑฐ์ ์ ๋ฆฌ๊ฐ survey๋งํผ ์๋์ด ์๋ค. ์ถ์ฒ.
- ๊ทธ๋ฆผ์ Cap์ Zero-shot ์ธก์ ์๋ ์ ์ ํ์ง ์์์ง๋ง, ๊ทธ๋๋ Few-show learning ํ๋ฉด CLIP ๋งํผ ์ข์ visual encoder๊ฐ ํ์ฑํ๋๋ผ. ์ถ๊ฐ๋ก, autoregrassive ํ๋๊ฒ ์๋ simple forward๋ง ํ ์ ์๋ CapPa ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉํ๋ ๋ ์ข์ visual encoder๊ฐ ํ์ฑํ๋๋ผ. Cap์ bag-of-words ๊ฐ๋ ์ ์ถ๊ฐํ๋ฉด ์ข ๋ ์ข์ ์ฑ๋ฅ์ ์ป์ ์ ์๋๋ผ.
- ๊ฒฐ๋ก : Our results show that pretraining a simple encoder-decoder architecture via image captioning alone can produce vision encoders competitive with CLIP
◽️๐ Long-CLIP Unlocking the Long-Text Capability of CLIP ECCV24
- ๊ถ๊ธํ ์ ๋ง ์ฐพ์๋ดค๋ค. ShareGPT4 ๋ฐ์ดํฐ์ ์ ํน๋ณํ ๋ณ๊ฒฝ ์์ด ๊ทธ๋๋ก ์ฌ์ฉํ๋ค. 150 words ์ ๋์ text๋ฅผ ๊ฑฐ์ ๊ทธ๋๋ก ์ฌ์ฉํ๋ค. Lora๋ ์ฌ์ฉํ์ง ์๊ณ , 1M (long caption, image) pairs ๋ง์ ๋๋ค์ผ๋ก ์ ํํด์ ๋ฑ 1epoch๋ง ํ์ต์ํจ ๊ฒ์ ๊ต์ฅํ ํฅ๋ฏธ๋กญ๋ค.
◽️ S2-Wrapper: When Do We Not Need Larger Vision Models?
๐ In-context leraning, In-the-loop, Teacher forcing, Training resolutions
◽️ Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023. 9. code-s22
- Tags: image selection and text assignment for in-context learning with VL models,
- NLP์๋ ๋ค๋ฅธ VL models์ ์ํ in-context learning ๊ธฐ๋ฒ ํ๊ตฌ ๋ฐ ์ธ์ฌ์ดํธ ์ ๊ณต, 4๊ฐ์ง ์ด๋ฏธ์ง ํ ์คํธ selection๋ฐฉ๋ฒ ์๊ฐ ๋ฐ ๋ถ์
- Intro: In-context learning in LM → Flamingo for VLM, few-shot prompt → NLP์์ selection or ordering of in-context samples ์๋ํ ์ฐ๊ตฌ๊ฐ ํ๋ฐ, but VL์์๋ ์ฆ๋ช ์๋จ → ์ฌ๋ฌ VLM tasks ์ค ์์ Image captioning (IC)์ ์ง์คํ๊ฒ ๋ค. → ์๋์ 4x4 ์ ๋ต์ ๋น๊ตํด์, ๋ช finding์ ๊ตฌ์ฒดํํ๊ณ → ์ด๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์ ์ ํ strategy๋ฅผ ์๊ฐํ๋ค.
- image selection: (1) random sampling (2) similarity-based image-image retrieval (3) similarity-based image-caption retrieval (4) diversity-based image-image retrieval
- caption assignment: ์์์ ์ ํ๋ ์ด๋ฏธ์ง์ ๋ํ (1) ground-truth captions (2) model-generated captions (3) iteratively prompting (4) model-generated captions as anchors
- Findings: (1) ๋น์ทํ ์ด๋ฏธ์ง๋ฅผ ๊ณ ๋ฅด๋๊ฒ ํญ์ ์ข์ ์ฑ๋ฅ์ ๋ณด์ฅํ๋๊ฑด ์๋๊ณ , the quality of the associated captions๊ฐ ๊ฐ์ฅ ์ค์ํ๋๋ผ. ์ด๋ ์บ ์ ํ๋ฆฌํฐ๋ descriptiveness and language patterns ์ ๊ด๊ณ์๋ค. (2) ์์ฃผ ๋น์ทํ ์ด๋ฏธ์ง๋ฅผ ์ฃผ๋ฉด short-cut inference ๋ฅผ ์ ๋ํด์ ์ ์ ํ์ง ์์ ์บก์ ์ด ๋์ค๊ธฐ๋ ํ๋ค.
- ๋ฒ ์ด์ค๋ผ์ธ์ผ๋ก๋ Open-Flamingo ๋ชจ๋ธ ํ์ฉ.
๐ Video understanding
- Video dataset ์์ฑ ๋ฐ, CLIP์ผ๋ก video-langage ๋ชจ๋ธ์ ๋ง๋๋๊ฒ์ด ๋๋ถ๋ถ ๋ ผ๋ฌธ์ ๋ชฉ์ ์ด๋ค. ๋ฐ์ดํฐ ์์ฑ ๋ถ๋ถ ๋ณด๋ค๋ video-langage model ํ์ต์ ์ํด contrastive learning, MAE ๋ถ๋ถ์ด ํฅ๋ฏธ๋กญ๋ค.
- InternVid, InternVideo: ์ด ๋ ผ๋ฌธ์์ ์์์ ๊ธธ์ด์, ์บก์ ์ ๊ธธ์ด๊ฐ ์ด๋์ ๋ ์ด๊ณ , evaluation metrixs๋ ์ด๋ค ๊ฒ์ ์ฌ์ฉํ๋์ง ํ์ธ ํ์.
◽️ InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. ICLR 2024 spotlight.
- Tag: Video-text dataset / Using ImageCaptioner (BLIP2, Tag2Text) / Video-text contrastive learning (ViCLIP)
- The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
- Their core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset. (BLIP2, Tag2Text)
- ViCLIP (Video-text representation learning based on ViT-L): (1) Learned on InternVid via contrastive learning. (2) hope to advance text-to-video generation research.
- Misunderstanding ๋ถ๋ถ ์ ๋ฆฌ
- InternVid (ICLR24), InternVideo, InternVideo2 (24.05 arxived. model code) ๋ชจ๋ Video vision transformer ์ฌ์ฉ.
- ์์ frame captioning ํ๋ ๋ถ๋ถ์ video-text data generation์ ์ํด์๋ง ์ฌ์ฉ. ํ์ง๋ง, ์์ธํ ๋ณด๋ฉด frame caption์์ ๊ธด ์บก์ ์ ์๊ตฌํ๋ ๊ฑด ์๋ ๊ฒ ๊ฐ์ ๋ณด์....
- ๋๋์ฒด ์ด ๋ ผ๋ฌธ์์ ์์์ ๊ธธ์ด์, ์บก์ ์ ๊ธธ์ด๊ฐ ์ด๋์ ๋ ์ด๊ณ , evaluation metrixs๋ ์ด๋ค ๊ฒ์ ์ฌ์ฉํ๋์ง ํ์ธ ํ์.
- ์๋ ๊ทธ๋ฆผ(์ผ์ชฝ ์ค๋ฅธ์ชฝ ๊ฐ๊ฐ VASTA, InterVideo2 ๋ ผ๋ฌธ์์ ์ฐจ์ฉ)๊ณผ ๊ฐ์ด, InternVideo2, VASTA ์์๋ Video caption ํ๊ฐ๋ฅผ ์ํ์ ์ํด CIDEr๋ฅผ ์ฌ์ฉํจ. (CIDER ํ๊ฐ๋ฅผ ์ํด์ single-sentence๋ก output์ด ๋์ค๋ฉฐ ์ ๋ณด๋๋ ๊ทธ๋ฆฌ ๋ง์ง ์์)
- InternVideo2์
Video Multimodal Annotation
๋จ๋ฝ์ ์ํ๋ฉด, ์์ง ๋ฐ์ดํฐ ์์ฑ์ ์ํด์ ์ ์ด๋ฏธ์ง์ ๊ฐ์ pipeline์ ์ฌ์ฉ ์ค.
◽️ Simple LLM Framework for Long-Range Video Question-Answering. arXiv, Dec, 2023.
- Tag: Video question- answering (LVQA) / Using ImageCaptioner (LaViLa, BLIP-2)
- Language-based Long-range Video question- answering (LVQA) framework.
- Previous work: FrozenBiLM (It fails to answer a question that requires reasoning about complex human activities in a long video.)
- Method: (1) a short-term visual captioner 0.5-8s (2) an LLM aggregates the captions to answer a given question.
- Findings: (1) proposing a novel multi-round summarization prompt (2) the choice of the visual captioner and LLM is critical. (LaViLa > BLIP-2 > EgoVLP)
◽️ LaViLa: Learning Video Representations from Large Language Models. Facebook, CVPR, 2023.
- Tag: Video-text dataset, CLIP pretraining with video data (ViCLIP)
- Automatically generate text pairing for such videos by leveraging Large Language Models (LLMs). → Take full advantage of the massive video data → stronger representations
- LaViLa (Language-model augmented Video-Language pre-training): GPT-2๋ฅผ backbone์ผ๋ก input: Video, Output: Automatically generated text. LaViLa becomes "visually-conditioned narrators." (์ฌ๊ธฐ์ GPT-2 ๋ฅผ ์ด๋ป๊ฒ ํ์ต์ํค๋๊ฑด์ง, image encoding์ ์ด๋ป๊ฒ ๋๋๊ฑด์ง ๋ชจ๋ฅด๊ฒ ๋ค. ๋์ค์ ํ์ํ๋ฉด ์ฐพ์๋ณด์.)
- LaViLa ์ฅ์ (1) LaViLa can generate dense descriptions for long videos, and (2) the generated text is well-aligned with the visual input. (3) an egocentric view-video์ ๋ํ assistive and augmented description ์์ฑ ๊ฐ๋ฅ. (4) ์์ฑ๋ ๋ฐ์ดํฐ๋ฅผ video-text contrastive learning์ ์ฌ์ฉ๊ฐ๋ฅ
- ์ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์ Narrator๋ action์ ๋ํ ์ค๋ช ์ ์ถ์ถํ๋ ์ญํ ์ ํ๊ณ Rephraser๋ text๋ฅผ augmentํ๋ ์ญํ ์ ํ๋ค. ์ด๋ ๊ฒ ์ป์ด์ง ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด์ the dual encoders (backgbones for representation learning) ์ ํ์ตํ๋ค.
- a per-gpu batch size of 32 over 32 GPUs for TimeSformer-B and a per-gpu batch size of 16 over 64 GPUs for TimeSformer-L / COCO captioning evalutation X.
◽️ EgoVLP: Egocentric Video-Language Pretraining. NeurIPS, 2022.
- Tag: Egocentric Video/Clip-text dataset. Video-text contrastive learning (EgoNCE)
- We exploit the recently released Ego4D dataset to pioneer Egocentric Video - Language Pretraining in three directions:
- We create EgoClip. 3.8M clip-text pairs.
- We propose a novel pretraining objective, dubbed EgoNCE.
- We created a benchmark, Egocentric Multiple-Choices Question (EgoMCQ), which contains 39K questions created from Ego4D and focuses on evaluating video-text alignment.
◽️ Distilling Vision-Language Models on Millions of Videos. CVPR 2024.
- Tag: Video-text data generation model (video question-answering/captioning), Video-text CLIP model (video-text retrieval, video recognition)
- Enough human-curated video-text data is not available. So, we fine-tune a video-language model from a strong image-language baseline with synthesized instructional data. The finetuned video model is then used to auto-label millions of videos.
- Evaluation: MSR-VTT zero-shot text-to-video retrieval, open-ended NExT-QA
- Problem of InternVid: The resulting image captions are often biased towards static scenes and lose videos’ rich temporal information.
- Method: (1) first fine-tune the visual encoder, (2) fine-tune the language model on a small amount of instruction-following data, and (3) The resulting video-language model sees both dynamic input and motion-focused output.
- Dataset: (1) High alignment btw video and text. (2) temporal information in captions (3) textual descriptions with multiple granularities (4) more scalable than human labeling
- Model: PaLI-3 (SOTA VLM model) ViT-G/14 (visual encoder) UL-2 (language model)
◽️ VideoCon: Robust Video-Language Alignment via Contrast Captions. Google, arXiv, Nov 2023. 2.
- Tags: contrastive dataset (ex, positive and negative pairs) with video-caption
- Overviews: Video-caption contrastive learning์ ์ํ dataset์ ๊ตฌ์ถ → LLM์ผ๋ก NLE (Natural language explanations) ๋ํ ์์ฑํด๋๊ธฐ → Video-language model ํ์ธ ํ๋
◽️ V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video Meta ICLR 2024 submitted
- Video-level MAE + SimSiam.
◽️ FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. arXiv, 2024.
- Tags: CLIP's poor interpretation of fine-grained attributes of CLIP
- Overviews: CLIP encoder๋ [fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning] ํด์ํ๋ ๋ฅ๋ ฅ์ด ์๋ค. → ์๋ํ๋ฉด caption ์์ฒด๊ฐ ๋ชจ๋ details๋ฅผ ๋ด๊ณ ์์ง ์์ผ๋ visual encoder ๋ํ ๋ด์ง ์์ ๊ฒ์ด๋ค. (just acting as a bag of words) → ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด์ a highquality, comprehensive, and relatively small datasets์ผ๋ก adapting ํด์ฃผ๋ฉด ๋๋ค. → ์ฌ๊ธฐ์ VidSitu (a video situation recognition dataset)์ ์ฌ์ฉํ๋๋ฐ, ์ฌ๊ธฐ์ verbs and rich semantic role labels (SRL) ์ ๋ณด๊ฐ ๋ด๊ฒจ์๋ค. → hard negatives and hierarchical losses => Fine-Grained CLIP (FiGCLIP)
- a single 12GB GPU / one RTX2080 GPU (์ด๊ฒ ์ด๋ป๊ฒ ๊ฐ๋ฅํ์ง?)
- ๐ ๋ด๊ฐ ์๊ฐํ๋ CLIP encoder์ ๋ฌธ์ ์ ์ ์ ๊ฑฐ๋ก ํ๋ค. ํ์ง๋ง ์ด๋ ์ฌ์ฉํ ํน๋ณํ ๋น๋์ค ๋ฐ์ดํฐ๊ฐ ๋์๊ฒ ์ต์ํ์ง ์์ ์ ์ด ์์ฝ๋ค. ๊ทธ๋ผ์๋ Implementation details ๋ฑ์ ์ฐธ๊ณ ํ๋ฉด ์ข์ ๊ฒ ๊ฐ๋ค.
- ๐ ์ด๋ฒ๋ฌ ๋ง์ ์ฝ๋๊ฐ ๊ณต๊ฐ๋๋ฉด LoRA๋ฅผ ์ด๋ป๊ฒ ๋ถ์ฌ์ ํ์ต๋๋์ง ํ์ธํด๋ณด๊ณ ์ถ๋ค.
๐ Distillations
◽️ MiniLLM: Knowledge Distillation of Large Language Models. ICLR, 2024. 71
- Motive: (1) White-box(=feature-level) KD for LLMs is yet to be explored. (2) The standard KD objectives are sub-optimal for LLMs that perform tasks.
- Findings: (1) For open-ended text generation tasks, which is usually the case of LLM applications, the output spaces are much more complex, and p(y|x) can contain many more modes than what qฮธ(y|x) can express due to the limited model capacity. (LLM์ ๋ฅ๋ ฅ์ ๋ค ๋ฐ๋ผ๊ฐ๊ธฐ์, ์์ ๋ชจ๋ธ์ capacity๊ฐ ๋ถ์กฑํ๋ค.) (2) Minimizing typical KLD causes q_ฮธ to assign unreasonably high probabilities to the void regions of p and produces very unlikely samples under p during free-run generation. (toy ์คํ ์ด๋ฏธ์ง ์ฐธ์กฐ)
- Methods: (1) They leverage reverse KL divergence. Minimizing reverse KLD has been shown to cause mode-seeking behavior. (2) single-step decomposition, (3) teacher-mixed sampling, and (4) length normalization. (2~4 approach ๋ํ ์ผ์ ๋ ผ๋ฌธ ์ฐธ์กฐ)
◽️ Sequence-Level Knowledge Distillation. EMNLP, 2016. 1026
๐ Efficiency
◽️ RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.
- Tag: CLIP with
small batch sizesmall memory GPUs. - Motive: Many image-text pairs (rich supervision) have a higher level of noise. To address this noise, training CLIP models require ∼3k V100-GPU-days.
- Method:
- Inspired by the notion of coarse-to-fine, See Figure 1. (1) Humans can effortlessly match text-image pairs even if the image size is small. (2) pretraining incorporates high-level information from small images, and finetuning enables the model to refocus its attention on the important details.
- Results:
- Only 16 tokens for the image encoding are sufficient for the main training phase (even if recent works tend to use long sequence lengths).
- We set the batch size to 16384. Our training is run on TPU-v3 infrastructure (each GPU has 16GB).
◽️ Data-Efficient Multimodal Fusion on a Single GPU. CVPR 24. Highlight
- ํต์ฌ์ DINOv2, BGE ๊ฐ์ unimodel์์ features๋ฅผ ๋ชจ๋ ๋ค ๋ฝ์ ๋๊ณ , Fusion Adapters๋ง ํ์ตํ๋ ๊ฒ. ๋ฐ๋ผ์ ํ์ต์๊ฐ์ด ์๋์ ์ผ๋ก ๋น ๋ฅด๋ค. (ํ ์ด๋ฏธ์ง๋น ์ ์ฅํด์ผํ๋ feature๊ฐ 1x512๋ฟ์ด๊ณ , ์ด๋ฏธ์ง๋ฅผ inferereneํ๋ ์๊ฐ ๋ฐ์๋ ํ์ํ์ง ์๋๋ค.)
- Findings: Our key insight is that off-the-shelf unimodal encoders that have been pre-trained on large amounts of unimodal data already encode rich semantics.
- Approach: FuseMix motivated by Mixup.
- Results: 3M, 5M ์ ๋์ ์ด๋ฏธ์ง์ (low-data regime) features ๋ง์ ๊ฐ์ง๊ณ ํ์ตํ์์๋ ๋ถ๊ตฌํ๊ณ , ์ถฉ๋ถํ ์ข์ ์ฑ๋ฅ์ ๋. ํ๋์ GPU๋ก, ์์ฒญ ํฐ Batch๋ฅผ ์ฌ์ฉํด๋ ์ถฉ๋ถํ ํ์ต๊ฐ๋ฅ.