๐Ÿ’™ Analysis of VLMs / Understanding VLMs

◽️ What matters when building vision-language models? Hugging face, arXiv, 2024.

  • Intro: VLMs ํ™œ์šฉ ์˜ˆ์‹œ (PDF ์ฝ๊ธฐ, ์ฐจํŠธ ์„ค๋ช…, ์ด๋ฏธ์ง€ ์•ˆ text ์ฝ๊ธฐ, ์›นํŽ˜์ด์ง€ ์ฝ”๋“œ๋กœ ๋ฐ”๊พธ๊ธฐ, ๋“ฑ) → unimodal ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ทธ๋ƒฅ ๋ถ™์—ฌ์„œ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋•Œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ design choise๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๊ฑฐ์— ๋Œ€ํ•œ ํƒ์ƒ‰์ด ๋ถ€์กฑํ•˜๋‹ค. (์˜ˆ๋ฅผ๋“ค์–ด BLIP2, Flamingo์˜ ๋‹ค๋ฅธ ๋””์ž์ธ ์ฐจ์ด) → ์•„๋ž˜์˜ ํ•ญ๋ชฉ๋“ค์— ๋Œ€ํ•ด์„œ ์กฐ์‚ฌํ•œ๋‹ค (1) fusion ๋””์ž์ธ ์ดˆ์ด์Šค (2) ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต ์ ˆ์ฐจ → ๋””์ž์ธ ๊ด€์ ์œผ๋กœ๋Š” BLIP2๊ฐ€ Flamingo๋ณด๋‹ค ์ข‹๋‹ค๋Š” ๋“ฑ์˜ Findings์„ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ œ์‹œํ•œ๋‹ค. → ์ด๋Ÿฐ ์ง€์‹์œผ๋กœ ํ•™์Šตํ•œ Idefics2-8B ๋ชจ๋ธ์„ ๋ฐœํ‘œํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ ๊ธฐ์กด 4๋ฐฐ ํฐ ๋ชจ๋ธ๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋”๋ผ.
  • Contributions: (1) Diverse ablation studies are conducted, especially regarding fusion design choices. (2) They introduce the Idefics2-8B model.
  • Evaluation: Average results of 4 downstream benchmarks: VQAv2, TextVQA(OCR), OKVQA(external know.), Captioning.
  • Observations
    1. For a fixed number of parameters, the quality of the language model backbone has a higher impact on the performance of the final VLM than the quality of the vision backbone.
    2. When training the unimodal backbones(LoRA), the fully autoregressive architecture outperforms the cross-attention one.
    3. Reducing the number of visual tokens with learned pooling significantly improves compute efficiency.
    4. Splitting images into sub-images during training allows trading compute efficiency for more performance. The increase in performance is particularly noticeable in tasks involving reading text in an image.

image-20240515234202753

 

◽️ Image Captioners Are Scalable Vision Learners Too. Google, NeurIPS 2023. No code.

  • Tags: VL pretraining (like CLIP, but different framework), two decoder types (autoregressive v.s., parrellel=single inference)
  • Intro์— ๊ณผ๊ฑฐ์˜ ์ •๋ฆฌ๊ฐ€ survey๋งŒํผ ์ž˜๋˜์–ด ์žˆ๋‹ค. ์ถ”์ฒœ.
  • ๊ทธ๋ฆผ์˜ Cap์€ Zero-shot ์ธก์ •์—๋Š” ์ ์ ˆํ•˜์ง„ ์•Š์•˜์ง€๋งŒ, ๊ทธ๋ž˜๋„ Few-show learning ํ•˜๋ฉด CLIP ๋งŒํผ ์ข‹์€ visual encoder๊ฐ€ ํƒ„์„ฑํ•˜๋”๋ผ. ์ถ”๊ฐ€๋กœ, autoregrassive ํ•˜๋Š”๊ฒŒ ์•„๋‹Œ simple forward๋งŒ ํ•  ์ˆ˜ ์žˆ๋Š” CapPa ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋‹ˆ ๋” ์ข‹์€ visual encoder๊ฐ€ ํƒ„์„ฑํ•˜๋”๋ผ. Cap์— bag-of-words ๊ฐœ๋…์„ ์ถ”๊ฐ€ํ•˜๋ฉด ์ข€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋”๋ผ.
  • ๊ฒฐ๋ก : Our results show that pretraining a simple encoder-decoder architecture via image captioning alone can produce vision encoders competitive with CLIP

image-20240504203153309

 

◽️๐Ÿ”’ Long-CLIP Unlocking the Long-Text Capability of CLIP ECCV24

  • ๊ถ๊ธˆํ•œ ์ ๋งŒ ์ฐพ์•„๋ดค๋‹ค. ShareGPT4 ๋ฐ์ดํ„ฐ์…‹์„ ํŠน๋ณ„ํ•œ ๋ณ€๊ฒฝ ์—†์ด ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค. 150 words ์ •๋„์˜ text๋ฅผ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค. Lora๋„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , 1M (long caption, image) pairs ๋งŒ์„ ๋žœ๋ค์œผ๋กœ ์„ ํƒํ•ด์„œ ๋”ฑ 1epoch๋งŒ ํ•™์Šต์‹œํ‚จ ๊ฒƒ์€ ๊ต‰์žฅํžˆ ํฅ๋ฏธ๋กญ๋‹ค.

 

◽️ S2-Wrapper: When Do We Not Need Larger Vision Models?

image-20241028153128981

 

 

 

 

๐Ÿ’™ In-context leraning, In-the-loop, Teacher forcing, Training resolutions

◽️ Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023. 9. code-s22

  • Tags: image selection and text assignment for in-context learning with VL models,
  • NLP์™€๋Š” ๋‹ค๋ฅธ VL models์„ ์œ„ํ•œ in-context learning ๊ธฐ๋ฒ• ํƒ๊ตฌ ๋ฐ ์ธ์‚ฌ์ดํŠธ ์ œ๊ณต, 4๊ฐ€์ง€ ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ selection๋ฐฉ๋ฒ• ์†Œ๊ฐœ ๋ฐ ๋ถ„์„
  • Intro: In-context learning in LM → Flamingo for VLM, few-shot prompt → NLP์—์„œ selection or ordering of in-context samples ์—๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœ, but VL์—์„œ๋Š” ์ฆ๋ช… ์•ˆ๋จ → ์—ฌ๋Ÿฌ VLM tasks ์ค‘ ์—์„œ Image captioning (IC)์— ์ง‘์ค‘ํ•˜๊ฒ ๋‹ค. → ์•„๋ž˜์˜ 4x4 ์ „๋žต์„ ๋น„๊ตํ•ด์„œ, ๋ช‡ finding์„ ๊ตฌ์ฒดํ™”ํ•˜๊ณ  → ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ ์ ˆํ•œ strategy๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค.
  • image selection: (1) random sampling (2) similarity-based image-image retrieval (3) similarity-based image-caption retrieval (4) diversity-based image-image retrieval
  • caption assignment: ์œ„์—์„œ ์„ ํƒ๋œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ (1) ground-truth captions (2) model-generated captions (3) iteratively prompting (4) model-generated captions as anchors
  • Findings: (1) ๋น„์Šทํ•œ ์ด๋ฏธ์ง€๋ฅผ ๊ณ ๋ฅด๋Š”๊ฒŒ ํ•ญ์ƒ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜๋Š”๊ฑด ์•„๋‹ˆ๊ณ , the quality of the associated captions๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๋”๋ผ. ์ด๋•Œ ์บ ์…˜ ํ€„๋ฆฌํ‹ฐ๋Š” descriptiveness and language patterns ์™€ ๊ด€๊ณ„์žˆ๋‹ค. (2) ์•„์ฃผ ๋น„์Šทํ•œ ์ด๋ฏธ์ง€๋ฅผ ์ฃผ๋ฉด short-cut inference ๋ฅผ ์œ ๋„ํ•ด์„œ ์ ์ ˆํ•˜์ง€ ์•Š์€ ์บก์…˜์ด ๋‚˜์˜ค๊ธฐ๋„ ํ•œ๋‹ค.
  • ๋ฒ ์ด์Šค๋ผ์ธ์œผ๋กœ๋Š” Open-Flamingo ๋ชจ๋ธ ํ™œ์šฉ.

image-20240506124548524

 

 

 

 

๐Ÿ’™ Video understanding

  • Video dataset ์ƒ์„ฑ ๋ฐ, CLIP์œผ๋กœ video-langage ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š”๊ฒƒ์ด ๋Œ€๋ถ€๋ถ„ ๋…ผ๋ฌธ์˜ ๋ชฉ์ ์ด๋‹ค. ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋ถ€๋ถ„ ๋ณด๋‹ค๋Š” video-langage model ํ•™์Šต์„ ์œ„ํ•ด contrastive learning, MAE ๋ถ€๋ถ„์ด ํฅ๋ฏธ๋กญ๋‹ค.
  • InternVid, InternVideo: ์ด ๋…ผ๋ฌธ์—์„œ ์˜์ƒ์˜ ๊ธธ์ด์™€, ์บก์…˜์˜ ๊ธธ์ด๊ฐ€ ์–ด๋А์ •๋„ ์ด๊ณ , evaluation metrixs๋Š” ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋Š”์ง€ ํ™•์ธ ํ•„์š”.

◽️ InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. ICLR 2024 spotlight.

  • Tag: Video-text dataset / Using ImageCaptioner (BLIP2, Tag2Text) / Video-text contrastive learning (ViCLIP)
  • The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
  • Their core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset. (BLIP2, Tag2Text)
  • ViCLIP (Video-text representation learning based on ViT-L): (1) Learned on InternVid via contrastive learning. (2) hope to advance text-to-video generation research.

image-20240502112211777

  • Misunderstanding ๋ถ€๋ถ„ ์ •๋ฆฌ
    • InternVid (ICLR24), InternVideo, InternVideo2 (24.05 arxived. model code) ๋ชจ๋‘ Video vision transformer ์‚ฌ์šฉ.
    • ์œ„์— frame captioning ํ•˜๋Š” ๋ถ€๋ถ„์€ video-text data generation์„ ์œ„ํ•ด์„œ๋งŒ ์‚ฌ์šฉ. ํ•˜์ง€๋งŒ, ์ž์„ธํžˆ ๋ณด๋ฉด frame caption์—์„œ ๊ธด ์บก์…˜์„ ์š”๊ตฌํ•˜๋Š” ๊ฑด ์•„๋‹Œ ๊ฒƒ ๊ฐ™์•„ ๋ณด์ž„....
    • ๋„๋Œ€์ฒด ์ด ๋…ผ๋ฌธ์—์„œ ์˜์ƒ์˜ ๊ธธ์ด์™€, ์บก์…˜์˜ ๊ธธ์ด๊ฐ€ ์–ด๋А์ •๋„ ์ด๊ณ , evaluation metrixs๋Š” ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋Š”์ง€ ํ™•์ธ ํ•„์š”.
    • ์•„๋ž˜ ๊ทธ๋ฆผ(์™ผ์ชฝ ์˜ค๋ฅธ์ชฝ ๊ฐ๊ฐ VASTA, InterVideo2 ๋…ผ๋ฌธ์—์„œ ์ฐจ์šฉ)๊ณผ ๊ฐ™์ด, InternVideo2, VASTA ์—์„œ๋„ Video caption ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰์„ ์œ„ํ•ด CIDEr๋ฅผ ์‚ฌ์šฉํ•จ. (CIDER ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด์„œ single-sentence๋กœ output์ด ๋‚˜์˜ค๋ฉฐ ์ •๋ณด๋Ÿ‰๋„ ๊ทธ๋ฆฌ ๋งŽ์ง€ ์•Š์Œ)
    • InternVideo2์˜ Video Multimodal Annotation ๋‹จ๋ฝ์— ์˜ํ•˜๋ฉด, ์•„์ง ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ์œ„ํ•ด์„œ ์œ„ ์ด๋ฏธ์ง€์™€ ๊ฐ™์€ pipeline์„ ์‚ฌ์šฉ ์ค‘.

image-20240716215706366

 

◽️ Simple LLM Framework for Long-Range Video Question-Answering. arXiv, Dec, 2023.

  • Tag: Video question- answering (LVQA) / Using ImageCaptioner (LaViLa, BLIP-2)
  • Language-based Long-range Video question- answering (LVQA) framework.
  • Previous work: FrozenBiLM (It fails to answer a question that requires reasoning about complex human activities in a long video.)
  • Method: (1) a short-term visual captioner 0.5-8s (2) an LLM aggregates the captions to answer a given question.
  • Findings: (1) proposing a novel multi-round summarization prompt (2) the choice of the visual captioner and LLM is critical. (LaViLa > BLIP-2 > EgoVLP)

image-20240502154033277

 

 

◽️ LaViLa: Learning Video Representations from Large Language Models. Facebook, CVPR, 2023.

  • Tag: Video-text dataset, CLIP pretraining with video data (ViCLIP)
  • Automatically generate text pairing for such videos by leveraging Large Language Models (LLMs). → Take full advantage of the massive video data → stronger representations
  • LaViLa (Language-model augmented Video-Language pre-training): GPT-2๋ฅผ backbone์œผ๋กœ input: Video, Output: Automatically generated text. LaViLa becomes "visually-conditioned narrators." (์—ฌ๊ธฐ์„œ GPT-2 ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š”๊ฑด์ง€, image encoding์€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฑด์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค. ๋‚˜์ค‘์— ํ•„์š”ํ•˜๋ฉด ์ฐพ์•„๋ณด์ž.)
  • LaViLa ์žฅ์  (1) LaViLa can generate dense descriptions for long videos, and (2) the generated text is well-aligned with the visual input. (3) an egocentric view-video์— ๋Œ€ํ•œ assistive and augmented description ์ƒ์„ฑ ๊ฐ€๋Šฅ. (4) ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ video-text contrastive learning์— ์‚ฌ์šฉ๊ฐ€๋Šฅ
  • ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์˜ Narrator๋Š” action์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์ถ”์ถœํ•˜๋Š” ์—ญํ• ์„ ํ•˜๊ณ  Rephraser๋Š” text๋ฅผ augmentํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์–ป์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ the dual encoders (backgbones for representation learning) ์„ ํ•™์Šตํ•œ๋‹ค.
  • a per-gpu batch size of 32 over 32 GPUs for TimeSformer-B and a per-gpu batch size of 16 over 64 GPUs for TimeSformer-L / COCO captioning evalutation X.

image-20240502210903645

 

 

◽️ EgoVLP: Egocentric Video-Language Pretraining. NeurIPS, 2022.

  • Tag: Egocentric Video/Clip-text dataset. Video-text contrastive learning (EgoNCE)
  • We exploit the recently released Ego4D dataset to pioneer Egocentric Video - Language Pretraining in three directions:
    1. We create EgoClip. 3.8M clip-text pairs.
    2. We propose a novel pretraining objective, dubbed EgoNCE.
    3. We created a benchmark, Egocentric Multiple-Choices Question (EgoMCQ), which contains 39K questions created from Ego4D and focuses on evaluating video-text alignment.

image-20240502212438192

 

 

◽️ Distilling Vision-Language Models on Millions of Videos. CVPR 2024.

  • Tag: Video-text data generation model (video question-answering/captioning), Video-text CLIP model (video-text retrieval, video recognition)
  • Enough human-curated video-text data is not available. So, we fine-tune a video-language model from a strong image-language baseline with synthesized instructional data. The finetuned video model is then used to auto-label millions of videos.
  • Evaluation: MSR-VTT zero-shot text-to-video retrieval, open-ended NExT-QA
  • Problem of InternVid: The resulting image captions are often biased towards static scenes and lose videos’ rich temporal information.
  • Method: (1) first fine-tune the visual encoder, (2) fine-tune the language model on a small amount of instruction-following data, and (3) The resulting video-language model sees both dynamic input and motion-focused output.
  • Dataset: (1) High alignment btw video and text. (2) temporal information in captions (3) textual descriptions with multiple granularities (4) more scalable than human labeling
  • Model: PaLI-3 (SOTA VLM model) ViT-G/14 (visual encoder) UL-2 (language model)

image-20240502223036133

 

◽️ VideoCon: Robust Video-Language Alignment via Contrast Captions. Google, arXiv, Nov 2023. 2.

  • Tags: contrastive dataset (ex, positive and negative pairs) with video-caption
  • Overviews: Video-caption contrastive learning์„ ์œ„ํ•œ dataset์„ ๊ตฌ์ถ• → LLM์œผ๋กœ NLE (Natural language explanations) ๋˜ํ•œ ์ƒ์„ฑํ•ด๋†“๊ธฐ → Video-language model ํŒŒ์ธ ํŠœ๋‹

image-20240506130412998

 

 

◽️ V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video Meta ICLR 2024 submitted

  • Video-level MAE + SimSiam.

image-20240520225801433

 

 

◽️ FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. arXiv, 2024.

  • Tags: CLIP's poor interpretation of fine-grained attributes of CLIP
  • Overviews: CLIP encoder๋Š” [fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning] ํ•ด์„ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์—†๋‹ค. → ์™œ๋ƒํ•˜๋ฉด caption ์ž์ฒด๊ฐ€ ๋ชจ๋“  details๋ฅผ ๋‹ด๊ณ  ์žˆ์ง€ ์•Š์œผ๋‹ˆ visual encoder ๋˜ํ•œ ๋‹ด์ง€ ์•Š์€ ๊ฒƒ์ด๋‹ค. (just acting as a bag of words) → ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ a highquality, comprehensive, and relatively small datasets์œผ๋กœ adapting ํ•ด์ฃผ๋ฉด ๋œ๋‹ค. → ์—ฌ๊ธฐ์„œ VidSitu (a video situation recognition dataset)์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์—” verbs and rich semantic role labels (SRL) ์ •๋ณด๊ฐ€ ๋‹ด๊ฒจ์žˆ๋‹ค. → hard negatives and hierarchical losses => Fine-Grained CLIP (FiGCLIP)
  • a single 12GB GPU / one RTX2080 GPU (์ด๊ฒŒ ์–ด๋–ป๊ฒŒ ๊ฐ€๋Šฅํ•˜์ง€?)
  • ๐Ÿ‘ ๋‚ด๊ฐ€ ์ƒ๊ฐํ•˜๋Š” CLIP encoder์˜ ๋ฌธ์ œ์ ์„ ์ž˜ ๊ฑฐ๋ก ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋•Œ ์‚ฌ์šฉํ•œ ํŠน๋ณ„ํ•œ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์—๊ฒŒ ์ต์ˆ™ํ•˜์ง€ ์•Š์€ ์ ์ด ์•„์‰ฝ๋‹ค. ๊ทธ๋Ÿผ์—๋„ Implementation details ๋“ฑ์€ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.
  • ๐Ÿ‘ ์ด๋ฒˆ๋‹ฌ ๋ง์— ์ฝ”๋“œ๊ฐ€ ๊ณต๊ฐœ๋˜๋ฉด LoRA๋ฅผ ์–ด๋–ป๊ฒŒ ๋ถ™์—ฌ์„œ ํ•™์Šต๋๋Š”์ง€ ํ™•์ธํ•ด๋ณด๊ณ  ์‹ถ๋‹ค.

image-20240506151858330

 

 

๐Ÿ’™ Distillations

◽️ MiniLLM: Knowledge Distillation of Large Language Models. ICLR, 2024. 71

  • Motive: (1) White-box(=feature-level) KD for LLMs is yet to be explored. (2) The standard KD objectives are sub-optimal for LLMs that perform tasks.
  • Findings: (1) For open-ended text generation tasks, which is usually the case of LLM applications, the output spaces are much more complex, and p(y|x) can contain many more modes than what qฮธ(y|x) can express due to the limited model capacity. (LLM์„ ๋Šฅ๋ ฅ์„ ๋‹ค ๋”ฐ๋ผ๊ฐ€๊ธฐ์—”, ์ž‘์€ ๋ชจ๋ธ์˜ capacity๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค.) (2) Minimizing typical KLD causes q_ฮธ to assign unreasonably high probabilities to the void regions of p and produces very unlikely samples under p during free-run generation. (toy ์‹คํ—˜ ์ด๋ฏธ์ง€ ์ฐธ์กฐ)
  • Methods: (1) They leverage reverse KL divergence. Minimizing reverse KLD has been shown to cause mode-seeking behavior. (2) single-step decomposition, (3) teacher-mixed sampling, and (4) length normalization. (2~4 approach ๋””ํ…Œ์ผ์€ ๋…ผ๋ฌธ ์ฐธ์กฐ)

image-20240515235456991

 

 

◽️ Sequence-Level Knowledge Distillation. EMNLP, 2016. 1026

image-20240515234613094

 

 

 

 

๐Ÿ’™ Efficiency

◽️ RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.

  • Tag: CLIP with small batch size small memory GPUs.
  • Motive: Many image-text pairs (rich supervision) have a higher level of noise. To address this noise, training CLIP models require ∼3k V100-GPU-days.
  • Method:
    • Inspired by the notion of coarse-to-fine, See Figure 1. (1) Humans can effortlessly match text-image pairs even if the image size is small. (2) pretraining incorporates high-level information from small images, and finetuning enables the model to refocus its attention on the important details.
  • Results:
    • Only 16 tokens for the image encoding are sufficient for the main training phase (even if recent works tend to use long sequence lengths).
    • We set the batch size to 16384. Our training is run on TPU-v3 infrastructure (each GPU has 16GB).

image-20240502230858890

 

◽️ Data-Efficient Multimodal Fusion on a Single GPU. CVPR 24. Highlight

  • ํ•ต์‹ฌ์€ DINOv2, BGE ๊ฐ™์€ unimodel์—์„œ features๋ฅผ ๋ชจ๋‘ ๋‹ค ๋ฝ‘์•„ ๋†“๊ณ , Fusion Adapters๋งŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ. ๋”ฐ๋ผ์„œ ํ•™์Šต์‹œ๊ฐ„์ด ์••๋„์ ์œผ๋กœ ๋น ๋ฅด๋‹ค. (ํ•œ ์ด๋ฏธ์ง€๋‹น ์ €์žฅํ•ด์•ผํ•˜๋Š” feature๊ฐ€ 1x512๋ฟ์ด๊ณ , ์ด๋ฏธ์ง€๋ฅผ inferereneํ•˜๋Š” ์‹œ๊ฐ„ ๋”ฐ์œ„๋„ ํ•„์š”ํ•˜์ง€ ์•Š๋Š”๋‹ค.)
  • Findings: Our key insight is that off-the-shelf unimodal encoders that have been pre-trained on large amounts of unimodal data already encode rich semantics.
  • Approach: FuseMix motivated by Mixup.
  • Results: 3M, 5M ์ •๋„์˜ ์ด๋ฏธ์ง€์˜ (low-data regime) features ๋งŒ์„ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์ถฉ๋ถ„ํžˆ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„. ํ•˜๋‚˜์˜ GPU๋กœ, ์—„์ฒญ ํฐ Batch๋ฅผ ์‚ฌ์šฉํ•ด๋„ ์ถฉ๋ถ„ํžˆ ํ•™์Šต๊ฐ€๋Šฅ.

image-20240520225311682