💙 Vision Laungague Model

◽️ Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20

Problem: Existing methods simply concatenate image region features and text features as input. However, the lack of explicit alignment information between the image regions and text poses alignment modeling a weakly-supervised learning task.
Method: Keys: Add object tags detectced by Faster-RCNN / Pretraining (Masked token modeling, Contrastive Loss) + Fine-tuning
Pretraining dataset: COCOCO, CC, SBU captions, Flicker30K, GQA (6.5M pairs)
Finetuning: Pretraining과 같은 Loss 사용. Inference 에서는, 이미지 + Object tags 가 첫 input으로 들어가고 autoregressive하게 text 생성. Stop token이 나올때까지 생성하며, beam size=5로 beam search 수행.
Archtecture: BERT-B(110M), BERT-L(340M) 사용에 따라서 OSCAR-B, OSCAR-L 있음. Faster-RCNN또한 고려해줘야 함.
Nocaps에서는 Visual Genome, Open Images 에 있는 labels만 tags로 넣어줌. pretraining 하지 않고, coco에서 pretraining만 수행했다고 함. / Nocaps는 Open Images에서의 15100이미지를 포함하며 600카테고리가 있는데 그 중 400개는 Coco에 없는 카테고리라고 함.

◽️ VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21

OSCAR upgrade 한 논문 (같은 저자)
Problem: Previous VL works neglect to improve the object detection model.
Solution: The detection model on ResNeXt-152-C4 (not PFN) is trained on COCO, OpenImages, Object365, and Visual Genome (VG). The VG dataset has a rich set of annotation for both objects and attributes.
Pretraining: Captioning dataset 4개, VQA dataset 3개. (8.85 pairs) / OSCRA의 archtecture와 Loss 사용

◽️ LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22

VinVL upgrade 한 논문 (같은 저자)
Ablation study on the dataset size (200M pairs in ALIGN) and model size.
Motive: Neural scaling law (These studies have observed consistent benefits of increasing the model size to billions of parameters, given billion magnitude of pre-training data available.)
Method: VinVL과 동일. 13M~674M parameters 가지는 여러모델 구성. Noisy data인 ALT 200M에서 랜던 선택된 5가지 다른 사이즈의 데이터 사용 (데이터 퀄리티 영향이 없도록 pre-training 과정에서 coco 사용 안함)
Image Captioning 중심 논문인 만큼 성능측정은 Nocaps, COCO karpath test split, CC3M dev set 에서 매인으로 수행함.

◽️ SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22

Problem: object deteciton model 사용, MLM loss 사용 → 낮은 zero-shot 성능
Approach: Encoder: BERT (bidirectional manner) / Decoder: GPT (autoregressive) 부분에는 Text만 두겠다. 또한 PrefixLM Loss는 decoder부분의 text의 Generation 부분에만 집중하도록 하겠다는 철학.
Archtecture: Conv+ViT (CoAtNet), Transformer, Tokenizer, No modality type embedding / ViT 사이즈를 따라서 B,L,H (대략86M, 307M, 632M)
Dataset: ALIGN (1.8B) + C4 (Only text dataset)

◽️ X-VLM: Multi-Grained Vision Language Pre-Training ICML22

Problem: detector를 사용하는 객체들의 feature들을 input으로 사용하는 기법들은 객체 사이의 관계들을 파악하는 하는데 어려움이 있을 수 있고, detector 사용하지 않는 coarse-grained 기법들은 dense-alignment를 학습하기에 어려움이 있다.
Goal: VLM to learn multi-grained alignment (= object + image level)
Approach: (1) Re-formulate the data (2) Arch: image, text encoder, a cross-modal encoder.
Loss: (a) Detection loss (bounding box prediction) / Contrastive loss / Matching prediction (mini bach안에서 hard negative text를 saingling 하고 그들에 대해 낮은 matching probability를 가지게 함) / Masked language modeling
Dataset: 4M pairs or 16 pairs
Arch: Image: Vision transformer (Swin transformer-B) / Text encoder (6 transformer layers) / Cross-model encoder (6 transformer layers). Total 215M / 8 A100, 4M pairs > 3.5 days
성능 비교 해보니 확실히 Captioning은 데이터셋 대비 VinVL, LEMON이 확실히 잘하는 듯 하다. X-VLM은 약하다.

◽️ OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22

Problem: Downstream tasks로 넘어가려면, previous works require extra learnable part (ex, adapter) and also task-specific formulation (loss, finetuning framework).
OFA (One for all): We formulate both pretraining and ﬁnetuning tasks in a uniﬁed sequence-to-sequence abstraction via handcrafted instructions (왼쪽 위 테이블의 많은 데이터셋을, 왼쪽 아래의 Instruction으로 바꾼 후 모델을 전체 학습한다.). (1) no learnable task- or modality-speciﬁc components will be added. (2) It is available to represent information from different modalities within a globally shared multimodal vocabulary across all tasks.
Method: ResNet 3 conv blocks → Image quantization → MAE training + Detection training + BART training

◽️GIT A Generative Image-to-text Transformer for Vision and Language arXiv22

특별한 방법론 없고 논문이 지저분하다. 많은 (1) 데이터 사이즈, (2) 모델 사이즈 - 이미지, 텍스트 (3) 이미지 해상도 를 사용함으로써 좋은 성능을 얻었다 하는 논문이다.
비교 및 코드 참고에서 이 논문은 고려하지 않는게 좋을 것 같다.
Arch: OFA랑 비슷하지만, Image encoder를 CLIP, Florenece weight를 가져왔고, Text decoder는 BERT 말고 random initialized 된 값을 사용했다. 학습 할 때는 모든 weight 학습한다.
Method: MLM 사용하지 않고, LM 만 사용했고, 실험적으로 이게 더 좋았다고 함.

◽️ X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22

그림과 같이 Cross-model skip-connected network 사용함으로써 Running time을 절감했다. + Detector 사용하지 않겠다.
Arch: Vision encoder는 CLIP ViT 사용하지만, 학습시에는 모든 weight 학습.

◽️ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023. (code)

Contributions:
- BLIP-2 effectively leverages both frozen pre-trained im- age models and language models by lightweight Q-former
- SOTA on various vision-language tasks, including visual question answering, image captioning, and image-text retrieval.
- LLM's complex reasoning 사용 가능 (ex, audi car + audi history of LLMs external knowleges)
- 54x fewer trainable parameters 8.7% better performance on zero-shot VQAv2, then Flamingo.
Training cost: Due to the use of frozen models, our pre-training is more computational friendly than existing large-scale VLP methods. We pre-train for 250k steps in the first stage and 80k steps in the second stage. We use a batch size of 2320/1680 for ViT-L/ViT-g in the first stage and a batch size of 1920/1520 for OPT/FlanT5 in the second stage.
아래 이미지에서 tranable parameters가 1B(논문에서 table3)인 이유는 Update Q-Former + the image encoder 하기 때문에. Q-Former만 보면 188M (table1), 108M(table2) 정도 인듯 하다.

◽️ LLaVA: Large Language and Vision Assistant. NeurIPS, 2023.

Good references: youtube
Contributions
- Multimodal Instruct Data. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
- We introduce LLaVA, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
- Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities.
Training
- Pretraining: CC3M to 595K image-text pairs, Adapter만 학습
- Fine-tuning: 158K language-image instruction-following data로 Adapter와 LLM 학습
- LLaVA1.0 github. 1.5버전과는 다르게 batch등을 바꿔 메모리를 줄일 수 있다고 언급. LLaVA-Lighting에 대한 언급도 있으니 체크해보면 좋을 듯.
Dataset
- Pretraining: Github, liuhaotian LLaVA-CC3M-Pretrain-595K
- Fine-tuning: LLaVa-Instruct-150K, 아래 노랑 형광팬 부분의 데이터셋. 파랑 형광팬 부분은 그냥 넣어놓은듯.
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge (24년 1월 공개, Training code 및 dataset 안됨)
LLaVA-NeXT: LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild (24년 5월 공개, Training code 및 dataset 공개 안됨)

◽️ LLaVA 1.5: Improved Baselines with Visual Instruction Tuning

Dataset
- Pretraining: Github, LAION/CC/SBU BLIP-Caption Concept-balanced 558K
- Finetuning: llava_v1_5_mix665k.json, and download the images from constituting datasets:
  - COCO: train2017
  - GQA: images
  - OCR-VQA: download script, we save all files as .jpg
  - TextVQA: train_val_images
  - VisualGenome: part1, part2

💙 Image captioning

◽️ Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024. code

Tag: Detector-free VLM, Diverse attributes (=tags)
Problem of detector-based VLM (VIVO, X-VLM): Heavy (frozen) Faster RCNN
Problem of detector-free VLM: discarding of valuable tags (=이미지 안에 object, attribute list, 3,429 categories).
Method: Tagging head
- (1) supervised by annotation-free image tags (large image-text pairs 데이터셋 활용한다. Grounding (bounding box) annotation 데이터는 활용하지 않는다).
- (2) Image encoder 뒤에 Small network 만 붙이면 된다.
Results:
- Their tagging model outperforms CLIP, BLIP-2.
- Generation-based (image captioning): Text description generation is based on the image features and also assigned tags.
12-layer transformer from BERT_{B}, 2-layer transformer for tag head, 8 A100 GPUs, 960 batch size, 20 epoch
Memo🤔: Domain Adaptive Semantic Segmentation Using Weak Labels / 왜 training cost 크지? / 성능이 가장 좋으려나? BLIP2랑도 비교를 했잖아. Captioning 관련 모델 구조 및 코드를 구체적으로 보고 싶다.

◽️ SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023. (code-s73)

Tags: lightweight training, retrieved captions from a datastore, training-free domain transfer,
Data and model size 커지면 training 코스트 커진다. 우리는 related captions retrieved from a datastore 사용하는 an alternative to large models, SmallCap, 제안한다. Parameter: cross-attention layers (7M) between a frozen CLIP encoder and GPT-2 decoder.
작은 파라미터에 정보를 저장하는데신에, retrieval을 적용함으로써 성능을 한층 끌어올린다. (?) (Figure 6)
Datastore만 replace하면 retraining 필요없음. Figure1의 작은 그림이 OOD에서 기존 모델의 성능 + datastore만 바꾼 SmallCap의 성능인 듯 하다.
Eval dataset: COCO, nocaps (rarely-seen and unseen visual concepts), VizWiz (impaired data)
Details: (1) Training takes up to 8 hours on a single NVIDIA A100 GPU using 16 GB / (2) a batch size of 64 / (3) k = 4 captions retrieved from a datastore. / (4) Retrieval is based on CLIPResNet-50x64 4 representations of input images and captions in the datastore. / (5) the latter being precomputed offline and indexed with FAISS for efficient nearest neighbor searching. / (6) beam size = 3 .
Method: (1) We further control the number of trainable parameters through the dimensionality of the projection matrices in the cross-attention layers. (768dim=12 headx64dim) (2) 아래 Figure 참조
Memo🤔: 모델 구조 및 코드를 구체적으로 보고 싶다.

◽️ CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022. code-s26

Tags: lightweight training, distillation to a EMA model

◽️ Retrieval-augmented image captioning. ACL 2023. code-s12

Tags: lightweight training, retrieved captions from a datastore
추가 논문: Retrieval-augmented transformer for image captioning. CBMI 22. 32. (smallcap 저자)

◽️ ClipCap: CLIP Prefix for Image Captioning. arXiv 2021. code-s1.2k

Tags: CLIP을 사용해서 captioning을 사용한 첫(?) 논문 / lightweight training, frozen vision encoder and language decode
Challenges of captioning: (1) semantic understanding (a man gives her a gift) (2) the large number of possible ways to describe (3) resource hungry (training tim,e, parameters, massive data)
Method: (1) use the frozen CLIP encoder + GPT-2 decoder (optionally ﬁne-tuning the GPT-2 by a preﬁx prompt learning) (2) our method produces a ﬁxed size embeddings sequence (3) training of the mapping network (Transformer, MLP layers)
Nvidia GTX1080 GPU for 80 hours

◽️Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023. 6. code-s131

Tage: Text-only training, Object hallucination in image captioning
Motive: modality bias induced by LLMs (학습 때 자주 본 사물)에 의한 hallucination 문제점을 거론한다. (Coco → NoCaps 과 같은 세팅으로 실험 수행 / BLIP 같은 최신 모델 비교 X).
Method: Entity-aware decoding (아래 그림의 CLIP-based classifier를 의미하는 것 같음) to improve the transferability of zero-shot captioning.

◽️With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2.

Tags: retrieval-augmented key-value in attention heads.
FAISS to retrieve memories.
Motivated by [10] Meshed-Memory Transformer for Image Captioning. CVPR 2020.
디테일하게 읽기에는.. 코드 공개도 안하고, citation도 적다. CaMEL과 성능차이도 크지 않아보인다.

◽️LocCa: Visual Pretraining with Location-aware Captioners. Google. ECCV 2024 submitted.

Tags: Image captioning with localization, VL model pretraining (summary in Twitter)
To obtain fine-grained object locations, they use a publicly available OWL-ViT-CLIP-L/14 model (detection model).
dataset? training cost? (24 +12 block transformer for encoder and decoder, respectively) cross attention? frozen model? (No. CLIP encoder아님. 자체적 모델사용. cross attention 디테일 없음)

◽️ FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted.

Tags: Region-specific descriptions of varying lengths (dense caption), New dataset, and task
New dataset: (1) image region descriptions of varying length, Figure1 참조 (2) 데이터셋에는 이미지 + bounding box + 여러가지 length의 caption이 존재한다.
논문 제목에서도 느껴지는 새로운 테스크의 신기함이 있다, 하지만 새로운 method, 기존 웍들의 문제점들, 새로운 Insight 및 배움들의 제공이 부족했던 것이 탈락의 매인 이유인 것 같다 (meta review - weaknesses). 하긴 요즘 보면 observation, interesting insights 잘 나열하는 논문이 인기가 있어보인다. 따라서 실험하면서 성능이 좋던 안 좋던 그 사이사이에서 느끼는 느낀점들을 잘 정리하고 표현(writing)하는 것이 매우 중요한 것 같다.

Junha

[VLM] Vision Language Models 1

💙 Vision Laungague Model

◽️ Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20

◽️ VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21

◽️ LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22

◽️ SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22

◽️ X-VLM: Multi-Grained Vision Language Pre-Training ICML22

◽️ OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22

◽️GIT A Generative Image-to-text Transformer for Vision and Language arXiv22

◽️ X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22

◽️ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023. (code)

◽️ LLaVA: Large Language and Vision Assistant. NeurIPS, 2023.

◽️ LLaVA 1.5: Improved Baselines with Visual Instruction Tuning

💙 Image captioning

◽️ Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024. code

◽️ SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023. (code-s73)

◽️ CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022. code-s26

◽️ Retrieval-augmented image captioning. ACL 2023. code-s12

◽️ ClipCap: CLIP Prefix for Image Captioning. arXiv 2021. code-s1.2k

◽️Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023. 6. code-s131

◽️With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2.

◽️LocCa: Visual Pretraining with Location-aware Captioners. Google. ECCV 2024 submitted.

◽️ FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted.

Posted by Junha Song

문의하기 양식

[VLM] Vision Language Models 1

💙 Vision Laungague Model

◽️ Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20

◽️ VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21

◽️ LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22

◽️ SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22

◽️ X-VLM: Multi-Grained Vision Language Pre-Training ICML22

◽️ OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22

◽️GIT A Generative Image-to-text Transformer for Vision and Language arXiv22

◽️ X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22

◽️ mPLUG Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections EMNLP22.

◽️ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023. (code)

◽️ LLaVA: Large Language and Vision Assistant. NeurIPS, 2023.

◽️ LLaVA 1.5: Improved Baselines with Visual Instruction Tuning

◽️ mPLUG-Owl2 Revolutionizing Multi-modal Large Language Model with Modality Collaboration CVPR24

💙 Image captioning

◽️ Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024. code

◽️ SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023. (code-s73)

◽️ CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022. code-s26

◽️ Retrieval-augmented image captioning. ACL 2023. code-s12

◽️ ClipCap: CLIP Prefix for Image Captioning. arXiv 2021. code-s1.2k

◽️Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023. 6. code-s131

◽️With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2.

◽️LocCa: Visual Pretraining with Location-aware Captioners. Google. ECCV 2024 submitted.

◽️ FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted.

Posted by Junha Song

문의하기 양식