๐ Vision Laungague Model
◽️ Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20
- Problem: Existing methods simply concatenate image region features and text features as input. However, the lack of explicit alignment information between the image regions and text poses alignment modeling a weakly-supervised learning task.
- Method: Keys: Add object tags detectced by Faster-RCNN / Pretraining (Masked token modeling, Contrastive Loss) + Fine-tuning
- Pretraining dataset: COCOCO, CC, SBU captions, Flicker30K, GQA (6.5M pairs)
- Finetuning: Pretraining๊ณผ ๊ฐ์ Loss ์ฌ์ฉ. Inference ์์๋, ์ด๋ฏธ์ง + Object tags ๊ฐ ์ฒซ input์ผ๋ก ๋ค์ด๊ฐ๊ณ autoregressiveํ๊ฒ text ์์ฑ. Stop token์ด ๋์ฌ๋๊น์ง ์์ฑํ๋ฉฐ, beam size=5๋ก beam search ์ํ.
- Archtecture: BERT-B(110M), BERT-L(340M) ์ฌ์ฉ์ ๋ฐ๋ผ์ OSCAR-B, OSCAR-L ์์. Faster-RCNN๋ํ ๊ณ ๋ คํด์ค์ผ ํจ.
- Nocaps์์๋ Visual Genome, Open Images ์ ์๋ labels๋ง tags๋ก ๋ฃ์ด์ค. pretraining ํ์ง ์๊ณ , coco์์ pretraining๋ง ์ํํ๋ค๊ณ ํจ. / Nocaps๋ Open Images์์์ 15100์ด๋ฏธ์ง๋ฅผ ํฌํจํ๋ฉฐ 600์นดํ ๊ณ ๋ฆฌ๊ฐ ์๋๋ฐ ๊ทธ ์ค 400๊ฐ๋ Coco์ ์๋ ์นดํ ๊ณ ๋ฆฌ๋ผ๊ณ ํจ.
◽️ VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21
- OSCAR upgrade ํ ๋ ผ๋ฌธ (๊ฐ์ ์ ์)
- Problem: Previous VL works neglect to improve the object detection model.
- Solution: The detection model on ResNeXt-152-C4 (not PFN) is trained on COCO, OpenImages, Object365, and Visual Genome (VG). The VG dataset has a rich set of annotation for both objects and attributes.
- Pretraining: Captioning dataset 4๊ฐ, VQA dataset 3๊ฐ. (8.85 pairs) / OSCRA์ archtecture์ Loss ์ฌ์ฉ
◽️ LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22
- VinVL upgrade ํ ๋ ผ๋ฌธ (๊ฐ์ ์ ์)
- Ablation study on the dataset size (200M pairs in ALIGN) and model size.
- Motive: Neural scaling law (These studies have observed consistent benefits of increasing the model size to billions of parameters, given billion magnitude of pre-training data available.)
- Method: VinVL๊ณผ ๋์ผ. 13M~674M parameters ๊ฐ์ง๋ ์ฌ๋ฌ๋ชจ๋ธ ๊ตฌ์ฑ. Noisy data์ธ ALT 200M์์ ๋๋ ์ ํ๋ 5๊ฐ์ง ๋ค๋ฅธ ์ฌ์ด์ฆ์ ๋ฐ์ดํฐ ์ฌ์ฉ (๋ฐ์ดํฐ ํ๋ฆฌํฐ ์ํฅ์ด ์๋๋ก pre-training ๊ณผ์ ์์ coco ์ฌ์ฉ ์ํจ)
- Image Captioning ์ค์ฌ ๋ ผ๋ฌธ์ธ ๋งํผ ์ฑ๋ฅ์ธก์ ์ Nocaps, COCO karpath test split, CC3M dev set ์์ ๋งค์ธ์ผ๋ก ์ํํจ.
◽️ SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22
- Problem: object deteciton model ์ฌ์ฉ, MLM loss ์ฌ์ฉ → ๋ฎ์ zero-shot ์ฑ๋ฅ
- Approach: Encoder: BERT (bidirectional manner) / Decoder: GPT (autoregressive) ๋ถ๋ถ์๋ Text๋ง ๋๊ฒ ๋ค. ๋ํ PrefixLM Loss๋ decoder๋ถ๋ถ์ text์ Generation ๋ถ๋ถ์๋ง ์ง์คํ๋๋ก ํ๊ฒ ๋ค๋ ์ฒ ํ.
- Archtecture: Conv+ViT (CoAtNet), Transformer, Tokenizer, No modality type embedding / ViT ์ฌ์ด์ฆ๋ฅผ ๋ฐ๋ผ์ B,L,H (๋๋ต86M, 307M, 632M)
- Dataset: ALIGN (1.8B) + C4 (Only text dataset)
◽️ X-VLM: Multi-Grained Vision Language Pre-Training ICML22
- Problem: detector๋ฅผ ์ฌ์ฉํ๋ ๊ฐ์ฒด๋ค์ feature๋ค์ input์ผ๋ก ์ฌ์ฉํ๋ ๊ธฐ๋ฒ๋ค์ ๊ฐ์ฒด ์ฌ์ด์ ๊ด๊ณ๋ค์ ํ์ ํ๋ ํ๋๋ฐ ์ด๋ ค์์ด ์์ ์ ์๊ณ , detector ์ฌ์ฉํ์ง ์๋ coarse-grained ๊ธฐ๋ฒ๋ค์ dense-alignment๋ฅผ ํ์ตํ๊ธฐ์ ์ด๋ ค์์ด ์๋ค.
- Goal: VLM to learn multi-grained alignment (= object + image level)
- Approach: (1) Re-formulate the data (2) Arch: image, text encoder, a cross-modal encoder.
- Loss: (a) Detection loss (bounding box prediction) / Contrastive loss / Matching prediction (mini bach์์์ hard negative text๋ฅผ saingling ํ๊ณ ๊ทธ๋ค์ ๋ํด ๋ฎ์ matching probability๋ฅผ ๊ฐ์ง๊ฒ ํจ) / Masked language modeling
- Dataset: 4M pairs or 16 pairs
- Arch: Image: Vision transformer (Swin transformer-B) / Text encoder (6 transformer layers) / Cross-model encoder (6 transformer layers). Total 215M / 8 A100, 4M pairs > 3.5 days
- ์ฑ๋ฅ ๋น๊ต ํด๋ณด๋ ํ์คํ Captioning์ ๋ฐ์ดํฐ์ ๋๋น VinVL, LEMON์ด ํ์คํ ์ํ๋ ๋ฏ ํ๋ค. X-VLM์ ์ฝํ๋ค.
◽️ OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22
- Problem: Downstream tasks๋ก ๋์ด๊ฐ๋ ค๋ฉด, previous works require extra learnable part (ex, adapter) and also task-specific formulation (loss, finetuning framework).
- OFA (One for all): We formulate both pretraining and ๏ฌnetuning tasks in a uni๏ฌed sequence-to-sequence abstraction via handcrafted instructions (์ผ์ชฝ ์ ํ ์ด๋ธ์ ๋ง์ ๋ฐ์ดํฐ์ ์, ์ผ์ชฝ ์๋์ Instruction์ผ๋ก ๋ฐ๊พผ ํ ๋ชจ๋ธ์ ์ ์ฒด ํ์ตํ๋ค.). (1) no learnable task- or modality-speci๏ฌc components will be added. (2) It is available to represent information from different modalities within a globally shared multimodal vocabulary across all tasks.
- Method: ResNet 3 conv blocks → Image quantization → MAE training + Detection training + BART training
◽️GIT A Generative Image-to-text Transformer for Vision and Language arXiv22
- ํน๋ณํ ๋ฐฉ๋ฒ๋ก ์๊ณ ๋ ผ๋ฌธ์ด ์ง์ ๋ถํ๋ค. ๋ง์ (1) ๋ฐ์ดํฐ ์ฌ์ด์ฆ, (2) ๋ชจ๋ธ ์ฌ์ด์ฆ - ์ด๋ฏธ์ง, ํ ์คํธ (3) ์ด๋ฏธ์ง ํด์๋ ๋ฅผ ์ฌ์ฉํจ์ผ๋ก์จ ์ข์ ์ฑ๋ฅ์ ์ป์๋ค ํ๋ ๋ ผ๋ฌธ์ด๋ค.
- ๋น๊ต ๋ฐ ์ฝ๋ ์ฐธ๊ณ ์์ ์ด ๋ ผ๋ฌธ์ ๊ณ ๋ คํ์ง ์๋๊ฒ ์ข์ ๊ฒ ๊ฐ๋ค.
- Arch: OFA๋ ๋น์ทํ์ง๋ง, Image encoder๋ฅผ CLIP, Florenece weight๋ฅผ ๊ฐ์ ธ์๊ณ , Text decoder๋ BERT ๋ง๊ณ random initialized ๋ ๊ฐ์ ์ฌ์ฉํ๋ค. ํ์ต ํ ๋๋ ๋ชจ๋ weight ํ์ตํ๋ค.
- Method: MLM ์ฌ์ฉํ์ง ์๊ณ , LM ๋ง ์ฌ์ฉํ๊ณ , ์คํ์ ์ผ๋ก ์ด๊ฒ ๋ ์ข์๋ค๊ณ ํจ.
◽️ X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22
◽️ mPLUG Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections EMNLP22.
- ๊ทธ๋ฆผ๊ณผ ๊ฐ์ด Cross-model skip-connected network ์ฌ์ฉํจ์ผ๋ก์จ Running time์ ์ ๊ฐํ๋ค. + Detector ์ฌ์ฉํ์ง ์๊ฒ ๋ค.
- Arch: Vision encoder๋ CLIP ViT ์ฌ์ฉํ์ง๋ง, ํ์ต์์๋ ๋ชจ๋ weight ํ์ต.
◽️ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023. (code)
- Contributions:
- BLIP-2 effectively leverages both frozen pre-trained im- age models and language models by lightweight Q-former
- SOTA on various vision-language tasks, including visual question answering, image captioning, and image-text retrieval.
- LLM's complex reasoning ์ฌ์ฉ ๊ฐ๋ฅ (ex, audi car + audi history of LLMs external knowleges)
- 54x fewer trainable parameters 8.7% better performance on zero-shot VQAv2, then Flamingo.
- Training cost: Due to the use of frozen models, our pre-training is more computational friendly than existing large-scale VLP methods. We pre-train for 250k steps in the first stage and 80k steps in the second stage. We use a batch size of 2320/1680 for ViT-L/ViT-g in the first stage and a batch size of 1920/1520 for OPT/FlanT5 in the second stage.
- ์๋ ์ด๋ฏธ์ง์์ tranable parameters๊ฐ 1B(๋ ผ๋ฌธ์์ table3)์ธ ์ด์ ๋ Update Q-Former + the image encoder ํ๊ธฐ ๋๋ฌธ์. Q-Former๋ง ๋ณด๋ฉด 188M (table1), 108M(table2) ์ ๋ ์ธ๋ฏ ํ๋ค.
◽️ LLaVA: Large Language and Vision Assistant. NeurIPS, 2023.
- Good references: youtube
- Contributions
- Multimodal Instruct Data. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
- We introduce LLaVA, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
- Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities.
- Training
- Pretraining: CC3M to 595K image-text pairs, Adapter๋ง ํ์ต
- Fine-tuning: 158K language-image instruction-following data๋ก Adapter์ LLM ํ์ต
- LLaVA1.0 github. 1.5๋ฒ์ ๊ณผ๋ ๋ค๋ฅด๊ฒ batch๋ฑ์ ๋ฐ๊ฟ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ค์ผ ์ ์๋ค๊ณ ์ธ๊ธ. LLaVA-Lighting์ ๋ํ ์ธ๊ธ๋ ์์ผ๋ ์ฒดํฌํด๋ณด๋ฉด ์ข์ ๋ฏ.
- Dataset
- Pretraining: Github, liuhaotian LLaVA-CC3M-Pretrain-595K
- Fine-tuning: LLaVa-Instruct-150K, ์๋ ๋ ธ๋ ํ๊ดํฌ ๋ถ๋ถ์ ๋ฐ์ดํฐ์ . ํ๋ ํ๊ดํฌ ๋ถ๋ถ์ ๊ทธ๋ฅ ๋ฃ์ด๋์๋ฏ.
- LLaVA-NeXT: Improved reasoning, OCR, and world knowledge (24๋ 1์ ๊ณต๊ฐ, Training code ๋ฐ dataset ์๋จ)
- LLaVA-NeXT: LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild (24๋ 5์ ๊ณต๊ฐ, Training code ๋ฐ dataset ๊ณต๊ฐ ์๋จ)
◽️ LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
- Dataset
- Pretraining: Github, LAION/CC/SBU BLIP-Caption Concept-balanced 558K
- Finetuning: llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
◽️ mPLUG-Owl2 Revolutionizing Multi-modal Large Language Model with Modality Collaboration CVPR24
๐ Image captioning
◽️ Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024. code
- Tag: Detector-free VLM, Diverse attributes (=tags)
- Problem of detector-based VLM (VIVO, X-VLM): Heavy (frozen) Faster RCNN
- Problem of detector-free VLM: discarding of valuable tags (=์ด๋ฏธ์ง ์์ object, attribute list, 3,429 categories).
- Method: Tagging head
- (1) supervised by annotation-free image tags (large image-text pairs ๋ฐ์ดํฐ์ ํ์ฉํ๋ค. Grounding (bounding box) annotation ๋ฐ์ดํฐ๋ ํ์ฉํ์ง ์๋๋ค).
- (2) Image encoder ๋ค์ Small network ๋ง ๋ถ์ด๋ฉด ๋๋ค.
- Results:
- Their tagging model outperforms CLIP, BLIP-2.
- Generation-based (image captioning): Text description generation is based on the image features and also assigned tags.
- 12-layer transformer from BERT_{B}, 2-layer transformer for tag head, 8 A100 GPUs, 960 batch size, 20 epoch
- Memo๐ค: Domain Adaptive Semantic Segmentation Using Weak Labels / ์ training cost ํฌ์ง? / ์ฑ๋ฅ์ด ๊ฐ์ฅ ์ข์ผ๋ ค๋? BLIP2๋๋ ๋น๊ต๋ฅผ ํ์์. Captioning ๊ด๋ จ ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฐ ์ฝ๋๋ฅผ ๊ตฌ์ฒด์ ์ผ๋ก ๋ณด๊ณ ์ถ๋ค.
◽️ SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023. (code-s73)
- Tags: lightweight training, retrieved captions from a datastore, training-free domain transfer,
- Data and model size ์ปค์ง๋ฉด training ์ฝ์คํธ ์ปค์ง๋ค. ์ฐ๋ฆฌ๋ related captions retrieved from a datastore ์ฌ์ฉํ๋ an alternative to large models, SmallCap, ์ ์ํ๋ค. Parameter: cross-attention layers (7M) between a frozen CLIP encoder and GPT-2 decoder.
- ์์ ํ๋ผ๋ฏธํฐ์ ์ ๋ณด๋ฅผ ์ ์ฅํ๋๋ฐ์ ์, retrieval์ ์ ์ฉํจ์ผ๋ก์จ ์ฑ๋ฅ์ ํ์ธต ๋์ด์ฌ๋ฆฐ๋ค. (?) (Figure 6)
- Datastore๋ง replaceํ๋ฉด retraining ํ์์์. Figure1์ ์์ ๊ทธ๋ฆผ์ด OOD์์ ๊ธฐ์กด ๋ชจ๋ธ์ ์ฑ๋ฅ + datastore๋ง ๋ฐ๊พผ SmallCap์ ์ฑ๋ฅ์ธ ๋ฏ ํ๋ค.
- Eval dataset: COCO, nocaps (rarely-seen and unseen visual concepts), VizWiz (impaired data)
- Details: (1) Training takes up to 8 hours on a single NVIDIA A100 GPU using 16 GB / (2) a batch size of 64 / (3) k = 4 captions retrieved from a datastore. / (4) Retrieval is based on CLIPResNet-50x64 4 representations of input images and captions in the datastore. / (5) the latter being precomputed offline and indexed with FAISS for efficient nearest neighbor searching. / (6) beam size = 3 .
- Method: (1) We further control the number of trainable parameters through the dimensionality of the projection matrices in the cross-attention layers. (768dim=12 headx64dim) (2) ์๋ Figure ์ฐธ์กฐ
- Memo๐ค: ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฐ ์ฝ๋๋ฅผ ๊ตฌ์ฒด์ ์ผ๋ก ๋ณด๊ณ ์ถ๋ค.
◽️ CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022. code-s26
- Tags: lightweight training, distillation to a EMA model
◽️ Retrieval-augmented image captioning. ACL 2023. code-s12
- Tags: lightweight training, retrieved captions from a datastore
- ์ถ๊ฐ ๋ ผ๋ฌธ: Retrieval-augmented transformer for image captioning. CBMI 22. 32. (smallcap ์ ์)
◽️ ClipCap: CLIP Prefix for Image Captioning. arXiv 2021. code-s1.2k
- Tags: CLIP์ ์ฌ์ฉํด์ captioning์ ์ฌ์ฉํ ์ฒซ(?) ๋ ผ๋ฌธ / lightweight training, frozen vision encoder and language decode
- Challenges of captioning: (1) semantic understanding (a man gives her a gift) (2) the large number of possible ways to describe (3) resource hungry (training tim,e, parameters, massive data)
- Method: (1) use the frozen CLIP encoder + GPT-2 decoder (optionally ๏ฌne-tuning the GPT-2 by a pre๏ฌx prompt learning) (2) our method produces a ๏ฌxed size embeddings sequence (3) training of the mapping network (Transformer, MLP layers)
- Nvidia GTX1080 GPU for 80 hours
◽️Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023. 6. code-s131
- Tage: Text-only training, Object hallucination in image captioning
- Motive: modality bias induced by LLMs (ํ์ต ๋ ์์ฃผ ๋ณธ ์ฌ๋ฌผ)์ ์ํ hallucination ๋ฌธ์ ์ ์ ๊ฑฐ๋ก ํ๋ค. (Coco → NoCaps ๊ณผ ๊ฐ์ ์ธํ ์ผ๋ก ์คํ ์ํ / BLIP ๊ฐ์ ์ต์ ๋ชจ๋ธ ๋น๊ต X).
- Method: Entity-aware decoding (์๋ ๊ทธ๋ฆผ์ CLIP-based classifier๋ฅผ ์๋ฏธํ๋ ๊ฒ ๊ฐ์) to improve the transferability of zero-shot captioning.
◽️With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2.
- Tags: retrieval-augmented key-value in attention heads.
- FAISS to retrieve memories.
- Motivated by [10] Meshed-Memory Transformer for Image Captioning. CVPR 2020.
- ๋ํ ์ผํ๊ฒ ์ฝ๊ธฐ์๋.. ์ฝ๋ ๊ณต๊ฐ๋ ์ํ๊ณ , citation๋ ์ ๋ค. CaMEL๊ณผ ์ฑ๋ฅ์ฐจ์ด๋ ํฌ์ง ์์๋ณด์ธ๋ค.
◽️LocCa: Visual Pretraining with Location-aware Captioners. Google. ECCV 2024 submitted.
- Tags: Image captioning with localization, VL model pretraining (summary in Twitter)
- To obtain fine-grained object locations, they use a publicly available OWL-ViT-CLIP-L/14 model (detection model).
- dataset? training cost? (24 +12 block transformer for encoder and decoder, respectively) cross attention? frozen model? (No. CLIP encoder์๋. ์์ฒด์ ๋ชจ๋ธ์ฌ์ฉ. cross attention ๋ํ ์ผ ์์)
◽️ FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted.
- Tags: Region-specific descriptions of varying lengths (dense caption), New dataset, and task
- New dataset: (1) image region descriptions of varying length, Figure1 ์ฐธ์กฐ (2) ๋ฐ์ดํฐ์ ์๋ ์ด๋ฏธ์ง + bounding box + ์ฌ๋ฌ๊ฐ์ง length์ caption์ด ์กด์ฌํ๋ค.
- ๋ ผ๋ฌธ ์ ๋ชฉ์์๋ ๋๊ปด์ง๋ ์๋ก์ด ํ ์คํฌ์ ์ ๊ธฐํจ์ด ์๋ค, ํ์ง๋ง ์๋ก์ด method, ๊ธฐ์กด ์๋ค์ ๋ฌธ์ ์ ๋ค, ์๋ก์ด Insight ๋ฐ ๋ฐฐ์๋ค์ ์ ๊ณต์ด ๋ถ์กฑํ๋ ๊ฒ์ด ํ๋ฝ์ ๋งค์ธ ์ด์ ์ธ ๊ฒ ๊ฐ๋ค (meta review - weaknesses). ํ๊ธด ์์ฆ ๋ณด๋ฉด observation, interesting insights ์ ๋์ดํ๋ ๋ ผ๋ฌธ์ด ์ธ๊ธฐ๊ฐ ์์ด๋ณด์ธ๋ค. ๋ฐ๋ผ์ ์คํํ๋ฉด์ ์ฑ๋ฅ์ด ์ข๋ ์ ์ข๋ ๊ทธ ์ฌ์ด์ฌ์ด์์ ๋๋ผ๋ ๋๋์ ๋ค์ ์ ์ ๋ฆฌํ๊ณ ํํ(writing)ํ๋ ๊ฒ์ด ๋งค์ฐ ์ค์ํ ๊ฒ ๊ฐ๋ค.