๐Ÿ’™ Vision Laungague Model

◽️ Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20

  • Problem: Existing methods simply concatenate image region features and text features as input. However, the lack of explicit alignment information between the image regions and text poses alignment modeling a weakly-supervised learning task.
  • Method: Keys: Add object tags detectced by Faster-RCNN / Pretraining (Masked token modeling, Contrastive Loss) + Fine-tuning
  • Pretraining dataset: COCOCO, CC, SBU captions, Flicker30K, GQA (6.5M pairs)
  • Finetuning: Pretraining๊ณผ ๊ฐ™์€ Loss ์‚ฌ์šฉ. Inference ์—์„œ๋Š”, ์ด๋ฏธ์ง€ + Object tags ๊ฐ€ ์ฒซ input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ณ  autoregressiveํ•˜๊ฒŒ text ์ƒ์„ฑ. Stop token์ด ๋‚˜์˜ฌ๋•Œ๊นŒ์ง€ ์ƒ์„ฑํ•˜๋ฉฐ, beam size=5๋กœ beam search ์ˆ˜ํ–‰.
  • Archtecture: BERT-B(110M), BERT-L(340M) ์‚ฌ์šฉ์— ๋”ฐ๋ผ์„œ OSCAR-B, OSCAR-L ์žˆ์Œ. Faster-RCNN๋˜ํ•œ ๊ณ ๋ คํ•ด์ค˜์•ผ ํ•จ.
  • Nocaps์—์„œ๋Š” Visual Genome, Open Images ์— ์žˆ๋Š” labels๋งŒ tags๋กœ ๋„ฃ์–ด์คŒ. pretraining ํ•˜์ง€ ์•Š๊ณ , coco์—์„œ pretraining๋งŒ ์ˆ˜ํ–‰ํ–ˆ๋‹ค๊ณ  ํ•จ. / Nocaps๋Š” Open Images์—์„œ์˜ 15100์ด๋ฏธ์ง€๋ฅผ ํฌํ•จํ•˜๋ฉฐ 600์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ ์žˆ๋Š”๋ฐ ๊ทธ ์ค‘ 400๊ฐœ๋Š” Coco์— ์—†๋Š” ์นดํ…Œ๊ณ ๋ฆฌ๋ผ๊ณ  ํ•จ.

image-20240525143525571

 

 

◽️ VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21

  • OSCAR upgrade ํ•œ ๋…ผ๋ฌธ (๊ฐ™์€ ์ €์ž)
  • Problem: Previous VL works neglect to improve the object detection model.
  • Solution: The detection model on ResNeXt-152-C4 (not PFN) is trained on COCO, OpenImages, Object365, and Visual Genome (VG). The VG dataset has a rich set of annotation for both objects and attributes.
  • Pretraining: Captioning dataset 4๊ฐœ, VQA dataset 3๊ฐœ. (8.85 pairs) / OSCRA์˜ archtecture์™€ Loss ์‚ฌ์šฉ

image-20240525152224277

 

 

◽️ LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22

  • VinVL upgrade ํ•œ ๋…ผ๋ฌธ (๊ฐ™์€ ์ €์ž)
  • Ablation study on the dataset size (200M pairs in ALIGN) and model size.
  • Motive: Neural scaling law (These studies have observed consistent benefits of increasing the model size to billions of parameters, given billion magnitude of pre-training data available.)
  • Method: VinVL๊ณผ ๋™์ผ. 13M~674M parameters ๊ฐ€์ง€๋Š” ์—ฌ๋Ÿฌ๋ชจ๋ธ ๊ตฌ์„ฑ. Noisy data์ธ ALT 200M์—์„œ ๋žœ๋˜ ์„ ํƒ๋œ 5๊ฐ€์ง€ ๋‹ค๋ฅธ ์‚ฌ์ด์ฆˆ์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ (๋ฐ์ดํ„ฐ ํ€„๋ฆฌํ‹ฐ ์˜ํ–ฅ์ด ์—†๋„๋ก pre-training ๊ณผ์ •์—์„œ coco ์‚ฌ์šฉ ์•ˆํ•จ)
  • Image Captioning ์ค‘์‹ฌ ๋…ผ๋ฌธ์ธ ๋งŒํผ ์„ฑ๋Šฅ์ธก์ •์€ Nocaps, COCO karpath test split, CC3M dev set ์—์„œ ๋งค์ธ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ.

image-20240525171236055

◽️ SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22

  • Problem: object deteciton model ์‚ฌ์šฉ, MLM loss ์‚ฌ์šฉ → ๋‚ฎ์€ zero-shot ์„ฑ๋Šฅ
  • Approach: Encoder: BERT (bidirectional manner) / Decoder: GPT (autoregressive) ๋ถ€๋ถ„์—๋Š” Text๋งŒ ๋‘๊ฒ ๋‹ค. ๋˜ํ•œ PrefixLM Loss๋Š” decoder๋ถ€๋ถ„์˜ text์˜ Generation ๋ถ€๋ถ„์—๋งŒ ์ง‘์ค‘ํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค๋Š” ์ฒ ํ•™.
  • Archtecture: Conv+ViT (CoAtNet), Transformer, Tokenizer, No modality type embedding / ViT ์‚ฌ์ด์ฆˆ๋ฅผ ๋”ฐ๋ผ์„œ B,L,H (๋Œ€๋žต86M, 307M, 632M)
  • Dataset: ALIGN (1.8B) + C4 (Only text dataset)

 

image-20240525175243330

 

 

 

◽️ X-VLM: Multi-Grained Vision Language Pre-Training ICML22

  • Problem: detector๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ์ฒด๋“ค์˜ feature๋“ค์„ input์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•๋“ค์€ ๊ฐ์ฒด ์‚ฌ์ด์˜ ๊ด€๊ณ„๋“ค์„ ํŒŒ์•…ํ•˜๋Š” ํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ์„ ์ˆ˜ ์žˆ๊ณ , detector ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” coarse-grained ๊ธฐ๋ฒ•๋“ค์€ dense-alignment๋ฅผ ํ•™์Šตํ•˜๊ธฐ์— ์–ด๋ ค์›€์ด ์žˆ๋‹ค.
  • Goal: VLM to learn multi-grained alignment (= object + image level)
  • Approach: (1) Re-formulate the data (2) Arch: image, text encoder, a cross-modal encoder.
  • Loss: (a) Detection loss (bounding box prediction) / Contrastive loss / Matching prediction (mini bach์•ˆ์—์„œ hard negative text๋ฅผ saingling ํ•˜๊ณ  ๊ทธ๋“ค์— ๋Œ€ํ•ด ๋‚ฎ์€ matching probability๋ฅผ ๊ฐ€์ง€๊ฒŒ ํ•จ) / Masked language modeling
  • Dataset: 4M pairs or 16 pairs
  • Arch: Image: Vision transformer (Swin transformer-B) / Text encoder (6 transformer layers) / Cross-model encoder (6 transformer layers). Total 215M / 8 A100, 4M pairs > 3.5 days
  • ์„ฑ๋Šฅ ๋น„๊ต ํ•ด๋ณด๋‹ˆ ํ™•์‹คํžˆ Captioning์€ ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋น„ VinVL, LEMON์ด ํ™•์‹คํžˆ ์ž˜ํ•˜๋Š” ๋“ฏ ํ•˜๋‹ค. X-VLM์€ ์•ฝํ•˜๋‹ค.

image-20240525183743904

 

◽️ OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22

  • Problem: Downstream tasks๋กœ ๋„˜์–ด๊ฐ€๋ ค๋ฉด, previous works require extra learnable part (ex, adapter) and also task-specific formulation (loss, finetuning framework).
  • OFA (One for all): We formulate both pretraining and ๏ฌnetuning tasks in a uni๏ฌed sequence-to-sequence abstraction via handcrafted instructions (์™ผ์ชฝ ์œ„ ํ…Œ์ด๋ธ”์˜ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์„, ์™ผ์ชฝ ์•„๋ž˜์˜ Instruction์œผ๋กœ ๋ฐ”๊พผ ํ›„ ๋ชจ๋ธ์„ ์ „์ฒด ํ•™์Šตํ•œ๋‹ค.). (1) no learnable task- or modality-speci๏ฌc components will be added. (2) It is available to represent information from different modalities within a globally shared multimodal vocabulary across all tasks.
  • Method: ResNet 3 conv blocks → Image quantization → MAE training + Detection training + BART training

image-20240525213743413

 

◽️GIT A Generative Image-to-text Transformer for Vision and Language arXiv22

  • ํŠน๋ณ„ํ•œ ๋ฐฉ๋ฒ•๋ก  ์—†๊ณ  ๋…ผ๋ฌธ์ด ์ง€์ €๋ถ„ํ•˜๋‹ค. ๋งŽ์€ (1) ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ, (2) ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ - ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ (3) ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„ ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์—ˆ๋‹ค ํ•˜๋Š” ๋…ผ๋ฌธ์ด๋‹ค.
  • ๋น„๊ต ๋ฐ ์ฝ”๋“œ ์ฐธ๊ณ ์—์„œ ์ด ๋…ผ๋ฌธ์€ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.
  • Arch: OFA๋ž‘ ๋น„์Šทํ•˜์ง€๋งŒ, Image encoder๋ฅผ CLIP, Florenece weight๋ฅผ ๊ฐ€์ ธ์™”๊ณ , Text decoder๋Š” BERT ๋ง๊ณ  random initialized ๋œ ๊ฐ’์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ํ•™์Šต ํ•  ๋•Œ๋Š” ๋ชจ๋“  weight ํ•™์Šตํ•œ๋‹ค.
  • Method: MLM ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , LM ๋งŒ ์‚ฌ์šฉํ–ˆ๊ณ , ์‹คํ—˜์ ์œผ๋กœ ์ด๊ฒŒ ๋” ์ข‹์•˜๋‹ค๊ณ  ํ•จ.

image-20240526132133738

 

◽️ X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22

image-20241028130541258

 

◽️ mPLUG Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections EMNLP22.

  • ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด Cross-model skip-connected network ์‚ฌ์šฉํ•จ์œผ๋กœ์จ Running time์„ ์ ˆ๊ฐํ–ˆ๋‹ค. + Detector ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ฒ ๋‹ค.
  • Arch: Vision encoder๋Š” CLIP ViT ์‚ฌ์šฉํ•˜์ง€๋งŒ, ํ•™์Šต์‹œ์—๋Š” ๋ชจ๋“  weight ํ•™์Šต.

image-20240526134139411

 

 

◽️ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023. (code)

  • Contributions:
    • BLIP-2 effectively leverages both frozen pre-trained im- age models and language models by lightweight Q-former
    • SOTA on various vision-language tasks, including visual question answering, image captioning, and image-text retrieval.
    • LLM's complex reasoning ์‚ฌ์šฉ ๊ฐ€๋Šฅ (ex, audi car + audi history of LLMs external knowleges)
    • 54x fewer trainable parameters 8.7% better performance on zero-shot VQAv2, then Flamingo.
  • Training cost: Due to the use of frozen models, our pre-training is more computational friendly than existing large-scale VLP methods. We pre-train for 250k steps in the first stage and 80k steps in the second stage. We use a batch size of 2320/1680 for ViT-L/ViT-g in the first stage and a batch size of 1920/1520 for OPT/FlanT5 in the second stage.
  • ์•„๋ž˜ ์ด๋ฏธ์ง€์—์„œ tranable parameters๊ฐ€ 1B(๋…ผ๋ฌธ์—์„œ table3)์ธ ์ด์œ ๋Š” Update Q-Former + the image encoder ํ•˜๊ธฐ ๋•Œ๋ฌธ์—. Q-Former๋งŒ ๋ณด๋ฉด 188M (table1), 108M(table2) ์ •๋„ ์ธ๋“ฏ ํ•˜๋‹ค.

image-20240502182741169

 

◽️ LLaVA: Large Language and Vision Assistant. NeurIPS, 2023.

  • Good references: youtube
  • Contributions
    • Multimodal Instruct Data. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
    • We introduce LLaVA, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
    • Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities.
  • Training
    • Pretraining: CC3M to 595K image-text pairs, Adapter๋งŒ ํ•™์Šต
    • Fine-tuning: 158K language-image instruction-following data๋กœ Adapter์™€ LLM ํ•™์Šต
    • LLaVA1.0 github. 1.5๋ฒ„์ „๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ batch๋“ฑ์„ ๋ฐ”๊ฟ” ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์–ธ๊ธ‰. LLaVA-Lighting์— ๋Œ€ํ•œ ์–ธ๊ธ‰๋„ ์žˆ์œผ๋‹ˆ ์ฒดํฌํ•ด๋ณด๋ฉด ์ข‹์„ ๋“ฏ.
  • Dataset
    • Pretraining: Github, liuhaotian LLaVA-CC3M-Pretrain-595K
    • Fine-tuning: LLaVa-Instruct-150K, ์•„๋ž˜ ๋…ธ๋ž‘ ํ˜•๊ด‘ํŒฌ ๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ์…‹. ํŒŒ๋ž‘ ํ˜•๊ด‘ํŒฌ ๋ถ€๋ถ„์€ ๊ทธ๋ƒฅ ๋„ฃ์–ด๋†“์€๋“ฏ.
  • LLaVA-NeXT: Improved reasoning, OCR, and world knowledge (24๋…„ 1์›” ๊ณต๊ฐœ, Training code ๋ฐ dataset ์•ˆ๋จ)
  • LLaVA-NeXT: LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild (24๋…„ 5์›” ๊ณต๊ฐœ, Training code ๋ฐ dataset ๊ณต๊ฐœ ์•ˆ๋จ)

image-20240607215252932

 

 

◽️ LLaVA 1.5: Improved Baselines with Visual Instruction Tuning

 

◽️ mPLUG-Owl2 Revolutionizing Multi-modal Large Language Model with Modality Collaboration CVPR24

image-20241028132314012

 

 

 

 

 

๐Ÿ’™ Image captioning

◽️ Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024. code

  • Tag: Detector-free VLM, Diverse attributes (=tags)
  • Problem of detector-based VLM (VIVO, X-VLM): Heavy (frozen) Faster RCNN
  • Problem of detector-free VLM: discarding of valuable tags (=์ด๋ฏธ์ง€ ์•ˆ์— object, attribute list, 3,429 categories).
  • Method: Tagging head
    • (1) supervised by annotation-free image tags (large image-text pairs ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉํ•œ๋‹ค. Grounding (bounding box) annotation ๋ฐ์ดํ„ฐ๋Š” ํ™œ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค).
    • (2) Image encoder ๋’ค์— Small network ๋งŒ ๋ถ™์ด๋ฉด ๋œ๋‹ค.
  • Results:
    • Their tagging model outperforms CLIP, BLIP-2.
    • Generation-based (image captioning): Text description generation is based on the image features and also assigned tags.
  • 12-layer transformer from BERT_{B}, 2-layer transformer for tag head, 8 A100 GPUs, 960 batch size, 20 epoch
  • Memo๐Ÿค”: Domain Adaptive Semantic Segmentation Using Weak Labels / ์™œ training cost ํฌ์ง€? / ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์œผ๋ ค๋‚˜? BLIP2๋ž‘๋„ ๋น„๊ต๋ฅผ ํ–ˆ์ž–์•„. Captioning ๊ด€๋ จ ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฐ ์ฝ”๋“œ๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ณด๊ณ  ์‹ถ๋‹ค.

image-20240502181608853

 

 

◽️ SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023. (code-s73)

  • Tags: lightweight training, retrieved captions from a datastore, training-free domain transfer,
  • Data and model size ์ปค์ง€๋ฉด training ์ฝ”์ŠคํŠธ ์ปค์ง„๋‹ค. ์šฐ๋ฆฌ๋Š” related captions retrieved from a datastore ์‚ฌ์šฉํ•˜๋Š” an alternative to large models, SmallCap, ์ œ์•ˆํ•œ๋‹ค. Parameter: cross-attention layers (7M) between a frozen CLIP encoder and GPT-2 decoder.
  • ์ž‘์€ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋Š”๋ฐ์‹ ์—, retrieval์„ ์ ์šฉํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ํ•œ์ธต ๋Œ์–ด์˜ฌ๋ฆฐ๋‹ค. (?) (Figure 6)
  • Datastore๋งŒ replaceํ•˜๋ฉด retraining ํ•„์š”์—†์Œ. Figure1์˜ ์ž‘์€ ๊ทธ๋ฆผ์ด OOD์—์„œ ๊ธฐ์กด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ + datastore๋งŒ ๋ฐ”๊พผ SmallCap์˜ ์„ฑ๋Šฅ์ธ ๋“ฏ ํ•˜๋‹ค.
  • Eval dataset: COCO, nocaps (rarely-seen and unseen visual concepts), VizWiz (impaired data)
  • Details: (1) Training takes up to 8 hours on a single NVIDIA A100 GPU using 16 GB / (2) a batch size of 64 / (3) k = 4 captions retrieved from a datastore. / (4) Retrieval is based on CLIPResNet-50x64 4 representations of input images and captions in the datastore. / (5) the latter being precomputed offline and indexed with FAISS for efficient nearest neighbor searching. / (6) beam size = 3 .
  • Method: (1) We further control the number of trainable parameters through the dimensionality of the projection matrices in the cross-attention layers. (768dim=12 headx64dim) (2) ์•„๋ž˜ Figure ์ฐธ์กฐ
  • Memo๐Ÿค”: ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฐ ์ฝ”๋“œ๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ณด๊ณ  ์‹ถ๋‹ค.

image-20240503213208938

 

 

◽️ CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022. code-s26

  • Tags: lightweight training, distillation to a EMA model

image-20240504173113422

 

 

◽️ Retrieval-augmented image captioning. ACL 2023. code-s12

  • Tags: lightweight training, retrieved captions from a datastore
  • ์ถ”๊ฐ€ ๋…ผ๋ฌธ: Retrieval-augmented transformer for image captioning. CBMI 22. 32. (smallcap ์ €์ž)

image-20240504180128893

 

 

◽️ ClipCap: CLIP Prefix for Image Captioning. arXiv 2021. code-s1.2k

  • Tags: CLIP์„ ์‚ฌ์šฉํ•ด์„œ captioning์„ ์‚ฌ์šฉํ•œ ์ฒซ(?) ๋…ผ๋ฌธ / lightweight training, frozen vision encoder and language decode
  • Challenges of captioning: (1) semantic understanding (a man gives her a gift) (2) the large number of possible ways to describe (3) resource hungry (training tim,e, parameters, massive data)
  • Method: (1) use the frozen CLIP encoder + GPT-2 decoder (optionally ๏ฌne-tuning the GPT-2 by a pre๏ฌx prompt learning) (2) our method produces a ๏ฌxed size embeddings sequence (3) training of the mapping network (Transformer, MLP layers)
  • Nvidia GTX1080 GPU for 80 hours

image-20240504011442639

 

 

◽️Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023. 6. code-s131

  • Tage: Text-only training, Object hallucination in image captioning
  • Motive: modality bias induced by LLMs (ํ•™์Šต ๋•Œ ์ž์ฃผ ๋ณธ ์‚ฌ๋ฌผ)์— ์˜ํ•œ hallucination ๋ฌธ์ œ์ ์„ ๊ฑฐ๋ก ํ•œ๋‹ค. (Coco → NoCaps ๊ณผ ๊ฐ™์€ ์„ธํŒ…์œผ๋กœ ์‹คํ—˜ ์ˆ˜ํ–‰ / BLIP ๊ฐ™์€ ์ตœ์‹  ๋ชจ๋ธ ๋น„๊ต X).
  • Method: Entity-aware decoding (์•„๋ž˜ ๊ทธ๋ฆผ์˜ CLIP-based classifier๋ฅผ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ ๊ฐ™์Œ) to improve the transferability of zero-shot captioning.

image-20240505001852784

 

 

◽️With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2.

  • Tags: retrieval-augmented key-value in attention heads.
  • FAISS to retrieve memories.
  • Motivated by [10] Meshed-Memory Transformer for Image Captioning. CVPR 2020.
  • ๋””ํ…Œ์ผํ•˜๊ฒŒ ์ฝ๊ธฐ์—๋Š”.. ์ฝ”๋“œ ๊ณต๊ฐœ๋„ ์•ˆํ•˜๊ณ , citation๋„ ์ ๋‹ค. CaMEL๊ณผ ์„ฑ๋Šฅ์ฐจ์ด๋„ ํฌ์ง€ ์•Š์•„๋ณด์ธ๋‹ค.

image-20240505012153290

 

 

◽️LocCa: Visual Pretraining with Location-aware Captioners. Google. ECCV 2024 submitted.

  • Tags: Image captioning with localization, VL model pretraining (summary in Twitter)
  • To obtain fine-grained object locations, they use a publicly available OWL-ViT-CLIP-L/14 model (detection model).
  • dataset? training cost? (24 +12 block transformer for encoder and decoder, respectively) cross attention? frozen model? (No. CLIP encoder์•„๋‹˜. ์ž์ฒด์  ๋ชจ๋ธ์‚ฌ์šฉ. cross attention ๋””ํ…Œ์ผ ์—†์Œ)

image-20240505010112447

 

 

◽️ FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted.

  • Tags: Region-specific descriptions of varying lengths (dense caption), New dataset, and task
  • New dataset: (1) image region descriptions of varying length, Figure1 ์ฐธ์กฐ (2) ๋ฐ์ดํ„ฐ์…‹์—๋Š” ์ด๋ฏธ์ง€ + bounding box + ์—ฌ๋Ÿฌ๊ฐ€์ง€ length์˜ caption์ด ์กด์žฌํ•œ๋‹ค.
  • ๋…ผ๋ฌธ ์ œ๋ชฉ์—์„œ๋„ ๋А๊ปด์ง€๋Š” ์ƒˆ๋กœ์šด ํ…Œ์Šคํฌ์˜ ์‹ ๊ธฐํ•จ์ด ์žˆ๋‹ค, ํ•˜์ง€๋งŒ ์ƒˆ๋กœ์šด method, ๊ธฐ์กด ์›๋“ค์˜ ๋ฌธ์ œ์ ๋“ค, ์ƒˆ๋กœ์šด Insight ๋ฐ ๋ฐฐ์›€๋“ค์˜ ์ œ๊ณต์ด ๋ถ€์กฑํ–ˆ๋˜ ๊ฒƒ์ด ํƒˆ๋ฝ์˜ ๋งค์ธ ์ด์œ ์ธ ๊ฒƒ ๊ฐ™๋‹ค (meta review - weaknesses). ํ•˜๊ธด ์š”์ฆ˜ ๋ณด๋ฉด observation, interesting insights ์ž˜ ๋‚˜์—ดํ•˜๋Š” ๋…ผ๋ฌธ์ด ์ธ๊ธฐ๊ฐ€ ์žˆ์–ด๋ณด์ธ๋‹ค. ๋”ฐ๋ผ์„œ ์‹คํ—˜ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ์ข‹๋˜ ์•ˆ ์ข‹๋˜ ๊ทธ ์‚ฌ์ด์‚ฌ์ด์—์„œ ๋А๋ผ๋Š” ๋А๋‚€์ ๋“ค์„ ์ž˜ ์ •๋ฆฌํ•˜๊ณ  ํ‘œํ˜„(writing)ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•œ ๊ฒƒ ๊ฐ™๋‹ค.

image-20240505004619691