๐Ÿ’™ Vision-centric Improvement / Region-based VLMs / Hallucination

◽️ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

  • ๊ธฐ๋ณธ MAE, ResNet ๊ฐ™์€ ๋…ผ๋ฌธ๊ณผ๋Š” ๋‹ค๋ฅธ Structure. ๋‚ด๊ฐ€ ํ•˜๊ณ  ์‹ถ์€ ์ด์•ผ๊ธฐ๋ฅผ ํ•˜๋Š” ๊ฒƒ ๋งˆ์Œ์— ๋“ ๋‹ค. ์—ญ์‹œ Saining.
  • Motive1: CLIP image encoder๋Š” ์ด๋ฏธ์ง€ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ ๊ตฌ๋ฆฌ๋‹ค. CLIP์œผ๋กœ ๋ฝ‘์€ image feature์—๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ์ดํ•ดํ•œ ์ •๋ณด๊ฐ€ ์—†๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‚˜๋น„ ๋‹ค๋ฆฌ๊ฐ€ ์žˆ๋Š”์ง€, ์ฐจ์˜ ๋ฌธ์ด ์—ด๋ ค์žˆ๋Š”์ง€ ๋‹ซํ˜€์žˆ๋Š”์ง€, CLIP์€ ๋ชจ๋ฅธ๋‹ค. ๊ทธ๋ž˜์„œ CLIP์‚ฌ์šฉํ•˜๋Š” VLM ์„ฑ๋Šฅ ๋˜ํ•œ ๊ตฌ๋ฆฌ๋‹ค.
  • Section2: DINO๊ฐ€ CLIP๋ณด๋‹ค "์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ ์ธ์ง€ํ•˜๋Š” ๋Šฅ๋ ฅ" ์ด ๋” ์ข‹๋‹ค. ๋ฐ˜๋Œ€๋กœ, CLIP์€ "๋ฌธ์ด ๋‹ซํžŒ ์ฐจ, ๋ฌธ์ด ์—ด๋ฆฐ ์ฐจ ์ด๋ฏธ์ง€๋ฅผ ๊ฑฐ์˜ ๊ฐ™์€ feature space๋กœ ์ž„๋ฒ ๋”ฉ ํ•œ๋‹ค" // DINOv2์—์„œ 0.6์ดํ•˜, CLIP์—์„œ 0.95์ด์ƒ similarity๋ฅผ ๊ฐ€์ง€๋Š” pairs๋ฅผ ์ฐพ๋Š”๋‹ค. → 150๊ฐœ์˜ pair์— human annotating + VQA ๋งŒ๋“ค๊ธฐ → SOTA MLLMs (multimodal LLM) ํ‰๊ฐ€ → ๊ฒฐ๋ก : Current MLLMs struggle with visual details.
  • Section3: GPT-4์—๊ฒŒ MLLMs๊ฐ€ ๊ตฌ๋ณ„ํ•˜๊ธฐ ํž˜๋“ค์–ดํ•˜๋Š” Visual pattern ์ฐพ๊ธฐ → Visual pattern์— ๋”ฐ๋ฅธ CLIP-based ๋ชจ๋ธ ํ‰๊ฐ€ํ•ด๋ณด๊ธฐ ํ‰๊ฐ€๋ฐฉ๋ฒ• Figure5 ๊ฒฐ๊ณผTable1 → CLIP์ด ๋ชปํ•˜๋Š”๊ฒƒ์ด LLaVA, InstructBLIP๋„ ์ž˜ ๋ชปํ•˜๋”๋ผ
  • Section4: DINO๋ž‘ CLIP์ด๋ž‘ ๊ฐ™์ด ์จ์„œ VML ๋Œ๋ ค๋ณด์ž. ๊ทผ๋ฐ ๊ทธ๋ƒฅ Naiveํ•˜๊ฒŒ ๋Œ๋ฆฌ๋ฉด ์•ˆ๋œ๋‹ค. // 4.2 Additive MoF (Figure7 2๋ฒˆ์งธ), ์„ฑ๋Šฅ Table2: ์œ„ ํ‰๊ฐ€์—์„œ๋Š” ์„ฑ๋Šฅ ์˜ค๋ฅด์ง€๋งŒ, LLaVA์˜ ๊ธฐ๋ณธ ์„ฑ๋Šฅ์€ ๋–จ์–ดํŠธ๋ฆฐ๋‹ค. → Figure 7 3๋ฒˆ์งธ ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ visual token์„ ๊ต์ฐจํ•ด์„œ ๋„ฃ์–ด์ฃผ๋ฉด ์„ฑ๋Šฅ ์ข‹์•„์ง„๋‹ค.

image-20240119161111082

 

 

◽️ GLaMM: Pixel Grounding Large Multimodal Model. arXiv. 2024.

  • ๊ธฐ์กด LLM ๋…ผ๋ฌธ๋“ค์€ 1) ์˜ค์ง text output๋งŒ์„ ๋‚ด๋†“๊ฑฐ๋‚˜, 2) grounding (text-based masking) ์ด ์•ˆ๋˜๊ฑฐ๋‚˜ 3) single object๋งŒ grounding ๊ฐ€๋Šฅํ•˜๋˜์ง€ (LISA) 4) Conversation์€ ๋ถˆ๊ฐ€๋Šฅํ•œ๋‹ค๋˜์ง€ ์˜ ํ•œ๊ณ„์ ์„ ๊ฐ€์กŒ๋‹ค. ๋” practicalํ•œ ๊ธฐ์ˆ ๋กœ GroundingLMM (GLaMM)์„ ์ œ์•ˆํ•œ๋‹ค.
  • ์ด๋Ÿฐ ๋ชจ๋ธ์ด ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Task๋Š” Grounded converstaion generation์ด๋‹ค. ์ด๋Š” Figure1์„ ํ†ตํ•ด ํ™•์ธ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • ์œ„ Task๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์†Œ๊ฐœํ•œ๋‹ค. 1) automated pipeline์œผ๋กœ ์ƒ์„ฑ๋œ Grounding-anything dataset, 2) ๊ธฐ์กด CV dataset์„ conversationํ™” ์‹œํ‚จ ๋ฐ์ดํ„ฐ์…‹
  • ๋ฐฉ๋ฒ•๋ก ์€ Figure2์™€ ๊ฐ™๋‹ค. ๋ฐฉ๋ฒ•๋ก ์˜ ๋””ํ…Œ์ผ๊ณผ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ตฌ์ฒด์ ์ด ๋‚ด์šฉ์€ ํ•„์š”ํ•  ๋•Œ ๋…ผ๋ฌธ ์ „์ฒด๋ฅผ ํ†ตํ•ด ํ™•์ธํ•ด๋ณธ๋‹ค.
  • ์ „์ฒด pretraining๊ณผ finetuning์„ ์œ„ํ•ด์„œ 8 NVIDIA A100-40GB GPUs๊ฐ€ ์‚ฌ์šฉ๋๋‹ค๊ณ  ํ•œ๋‹ค.

 

 

 

 

 

๐Ÿ’™ Dense, Long, Detailed caption / Caption evaluation

◽️ DCI: Densely Captioned Images: A Picture is Worth More Than 77 Text Tokens. Evaluating CLIP-Style Models on Dense Captions Meta CVPR 24

  • ์‹ ๋ขฐํ• ๋งŒํ•œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์ด ์—†๋‹ค. ๊ทธ๋ž˜์„œ DCI ๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹ ์†Œ๊ฐœํ•œ๋‹ค. ์ด๊ฑฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ป๊ฒŒ VLM์„ ํ‰๊ฐ€ํ•˜๋Š”์ง€ ์„ค๋ช…ํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” negative pair maching (ํ‹€๋ฆฐ ์บก์…˜์€ ๋ฉ€๊ฒŒ)์ด๊ณ , ๋‘˜์จฐ๋Š” subcrop-caption matching (ํ•œ ์ด๋ฏธ์ง€์˜ ์—ฌ๋Ÿฌ ์˜์—ญ์— ๋Œ€ํ•ด์„œ๋ฉด ๋ฉ”์นญ ์„ฑ๋Šฅ ์ œ์‹œ)์ด๋‹ค.
  • DCI๋Š” long human annptated. LLM summary (77 tokens ์ด๋‚ด), LLM negative ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. DAC(densely aligned captions)๋Š” ๊ธฐ๊ณ„๋กœ ๋งŒ๋“ค์–ด์ง„ dence caption์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค, ๋ผ๊ณ  ํ–ˆ๋‹ค. DCI๋Š” human annotators๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ์ข‹๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค.
  • Github link. (1) SAM์€ ํ•˜๋‚˜์˜ tar๋งŒ ์ง์ ‘ ๋ฐ›์œผ๋ฉด ๋˜๊ณ , (2) GT๋Š” ์‹œํ‚ค๋Š”๋ฐ๋กœ ๋‹ค์šด๋ฐ›์œผ๋ฉด, complete์— summaries ํฌํ•จํ•œ ๊ฐ’๋“ค์ด ์ €์žฅ๋˜์–ด ์žˆ์Œ. (3) ๊ธฐ์กด ์‚ฌ๋žŒ๋“ค์ด ๋งŒ๋“ค์–ด ๋†“์€ DenseCaptionedDataset ์ด ํŒŒ์ผ ์ ๊ทน ์ด์šฉ ํ•˜๊ธฐ.
  • ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋ฐฉ๋ฒ•: (1) canny edge๋กœ ํฌ์ธํŠธ ์ฐพ์Œ (2) ํฌ์ธํŠธ์— ๋Œ€ํ•œ sub-masks๋ฅผ SAM์œผ๋กœ ์–ป๊ธฐ (3) ์ „์ฒด ์ด๋ฏธ์ง€ ๋ฐ sub-mask์— ๋Œ€ํ•œ captions์„ ๋ˆ์ฃผ๊ณ  ์‚ฌ๋žŒ์—๊ฒŒ ๋งž๊น€
  • Summary ๋งŒ๋“ค๊ธฐ: LLaMA-2-70B๋ฅผ ์จ์„œ ์š”์•ฝํ•ด๋‹ฌ๋ผ๊ณ  ํ•จ. (์ฝ”๋“œ gen_summaries.py) ๋จธ์‹  ์ผ์œผ๋‹ˆ ๋…ธ์ด์ฆˆ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. ํ•˜์ง€๋งŒ negative sample๋„ ๋งŒ๋“ค์—ˆ์œผ๋‹ˆ๊นŒ, CLIP ํ•™์Šต์— ๊ดœ์ฐฎ์„๊ฑฐ์ž„ ์ด๋ผ ์ฃผ์žฅ.
    • ์•„๋ž˜ ์ด๋ฏธ์ง€ ํ…Œ์ด๋ธ”์— ์œ„์—์„œ 4๋ฒˆ์งธ ํ–‰์ฒ˜๋Ÿผ, sub-mask์— ๋Œ€ํ•œ summarized caption ์กด์žฌํ•จ. (ํ•˜์ง€๋งŒ ์ด ์บก์…˜์— ์ด๊ฑด ์ด๋ฏธ์ง€์™€์˜ ๊ด€๊ณ„๊ฐ€ ์„œ์ˆ ๋˜์–ด ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ์Œ. ์ด๋Ÿฌ๋ฉด ์ข‹์€ ๋ฐ์ดํ„ฐ์…‹์€ ์•„๋‹์ˆ˜ ์žˆ์Œ)
  • CLIP๋ชจ๋ธ์„ summary-DCI (8K)๋กœ Lora-finetuningํ•˜๊ณ  summary-DCI test set์—์„œ ํ‰๊ฐ€ํ–ˆ์„๋•Œ ์„ฑ๋Šฅ: Negative loss๋ž‘ ๊ฐ™์ด CLIPํ•™์Šตํ•˜๋ฉด 8,000์žฅ๋งŒ ์‚ฌ์šฉํ•ด๋„ ์„ฑ๋Šฅ ๋งŽ์ด ์˜ฌ๋ผ๊ฐ. (๋‹จ, DAC ๋ผ๋Š” ๋…ผ๋ฌธ์˜ machine-generated captions 3M๋ฅผ ์‚ฌ์šฉํ•œ๊ฒŒ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„)

image-20240726180921476

 

◽️ Graph-Based Captioning- Enhancing Visual Descriptions by Interconnecting Region Captions. Apple, arxiv24.

  • CLIP์„ฑ๋Šฅ์„ ๋†’ํžˆ๊ธฐ ์œ„ํ•ด (1) ํ•„ํ„ฐ๋ง [17, 22, 53] (2) ์บก์…˜ ์žฌ์ƒ์„ฑ [14, 16, 35, 45] ์ œ์•ˆ๋จ.
    • Improving CLIP Training with Language Rewrites. NeurIPS 2023: Lauguage only reweiting
    • VeCLIP: Improving CLIP Training via Visual-enriched Captions: LLaVA์— Describe the image concisely, less than 20 words ๋ผ๊ณ  ๋ฌผ์–ด์„œ ๋‚˜์˜ค๋Š” caption ์‚ฌ์šฉํ•ด์„œ CLIP ํ•™์Šต.
  • ์ด๋ฏธ์ง€๋‹น ์—ฌ๋Ÿฌ short captions (ํ‰๊ท  ํ•˜๋‚˜์˜ ์บก์…˜๋‹น 30๊ฐœ ๋‹จ์–ด = 35๊ฐœ ํ† ํฐ)์ด ์กด์žฌํ•œ๋‹ค. ์ด๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๊ณ , ์ด์•  ๋ฐํ•œ ์ •๋ณด๋Š” ์•„๋ž˜์˜ ํ…Œ์ด๋ธ”๊ณผ ๊ฐ™๋‹ค.
  • ์–ด๋–ป๊ฒŒ ๋”ฐ๋กœ short๋ฅผ ๋งŒ๋“ค์—ˆ๋Š”์ง€๋Š” ์•„๋ฌด๋ฆฌ ์ฐพ์•„๋ด๋„ ์—†๋‹ค. Detailed captions ์ƒ์„ฑํ•˜๋Š”๊ฒƒ์€ few shot๊นŒ์ง€ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”์ง€ ์ž˜ ๋‚˜์™€์žˆ๋Š”๋ฐ... short์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์—†๋‹ค. (LLaVA-1.6์€ 1.5๋ณด๋‹ค ์ข‹์œผ๋ ค๋‚˜?? ์‹ถ๋‹ค.)
  • ์ถ”๊ฐ€์ ์œผ๋กœ ์ด๋ฏธ์ง€๋‚ด composition(link ์ˆœ์„œ), relation(link๊ธ€๋ฌ˜์‚ฌ)๋ฅผ ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ์— ํ•™์Šต์‹œํ‚ฌ์ง€๋„ ์ œ์•ˆํ•œ ๋ถ€๋ถ„์ด ์žˆ์œผ๋‹ˆ ๋‚˜์ค‘์— ์ฐพ์•„์ฝ์ž.

image-20240727163410356

from datasets import load_dataset
ds = load_dataset("graph-based-captions/GBC1M", cache_dir=".")
for i in range(100): len(ds['train'][i]['short_caption'].split(' '))

# 1. ์ƒํƒœ๋ฅผ ๋ณด๋‹ˆ ๊ธธ์ด๊ฐ€ ๋“ค์ญ‰ ๋‚ ์ญ‰์ด๋‹ค. ํ€„๋ฆฌํ‹ฐ ๋ฐ ๊ธธ์ด ์ •๋ณด์— ๋”ฐ๋ผ์„œ ์ผ๋ถ€๋งŒ ์„ ํƒ/ํ•„ํ„ฐ๋ง ํ•ด์„œ ์‚ฌ์šฉํ•ด๋„ ๋˜์ง€ ์•Š์„๊นŒ ๊นŠ๋‹ค. 
# 2. ์ด๋ฏธ์ง€ ํ•˜๋‚˜๋‹น ๋งค์ธ ์บก์…˜์€ ํ•˜๋‚˜๋‹ค.
# 3. ์ด๋ฏธ์ง€ ์•ˆ์— ํ•˜๋‚˜์˜ ๊ฐ์ฒด์— ๋Œ€ํ•œ description์œผ๋กœ short๊ฐ€ ์กด์žฌํ•  ์ˆ˜๋„ ์žˆ๊ณ  ์•„๋‹์ˆ˜๋„ ์žˆ๋‹ค. ๋‹จ! short๊ฐ€ ์—†๋‹ค๋ฉด detailed๊ฐ€ ์ถฉ๋ถ„ํžˆ ์งง๋‹ค.
>>> for i in range(3): ds['train'][0]['vertices'][i]['descs'][0]['label']
...
'detail'
'detail'
'detail'
>>> for i in range(3): ds['train'][0]['vertices'][i]['descs'][1]['label']
...
'short'
'error! no short, so, out of index'

 

◽️ PixelProse From Pixels to Prose A Large Dataset of Dense Image Captions arxiv24

  • Google Gemini 1.0 Pro Vision Model์œผ๋กœ 12M ์ด๋ฏธ์ง€๋ฅผ captionํ•œ ๋ฐ์ดํ„ฐ์…‹ ์ œ๊ณต
  • ์•„๋ž˜ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ, ํ‰๊ท  100๊ฐœ ์ด์ƒ์˜ words๋ฅผ ๊ฐ€์ง„๋Š” captions์ด๋‹ค. (๋„ˆ๋ฌด ๊ธธ๋‹ค)

image-20240727163906260

 

◽️ ShareGPT4V- Improving Large Multi-Modal Models with Better Captions ECCV24

  • Long-CLIP์—์„œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.
  • ์ด๊ฑด ๋„ˆ๋ฌด ๊ธธ๊ณ  (๊ฑฐ์˜ ํ•œ ์บก์…˜๋‹น 180๋‹จ์–ด) ์บก์…˜ ๋‚ด๋ถ€์— \n\n๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๋‹ค.
  • GPT-4 vision ์„ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์žฅ ์ •ํ™•ํ•ด๋ณด์ธ๋‹ค. ๋˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ํ’๋ถ„ํ•œ object ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๊ฒƒ์„ ์ž˜ minimize ํ•ด์„œ, caption ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ๊ฐ€์žฅ ์ข‹์•„ ๋ณด์ธ๋‹ค. (๋ฌผ๋ก  ์‚ฌ์šฉํ•ด๋ณผ ๋งŒํ•œ๊ฑด ์ด๋ฏธ ๋งŽ์•„์„œ.. ์ด๊ฑด ๊ท€์ฐฎ์€ ์ž‘์—…์ด ํ•„์š”ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.)
  • ์˜ค๋ฅธ์ชฝ ์•„๋ž˜ ๊ทธ๋ฆผ์˜ ํ”„๋กœ์„ธ์Šค ์„ค๋ช…
    • 100K ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ ChatGPT4๋ฅผ ํ™œ์šฉํ•ด์„œ description์„ ์ถ”์ถœํ•œ๋‹ค. ์ด๋ฅผ ํ™œ์šฉํ•ด์„œ ShareCaptioner๋ผ๋Š” ์ž์ฒด ๋ชจ๋ธ ์ƒ์„ฑ.
    • ShareCaptioner๋ฅผ ์‚ฌ์šฉํ•ด์„œ 1.2M ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ caption ์ƒ์„ฑ.

image-20240727164337637

 

◽️ Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. NeurIPS 2023 Spotlight

  • ๊ธฐ์กด CLIP ๋ชจ๋ธ์˜ ํ•œ๊ณ„: bags of nouns ๋กœ๋งŒ ๋™์ž‘. ๋”ฐ๋ผ์„œ compositional reasoning ๋ถ€์กฑ. = non-object notions, object attributes, states, relations ์ดํ•ด ๋ถ€์กฑ.
  • ์›์ธ1: web-crawled captions ํ€„๋ฆฌํ‹ฐ๊ฐ€ ์“ฐ๋ ˆ๊ธฐ์ž„. ์›์ธ2: ์ด๋ฏธ์ง€์˜ ์ผ๋ถ€๋ถ„๋งŒ ์„ค๋ช…ํ•˜๋Š” caption์ด ๋งŽ์Œ. ์ด๋ฏธ์ง€์•ˆ์—๋Š” ๋งŽ์€ objects, relations๊ฐ€ ์กด์žฌํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ ..
  • ํ•ด๊ฒฐ์ฑ…: (1) BLIP-2๋กœ captions ๋งŒ๋“ฌ. (2) LLM expender: {์บก์…˜}์„ ๊ฐ€์ง„ ์ด๋ฏธ์ง€์—๋Š” ๋ญ๊ฐ€ ์žˆ์„๊ฒƒ ๊ฐ™์€์ง€ ์ƒ์ƒํ•ด๋ด (3)SAM expander: {mask-croped image} → BLIP-2 ๋กœ ์—ฌ๋Ÿฌ ์บก์…˜ ๋งŒ๋“ค๊ธฐ. (4) Negative loss ์ ๊ทน ํ™•์šฉํ•˜๊ธฐ (SVLC๋ผ๋Š” ๋…ผ๋ฌธ์— negative captions ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•๋ก  ์ฐจ์šฉ)
  • ์œ„ (2), (3)๋ฒˆ ๋ฐฉ๋ฒ•์€ ์–ด์ด์—†๋Š” ๋ฐฉ๋ฒ•์ธ๊ฑด ๋งž๋‹ค. ๋…ธ์ด์ฆˆ๊ฐ€ ์—„์ฒญ ์ƒ๊ธธ๊ฑฐ๋‹ค. ํ•˜์ง€๋งŒ, ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์—์„œ ๋งŒ๋“ค์–ด์ง„ auto-generated captions ๋ณด๋‹ค๋Š” ํ˜„ ์ด๋ฏธ์ง€๊ฐ€ ๊ฐ€๊นŒ์šธ ๊ฒƒ์ด๋‹ค. ๋ผ๋Š” ๊ด€์ ์—์„œ Loss_{multiple instance learning} ์ œ์•ˆ.
  • ๋‚˜๋จธ์ง€ loss๋Š” loss_negative, loss_contrastive (in CLIP).

image-20240727012145377

 

 

◽️ ARO: When and why vision language models behave like bags-of-words, and what to do about it? ICLR 2023 Oral

  • ARO ๋ฒค์น˜๋งˆํฌ ์†Œ๊ฐœ: Visual Genome ๋ฐ์ดํ„ฐ์…‹์—๋Š” object, attribute, relation ์— ๋Œ€ํ•œ ์ •๋ณด ๊ฐ€์ง€๊ณ  ์žˆ์Œ. COCO๋Š” object๋“ค์ด ๋งŽ์ด ์žˆ๊ณ  ์–ด๋–ค object๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€ ๋ฆฌ์ŠคํŠธ๊ฐ€ ์žˆ์Œ. ์ด๋Ÿฌํ•œ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ (VG, COCO) ๋‚ด๋ถ€์˜ ๋ฉ”ํƒ€ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด, (์ด๋“ค์„ permutationํ•จ์œผ๋กœ์จ) ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ์…‹ ์ œ์ž‘
  • ARO ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์ด ๋งŽ์ด ๋–จ์–ด์ง. ์ฆ‰, compositional understanding (to the right of v.s. behind) ๋Šฅ๋ ฅ์ด ๊ธฐ์กด CLIP, BLIP ๋ชจ๋ธ์—์„œ ๋ถ€์กฑํ•จ์„ ๋ณด์ž„.
  • ์™œ ์ด๋Ÿฌํ•œ ์‚ฌ์‹ค์ด ๋ฌด์‹œ๋˜์–ด ์™”๋Š”๊ฐ€? Retrieval task๊ฐ€ ๋Œ€ํ‘œ์ ์ธ Task์ธ๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” ๋ชจ๋ธ์ด compositional understanding ์ •๋ณด๋ฅผ ๊ฐ€์งˆ ํ•„์š”๊ฐ€ ์—†์Œ. ๊ทธ์ € Bag-of-words๋ฅผ ์ด์šฉํ•˜๋Š” Task์ž„.
  • ๋˜ํ•œ CLIP ํ•™์Šต ์ ˆ์ฐจ ์ž์ฒด๊ฐ€, compositional understanding์„ ํ•  ํ•„์š”๊ฐ€ ์—†๋„๋ก ํ•™์Šต์ด ์ด๋ค„์ง€๋Š” ๋ฐฉ๋ฒ•์ž„.
  • ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ compsotion-aware hard negatives ๋ฅผ ์†Œ๊ฐœํ•จ. (1) ๋ฐฐ์น˜ ๋‚ด๋ถ€์—์„œ nearest neighboring images (2) negative caption (object, attribute, relation ์ •๋ณด๋ฅผ ์•ฝ๊ฐ„ ๋ฐ”๊ฟ”๋†“์€) ํ™œ์šฉ.

image-20240731204930613

◽️ Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations EMNLP 2022.

  • CLIP ๋ชจ๋ธ์„ classification๊ณผ ๊ฐ™์€ downstream task์—์„œ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ์ข‹์€ ํ•ด์„์ด ์•„๋‹ˆ๋‹ค.
  • image-text matching ๋Šฅ๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, CLIP ๋ชจ๋ธ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ์ œ์•ˆํ•˜๋‹ค.
  • Nagative sampling generation ์ด ํฌ์ธํŠธ: Visual Genome ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” object, attribute, relation ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด์„œ, embeding vector์˜ cos-similarity ๊ฐ€ 0.5 ์ด์ƒ์ธ ๋‹จ์–ด๋“ค๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋งŒ๋“ ๋‹ค.

image-20240801120246492

  • (DAC์— ๋”ฐ๋ฅด๋ฉด) ARO, VL-checklist Evaluation ๋‘ ํ‰๊ฐ€ ๊ธฐ๋ฒ• ๋ชจ๋‘ Visual Genorm๋ฐ์ดํ„ฐ์…‹ ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ, postivie, negative captions๊ฐ€ ์ด๋ฏธ ๋งŒ๋“ค์–ด์ ธ์žˆ๋‹ค. ํ˜„์žฌ ๋‚˜์˜ CLIP ๋ชจ๋ธ์ด ์œ„๋ฅผ ์ž˜ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํŒŒ์•…ํ•œ๋‹ค. ์ด ๋•Œ negative captions์€ object, attribute, relation์ด ์กฐ๊ธˆ ๋ฐ”๋€ ์บก์…˜์ด๋‹ค.  

 

◽️ (DSG) Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. ICLR 2024.

  • gpt-3.5-turbo ์‚ฌ์šฉํ•ด์„œ quetion ๋งŒ๋“ค๊ณ , gpt-4.0-v ์‚ฌ์šฉํ•ด์„œ VQA ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ ์ˆ˜ ์ถ”์ถœ. ํ•„์š”ํ•œ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ ๋‹ค ์žˆ์Œ.
  • ๋…ผ๋ฌธ ์š”์•ฝ
    • ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  (FIFA) ๋Š” questions ๋“ค์˜ ์˜์กด๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ. ๋”ฐ๋ผ์„œ ('is there a motorcycle'์™€ 'is the motorcycle blue'๋ฅผ ์™„์ „ํžˆ ๋…๋ฆฝ๋œ ์งˆ๋ฌธ์ด๋ผ๊ณ  ๊ฐ€์ •.)
    • ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž, questions ๋ผ๋ฆฌ์˜ ์˜์กด์„ฑ์„ ๊ณ ๋ คํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆ. (prompt tuningํ•ด์„œ LLMํ•œํ…Œ ์‹œํ‚ค๋Š”๊ฒŒ ์ „๋ถ€) Qestion์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์€ ์•„๋ž˜ Figure4 ์ฐธ์กฐ.
    • VQA๋„... ๊ทธ๋ƒฅ ๊ธฐ์กด ๋ชจ๋ธ ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ์ „๋ถ€.
    • ๋…ผ๋ฌธ์—์„œ๋Š” generated questions์ด ์‚ฌ๋žŒ์ด ๋งŒ๋“œ๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ๊ฐ€? VQA ํšจ๊ณผ๋Š” ์–ด๋–ค๊ฐ€์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ๋ถ„์„์„ ๋‚˜์—ดํ•ด๋†จ๋‹ค. (๊ทธ ๋ถ€๋ถ„์€ ์ฝ์ง€ ์•Š์Œ pass)
  • ์ฃผ๋œ ๋…ผ๋ฆฌ๊ฐ€ ๋ฌด์—‡์ธ๊ฐ€? object๋กœ ๋ถ€ํ„ฐ ์‹œ์ž‘๋˜๋Š” ๋ฟŒ๋ฆฌ๋“ค์ด ๋ญ ์–ด์จ‹๋‹ค๋Š”๊ฒจ?
    • ๋งŒ์•ฝ entity๊ฐ€ ์—†๋‹ค๋ฉด, ๊ทธ ์ดํ›„ ์งˆ๋ฌธ๋“ค์€ ๋ชจ๋‘ false๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๊ตณ์ด ์ถ”๊ฐ€ ์งˆ๋ฌธ์„ ํ•˜์ง€ ์•Š๋Š”๋‹ค.
  • LLM์œผ๋กœ parsing ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ตฌ์ฒด์ ์œผ๋กœ ๋ญ”๊ฐ€?
    • ์šฐ์„  ์•„๋ž˜ figure3๊ณผ ๊ฐ™์ด semantioc category๋ฅผ ์ง€์ •ํ•œ๋‹ค.
    • PaLM-2-340B๋ฅผ ํ™œ์šฉํ•˜๋ฉฐ, the details on the preamble engineering์€ Appendix-A์— ์žˆ์Œ.
  • ๊ทธ๋“ค ์Šค์Šค๋กœ์˜ matrix๋ฅผ ์–ด๋–ป๊ฒŒ ํ‰๊ฐ€ํ•˜๋Š”๊ฐ€?
    • 30๊ฐœ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด์„œ ์‚ฌ๋žŒ์ด ๋งŒ๋“  tuple, question๊ณผ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ์ง€ precision, recall ์ฒดํฌ.
    • dependencies valid๋ผ๋Š”๊ฒŒ ์žˆ๋Š”๋ฐ, ์ด๊ฑฐ๋Š” tuple๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์–ด๋–ป๊ฒŒ linking๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด๊ฒƒ์— ๋Œ€ํ•œ ์ •ํ™•๋„๋Š” 100%๋ผ๊ณ  ํ•จ

image-20240801233723462

image-20240828142450857

 

◽️ Prometheus-Vision. arXiv. 24

  • VML์˜ output์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค. (1) instruction, question์— ์ž˜ ๋”ฐ๋ž๋Š”์ง€๋„ ํ‰๊ฐ€ํ•ด์•ผํ•˜๊ณ , (2) ์ด๋ฏธ์ง€๋ž‘ ์ž˜ ์—ฐ๊ด€๋œ ๋‹ต๋ณ€์„ ํ–ˆ๋Š”์ง€๋„ ํ‰๊ฐ€ํ•ด์•ผํ•œ๋‹ค.
  • ํ•˜์ง€๋งŒ ๊ธฐ์กด SPICE, METEOR ์™€ ๊ฐ™์€ ์ง€ํ‘œ๋“ค์€ ๊ธด output์„ ํ‰๊ฐ€ํ•˜๋Š”๋ฐ ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.
  • ๊ธฐ์กด Open-source VLM์„ ๊ทธ๋Œ€๋กœ assessing์„ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ์—”, human, GPT-4๊ณผ ๋น„๊ตํ•ด ๋Šฅ๋ ฅ์ด ๋งŽ์ด ๋ถ€์กฑํ•˜๋‹ค.
  • ๋”ฐ๋ผ์„œ LLaMA-1.5๋ฅผ Finetuningํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์†Œ๊ฐœํ•˜๊ณ , ์ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ธ prometheus-vision ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๋‹ค.

image-20240802012314977

 

◽️ Semantic parsing

  1. Image Retrieval using Scene Graphs CVPR15
    • object, attributes, relationships ์œผ๋กœ ๋ถ„๋ฆฌ๋œ scene graph ์ฒ˜์Œ ์ œ์•ˆ
    • scene graph ๊ธฐ๋ฐ˜์œผ๋กœ retrival ์ˆ˜ํ–‰ / scene graph๋ฅผ user๊ฐ€ ๋„ฃ์–ด์ค˜์•ผ ํ•จ
    • [scene graph - images] 5,000๊ฐœ pair ์ œ์•ˆ
  2. Stanford-scene-graph-parser: generating semantically precise scene graphs from textual descriptions for improved image retrieval EMNLP 2015
    • scene graphs๋ฅผ automaticallyํ•˜๊ฒŒ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก  ์ œ์•ˆ (rule-based / calssifier based scene praph parsing)
    • one sentence ๋งŒ ๊ณ ๋ ค
    • parsing ํ•˜๋Š”๊ฑฐ ์–ด๋ ค์›€ ์˜ˆ๋ฅผ ๋“ค์–ด์„œ, Pronoun: "a bed with a pillow on it." / Plural nouns: "three men are wearing jeans", "three men are carrying a piano"
    • Rule-base parsing: (nine dependency patterns) These patterns capture the constructions and phenomena. / Classifier-based parsing: by using scene graphs datsets, training the model which can extract all candidate objects and attributes, and relations.
  3. SPICE: Semantic Propositional Image Caption Evaluation ECCV 16
    • caption quality๋ฅผ ์ฒดํฌํ•˜๋Š”๋ฐ scene graph ์ด์šฉ.
    • ๋จผ์ € dependencies between words๋ฅผ ์œ„ ๋…ผ๋ฌธ์„ ์‚ฌ์šฉํ•ด์„œ parsing ํ•˜๊ณ , ์ „์ฒด dependence ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด์„œ tree๋ฅผ ๊ทธ๋ฆฐ๋‹ค.
    • reference (GT), candidate (generated) captions ์‚ฌ์ด์˜ F1-score๋ฅผ ํ™œ์šฉํ•œ๋‹ค.
    • ์ฝ”๋“œ์—์„œ๋Š” java๋กœ ๋ชจ๋“  ์‹คํ–‰์ด ๋งˆ์ณ์ง€๊ณ , precision, recall ์— ๋Œ€ํ•œ ์ •๋ณด๋งŒ์ด python์œผ๋กœ ๋„˜์–ด์˜จ๋‹ค. (ex, ๊ฒน์น˜๋Š” tuple(object, attributes, relations)์ด ๋ช‡๊ฐœ์ธ์ง€ ๋“ฑ.)
  4. Unified Visual-Semantic Embeddings- Bridging Vision and Language with Structured Meaning Representations CVPR19
    • scene graph ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์„œ, CLIP training์„ ์œ„ํ•œ negative mining์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
    • semantic parsing์„ ์œ„ํ•ด์„œ ์œ„ 2๋ฒˆ ๋…ผ๋ฌธ์˜ rule-base parsing ๋”ฐ๋ผ์„œ ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ณต๊ฐœํ–ˆ๋‹ค. ๋‘ ์ฝ”๋“œ๋Š” ์™„์ „ํžˆ ๋™์ผํ•œ ์—ญํ• ์„ ํ•˜์ง€๋Š” ์•Š๊ณ , ์ฐจ์ด์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.