💙 Vision-centric Improvement / Region-based VLMs / Hallucination

◽️ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

기본 MAE, ResNet 같은 논문과는 다른 Structure. 내가 하고 싶은 이야기를 하는 것 마음에 든다. 역시 Saining.
Motive1: CLIP image encoder는 이미지 이해하는 능력 구리다. CLIP으로 뽑은 image feature에는 이미지 전체를 이해한 정보가 없다. 예를 들어, 나비 다리가 있는지, 차의 문이 열려있는지 닫혀있는지, CLIP은 모른다. 그래서 CLIP사용하는 VLM 성능 또한 구리다.
Section2: DINO가 CLIP보다 "서로 다른 이미지를 다른 이미지로 인지하는 능력" 이 더 좋다. 반대로, CLIP은 "문이 닫힌 차, 문이 열린 차 이미지를 거의 같은 feature space로 임베딩 한다" // DINOv2에서 0.6이하, CLIP에서 0.95이상 similarity를 가지는 pairs를 찾는다. → 150개의 pair에 human annotating + VQA 만들기 → SOTA MLLMs (multimodal LLM) 평가 → 결론: Current MLLMs struggle with visual details.
Section3: GPT-4에게 MLLMs가 구별하기 힘들어하는 Visual pattern 찾기 → Visual pattern에 따른 CLIP-based 모델 평가해보기 평가방법 Figure5 결과Table1 → CLIP이 못하는것이 LLaVA, InstructBLIP도 잘 못하더라
Section4: DINO랑 CLIP이랑 같이 써서 VML 돌려보자. 근데 그냥 Naive하게 돌리면 안된다. // 4.2 Additive MoF (Figure7 2번째), 성능 Table2: 위 평가에서는 성능 오르지만, LLaVA의 기본 성능은 떨어트린다. → Figure 7 3번째 과 같은 방식으로 visual token을 교차해서 넣어주면 성능 좋아진다.

◽️ GLaMM: Pixel Grounding Large Multimodal Model. arXiv. 2024.

기존 LLM 논문들은 1) 오직 text output만을 내놓거나, 2) grounding (text-based masking) 이 안되거나 3) single object만 grounding 가능하던지 (LISA) 4) Conversation은 불가능한다던지 의 한계점을 가졌다. 더 practical한 기술로 GroundingLMM (GLaMM)을 제안한다.
이런 모델이 수행할 수 있는 Task는 Grounded converstaion generation이다. 이는 Figure1을 통해 확인 가능하다.
위 Task를 위한 데이터셋을 소개한다. 1) automated pipeline으로 생성된 Grounding-anything dataset, 2) 기존 CV dataset을 conversation화 시킨 데이터셋
방법론은 Figure2와 같다. 방법론의 디테일과 데이터셋의 구체적이 내용은 필요할 때 논문 전체를 통해 확인해본다.
전체 pretraining과 finetuning을 위해서 8 NVIDIA A100-40GB GPUs가 사용됐다고 한다.

💙 Dense, Long, Detailed caption / Caption evaluation

◽️ DCI: Densely Captioned Images: A Picture is Worth More Than 77 Text Tokens. Evaluating CLIP-Style Models on Dense Captions Meta CVPR 24

신뢰할만한 평가 데이터셋이 없다. 그래서 DCI 라는 데이터셋 소개한다. 이거를 가지고 어떻게 VLM을 평가하는지 설명한다. 첫번째는 negative pair maching (틀린 캡션은 멀게)이고, 둘쨰는 subcrop-caption matching (한 이미지의 여러 영역에 대해서면 메칭 성능 제시)이다.
DCI는 long human annptated. LLM summary (77 tokens 이내), LLM negative 를 제공한다. DAC(densely aligned captions)는 기계로 만들어진 dence caption이 좋은 성능을 낸다, 라고 했다. DCI는 human annotators를 사용하면 더 좋다는 것을 보였다.
Github link. (1) SAM은 하나의 tar만 직접 받으면 되고, (2) GT는 시키는데로 다운받으면, complete에 summaries 포함한 값들이 저장되어 있음. (3) 기존 사람들이 만들어 놓은 DenseCaptionedDataset 이 파일 적극 이용 하기.
데이터 생성 방법: (1) canny edge로 포인트 찾음 (2) 포인트에 대한 sub-masks를 SAM으로 얻기 (3) 전체 이미지 및 sub-mask에 대한 captions을 돈주고 사람에게 맞김
Summary 만들기: LLaMA-2-70B를 써서 요약해달라고 함. (코드 gen_summaries.py) 머신 썼으니 노이즈 있을 수 있음. 하지만 negative sample도 만들었으니까, CLIP 학습에 괜찮을거임 이라 주장.
- 아래 이미지 테이블에 위에서 4번째 행처럼, sub-mask에 대한 summarized caption 존재함. (하지만 이 캡션에 이건 이미지와의 관계가 서술되어 있을 가능성도 있음. 이러면 좋은 데이터셋은 아닐수 있음)
CLIP모델을 summary-DCI (8K)로 Lora-finetuning하고 summary-DCI test set에서 평가했을때 성능: Negative loss랑 같이 CLIP학습하면 8,000장만 사용해도 성능 많이 올라감. (단, DAC 라는 논문의 machine-generated captions 3M를 사용한게 가장 좋은 성능을 보임)

◽️ Graph-Based Captioning- Enhancing Visual Descriptions by Interconnecting Region Captions. Apple, arxiv24.

CLIP성능을 높히기 위해 (1) 필터링 [17, 22, 53] (2) 캡션 재생성 [14, 16, 35, 45] 제안됨.
- Improving CLIP Training with Language Rewrites. NeurIPS 2023: Lauguage only reweiting
- VeCLIP: Improving CLIP Training via Visual-enriched Captions: LLaVA에 Describe the image concisely, less than 20 words 라고 물어서 나오는 caption 사용해서 CLIP 학습.
이미지당 여러 short captions (평균 하나의 캡션당 30개 단어 = 35개 토큰)이 존재한다. 이를 사용하면 좋을 것 같고, 이애 데한 정보는 아래의 테이블과 같다.
어떻게 따로 short를 만들었는지는 아무리 찾아봐도 없다. Detailed captions 생성하는것은 few shot까지 어떻게 하는지 잘 나와있는데... short에 대한 정보는 없다. (LLaVA-1.6은 1.5보다 좋으려나?? 싶다.)
추가적으로 이미지내 composition(link 순서), relation(link글묘사)를 어떻게 모델에 학습시킬지도 제안한 부분이 있으니 나중에 찾아읽자.

from datasets import load_dataset
ds = load_dataset("graph-based-captions/GBC1M", cache_dir=".")
for i in range(100): len(ds['train'][i]['short_caption'].split(' '))

# 1. 상태를 보니 길이가 들쭉 날쭉이다. 퀄리티 및 길이 정보에 따라서 일부만 선택/필터링 해서 사용해도 되지 않을까 깊다. 
# 2. 이미지 하나당 매인 캡션은 하나다.
# 3. 이미지 안에 하나의 객체에 대한 description으로 short가 존재할 수도 있고 아닐수도 있다. 단! short가 없다면 detailed가 충분히 짧다.
>>> for i in range(3): ds['train'][0]['vertices'][i]['descs'][0]['label']
...
'detail'
'detail'
'detail'
>>> for i in range(3): ds['train'][0]['vertices'][i]['descs'][1]['label']
...
'short'
'error! no short, so, out of index'

◽️ PixelProse From Pixels to Prose A Large Dataset of Dense Image Captions arxiv24

Google Gemini 1.0 Pro Vision Model으로 12M 이미지를 caption한 데이터셋 제공
아래 이미지처럼, 평균 100개 이상의 words를 가진는 captions이다. (너무 길다)

Long-CLIP에서 사용한 데이터셋이다.
이건 너무 길고 (거의 한 캡션당 180단어) 캡션 내부에 \n\n가 너무 많다.
GPT-4 vision 을 사용했기 때문에 가장 정확해보인다. 또한 결과를 보면 풍분한 object 정보를 가지고 있다. 이것을 잘 minimize 해서, caption 데이터로 사용하는게 가장 좋아 보인다. (물론 사용해볼 만한건 이미 많아서.. 이건 귀찮은 작업이 필요할 수도 있다.)
오른쪽 아래 그림의 프로세스 설명
- 100K 이미지에 대해서 ChatGPT4를 활용해서 description을 추출한다. 이를 활용해서 ShareCaptioner라는 자체 모델 생성.
- ShareCaptioner를 사용해서 1.2M 데이터에 대한 caption 생성.

◽️ Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. NeurIPS 2023 Spotlight

기존 CLIP 모델의 한계: bags of nouns 로만 동작. 따라서 compositional reasoning 부족. = non-object notions, object attributes, states, relations 이해 부족.
원인1: web-crawled captions 퀄리티가 쓰레기임. 원인2: 이미지의 일부분만 설명하는 caption이 많음. 이미지안에는 많은 objects, relations가 존재함에도 불구하고..
해결책: (1) BLIP-2로 captions 만듬. (2) LLM expender: {캡션}을 가진 이미지에는 뭐가 있을것 같은지 상상해봐 (3)SAM expander: {mask-croped image} → BLIP-2 로 여러 캡션 만들기. (4) Negative loss 적극 확용하기 (SVLC라는 논문에 negative captions 만드는 방법론 차용)
위 (2), (3)번 방법은 어이없는 방법인건 맞다. 노이즈가 엄청 생길거다. 하지만, 다른 이미지에서 만들어진 auto-generated captions 보다는 현 이미지가 가까울 것이다. 라는 관점에서 Loss_{multiple instance learning} 제안.
나머지 loss는 loss_negative, loss_contrastive (in CLIP).

◽️ ARO: When and why vision language models behave like bags-of-words, and what to do about it? ICLR 2023 Oral

ARO 벤치마크 소개: Visual Genome 데이터셋에는 object, attribute, relation 에 대한 정보 가지고 있음. COCO는 object들이 많이 있고 어떤 object가 존재하는지 리스트가 있음. 이러한 기존 데이터셋 (VG, COCO) 내부의 메타 정보를 활용해, (이들을 permutation함으로써) 평가용 데이터셋 제작
ARO 벤치마크에서 기존 모델들의 성능이 많이 떨어짐. 즉, compositional understanding (to the right of v.s. behind) 능력이 기존 CLIP, BLIP 모델에서 부족함을 보임.
왜 이러한 사실이 무시되어 왔는가? Retrieval task가 대표적인 Task인데, 여기서는 모델이 compositional understanding 정보를 가질 필요가 없음. 그저 Bag-of-words를 이용하는 Task임.
또한 CLIP 학습 절차 자체가, compositional understanding을 할 필요가 없도록 학습이 이뤄지는 방법임.
이를 완화하기 위해서 compsotion-aware hard negatives 를 소개함. (1) 배치 내부에서 nearest neighboring images (2) negative caption (object, attribute, relation 정보를 약간 바꿔놓은) 활용.

◽️ Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations EMNLP 2022.

CLIP 모델을 classification과 같은 downstream task에서 평가하는 것은 좋은 해석이 아니다.
image-text matching 능력을 기반으로, CLIP 모델에 가장 적합한 평가 지표를 제안하다.
Nagative sampling generation 이 포인트: Visual Genome 데이터셋에 있는 object, attribute, relation 정보를 활용해서, embeding vector의 cos-similarity 가 0.5 이상인 단어들로 변환하여 만든다.

(DAC에 따르면) ARO, VL-checklist Evaluation 두 평가 기법 모두 Visual Genorm데이터셋 과 같은 데이터의 이미지에 대해서, postivie, negative captions가 이미 만들어져있다. 현재 나의 CLIP 모델이 위를 잘 구별할 수 있는지 파악한다. 이 때 negative captions은 object, attribute, relation이 조금 바뀐 캡션이다.

◽️ (DSG) Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. ICLR 2024.

gpt-3.5-turbo 사용해서 quetion 만들고, gpt-4.0-v 사용해서 VQA 수행함으로써 점수 추출. 필요한 코드는 여기 다 있음.
논문 요약
- 기존 방법론 (FIFA) 는 questions 들의 의존관계를 고려하지 않음. 따라서 ('is there a motorcycle'와 'is the motorcycle blue'를 완전히 독립된 질문이라고 가정.)
- 이 문제를 해결하고자, questions 끼리의 의존성을 고려하는 방법론을 제안. (prompt tuning해서 LLM한테 시키는게 전부) Qestion을 만드는 과정은 아래 Figure4 참조.
- VQA도... 그냥 기존 모델 사용하는게 전부.
- 논문에서는 generated questions이 사람이 만드는 것과 유사한가? VQA 효과는 어떤가에 대한 여러 분석을 나열해놨다. (그 부분은 읽지 않음 pass)
주된 논리가 무엇인가? object로 부터 시작되는 뿌리들이 뭐 어쨋다는겨?
- 만약 entity가 없다면, 그 이후 질문들은 모두 false로 처리한다. 굳이 추가 질문을 하지 않는다.
LLM으로 parsing 하는 방법이 구체적으로 뭔가?
- 우선 아래 figure3과 같이 semantioc category를 지정한다.
- PaLM-2-340B를 활용하며, the details on the preamble engineering은 Appendix-A에 있음.
그들 스스로의 matrix를 어떻게 평가하는가?
- 30개 샘플에 대해서 사람이 만든 tuple, question과 얼마나 다른지 precision, recall 체크.
- dependencies valid라는게 있는데, 이거는 tuple간의 관계가 어떻게 linking되어 있는지를 의미한다. 이것에 대한 정확도는 100%라고 함

◽️ Prometheus-Vision. arXiv. 24

VML의 output을 평가하는 것은 어렵다. (1) instruction, question에 잘 따랐는지도 평가해야하고, (2) 이미지랑 잘 연관된 답변을 했는지도 평가해야한다.
하지만 기존 SPICE, METEOR 와 같은 지표들은 긴 output을 평가하는데 적합하지 않다.
기존 Open-source VLM을 그대로 assessing을 위해서 사용하기엔, human, GPT-4과 비교해 능력이 많이 부족하다.
따라서 LLaMA-1.5를 Finetuning하기 위한 데이터셋을 소개하고, 이 데이터셋으로 학습한 모델인 prometheus-vision 모델을 제안하다.

◽️ Semantic parsing

Image Retrieval using Scene Graphs CVPR15
- object, attributes, relationships 으로 분리된 scene graph 처음 제안
- scene graph 기반으로 retrival 수행 / scene graph를 user가 넣어줘야 함
- [scene graph - images] 5,000개 pair 제안
Stanford-scene-graph-parser: generating semantically precise scene graphs from textual descriptions for improved image retrieval EMNLP 2015
- scene graphs를 automatically하게 생성하는 방법론 제안 (rule-based / calssifier based scene praph parsing)
- one sentence 만 고려
- parsing 하는거 어려움 예를 들어서, Pronoun: "a bed with a pillow on it." / Plural nouns: "three men are wearing jeans", "three men are carrying a piano"
- Rule-base parsing: (nine dependency patterns) These patterns capture the constructions and phenomena. / Classifier-based parsing: by using scene graphs datsets, training the model which can extract all candidate objects and attributes, and relations.
SPICE: Semantic Propositional Image Caption Evaluation ECCV 16
- caption quality를 체크하는데 scene graph 이용.
- 먼저 dependencies between words를 위 논문을 사용해서 parsing 하고, 전체 dependence 정보를 활용해서 tree를 그린다.
- reference (GT), candidate (generated) captions 사이의 F1-score를 활용한다.
- 코드에서는 java로 모든 실행이 마쳐지고, precision, recall 에 대한 정보만이 python으로 넘어온다. (ex, 겹치는 tuple(object, attributes, relations)이 몇개인지 등.)
Unified Visual-Semantic Embeddings- Bridging Vision and Language with Structured Meaning Representations CVPR19
- scene graph 정보를 이용해서, CLIP training을 위한 negative mining을 수행한다.
- semantic parsing을 위해서 위 2번 논문의 rule-base parsing 따라서 코드를 만들고 공개했다. 두 코드는 완전히 동일한 역할을 하지는 않고, 차이점은 다음과 같다.

Junha

[VLM] Vision Language Models 2

💙 Vision-centric Improvement / Region-based VLMs / Hallucination

◽️ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

◽️ GLaMM: Pixel Grounding Large Multimodal Model. arXiv. 2024.

💙 Dense, Long, Detailed caption / Caption evaluation

◽️ DCI: Densely Captioned Images: A Picture is Worth More Than 77 Text Tokens. Evaluating CLIP-Style Models on Dense Captions Meta CVPR 24

◽️ Graph-Based Captioning- Enhancing Visual Descriptions by Interconnecting Region Captions. Apple, arxiv24.

◽️ PixelProse From Pixels to Prose A Large Dataset of Dense Image Captions arxiv24

◽️ Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. NeurIPS 2023 Spotlight

◽️ ARO: When and why vision language models behave like bags-of-words, and what to do about it? ICLR 2023 Oral

◽️ Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations EMNLP 2022.

◽️ (DSG) Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. ICLR 2024.

◽️ Prometheus-Vision. arXiv. 24

◽️ Semantic parsing

Posted by Junha Song

문의하기 양식

[VLM] Vision Language Models 2

💙 Vision-centric Improvement / Region-based VLMs / Hallucination

◽️ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

◽️ GLaMM: Pixel Grounding Large Multimodal Model. arXiv. 2024.

💙 Dense, Long, Detailed caption / Caption evaluation

◽️ DCI: Densely Captioned Images: A Picture is Worth More Than 77 Text Tokens. Evaluating CLIP-Style Models on Dense Captions Meta CVPR 24

◽️ Graph-Based Captioning- Enhancing Visual Descriptions by Interconnecting Region Captions. Apple, arxiv24.

◽️ PixelProse From Pixels to Prose A Large Dataset of Dense Image Captions arxiv24

◽️ ShareGPT4V- Improving Large Multi-Modal Models with Better Captions ECCV24

◽️ Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. NeurIPS 2023 Spotlight

◽️ ARO: When and why vision language models behave like bags-of-words, and what to do about it? ICLR 2023 Oral

◽️ Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations EMNLP 2022.

◽️ (DSG) Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. ICLR 2024.

◽️ Prometheus-Vision. arXiv. 24

◽️ Semantic parsing

Posted by Junha Song

문의하기 양식