240403_Adapt_w_foundation

Paper List

Adaptation with Foundation Model

Measuring CLIP capability
1. C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion. ICLR, 2024.
2. DSG:Davidsonian scene graph: Improving reliability in fine-grained evaluation for textimage generation. ICLR, 2024.
3. Decomposed CLIPScore: Improving Text-to-Image Consistency via Automatic Prompt Optimization. Meta, 2024.
CLIP Finetuning
1. Fine-tuned CLIP Models are Efficient Video Learners. CVPR, 2023. 55.
2. Fine-tuning CLIP Text Encoders with Two-step Paraphrasing. EACL, 2024. 0.
3. Improving CLIP Fine-tuning Performance. ICCV, 2023. 2.
4. 🍀 CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet. arXiv, 2022. 17.
5. 🍀 ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. NeurIPS, 2022. 92.
6. 🍀 Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. CVPR 2024.
7. Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification. ECCV, 2022. 139.
8. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. IJCV, 2024. 480.
9. 🍀 A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models. CVPR, 2024. 2.
10. Feature Adaptation with CLIP for Few-shot Classification. ACM, 2023. 0.
11. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. CVPR, 2023. 53.
12. Multimodal Adaptation of CLIP for Few-Shot Action Recognition. CVPR, 2023. 6.
13. Not all features matter: Enhancing few-shot clip with adaptive prior refinement. ICCV, 2023.
14. 🍀 A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation. ICLR, 2024.0.
15. Task Residual for Tuning Vision-Language Models. CVPR, 2023. 32.
16. Towards Calibrated Robust Fine-Tuning of Vision-Language Models. NeurIPS_W 2023. 3.
17. Robust Cross-Modal Representation Learning with Progressive Self-Distillation. CVPR, 2022. 36.(contrastive learning with noise data)
18. CoOp: Learning to Prompt for Vision-Language Models. IJCV. 2022. 1316.
19. Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification. CVPR 2024.
CLIP pretraining & Analyzing
1. Long-CLIP: Long-CLIP: Unlocking the Long-Text Capability of CLIP, Mar 2024.
2. DreamLIP: Language-Image Pre-training with Long Captions (project). arXiv, 2024.
3. Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies. arXiv. Arp 2024.
4. Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends. 122. (survey from MS)
5. Interpreting CLIP's Image Representation via Text-Based Decomposition. ICLR 2024.
6. SigCLIP
7. MetaCLIP
CLIP adaptation
1. Domain Adaptation via Prompt Learning. arXiv 2022.
2. Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval. ICCV, 2023. 5.
3. AD-CLIP: Adapting Domains in Prompt Space Using CLIP. ICCV workshop, 2023. 13.
4. AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation. CVPR, 2023. 6.
5. PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization. ICCV, 2023. 18.
6. POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models. ICML 2023. 18. (SFDA)
7. Sus-x: Training-free name-only transfer of vision-language models. ICCV, 2023. 28. (training-free)
8. Improving zero-shot generalization and robustness of multi-modal models. CVPR, 2023. 15 (training-free)
9. 🍀 TPT: Test-time prompt tuning for zero-shot generalization in vision language models. NeurIPS, 2022. 141.
10. Align your prompt
11. Robust Multi-Task Learning and Online Refinement for Spacecraft Pose Estimation across Domain Gap. Advances in Space Research. 2022. 34.
12. 🍀 SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models. NeurIPS, 2023. 3.
13. 🍀 DiffTPT: Diverse data augmentation with diffusions for effective test-time prompt tuning. ICCV, 2023. 15.
14. BaFTA: Backprop-Free Test-Time Adaptation for Zero-shot Vision Language Models. ICLR 2024 rejected (but good scores)
15. 🍀 Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models. WACV, 2024. 1.
16. 🍀 TDA: Efficient Test-Time Adaptation of Vision-Language Models. CVPR, 2024.
17. 🍀 Source-Free Domain Adaptation with Frozen Multimodal Foundation Model CVPR 2024.
18. 🍀 ReCLIP Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation. WACV oral, 2024.
Retrival augmented methods in computer vision.
1. linkedin post1, linkedin post2
2. Retrieval augmented classiﬁcation for long-tail visual recognition. CPVR, 2022. 67.
3. 🍀 Improving Image Recognition by Retrieving from Web-Scale Image-Text Data. CVPR, 2023. 10.
4. 🍀 REACT: Learning Customized Visual Models with Retrieval-Augmented Knowledge. CVPR, 2023. 10.
5. 🍀 Retrieval-Augmented Multimodal Language Modeling. ICML, 2023. 44.
6. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. MS. 2023. 234.
7. K-lite: Learning transferable visual models with external knowledge. NeurIPS, 2022. 64
SAM + Domain adaptation
1. 🍀 SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation. arXiv, 2024. 7 (github)
2. 🍀 SAM4UDASS: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles. arXiv, 2024.
3. SAM-guided Unsupervised Domain Adaptation for 3D Segmentation. arXiv(ICLR2024 submitted), 2024.
Utilizing text-image alignment.
1. SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs. NeurIPS 2023 spotlight. 14.
2. Using Language to Extend to Unseen Domains. ICLR, 2023. 20.
3. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. ACM, 2021. 471.
4. Diagnosing and Rectifying Vision Models using Language. ICLR, 2023. 27.
5. TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation. ICCV, 2023. 2.
6. Using Language to Entend to Unseen Domains. ICLR, 2023. 20.
7. Embedding Arithmetic of Multimodal Queries for Image Retrieval. CVPRW, 2022. 17.
Distillation
1. NVIDIA-AI-IOT/CLIP-distillation (github)
2. CLIP-KD: An Empirical Study of Distilling CLIP Models. CVPR 2024. 3.
3. TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance. ICCV 2023. 10.
4. CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. CVPR 2024. 18.
5. Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. MS, 2022. (CLIP-TD: CLIP Targeted Distillation). 5.
6. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. CVPR 2024. 15.
  - MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation. CVPR, 2023. 141.
7. dd
Generalize/Distill and Adapt
1. Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation. ICCV 2021. 102.
2. DiGA: Distil to Generalize and then Adapt for Domain Adaptive Semantic Segmentation. CVPR 2023. 8.
Image captioning
1. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
2. Simple LLM Framework for Long-Range Video Question-Answering
3. RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.
4. Distilling Vision-Language Models on Millions of Videos. CVPR 2024.
5. 영택이가 공유해준 논문들도 보자
6. 영택이 토픽, General purpose model이 BLIP2이다. 그거보다 작은 모델이 있다. Captioner가 Video understanding에 사용된다면, 문제가 있을 수 있다. 기본 이미지 captioner는 motion 동사에 취약할 수 있기 때문이다. BLIP2는 Grounding에 약하다, LLM을 쓰면 Halluciation이 분명이 생긴다.
7. Bert, GPT2 같은 작은 Transfomer를 사용하는 Image Captioning 모델도 있다.

Summary

◽️ A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models. CVPR, 2024.

Setup: robust finetuning, few-show learning on a specific target domain, generalization capabilities 감소시키지 않고 모델 학습하는 방법 연구, small parameter finetuning (few-shot adaptation) on target task.
Related work: CLIP-Adapter, TIP-Adapter, Multimodality helps unimodality (특히 이 논문에서 앞 두 논문이 많은 test-set 사용함을 비판)
Motive: 주어진 target dataset으로 CLIP을 학습시키고, 배포했을때 대부분의 도메인에서 잘 동작할거라는 보장이 없다. related work로 실험을 해보니 target domain으로 학습한 후, 다른 도메인에서의 성능이 많이 떨어지더라. 이런걸 모두 고려해서 best model을 고르는 작업을 추가한다는 것은 사실상 불가능할거다.
Method: zero shot weight로 initalized된 linear probing이 우선 가장 안전하고 성능이 좋다. 이걸 추가적으로 improve 하기 위한 CLAP, ALM 기법을 제안한다.

◽️ A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation. ICLR, 2024

Setup: few-show learning on a specific target domain. training-free 이므로 Imbalanced setup에 강인함.
Motive: CLIP training 굳이 하지 않더라도, few-shot learning에서 충분히 좋은 성능을 가지게 만들 수 있다.
Method: feature distrubution of few datas 들을 재료로 Gaussian Discriminant Analysis를 통해 linear layer의 weight, bias를 구한다. 그 결과와 CLIP zero-shot 결과를 앙상블한다.
Training-free 임에도 불구하고, adaptation 하는 기법들만큼 좋은 성능을 가지는 것은 흥미롭다.

◽️ Efficient Test-Time Adaptation of Vision-Language Models. CVPR, 2024

Setup: Training-free test-time adaptation of VLMs
Related work(Few-shot learning): CoOp, CoCoOp, CLIP-Adapter, Tip-Adapter
Related work(TTAofVLM): TPT, DiffTPT
Keys of method: dynamic adapter, Caches storing positive and negative pseudo labels
- Each cache stores a dynamic queue of few-shot test features as keys and the corresponding pseudo labels as values.
- Two cache: positive cache (entropy가 낮은 샘플의 key-value pairs 저장) / negative cache (noisy sample들 저장, 이 샘플들과 가까운 sample들엔 XX 클래스가 아닐 가능성이 높다고 알려주는 역할)
Target domain에서 TTA과정을 거치면 cache들이 추척된다. 그들은 freeze 해두고, CLIP의 능력을 1) a photo of에 국한하지 않는다던지 2) LMMs, CLIP score 판단에 더 이롭게 동작할수 있게 된건지를 파악해보면 좋을 듯 하다. CLIP 자체 능력을 파악해보는데이는 training dataset의 eval-set에서의 loss값이나, accuracy를 측정해볼 수 있겠다.
TPT, DiffTPT 또한 동일한 세팅 및 Motive를 가졌기에 디테일한 논문 파악은 나중에.

◽️ SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models. NeurIPS, 2023

Setup: Test-time adaptation by updating prompts
Method: TPT에서 update하는 Prompt_{online} 뿐만 아니라, Prompt_{EAM}을 함께 사용하자.
The target prompt and the online prompt. We optimize the online prompt while the target prompt is gradually updated through a slow-moving average process, which incorporates past information to increase stability and effectiveness.
SwAV에서는 model을 EMA하고, 두 모델 (EMA, student)로 계산된 augmented image의 prediction이 similar하게 만들었다. 비슷하게 prompt를 EMA하고 두 prompt (EMA, online)로 계산된 augmented image의 prediction를 similar하게 만든다.

◽️ DaPL: Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models. WACV, 2024

Intro1: Arch관점으로 ViT가 ResNet보다 어차피 좋다. + UDA에서 CLIP pretrained weight를 이용해보자.
Intro2: UDA에서 CLIP pretrained weight를 이용해보자. 이때 발생가능한 challenges. (1) they have billions of parameters that require heavy computational resources to tune (2) CLIP learned from the 400M data can deteriorate through standard fine-tuning.
Method: (1) Text prompt tuning (PTT): incorporating a linear layer after the text encoder and tunning the layer, (2) Visual feature refinement (VFR): add the set of learnable parameters to features from the image encoder, (3) Domain-aware pseudo labeling (DaPL): "a [DOMAIN] photo of a [CLASS]".

◽️ Multimodal Adaptation of CLIP for Few-Shot Action Recognition. CVPR, 2023

Motive: 대량의 데이터로 학습시켜 만들어진 CLIP모델은 downstream task에 좋은 재료가 될 수 있다. 추가적인 많은 데이터와 많은 학습의 필요성을 감소시켜준다.
Method1: Parameter efficient finetuning, 어느 parameter 학습이 좋은지. (1) training parameter number 감소 (2) video input을 handle할 수 있도록 모델 조정.
Method2: Text-guided Prototype Construction Module
Pretrained weight를 ImageNet vs CLIP 으로 구분해놓은 Table이 깔끔한 듯 하다. 사실 위 논문(Domain-aware pseudo labeling)도 CLIP pretrained weight 사용.

◽️ SAM + Domain Adaptation

SAM의 output을 naive하게 이용해서 domain adaptation을 수행하겠다는 논문들이다. Classification에서 개발한 general approach를 SAM에서도 사용할 수 있게 만드는게 좋지, 이렇게 naive하게 output 사용하는 것은 일단 지양하자. 흥미로운 논문을 위해서 SAM의 inherent clusttering 능력을 internal하게 추출할 수 있는 방법을 찾는 것으로 가는게 맞을 것 같다.

◽️ ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation. Amazon. WACV oral, 2024.

Setup: domain adaptation with unlabeled target (downstream) domain, where a source model is the CLIP pre-trained model.
Motive: The CLIP model has significant misalignments between visiual and text embedding.
Method: (1) learning a projection subspace that removes redundant dimensions and class-agnostic information. (2) cross-modality self-training

◽️ ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. NeurIPS 2022.

Task: Developing transferable visual models that perform well on a wide range of downstream applications. Finetuning.
좋은 논문이지만 지금 당장 필요 없음. 나중에 참고.

◽️ Source-Free Domain Adaptation with Frozen Multimodal Foundation Model CVPR 2024.

Setup: How can you utilize the knowledge of large multimodal model (e.g., CLIP) for domain adaptation?
1. Fine-tunning/adapting the VL model itself to the target domain
2. Adapting the source model to the target domain while utilizing the VL models as external knowledge. ✅
Problem: Direct application of the VL model proves unsatisfactory, lacking specialization for specific tasks.
Method
- Step1: Task-specific customization of a VL model through task-specific prompt learning.
- Step2: Target model adaptation with two regularizations.
/Users/junha/Library/CloudStorage/OneDrive-개인/Davian/Study/Robustness/adaptation_240318.pptx

◽️ CLIPArTT: Light-weight Adaptation of CLIP to New Domains at Test Time. arXiv, 2024.

Motive1 (why TTA + CLIP): CLIP have been employed in fields as diverse as video, audio, and medical. The challenge is to adapt the model to new domains in real-time to maintain its attractive zero-shot capabilities.
Method: top-3 prediction 으로 생성된 prompt를 사용 (ex, A photo of car, bus, and truck.)
Motive2 (why 3 top-3 predictions): the correct class is within the top-3 predictions 80.92% (yet, 61.78 within the top-1) of the times for CIFAR100.
기존 논문(ex, TPT)들이랑 비교 없음. 하지만 top-3 prompt라는 아이디어가 재밌음.

Junha

[DAwFM] Adaptation with Foundation Models

Paper List

Adaptation with Foundation Model

Summary

◽️ A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models. CVPR, 2024.

◽️ A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation. ICLR, 2024

◽️ Efficient Test-Time Adaptation of Vision-Language Models. CVPR, 2024

◽️ SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models. NeurIPS, 2023

◽️ DaPL: Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models. WACV, 2024

◽️ Multimodal Adaptation of CLIP for Few-Shot Action Recognition. CVPR, 2023

◽️ SAM + Domain Adaptation

◽️ ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation. Amazon. WACV oral, 2024.

◽️ ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. NeurIPS 2022.

◽️ Source-Free Domain Adaptation with Frozen Multimodal Foundation Model CVPR 2024.

◽️ CLIPArTT: Light-weight Adaptation of CLIP to New Domains at Test Time. arXiv, 2024.

Posted by Junha Song

문의하기 양식