Paper List
Adaptation with Foundation Model
- Measuring CLIP capability
- C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion. ICLR, 2024.
- DSG:Davidsonian scene graph: Improving reliability in fine-grained evaluation for textimage generation. ICLR, 2024.
- Decomposed CLIPScore: Improving Text-to-Image Consistency via Automatic Prompt Optimization. Meta, 2024.
- CLIP Finetuning
- Fine-tuned CLIP Models are Efficient Video Learners. CVPR, 2023. 55.
- Fine-tuning CLIP Text Encoders with Two-step Paraphrasing. EACL, 2024. 0.
- Improving CLIP Fine-tuning Performance. ICCV, 2023. 2.
- ๐ CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet. arXiv, 2022. 17.
- ๐ ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. NeurIPS, 2022. 92.
- ๐ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. CVPR 2024.
- Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification. ECCV, 2022. 139.
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters. IJCV, 2024. 480.
- ๐ A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models. CVPR, 2024. 2.
- Feature Adaptation with CLIP for Few-shot Classification. ACM, 2023. 0.
- Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. CVPR, 2023. 53.
- Multimodal Adaptation of CLIP for Few-Shot Action Recognition. CVPR, 2023. 6.
- Not all features matter: Enhancing few-shot clip with adaptive prior refinement. ICCV, 2023.
- ๐ A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation. ICLR, 2024.0.
- Task Residual for Tuning Vision-Language Models. CVPR, 2023. 32.
- Towards Calibrated Robust Fine-Tuning of Vision-Language Models. NeurIPS_W 2023. 3.
- Robust Cross-Modal Representation Learning with Progressive Self-Distillation. CVPR, 2022. 36.(contrastive learning with noise data)
- CoOp: Learning to Prompt for Vision-Language Models. IJCV. 2022. 1316.
- Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification. CVPR 2024.
- CLIP pretraining & Analyzing
- Long-CLIP: Long-CLIP: Unlocking the Long-Text Capability of CLIP, Mar 2024.
- DreamLIP: Language-Image Pre-training with Long Captions (project). arXiv, 2024.
- Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies. arXiv. Arp 2024.
- Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends. 122. (survey from MS)
- Interpreting CLIP's Image Representation via Text-Based Decomposition. ICLR 2024.
- SigCLIP
- MetaCLIP
- CLIP adaptation
- Domain Adaptation via Prompt Learning. arXiv 2022.
- Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval. ICCV, 2023. 5.
- AD-CLIP: Adapting Domains in Prompt Space Using CLIP. ICCV workshop, 2023. 13.
- AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation. CVPR, 2023. 6.
- PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization. ICCV, 2023. 18.
- POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models. ICML 2023. 18. (SFDA)
- Sus-x: Training-free name-only transfer of vision-language models. ICCV, 2023. 28. (training-free)
- Improving zero-shot generalization and robustness of multi-modal models. CVPR, 2023. 15 (training-free)
- ๐ TPT: Test-time prompt tuning for zero-shot generalization in vision language models. NeurIPS, 2022. 141.
- Align your prompt
- Robust Multi-Task Learning and Online Refinement for Spacecraft Pose Estimation across Domain Gap. Advances in Space Research. 2022. 34.
- ๐ SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models. NeurIPS, 2023. 3.
- ๐ DiffTPT: Diverse data augmentation with diffusions for effective test-time prompt tuning. ICCV, 2023. 15.
- BaFTA: Backprop-Free Test-Time Adaptation for Zero-shot Vision Language Models. ICLR 2024 rejected (but good scores)
- ๐ Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models. WACV, 2024. 1.
- ๐ TDA: Efficient Test-Time Adaptation of Vision-Language Models. CVPR, 2024.
- ๐ Source-Free Domain Adaptation with Frozen Multimodal Foundation Model CVPR 2024.
- ๐ ReCLIP Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation. WACV oral, 2024.
- Retrival augmented methods in computer vision.
- linkedin post1, linkedin post2
- Retrieval augmented classi๏ฌcation for long-tail visual recognition. CPVR, 2022. 67.
- ๐ Improving Image Recognition by Retrieving from Web-Scale Image-Text Data. CVPR, 2023. 10.
- ๐ REACT: Learning Customized Visual Models with Retrieval-Augmented Knowledge. CVPR, 2023. 10.
- ๐ Retrieval-Augmented Multimodal Language Modeling. ICML, 2023. 44.
- Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. MS. 2023. 234.
- K-lite: Learning transferable visual models with external knowledge. NeurIPS, 2022. 64
- SAM + Domain adaptation
- ๐ SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation. arXiv, 2024. 7 (github)
- ๐ SAM4UDASS: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles. arXiv, 2024.
- SAM-guided Unsupervised Domain Adaptation for 3D Segmentation. arXiv(ICLR2024 submitted), 2024.
- Utilizing text-image alignment.
- SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs. NeurIPS 2023 spotlight. 14.
- Using Language to Extend to Unseen Domains. ICLR, 2023. 20.
- StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. ACM, 2021. 471.
- Diagnosing and Rectifying Vision Models using Language. ICLR, 2023. 27.
- TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation. ICCV, 2023. 2.
- Using Language to Entend to Unseen Domains. ICLR, 2023. 20.
- Embedding Arithmetic of Multimodal Queries for Image Retrieval. CVPRW, 2022. 17.
- Distillation
- NVIDIA-AI-IOT/CLIP-distillation (github)
- CLIP-KD: An Empirical Study of Distilling CLIP Models. CVPR 2024. 3.
- TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance. ICCV 2023. 10.
- CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. CVPR 2024. 18.
- Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. MS, 2022. (CLIP-TD: CLIP Targeted Distillation). 5.
- EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. CVPR 2024. 15.
- MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation. CVPR, 2023. 141.
- dd
- Generalize/Distill and Adapt
- Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation. ICCV 2021. 102.
- DiGA: Distil to Generalize and then Adapt for Domain Adaptive Semantic Segmentation. CVPR 2023. 8.
- Image captioning
- InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
- Simple LLM Framework for Long-Range Video Question-Answering
- RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.
- Distilling Vision-Language Models on Millions of Videos. CVPR 2024.
- ์ํ์ด๊ฐ ๊ณต์ ํด์ค ๋ ผ๋ฌธ๋ค๋ ๋ณด์
- ์ํ์ด ํ ํฝ, General purpose model์ด BLIP2์ด๋ค. ๊ทธ๊ฑฐ๋ณด๋ค ์์ ๋ชจ๋ธ์ด ์๋ค. Captioner๊ฐ Video understanding์ ์ฌ์ฉ๋๋ค๋ฉด, ๋ฌธ์ ๊ฐ ์์ ์ ์๋ค. ๊ธฐ๋ณธ ์ด๋ฏธ์ง captioner๋ motion ๋์ฌ์ ์ทจ์ฝํ ์ ์๊ธฐ ๋๋ฌธ์ด๋ค. BLIP2๋ Grounding์ ์ฝํ๋ค, LLM์ ์ฐ๋ฉด Halluciation์ด ๋ถ๋ช ์ด ์๊ธด๋ค.
- Bert, GPT2 ๊ฐ์ ์์ Transfomer๋ฅผ ์ฌ์ฉํ๋ Image Captioning ๋ชจ๋ธ๋ ์๋ค.
Summary
◽️ A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models. CVPR, 2024.
- Setup: robust finetuning, few-show learning on a specific target domain, generalization capabilities ๊ฐ์์ํค์ง ์๊ณ ๋ชจ๋ธ ํ์ตํ๋ ๋ฐฉ๋ฒ ์ฐ๊ตฌ, small parameter finetuning (few-shot adaptation) on target task.
- Related work: CLIP-Adapter, TIP-Adapter, Multimodality helps unimodality (ํนํ ์ด ๋ ผ๋ฌธ์์ ์ ๋ ๋ ผ๋ฌธ์ด ๋ง์ test-set ์ฌ์ฉํจ์ ๋นํ)
- Motive: ์ฃผ์ด์ง target dataset์ผ๋ก CLIP์ ํ์ต์ํค๊ณ , ๋ฐฐํฌํ์๋ ๋๋ถ๋ถ์ ๋๋ฉ์ธ์์ ์ ๋์ํ ๊ฑฐ๋ผ๋ ๋ณด์ฅ์ด ์๋ค. related work๋ก ์คํ์ ํด๋ณด๋ target domain์ผ๋ก ํ์ตํ ํ, ๋ค๋ฅธ ๋๋ฉ์ธ์์์ ์ฑ๋ฅ์ด ๋ง์ด ๋จ์ด์ง๋๋ผ. ์ด๋ฐ๊ฑธ ๋ชจ๋ ๊ณ ๋ คํด์ best model์ ๊ณ ๋ฅด๋ ์์ ์ ์ถ๊ฐํ๋ค๋ ๊ฒ์ ์ฌ์ค์ ๋ถ๊ฐ๋ฅํ ๊ฑฐ๋ค.
- Method: zero shot weight๋ก initalized๋ linear probing์ด ์ฐ์ ๊ฐ์ฅ ์์ ํ๊ณ ์ฑ๋ฅ์ด ์ข๋ค. ์ด๊ฑธ ์ถ๊ฐ์ ์ผ๋ก improve ํ๊ธฐ ์ํ CLAP, ALM ๊ธฐ๋ฒ์ ์ ์ํ๋ค.
◽️ A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation. ICLR, 2024
- Setup: few-show learning on a specific target domain. training-free ์ด๋ฏ๋ก Imbalanced setup์ ๊ฐ์ธํจ.
- Motive: CLIP training ๊ตณ์ด ํ์ง ์๋๋ผ๋, few-shot learning์์ ์ถฉ๋ถํ ์ข์ ์ฑ๋ฅ์ ๊ฐ์ง๊ฒ ๋ง๋ค ์ ์๋ค.
- Method: feature distrubution of few datas ๋ค์ ์ฌ๋ฃ๋ก Gaussian Discriminant Analysis๋ฅผ ํตํด linear layer์ weight, bias๋ฅผ ๊ตฌํ๋ค. ๊ทธ ๊ฒฐ๊ณผ์ CLIP zero-shot ๊ฒฐ๊ณผ๋ฅผ ์์๋ธํ๋ค.
- Training-free ์์๋ ๋ถ๊ตฌํ๊ณ , adaptation ํ๋ ๊ธฐ๋ฒ๋ค๋งํผ ์ข์ ์ฑ๋ฅ์ ๊ฐ์ง๋ ๊ฒ์ ํฅ๋ฏธ๋กญ๋ค.
◽️ Efficient Test-Time Adaptation of Vision-Language Models. CVPR, 2024
- Setup: Training-free test-time adaptation of VLMs
- Related work(Few-shot learning): CoOp, CoCoOp, CLIP-Adapter, Tip-Adapter
- Related work(TTAofVLM): TPT, DiffTPT
- Keys of method: dynamic adapter, Caches storing positive and negative pseudo labels
- Each cache stores a dynamic queue of few-shot test features as keys and the corresponding pseudo labels as values.
- Two cache: positive cache (entropy๊ฐ ๋ฎ์ ์ํ์ key-value pairs ์ ์ฅ) / negative cache (noisy sample๋ค ์ ์ฅ, ์ด ์ํ๋ค๊ณผ ๊ฐ๊น์ด sample๋ค์ XX ํด๋์ค๊ฐ ์๋ ๊ฐ๋ฅ์ฑ์ด ๋๋ค๊ณ ์๋ ค์ฃผ๋ ์ญํ )
- Target domain์์ TTA๊ณผ์ ์ ๊ฑฐ์น๋ฉด cache๋ค์ด ์ถ์ฒ๋๋ค. ๊ทธ๋ค์ freeze ํด๋๊ณ , CLIP์ ๋ฅ๋ ฅ์ 1) a photo of์ ๊ตญํํ์ง ์๋๋ค๋์ง 2) LMMs, CLIP score ํ๋จ์ ๋ ์ด๋กญ๊ฒ ๋์ํ ์ ์๊ฒ ๋๊ฑด์ง๋ฅผ ํ์ ํด๋ณด๋ฉด ์ข์ ๋ฏ ํ๋ค. CLIP ์์ฒด ๋ฅ๋ ฅ์ ํ์ ํด๋ณด๋๋ฐ์ด๋ training dataset์ eval-set์์์ loss๊ฐ์ด๋, accuracy๋ฅผ ์ธก์ ํด๋ณผ ์ ์๊ฒ ๋ค.
- TPT, DiffTPT ๋ํ ๋์ผํ ์ธํ ๋ฐ Motive๋ฅผ ๊ฐ์ก๊ธฐ์ ๋ํ ์ผํ ๋ ผ๋ฌธ ํ์ ์ ๋์ค์.
◽️ SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models. NeurIPS, 2023
- Setup: Test-time adaptation by updating prompts
- Method: TPT์์ updateํ๋ Prompt_{online} ๋ฟ๋ง ์๋๋ผ, Prompt_{EAM}์ ํจ๊ป ์ฌ์ฉํ์.
- The target prompt and the online prompt. We optimize the online prompt while the target prompt is gradually updated through a slow-moving average process, which incorporates past information to increase stability and effectiveness.
- SwAV์์๋ model์ EMAํ๊ณ , ๋ ๋ชจ๋ธ (EMA, student)๋ก ๊ณ์ฐ๋ augmented image์ prediction์ด similarํ๊ฒ ๋ง๋ค์๋ค. ๋น์ทํ๊ฒ prompt๋ฅผ EMAํ๊ณ ๋ prompt (EMA, online)๋ก ๊ณ์ฐ๋ augmented image์ prediction๋ฅผ similarํ๊ฒ ๋ง๋ ๋ค.
◽️ DaPL: Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models. WACV, 2024
- Intro1: Arch๊ด์ ์ผ๋ก ViT๊ฐ ResNet๋ณด๋ค ์ด์ฐจํผ ์ข๋ค. + UDA์์ CLIP pretrained weight๋ฅผ ์ด์ฉํด๋ณด์.
- Intro2: UDA์์ CLIP pretrained weight๋ฅผ ์ด์ฉํด๋ณด์. ์ด๋ ๋ฐ์๊ฐ๋ฅํ challenges. (1) they have billions of parameters that require heavy computational resources to tune (2) CLIP learned from the 400M data can deteriorate through standard fine-tuning.
- Method: (1) Text prompt tuning (PTT): incorporating a linear layer after the text encoder and tunning the layer, (2) Visual feature refinement (VFR): add the set of learnable parameters to features from the image encoder, (3) Domain-aware pseudo labeling (DaPL): "a [DOMAIN] photo of a [CLASS]".
◽️ Multimodal Adaptation of CLIP for Few-Shot Action Recognition. CVPR, 2023
- Motive: ๋๋์ ๋ฐ์ดํฐ๋ก ํ์ต์์ผ ๋ง๋ค์ด์ง CLIP๋ชจ๋ธ์ downstream task์ ์ข์ ์ฌ๋ฃ๊ฐ ๋ ์ ์๋ค. ์ถ๊ฐ์ ์ธ ๋ง์ ๋ฐ์ดํฐ์ ๋ง์ ํ์ต์ ํ์์ฑ์ ๊ฐ์์์ผ์ค๋ค.
- Method1: Parameter efficient finetuning, ์ด๋ parameter ํ์ต์ด ์ข์์ง. (1) training parameter number ๊ฐ์ (2) video input์ handleํ ์ ์๋๋ก ๋ชจ๋ธ ์กฐ์ .
- Method2: Text-guided Prototype Construction Module
- Pretrained weight๋ฅผ ImageNet vs CLIP ์ผ๋ก ๊ตฌ๋ถํด๋์ Table์ด ๊น๋ํ ๋ฏ ํ๋ค. ์ฌ์ค ์ ๋ ผ๋ฌธ(Domain-aware pseudo labeling)๋ CLIP pretrained weight ์ฌ์ฉ.
◽️ SAM + Domain Adaptation
- SAM์ output์ naiveํ๊ฒ ์ด์ฉํด์ domain adaptation์ ์ํํ๊ฒ ๋ค๋ ๋ ผ๋ฌธ๋ค์ด๋ค. Classification์์ ๊ฐ๋ฐํ general approach๋ฅผ SAM์์๋ ์ฌ์ฉํ ์ ์๊ฒ ๋ง๋๋๊ฒ ์ข์ง, ์ด๋ ๊ฒ naiveํ๊ฒ output ์ฌ์ฉํ๋ ๊ฒ์ ์ผ๋จ ์ง์ํ์. ํฅ๋ฏธ๋ก์ด ๋ ผ๋ฌธ์ ์ํด์ SAM์ inherent clusttering ๋ฅ๋ ฅ์ internalํ๊ฒ ์ถ์ถํ ์ ์๋ ๋ฐฉ๋ฒ์ ์ฐพ๋ ๊ฒ์ผ๋ก ๊ฐ๋๊ฒ ๋ง์ ๊ฒ ๊ฐ๋ค.
◽️ ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation. Amazon. WACV oral, 2024.
- Setup: domain adaptation with unlabeled target (downstream) domain, where a source model is the CLIP pre-trained model.
- Motive: The CLIP model has significant misalignments between visiual and text embedding.
- Method: (1) learning a projection subspace that removes redundant dimensions and class-agnostic information. (2) cross-modality self-training
◽️ ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. NeurIPS 2022.
- Task: Developing transferable visual models that perform well on a wide range of downstream applications. Finetuning.
- ์ข์ ๋ ผ๋ฌธ์ด์ง๋ง ์ง๊ธ ๋น์ฅ ํ์ ์์. ๋์ค์ ์ฐธ๊ณ .
◽️ Source-Free Domain Adaptation with Frozen Multimodal Foundation Model CVPR 2024.
- Setup: How can you utilize the knowledge of large multimodal model (e.g., CLIP) for domain adaptation?
- Fine-tunning/adapting the VL model itself to the target domain
- Adapting the source model to the target domain while utilizing the VL models as external knowledge. ✅
- Problem: Direct application of the VL model proves unsatisfactory, lacking specialization for specific tasks.
- Method
- Step1: Task-specific customization of a VL model through task-specific prompt learning.
- Step2: Target model adaptation with two regularizations.
/Users/junha/Library/CloudStorage/OneDrive-๊ฐ์ธ/Davian/Study/Robustness/adaptation_240318.pptx
◽️ CLIPArTT: Light-weight Adaptation of CLIP to New Domains at Test Time. arXiv, 2024.
- Motive1 (why TTA + CLIP): CLIP have been employed in fields as diverse as video, audio, and medical. The challenge is to adapt the model to new domains in real-time to maintain its attractive zero-shot capabilities.
- Method: top-3 prediction ์ผ๋ก ์์ฑ๋ prompt๋ฅผ ์ฌ์ฉ (ex, A photo of car, bus, and truck.)
- Motive2 (why 3 top-3 predictions): the correct class is within the top-3 predictions 80.92% (yet, 61.78 within the top-1) of the times for CIFAR100.
- ๊ธฐ์กด ๋ ผ๋ฌธ(ex, TPT)๋ค์ด๋ ๋น๊ต ์์. ํ์ง๋ง top-3 prompt๋ผ๋ ์์ด๋์ด๊ฐ ์ฌ๋ฐ์.