General VLM

  1. (BLIP2 > BLIP = OSCAR(detector-based) = SimVLM = LEMON(Scaling up vision-language pre-training for image captioning) = UnifiedVL (Unifying Vision-and-Language Tasks via Text Generation) = VIVO, X-VLM(detector-based) > Flamingo)
  2. Perceiver IO: A General Architecture for Structured Inputs & Outputs. ICLR 2022
  3. Perceiver: Flamingo: a Visual Language Model for Few-Shot Learning NeurIPS 2022
  4. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20
  5. VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21
  6. LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22
  7. SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22
  8. X-VLM: Multi-Grained Vision Language Pre-Training ICML22
  9. OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22
  10. GIT A Generative Image-to-text Transformer for Vision and Language arXiv22
  11. X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22
  12. mPLUG Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections EMNLP22.
  13. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023
  14. LLaVA: Large Language and Vision Assistant. NeurIPS, 2023
  15. LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
  16. mPLUG-Owl2 Revolutionizing Multi-modal Large Language Model with Modality Collaboration CVPR24
  17. 🔒ShareGPT4V. ECCV 2024
  18. 🔒🚨TinyLLaVA- A Framework of Small-scale Large Multimodal Models arxiv24
  19. 🔒🚨PaliGemma- A versatile 3B VLM for transfer Google arxiv24
  20. 🔒InternVL: How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites CVPR24
  21. 🔒Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond arXiv24
  22. 🔒 Phi-3.5-vision-instruct (link)
  23. 🔒🚨AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
  24. 🔒🚨Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
  25. 🔒 MDPO: Conditional Preference Optimization for Multimodal Large Language Models. Jun, 24.

Image Captioning

  1. Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024
  2. SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023
  3. CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022.
  4. Retrieval-augmented image captioning. ACL 2023
  5. ClipCap: CLIP Prefix for Image Captioning. arXiv 2021
  6. 🔒I-tuning: Tuning language models with image for caption generation. ICASSP 2023
  7. Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023.
  8. With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2
  9. FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted
  10. LocCa: Visual Pretraining with Location-aware Captioners. Google. NeurIPS 2024

Vision-centric Improvement (Saining Xie)

  1. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. CVPR. 2024
  2. 🔒🚨Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. NeurIPS. 2024
  3. 🔒🚨Locality Alignment Improves Vision-Language Models ICLR 2025 submitted
  4. 🔒F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models. ICLR2023
  5. 🔒Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation. ICCV, 23.
  6. 🔒Scaling language-image pretraining via masking. Kaiming, CVPR 2023. 264
  7. 🔒Visual In-Context Prompting. MS. CVPR24.

Region-based VLMs

  1. Graph-Based Captioning(GBC): Enhancing Visual Descriptions by Interconnecting Region Captions. arXiv 24
  2. GLaMM : Pixel Grounding Large Multimodal Model. CVPR 24 => Region encoder
  3. 🔒ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts. CVPR2024.
  4. 🔒ControlCap: Controllable Region-level Captioning. ECCV2024

Hallucination

  1. 🔒Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding, ACL2024

Long/ grounded/ dense CLIP

  1. DCI: Densely Captioned Images: A Picture is Worth More Than 77 Text Tokens. Evaluating CLIP-Style Models on Dense Captions Meta CVPR 24
  2. GBC: Graph-Based Captioning- Enhancing Visual Descriptions by Interconnecting Region Captions. Apple, arxiv24.
  3. PixelProse From Pixels to Prose A Large Dataset of Dense Image Captions arxiv24
  4. ShareGPT4V- Improving Large Multi-Modal Models with Better Captions ECCV24
  5. Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. NeurIPS 2023 Spotlight
  6. 🔒The Pyramid of Captions arxiv24
  7. 🔒🚨ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference. ECCV, 2024
  8. 🔒🚨Modeling Caption Diversity in Contrastive Vision-Language Pretraining. ICML 2024.
  9. 🔒🚨VeCLIP: Improving CLIP Training via Visual-enriched Captions. ECCV, 2024.
  10. 🔒CLIP with Quality Captions: A Strong Pretraining for Vision Tasks.
  11. 🔒Eva-clip: Improved training techniques for clip at scale.
  12. 🔒Sig-CLIP: Sigmoid loss for language image pre-training.

Caption evaluation matirx

  1. ARO: When and why visionlanguage models behave like bags-of-words, and what to do about it? ICLR 2023 Oral
  2. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations EMNLP 2022
  3. DSG evaluation: Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. ICLR 2024.
  4. Prometheus-Vision. arXiv. 24
  5. Semantic parsing
    1. Image Retrieval using Scene Graphs CVPR15
    2. tanford-scene-graph-parser EMNLP 2015
    3. SPICE: Semantic Propositional Image Caption Evaluation ECCV 16
    4. Unified Visual-Semantic Embeddings- Bridging Vision and Language with Structured Meaning Representations. [python code] CVPR19
  6. 🔒🚨GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation. CMU. CVPR2024 → VQAScore
  7. 🔒🚨Benchmarking and Improving Detail Image Caption. arXiv, 2024

Analysis of VLMs / Understanding VLMs

  1. What matters when building vision-language models? Hugging face. arXiv, 2024.
  2. Image Captioners Are Scalable Vision Learners Too. Google, NeurIPS 2023.
  3. S2-Wrapper: When Do We Not Need Larger Vision Models?
  4. 🔒Long-CLIP Unlocking the Long-Text Capability of CLIP ECCV24
  5. 🔒🔥Mm1: Methods, analysis & insights from multimodal llm pre-training. Apple, ECCV, 2024.
  6. 🔒🔥MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning. Apple, arXiv, 2024.
  7. 🔒Prismatic vlms: Investigating the design space of visually-conditioned language models. ICML 2024.
  8. 🔒Vila: On pre-training for visual language models. CVPR 2024. 122
  9. 🔒Vision Transformers Need Registers. Meta ICLR 2024 Oral
  10. 🔒Demystifying CLIP Data. Meta. ICLR 2024 Spotlight
  11. 🔒Bridging Vision and Language Spaces with Assignment Prediction. Naver, ICLR, 2024.
  12. 🔒🔥An Introduction to Vision-Language Modeling. Meta, arXiv, 2024.
  13. 🔒MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models. arXiv, 2024.
  14. 🔒Natural Language Inference Improves Compositionality in Vision-Language Models. arXiv, 2024.

In-context leraning, In-the-loop, Teacher forcing, Training resolutions

  1. Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023. 9.
  2. 🔒Compositional Chain-of-Thought Prompting for Large Multimodal Models CVPR24
  3. 🔒Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models. NAACL 2024.
  4. 🔒Improve Vision Language Model Chain-of-thought Reasoning. CMU. arXiv. 2024

Video

  1. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. ICLR 2024 spotlight
  2. Simple LLM Framework for Long-Range Video Question-Answering. arXiv, Dec, 2023
  3. LaViLa: Learning Video Representations from Large Language Models. Facebook, CVPR, 2023
  4. EgoVLP: Egocentric Video-Language Pretraining. NeurIPS, 2022
  5. Distilling Vision-Language Models on Millions of Videos. CVPR 2024
  6. VideoCon: Robust Video-Language Alignment via Contrast Captions. Google, CVPR 2024. 5
  7. V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video. Meta. TMLR 2024
  8. FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. arXiv, 2024.
  9. 🔒FrozenBiLM. NeurIPS 22. 144. (cited by LVQA.)
  10. 🔒Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval. CVPR, 2024.
  11. 🔒PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning. arXiv, Apr 2024. (SOTA!)
  12. 🔒Revisiting Feature Prediction for Learning Visual Representations from Video Meta ICLR 2024
  13. 🔒InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
  14. 🔒UMT (Unmasked Teacher): Towards Training-Efficient Video Foundation Models. ICCV 2023 Oral
  15. 🔒Video-LLaVA. EMNLP, 2024
  16. 🔒ShareGPT4Video-NeurIPS 2024

LLM distillations

  1. awesome github / survey / survey on LLM specialization
  2. MiniLLM: Knowledge Distillation of Large Language Models. ICLR, 2024. 71.
  3. Sequence-Level Knowledge Distillation EMNLP 2016
  4. 🔒Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. ACL, 2023. 174.
  5. 🔒Beyond one-model-fits-all: A survey of domain specialization for large language models. 23.
  6. 🔒Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks. NeurIPS 2023. 15.
  7. 🔒DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. NeurIPS_W 2019
  8. 🔒LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement ACL24
  9. 🔒Contrastive Decoding: Open-ended Text Generation as Optimization. ACL 2023. 200

VLM distillations

  1. 🔒🚨EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning. ACL-Finding, 2023
  2. 🔒🚨PromptKD: Unsupervised Prompt Distillation for Vision-Language Models. CVPR, 2024
  3. 🔒🚨CLIPPING Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval CVPR24
  4. 🔒🚨Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models. ECCV224
  5. 🔒🚨Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models. ECCV24
  6. 🔒🚨SILC: Improving Vision Language Pretraining with Self-Distillation. ECCV2024
  7. 🔒Compositional Chain-of-Thought Prompting for Large Multimodal Models CVPR24
  8. 🔒Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models CVPR24
  9. 🔒Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
  10. 🔒Compressing Visual-linguistic Model via Knowledge Distillation. ICCV, 2021.

AttentionScore Distillation

  1. 🔒Knowledge distillation method for better vision-language models. AMAZON, article
  2. 🔒TinyBERT: Distilling BERT for Natural Language Understanding. EMNLP 2020. 1876.
  3. 🔒Unveiling the Magic: Investigating Attention Distillation in Retrieval-augmented Generation. arXiv 2024.
  4. 🔒Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching. AAAI 2021. 130.
  5. 🔒Frequency Attention for Knowledge Distillation. WACV 2024.

Image distillations

  1. 🔒🚨Distilling Internet-Scale Vision-Language Models into Embodied Agents. DeepMind, ICML 2023. 19.
  2. 🔒Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation. arXiv, 2022. 136.

Efficiency of CLIP (presented in RECLIP):

  1. RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.
  2. Data-Efficient Multimodal Fusion on a Single GPU. CVPR, 2024. Highlight
  3. 🔒Lit: Zero-shot transfer with locked-image text tuning. CVPR, 2022. (precompute image features, pretrained classifiers)
  4. 🔒Filip: Fine-grained interactive language-image pre-training. ICLR, 2021. (masked images, multi-view supervisions)
  5. 🔒Sigmoid loss for language image pre-training. ICCV oral, 2023. (sigmoid loss)
  6. 🔒Maskclip: Masked self-distillation advances contrastive language-image pretraining. CVPR 2023.
  7. 🔒A data-efficient contrastive language-image pre-training paradigm. ICLR 2022.
  8. 🔒Supervision exists everywhere: A data-efficient contrastive language-image pre-training paradigm. ICLR 2022. (multi-view supervisions)

World Models

  1. 🔒Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability. NeurIPS 2024

Language Model

  1. 🔒How Much are LLMs Contaminated? arXiv. 2024
  2. 🔒ORPO: Monolithic Preference Optimization without Reference Model. arXiv, 2024
  3. 🔒The Curious Case of Neural Text Degeneration. ICLR 2020
  4. 🔒Fast Inference from Transformers via Speculative Decoding ICLR 2023 Oral (A Hitchhiker’s Guide to Speculative Decoding)
  5. 🔒🚨 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models