General VLM
- (BLIP2 > BLIP = OSCAR(detector-based) = SimVLM = LEMON(Scaling up vision-language pre-training for image captioning) = UnifiedVL (Unifying Vision-and-Language Tasks via Text Generation) = VIVO, X-VLM(detector-based) > Flamingo)
- Perceiver IO: A General Architecture for Structured Inputs & Outputs. ICLR 2022
- Perceiver: Flamingo: a Visual Language Model for Few-Shot Learning NeurIPS 2022
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks ECCV20
- VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR21
- LEMON: Scaling Up Vision-Language Pre-training for Image Captioning CVPR22
- SimVLM Simple Visual Language Model Pretraining with Weak Supervision ICLR22
- X-VLM: Multi-Grained Vision Language Pre-Training ICML22
- OFA Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML22
- GIT A Generative Image-to-text Transformer for Vision and Language arXiv22
- X-model: Beyond a Pre-Trained Object Detector for Image Captioning CVPR22
- mPLUG Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections EMNLP22.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023
- LLaVA: Large Language and Vision Assistant. NeurIPS, 2023
- LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
- mPLUG-Owl2 Revolutionizing Multi-modal Large Language Model with Modality Collaboration CVPR24
- 🔒ShareGPT4V. ECCV 2024
- 🔒🚨TinyLLaVA- A Framework of Small-scale Large Multimodal Models arxiv24
- 🔒🚨PaliGemma- A versatile 3B VLM for transfer Google arxiv24
- 🔒InternVL: How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites CVPR24
- 🔒Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond arXiv24
- 🔒 Phi-3.5-vision-instruct (link)
- 🔒🚨AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
- 🔒🚨Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
- 🔒 MDPO: Conditional Preference Optimization for Multimodal Large Language Models. Jun, 24.
Image Captioning
- Tag2Text: Guiding Vision-Language Model via Image Tagging. ICLR, 2024
- SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. CVPR, 2023
- CaMEL: Mean Teacher Learning for Image Captioning. ICPR, 2022.
- Retrieval-augmented image captioning. ACL 2023
- ClipCap: CLIP Prefix for Image Captioning. arXiv 2021
- 🔒I-tuning: Tuning language models with image for caption generation. ICASSP 2023
- Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. ICCV, 2023.
- With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023. 2
- FlexCap: Generating Rich, Localized, and Flexible Captions in Images. DeepMind & CMU, ICLR, 2024 submitted
- LocCa: Visual Pretraining with Location-aware Captioners. Google. NeurIPS 2024
Vision-centric Improvement (Saining Xie)
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. CVPR. 2024
- 🔒🚨Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. NeurIPS. 2024
- 🔒🚨Locality Alignment Improves Vision-Language Models ICLR 2025 submitted
- 🔒F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models. ICLR2023
- 🔒Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation. ICCV, 23.
- 🔒Scaling language-image pretraining via masking. Kaiming, CVPR 2023. 264
- 🔒Visual In-Context Prompting. MS. CVPR24.
Region-based VLMs
- Graph-Based Captioning(GBC): Enhancing Visual Descriptions by Interconnecting Region Captions. arXiv 24
- GLaMM : Pixel Grounding Large Multimodal Model. CVPR 24 =>
Region encoder
- 🔒ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts. CVPR2024.
- 🔒ControlCap: Controllable Region-level Captioning. ECCV2024
Hallucination
- 🔒Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding, ACL2024
Long/ grounded/ dense CLIP
- DCI: Densely Captioned Images: A Picture is Worth More Than 77 Text Tokens. Evaluating CLIP-Style Models on Dense Captions Meta CVPR 24
- GBC: Graph-Based Captioning- Enhancing Visual Descriptions by Interconnecting Region Captions. Apple, arxiv24.
- PixelProse From Pixels to Prose A Large Dataset of Dense Image Captions arxiv24
- ShareGPT4V- Improving Large Multi-Modal Models with Better Captions ECCV24
- Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. NeurIPS 2023 Spotlight
- 🔒The Pyramid of Captions arxiv24
- 🔒🚨ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference. ECCV, 2024
- 🔒🚨Modeling Caption Diversity in Contrastive Vision-Language Pretraining. ICML 2024.
- 🔒🚨VeCLIP: Improving CLIP Training via Visual-enriched Captions. ECCV, 2024.
- 🔒CLIP with Quality Captions: A Strong Pretraining for Vision Tasks.
- 🔒Eva-clip: Improved training techniques for clip at scale.
- 🔒Sig-CLIP: Sigmoid loss for language image pre-training.
Caption evaluation matirx
- ARO: When and why visionlanguage models behave like bags-of-words, and what to do about it? ICLR 2023 Oral
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations EMNLP 2022
- DSG evaluation: Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. ICLR 2024.
- Prometheus-Vision. arXiv. 24
- Semantic parsing
- Image Retrieval using Scene Graphs CVPR15
- tanford-scene-graph-parser EMNLP 2015
- SPICE: Semantic Propositional Image Caption Evaluation ECCV 16
- Unified Visual-Semantic Embeddings- Bridging Vision and Language with Structured Meaning Representations. [python code] CVPR19
- 🔒🚨GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation. CMU. CVPR2024 → VQAScore
- 🔒🚨Benchmarking and Improving Detail Image Caption. arXiv, 2024
Analysis of VLMs / Understanding VLMs
- What matters when building vision-language models? Hugging face. arXiv, 2024.
- Image Captioners Are Scalable Vision Learners Too. Google, NeurIPS 2023.
- S2-Wrapper: When Do We Not Need Larger Vision Models?
- 🔒Long-CLIP Unlocking the Long-Text Capability of CLIP ECCV24
- 🔒🔥Mm1: Methods, analysis & insights from multimodal llm pre-training. Apple, ECCV, 2024.
- 🔒🔥MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning. Apple, arXiv, 2024.
- 🔒Prismatic vlms: Investigating the design space of visually-conditioned language models. ICML 2024.
- 🔒Vila: On pre-training for visual language models. CVPR 2024. 122
- 🔒Vision Transformers Need Registers. Meta ICLR 2024 Oral
- 🔒Demystifying CLIP Data. Meta. ICLR 2024 Spotlight
- 🔒Bridging Vision and Language Spaces with Assignment Prediction. Naver, ICLR, 2024.
- 🔒🔥An Introduction to Vision-Language Modeling. Meta, arXiv, 2024.
- 🔒MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models. arXiv, 2024.
- 🔒Natural Language Inference Improves Compositionality in Vision-Language Models. arXiv, 2024.
In-context leraning, In-the-loop, Teacher forcing, Training resolutions
- Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023. 9.
- 🔒Compositional Chain-of-Thought Prompting for Large Multimodal Models CVPR24
- 🔒Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models. NAACL 2024.
- 🔒Improve Vision Language Model Chain-of-thought Reasoning. CMU. arXiv. 2024
Video
- InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. ICLR 2024 spotlight
- Simple LLM Framework for Long-Range Video Question-Answering. arXiv, Dec, 2023
- LaViLa: Learning Video Representations from Large Language Models. Facebook, CVPR, 2023
- EgoVLP: Egocentric Video-Language Pretraining. NeurIPS, 2022
- Distilling Vision-Language Models on Millions of Videos. CVPR 2024
- VideoCon: Robust Video-Language Alignment via Contrast Captions. Google, CVPR 2024. 5
- V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video. Meta. TMLR 2024
- FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. arXiv, 2024.
- 🔒FrozenBiLM. NeurIPS 22. 144. (cited by LVQA.)
- 🔒Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval. CVPR, 2024.
- 🔒PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning. arXiv, Apr 2024. (SOTA!)
- 🔒Revisiting Feature Prediction for Learning Visual Representations from Video Meta ICLR 2024
- 🔒InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
- 🔒UMT (Unmasked Teacher): Towards Training-Efficient Video Foundation Models. ICCV 2023 Oral
- 🔒Video-LLaVA. EMNLP, 2024
- 🔒ShareGPT4Video-NeurIPS 2024
LLM distillations
- awesome github / survey / survey on LLM specialization
- MiniLLM: Knowledge Distillation of Large Language Models. ICLR, 2024. 71.
- Sequence-Level Knowledge Distillation EMNLP 2016
- 🔒Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. ACL, 2023. 174.
- 🔒Beyond one-model-fits-all: A survey of domain specialization for large language models. 23.
- 🔒Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks. NeurIPS 2023. 15.
- 🔒DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. NeurIPS_W 2019
- 🔒LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement ACL24
- 🔒Contrastive Decoding: Open-ended Text Generation as Optimization. ACL 2023. 200
VLM distillations
- 🔒🚨EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning. ACL-Finding, 2023
- 🔒🚨PromptKD: Unsupervised Prompt Distillation for Vision-Language Models. CVPR, 2024
- 🔒🚨CLIPPING Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval CVPR24
- 🔒🚨Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models. ECCV224
- 🔒🚨Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models. ECCV24
- 🔒🚨SILC: Improving Vision Language Pretraining with Self-Distillation. ECCV2024
- 🔒Compositional Chain-of-Thought Prompting for Large Multimodal Models CVPR24
- 🔒Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models CVPR24
- 🔒Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
- 🔒Compressing Visual-linguistic Model via Knowledge Distillation. ICCV, 2021.
AttentionScore Distillation
- 🔒Knowledge distillation method for better vision-language models. AMAZON, article
- 🔒TinyBERT: Distilling BERT for Natural Language Understanding. EMNLP 2020. 1876.
- 🔒Unveiling the Magic: Investigating Attention Distillation in Retrieval-augmented Generation. arXiv 2024.
- 🔒Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching. AAAI 2021. 130.
- 🔒Frequency Attention for Knowledge Distillation. WACV 2024.
Image distillations
- 🔒🚨Distilling Internet-Scale Vision-Language Models into Embodied Agents. DeepMind, ICML 2023. 19.
- 🔒Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation. arXiv, 2022. 136.
Efficiency of CLIP (presented in RECLIP):
- RECLIP: Resource-Efficient Clip by Training with Small Images. TMLR, 2023.
- Data-Efficient Multimodal Fusion on a Single GPU. CVPR, 2024. Highlight
- 🔒Lit: Zero-shot transfer with locked-image text tuning. CVPR, 2022. (precompute image features, pretrained classifiers)
- 🔒Filip: Fine-grained interactive language-image pre-training. ICLR, 2021. (masked images, multi-view supervisions)
- 🔒Sigmoid loss for language image pre-training. ICCV oral, 2023. (sigmoid loss)
- 🔒Maskclip: Masked self-distillation advances contrastive language-image pretraining. CVPR 2023.
- 🔒A data-efficient contrastive language-image pre-training paradigm. ICLR 2022.
- 🔒Supervision exists everywhere: A data-efficient contrastive language-image pre-training paradigm. ICLR 2022. (multi-view supervisions)
World Models
- 🔒Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability. NeurIPS 2024
Language Model
- 🔒How Much are LLMs Contaminated? arXiv. 2024
- 🔒ORPO: Monolithic Preference Optimization without Reference Model. arXiv, 2024
- 🔒The Curious Case of Neural Text Degeneration. ICLR 2020
- 🔒Fast Inference from Transformers via Speculative Decoding ICLR 2023 Oral (A Hitchhiker’s Guide to Speculative Decoding)
- 🔒🚨 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models