OpenGVLab

All

74 repositories

InternVideo
Public
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
benchmark action-recognition video-understanding video-data self-supervised multimodal video-dataset open-set-recognition video-retrieval video-question-answering
Python
•
Apache License 2.0
•97•1.6k•99•4•Updated Jan 22, 2025Jan 22, 2025
VisionLLM
Public
VisionLLM Series
object-detection large-language-models generalist-model
Python
•
Apache License 2.0
•33•981•15•0•Updated Jan 22, 2025Jan 22, 2025
InternImage
Public
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
backbone semantic-segmentation deformable-convolution foundation-model object-detection
Python
•
MIT License
•241•2.6k•186•5•Updated Jan 20, 2025Jan 20, 2025
Ask-Anything
Public
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
chat video gradio big-model video-understanding captioning-videos video-question-answering foundation-models large-model large-language-models
Python
•
MIT License
•257•3.1k•68•5•Updated Jan 18, 2025Jan 18, 2025
STM-Evaluation
Public
Python
•
MIT License
•6•70•1•0•Updated Jan 18, 2025Jan 18, 2025
VideoChat-Flash
Public
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Python
•
MIT License
•3•270•1•0•Updated Jan 17, 2025Jan 17, 2025
PIIP
Public
[NeurIPS 2024 Spotlight ⭐️] Parameter-Inverted Image Pyramid Networks (PIIP)
computer-vision image-classification object-detection semantic-segmentation instance-segmentation vision-transformer multimodal-large-language-models vision-language-models
Python
•
MIT License
•2•81•0•0•Updated Jan 15, 2025Jan 15, 2025
vinci
Public
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Python
•2•37•2•1•Updated Jan 13, 2025Jan 13, 2025
TimeSuite
Public
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
temporal-grounding long-video-understanding
MIT License
•0•8•0•0•Updated Jan 5, 2025Jan 5, 2025
TPO
Public
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Python
•1•39•1•0•Updated Jan 2, 2025Jan 2, 2025
InternVL
Public
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
image-classification gpt multi-modal semantic-segmentation video-classification image-text-retrieval llm vision-language-model gpt-4v vit-6b
Python
•
MIT License
•524•6.9k•147•3•Updated Dec 25, 2024Dec 25, 2024
PVC
Public
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Python
•
MIT License
•0•22•2•0•Updated Dec 18, 2024Dec 18, 2024
V2PE
Public
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Python
•
MIT License
•1•26•0•0•Updated Dec 13, 2024Dec 13, 2024
VLMEvalKit_InternVL2_5
Public
Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
Python
•
Apache License 2.0
•245•0•0•0•Updated Dec 9, 2024Dec 9, 2024
Hulk
Public
An official implementation of "Hulk: A Universal Knowledge Translator for Human-Centric Tasks"
Python
•
MIT License
•4•116•14•0•Updated Dec 4, 2024Dec 4, 2024
MM-NIAH
Public
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
benchmark long-context vision-language-model multimodal-large-language-models
Python
•6•109•1•0•Updated Nov 25, 2024Nov 25, 2024
OmniCorpus
Public
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Python
•7•305•0•0•Updated Nov 17, 2024Nov 17, 2024
GUI-Odyssey
Public
GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos.
Python
•4•82•2•0•Updated Nov 12, 2024Nov 12, 2024
Vision-RWKV
Public
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python
•
Apache License 2.0
•16•398•19•0•Updated Oct 31, 2024Oct 31, 2024
.github
Public
1•0•0•0•Updated Oct 30, 2024Oct 30, 2024
OV-OAD
Public
This repo takes the initial step towards leveraging text learning for online action detection without explicit human supervision.
1•1•0•0•Updated Oct 28, 2024Oct 28, 2024
InternVL-MMDetSeg
Public
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
object-detection semantic-segmentation vision-foundation
Jupyter Notebook
•6•74•1•0•Updated Oct 25, 2024Oct 25, 2024
PhyGenBench
Public
The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Python
•1•80•3•0•Updated Oct 25, 2024Oct 25, 2024
VideoMAEv2
Public
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
video-understanding action-detection self-supervised-learning temporal-action-detection foundation-model cvpr2023 action-recognition
Python
•
MIT License
•64•566•16•0•Updated Oct 8, 2024Oct 8, 2024
EfficientQAT
Public
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Python
•18•240•5•0•Updated Oct 8, 2024Oct 8, 2024
OmniQuant
Public
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
quantization large-language-models llm
Python
•
MIT License
•59•755•25•1•Updated Oct 8, 2024Oct 8, 2024
MMIU
Public
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Python
•2•58•3•0•Updated Sep 14, 2024Sep 14, 2024
ChartAst
Public
[ACL 2024] ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.
Python
•
Other
•9•110•7•0•Updated Sep 7, 2024Sep 7, 2024
EgoExoLearn
Public
[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
Python
•
MIT License
•0•51•2•0•Updated Sep 3, 2024Sep 3, 2024
InternGPT
Public
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
sam click vqa image-captioning llama gpt gradio husky multimodal video-generation
Python
•
Apache License 2.0
•232•3.2k•19•1•Updated Aug 20, 2024Aug 20, 2024