This project explores using vision-language models (VLMs) as decision-making agents for playing Super Mario Bros through reinforcement learning. We implement two distinct approaches: a PPO-trained vision-language model and a traditional Double Deep Q-Network (DDQN) for comparison.
Our project implements two complementary approaches:
- Model: Qwen2.5-VL-3B-Instruct vision-language model
- Fine-tuning: LoRA (Low-Rank Adaptation) with 4-bit quantization
- Training: Proximal Policy Optimization (PPO) with custom policy/value heads
- Input: Game screenshots (base64 encoded) + game state information
- Output: Action selection with value estimation
The VLM agent processes both visual game screenshots and textual game state information to make informed decisions. The model is fine-tuned using LoRA adapters and trained with PPO to learn optimal Mario gameplay policies.
- Architecture: Convolutional neural network with experience replay
- Algorithm: Double Deep Q-Network (DDQN)
- Input: Preprocessed game frames
- Purpose: Baseline comparison for the VLM approach
-
Install the required dependencies:
pip install -r requirements.txt
-
Ensure CUDA is available for GPU acceleration (required for the VLM approach).
cd llm_mario
python main.py --train --display --wandb # Training mode with visualization and logging
python main.py --display # Evaluation modecd ddqn
python main.py # Run DDQN training- Multi-modal input: Combines visual screenshots with structured game state
- Memory efficient: 4-bit quantization with gradient checkpointing
- Advanced training: PPO with Generalized Advantage Estimation (GAE)
- Monitoring: Weights & Biases integration for experiment tracking
- Base Model: Qwen2.5-VL-3B-Instruct (3 billion parameters)
- Adaptation: LoRA fine-tuning (rank 16, alpha 32)
- Policy Network: Custom policy/value heads on top of VLM features
- Environment: Super Mario Bros 1-1 with action space simplification
- Ishan Khare
- Gabriel Seir
- Anthony Zhan
