A robotic navigation framework that integrates Vision Language Models (VLMs) for high-level reasoning, object detection, and path planning. The system enables a Raspberry Pi-based robot (SunFounder PiCar-X) to understand its environment visually and execute natural language instructions (e.g., "Find the yellow rubber duck") using models like Google Gemini or local open-weight models via Ollama.
- Brain-Client Architecture:
- Client (PiCar-X): Lightweight Python agent handling motor control, camera streaming, and sensor readings.
- Server (Brain): High-performance ZeroMQ server running state-of-the-art VLMs for cognitive decision making.
- Unified VLM Backend:
- Supports Google Gemini 1.5/3 Pro (Cloud) for high-speed, accurate reasoning.
- Supports Ollama (Local) for offline, privacy-focused inference (e.g., LLaVA).
- Uses an OpenAI-compatible API format for easy model swapping.
- Rich Interaction:
- Movement: Navigate with obstacle avoidance and dynamic steering.
- Vision: Scan the room by moving the camera head (independent of the chassis).
- Speech: Speak to users or ask questions using Text-to-Speech (TTS).
- Robot: SunFounder PiCar-X (Raspberry Pi 4 Model B)
- Host Machine: Any PC/Laptop (Windows/Linux/Mac) capable of running Python 3.8+
- Camera: Standard Pi Camera Module or USB Webcam
VLMxRobot/
├── client-pi/ # Code running on the Robot (Raspberry Pi)
│ ├── actions.py # Motor & Servo control logic
│ ├── camera.py # Camera streaming
│ ├── client.py # Main ZMQ client loop
│ └── ...
├── server-VLM/ # Code running on the Host (Brain)
│ ├── server.py # Main ZMQ server & VLM handler
│ ├── config.py # System configurations and prompt loading
│ ├── core/ # Core modules (network, memory, models, actions)
│ └── ...
└── README.md # This file
Run this on your host machine (PC/Laptop).
cd server-VLM
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Configure API Keys
cp .env-example .env
# Edit .env to add your VLM_API_KEY (e.g., Google API Key) or configure Ollama
# Start the Server
python server.pyRun this on the Raspberry Pi.
cd client-pi
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Configure Connection
cp .env-example .env
# Edit .env and set SERVER_IP to your host machine's IP address
# Start the Robot
python client.py- Capture: The robot captures an image of its current view.
- Transmit: The image is sent over Wi-Fi via ZeroMQ to the server.
- Analyze: The VLM server analyzes the image against the user's instruction.
- Plan: The VLM generates a structured JSON command (e.g.,
{ "action": "look_left", "angle": 30 }). - Execute: The robot receives the JSON, parses it, and executes the physical movement.
- Loop: The cycle repeats until the task is successfully completed.
For more detailed information, check the specific module readmes: