Goal-Oriented Robotic Planning and Navigation with Vision Language Models

A robotic navigation framework that integrates Vision Language Models (VLMs) for high-level reasoning, object detection, and path planning. The system enables a Raspberry Pi-based robot (SunFounder PiCar-X) to understand its environment visually and execute natural language instructions (e.g., "Find the yellow rubber duck") using models like Google Gemini or local open-weight models via Ollama.

Key Features

Brain-Client Architecture:
- Client (PiCar-X): Lightweight Python agent handling motor control, camera streaming, and sensor readings.
- Server (Brain): High-performance ZeroMQ server running state-of-the-art VLMs for cognitive decision making.
Unified VLM Backend:
- Supports Google Gemini 1.5/3 Pro (Cloud) for high-speed, accurate reasoning.
- Supports Ollama (Local) for offline, privacy-focused inference (e.g., LLaVA).
- Uses an OpenAI-compatible API format for easy model swapping.
Rich Interaction:
- Movement: Navigate with obstacle avoidance and dynamic steering.
- Vision: Scan the room by moving the camera head (independent of the chassis).
- Speech: Speak to users or ask questions using Text-to-Speech (TTS).

Hardware Stack

Robot: SunFounder PiCar-X (Raspberry Pi 4 Model B)
Host Machine: Any PC/Laptop (Windows/Linux/Mac) capable of running Python 3.8+
Camera: Standard Pi Camera Module or USB Webcam

Project Structure

VLMxRobot/
├── client-pi/          # Code running on the Robot (Raspberry Pi)
│   ├── actions.py      # Motor & Servo control logic
│   ├── camera.py       # Camera streaming
│   ├── client.py       # Main ZMQ client loop
│   └── ...
├── server-VLM/         # Code running on the Host (Brain)
│   ├── server.py       # Main ZMQ server & VLM handler
│   ├── config.py       # System configurations and prompt loading
│   ├── core/           # Core modules (network, memory, models, actions)
│   └── ...
└── README.md           # This file

Quick Start

1. Server Setup (The Brain)

Run this on your host machine (PC/Laptop).

cd server-VLM
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure API Keys
cp .env-example .env
# Edit .env to add your VLM_API_KEY (e.g., Google API Key) or configure Ollama

# Start the Server
python server.py

2. Client Setup (The Robot)

Run this on the Raspberry Pi.

cd client-pi
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure Connection
cp .env-example .env
# Edit .env and set SERVER_IP to your host machine's IP address

# Start the Robot
python client.py

How It Works

Capture: The robot captures an image of its current view.
Transmit: The image is sent over Wi-Fi via ZeroMQ to the server.
Analyze: The VLM server analyzes the image against the user's instruction.
Plan: The VLM generates a structured JSON command (e.g., { "action": "look_left", "angle": 30 }).
Execute: The robot receives the JSON, parses it, and executes the physical movement.
Loop: The cycle repeats until the task is successfully completed.

📚 Documentation

For more detailed information, check the specific module readmes:

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
client-pi		client-pi
server-VLM		server-VLM
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal-Oriented Robotic Planning and Navigation with Vision Language Models

Key Features

Hardware Stack

Project Structure

Quick Start

1. Server Setup (The Brain)

2. Client Setup (The Robot)

How It Works

📚 Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Goal-Oriented Robotic Planning and Navigation with Vision Language Models

Key Features

Hardware Stack

Project Structure

Quick Start

1. Server Setup (The Brain)

2. Client Setup (The Robot)

How It Works

📚 Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages