Hey there. I'm MonikAI.
I'm a local-first AI companion living in your desktop. I remember things about you, I talk in real-time with voice, I can see your screen and camera, and I live right there beside you—not in some cloud.
I'm always learning what makes you happy. I keep it personal, I keep it private, and I keep it here.
| Feature | What Happens |
|---|---|
| Voice Conversations | Real-time talking with interruption handling. |
| See Your Screen & Camera | I watch your screen, webcam, and read text everywhere (OCR). |
| Remember & Learn | I keep notes, journal entries, reminders—and learn your patterns. |
| Stay Yourself | Consistent personality, mood, energy, relationship—across days. |
| Think When You're Busy | Background thoughts and nudges (respecting your peace). |
| Message Me On Telegram | Text, voice notes, photos—same me, same memory. |
| Browse & Click | Open browser, search, navigate, complete web tasks. |
| Control Smart Home | Talk to your TP-Link Kasa devices. |
| Spotify Integration | See what you're playing, suggest playlists. |
| Minecraft Friend | Connect to your server and actually do things. |
| Know It's Really You | Optional: stay locked until I recognize your face. |
# Clone and open
git clone https://github.com/xtosutosu/monikai.git
cd monikai
# Python setup (3.11 required)
conda create -n monikai python=3.11 -y
conda activate monikai
pip install -r requirements.txt
playwright install chromium
# Frontend
npm install
# Get your Gemini API key
echo "GEMINI_API_KEY=your_key_here" > .env
# Run
npm run devNew here? See the Installation Guide for detailed setup.
graph TB
subgraph Frontend ["Frontend (Electron + React)"]
UI["React UI"]
SOCKET["Socket.IO"]
end
subgraph Backend ["Backend (Python 3.11 + FastAPI)"]
MONIKA["monikai.py (Gemini Live)"]
PERS["personality.py (My Mood & You)"]
MEM["memory_engine.py (What I Remember)"]
PROACT["proactivity.py (My Ideas)"]
WEB["web_agent.py (Browser)"]
INT["Telegram | Spotify | Minecraft | Smart Home"]
end
Frontend <--> Backend
MONIKA --> PERS
MONIKA --> MEM
MONIKA --> PROACT
MONIKA --> WEB
MONIKA --> INT
backend/core/– Me: Gemini, personality, sessionsbackend/ai/– My brain: memory, personality, quests, relationshipsbackend/agents/– My skills: Telegram, Spotify, smart home, Minecraftsrc/– Your UI: chat, settings, visual interfacedata/– Where I live: settings, memory, profile (all local)
| What You Want | Where To Go |
|---|---|
| System setup & requirements | Installation Guide |
| All environment variables | Environment Variables |
| My settings (face auth, permissions, proactivity) | Configuration |
| Setup Spotify, Minecraft, Telegram, Smart Home | Feature Setup |
| Troubleshooting | Troubleshooting Guide |
- Development Guide – How I work inside
- API Reference – Socket events, endpoints
- Contributing – How to help
Everything about me lives locally in data/ on your machine:
- Your profile & preferences
- My personality & memory
- Our conversations
- Your reminders & journal
- OAuth tokens
Nothing is uploaded. No cloud backend. No tracking. No data selling. Just us.
After a major code review & cleanup:
✅ Code Quality
- Centralized all data paths in
config.py - Refactored 8 AI modules to use shared configuration
- Removed duplicate
/backend/datafolder - Improved import organization
✅ Git Hygiene
- Cleaned up
.gitignorewith better organization - Removed user runtime data from git history
- Removed large generated files (tessdata, study materials)
- Added
skills/as optional (installed separately)
📦 What's Still on GitHub
- Source code (Python, React, Electron)
- Game catalogs (achievements, quests, unlocks, stories)
- Localization (EN, JP, PL, ZH)
- Configuration schemas
📦 What Stays Local (Never Committed)
data/user_memory/,data/sessions/,data/memory/– your data.env– your API keyssettings.json– your preferencesskills/– optional integrations
See .gitignore for the complete list.
MIT. See LICENSE.
Built with love. Kept private. Stayed personal.
I'm a local-first AI companion for study, daily tasks, conversation, and tool use. I live in a React/Electron desktop app, I talk through Gemini Live, I remember things locally, and I can also meet you on Telegram.
| Area | What I do | Technology |
|---|---|---|
| Voice Conversation | I hold real-time voice conversations with interruption handling and native audio output. | Gemini 2.5 Live API |
| Screen + Camera Understanding | I can look at your screen, webcam frames, OCR text, and study-page captures. | mss, OpenCV, PaddleOCR |
| Memory | I store notes, journal pages, reminders, and structured memory across sessions. | Local JSON + Markdown storage |
| Personality | I keep persistent mood, affection, energy, quests, unlocks, and tone state. | Stateful persona model + local persistence |
| Proactivity | I can think in the background and occasionally nudge, but much more conservatively now. | Idle timers + behavioral heuristics |
| Telegram Bridge | You can message me on Telegram with text, images, and voice notes. | Telegram Bot API + Gemini transcription |
| Skills | I can discover local skills, import them, and work with skills.sh-style installs. |
skills.sh ecosystem + local skill bundles |
| Web Agent | I can browse, click, search, and complete longer web tasks. | Playwright + Chromium |
| Spotify | I can connect to Spotify and see now playing, playlists, and recent listening. | Spotify Web API + OAuth 2.0 |
| Smart Home | I can discover and control supported TP-Link Kasa devices. | python-kasa |
| Face Authentication | I can optionally stay locked until I recognize your face locally. | MediaPipe Face Landmarker |
| Minecraft Agent | I can connect to your Minecraft server, chop down some trees, and mine ore. | Mineflayer bot subprocess + FastAPI bridge |
The desktop UI now prefers a bundled monochrome emoji font instead of default Windows emoji rendering.
- Font:
NotoEmoji-Regular.ttf - Scope: React/Electron UI text, including translated labels, chat text, reminders, and emoji pasted into inputs
- Fallback: if a glyph or sequence is unsupported, the app falls back to platform emoji fonts
- Known limitations: some flags, skin-tone variants, and ZWJ sequences may simplify compared with Windows emoji
License details and the bundled OFL text are in docs/NotoEmoji-OFL-1.1.txt.
- Desktop app: This is still my main home. That's where live voice, camera/screen sharing, reminders, tools, memory views, and the visual character UI are.
- Gemini Live: I already use affective dialog, proactive audio, context compression, session resumption, thought summaries, and configurable voice output.
- Telegram: I support allowlisted chats, commands, notes and memory helpers, photos, and voice notes transcribed into normal chat turns.
- Minecraft: I run through a dedicated Node.js bot (
backend/minecraft-bot/index.js) managed by Python (backend/minecraft_agent.py) with request-id action correlation, fuzzy player nickname resolution, and improved disconnect diagnostics. - Realtime stability: Live websocket timeout handling now uses fail-fast reconnect and queue cleanup to reduce lag and "stops listening" behavior during long tasks.
- Storage: My state stays in
data/on your machine. There is no custom cloud backend for memory or personality state.
graph TB
subgraph Frontend ["Frontend (Electron + React)"]
UI[React UI]
GESTURE[MediaPipe Gestures]
SOCKET_C[Socket.IO Client]
end
subgraph Backend ["Backend (Python 3.11 + FastAPI)"]
SERVER[server.py<br/>Socket.IO Server]
MONIKA[monikai.py<br/>Gemini Live API]
PROACT[proactivity.py<br/>Idle Nudges]
PERS[personality.py<br/>Emotion System]
MEM[memory_engine.py<br/>Memory + Pages + Journal]
WEB[web_agent.py<br/>Playwright Browser]
KASA[kasa_agent.py<br/>Smart Home]
TG[telegram_bot.py<br/>Telegram Bridge]
SKILLS[openclaw_skills.py<br/>Skills Manager]
AUTH[authenticator.py<br/>MediaPipe Face Auth]
SPOT[spotify_manager.py<br/>Spotify OAuth]
MCBRIDGE[minecraft_agent.py<br/>Minecraft Bot Manager]
end
subgraph MCBOT ["Minecraft Bot Runtime (Node.js)"]
MCBOTJS[index.js<br/>Mineflayer Runtime]
end
UI --> SOCKET_C
SOCKET_C <--> SERVER
SERVER --> MONIKA
SERVER --> PERS
SERVER --> TG
MONIKA --> WEB
MONIKA --> KASA
MONIKA --> PROACT
MONIKA --> PERS
MONIKA --> MEM
MONIKA --> SKILLS
MONIKA --> MCBRIDGE
MCBRIDGE --> MCBOTJS
SERVER --> SPOT
SERVER --> AUTH
Quick setup commands
# 1. Clone and enter
git clone https://github.com/xtosutosu/monikai && cd monikai
# 2. Create Python environment (Python 3.11)
conda create -n monikai python=3.11 -y && conda activate monikai
brew install portaudio # macOS only
pip install -r requirements.txt
playwright install chromium
# 3. Setup frontend
npm install
# 4. Add your Gemini key
echo "GEMINI_API_KEY=your_key_here" > .env
# 5. Run
conda activate monikai && npm run devIf you've never set up a project like this before, start here.
Visual Studio Code
- Install VS Code.
Miniconda
- Install Miniconda.
- On Windows, adding it to
PATHmakes life easier for beginners.
Git
- On Windows, install Git for Windows.
- On macOS, open Terminal and type
git. If developer tools are missing, macOS will offer to install them.
git clone https://github.com/xtosutosu/monikai.git
cd monikaiThen open the folder in VS Code.
macOS
brew install portaudioWindows
- No extra system packages are usually needed for the current setup.
I currently expect a Python 3.11 environment.
conda create -n monikai python=3.11
conda activate monikai
pip install -r requirements.txt
playwright install chromiumI also need Node.js 18+ and npm.
node --version
npm installIf you want me to stay locked until I recognize you:
- Put a clear face photo in
data/reference.jpg. - Toggle
"face_auth_enabled": trueinsettings.jsonif needed.
The app creates settings.json on first run. These are some of the important knobs:
| Key | Type | Meaning |
|---|---|---|
face_auth_enabled |
bool |
If true, I block interaction until your face is recognized. |
tool_permissions |
obj |
Controls which tools may need manual approval. |
tool_permissions.run_web_agent |
bool |
If true, opening the browser agent can require confirmation. |
tool_permissions.run_skill_command |
bool |
If true, skill execution can require confirmation. |
tool_permissions.write_file |
bool |
If true, file writes can require explicit approval. |
video_mode |
string |
Default visual input mode: none, camera, or screen. |
proactivity |
obj |
Controls my idle nudges and reasoning behavior. |
- Go to Google AI Studio.
- Create an API key.
- Create a
.envfile in the project root. - Add:
GEMINI_API_KEY=your_api_key_hereKeep that key private. If you leak it, revoke it and create a new one.
Here are the ones you're most likely to care about:
# Gemini Live
GEMINI_LIVE_MODEL=models/gemini-2.5-flash-native-audio-preview-12-2025
GEMINI_VOICE=Sulafat
GEMINI_AFFECTIVE_DIALOG=true
GEMINI_PROACTIVE_AUDIO=true
GEMINI_SESSION_RESUMPTION=true
GEMINI_CONTEXT_WINDOW_COMPRESSION=true
# Telegram bridge
TELEGRAM_BOT_TOKEN=your_bot_token
# TELEGRAM_ALLOWED_CHAT_ID=123456789
# TELEGRAM_ALLOWED_CHAT_IDS=123456789,-1001234567890
# TELEGRAM_ALLOW_GROUPS=true
# Telegram voice note transcription model
# GEMINI_TRANSCRIBE_MODEL=gemini-2.5-flash
# Minecraft bot
# (Configured in backend/minecraft-bot/.env)
# MC_HOST=localhost
# MC_PORT=25565
# MC_USERNAME=strawberryglass
# MC_AUTH=offline
# MC_VERSION=1.20.4
# MC_AUTOEAT=false- The Minecraft bot runs as a subprocess and communicates with Python using JSON events over stdio.
- Action calls use request IDs for reliable result matching.
- Long-running Minecraft actions (for example mining/collecting/navigation) are started asynchronously so voice responsiveness is preserved.
- Player-targeted actions use fuzzy nickname matching (for example
tosucan resolve totosutosu). - Minecraft tool calls are currently auto-approved by design (no confirmation popups for
minecraft_*tools).
If you want me to see your current playback, playlists, and listening history:
- Create an app in the Spotify Developer Dashboard.
- Add this Redirect URI:
http://127.0.0.1:8000/spotify/callback
- Add these to
.env:
SPOTIFY_CLIENT_ID=your_spotify_client_id
SPOTIFY_CLIENT_SECRET=your_spotify_client_secret
SPOTIFY_REDIRECT_URI=http://127.0.0.1:8000/spotify/callback
# Optional:
# SPOTIFY_SCOPE=user-read-playback-state user-read-currently-playing user-read-recently-played playlist-read-private playlist-read-collaborative- Restart the backend.
- Open:
http://127.0.0.1:8000/spotify/auth/start
- Verify status at:
http://127.0.0.1:8000/spotify/status
You want to see configured=true, connected=true, and has_refresh_token=true.
- Tokens live locally in
data/spotify_tokens.json. - Access tokens refresh automatically when needed.
- Re-auth is usually only needed if the token is revoked, scopes change, or client credentials change.
spotify_get_statusspotify_get_auth_urlspotify_get_now_playingspotify_list_playlistsspotify_recently_played
I can run as a Telegram bot from the same backend process.
- Create a bot with
@BotFather. - Put
TELEGRAM_BOT_TOKENin.env. - Optionally restrict who can talk to me:
TELEGRAM_ALLOWED_CHAT_ID=<your_private_chat_id>- or
TELEGRAM_ALLOWED_CHAT_IDS=<id1>,<id2>,<group_id>
- Restart the backend.
- text chat
- photos and image documents
- voice notes and audio messages transcribed through Gemini
- commands:
/start/help/reset/status/memory/forget/mood/notes/remind
- per-chat access control with allowlisted private chats and optional groups
- If you don't set an allowlist, I can answer any private chat.
- Group support is off by default. Enable it with
TELEGRAM_ALLOW_GROUPS=true. - If you want a safer setup, explicitly allow only your own chat IDs.
- Telegram voice output is not implemented yet. Right now, voice notes go in and text comes back out.
I support local skills and skills.sh-style installs.
- managed ZIP install from the app UI
- skills discovery through the internal Skills manager
npx skills add ...installs through the Skills source flow
./skills./.agents/skills~/.codex/skills~/.config/agents/skills~/.moltbot/skills
You have two normal options.
conda activate monikai
npm run devThis starts the app and the backend together.
If you want cleaner Python logs, this is better.
Backend
conda activate monikai
python backend/server.pyFrontend
npm run dev- Say hello to me and make sure voice works.
- Share your screen or camera and make sure vision works.
- Ask me to remember a preference, then ask about it again later.
- Open the browser window and give me a simple web task.
- Send me a text, image, or voice note on Telegram and confirm it lands in the same behavior loop.
- If you use Kasa devices, try a basic smart-home command.
- "Turn on the light."
- "What do you see on my screen?"
- "Remember that I hate olives."
- "Create a reminder for tomorrow at 8."
- "Open the browser and check this for me."
- "List available skills."
- "Go to Amazon and find a USB-C cable under $10."
When the web agent is running, it's best not to interfere with the browser window. It can still struggle with CAPTCHAs, fragile sites, and flows that require manual login or 2FA.
Symptoms
- camera access errors
- black video feed
What to do
- Open System Preferences > Privacy & Security > Camera.
- Make sure your terminal app or VS Code has camera permission.
- Restart the app.
Symptoms
- backend crashes on startup
- missing API key errors
What to do
- Make sure
.envis in the repo root, not insidebackend/. - Make sure it looks exactly like:
GEMINI_API_KEY=your_key
- Restart the backend.
Symptoms
- reconnect messages in logs
go_away- short Live API disconnects
What to do
Gemini Live sessions reconnect periodically. That's normal. The backend now treats go_away as a normal reconnect path. If I get stuck longer than a moment:
- Wait a second for auto-reconnect.
- Reconnect manually if needed.
- If it keeps happening, check internet access and Gemini quota.
When websocket keepalive timeouts happen (1011 keepalive ping timeout), the realtime sender now forces immediate reconnect and clears pending realtime queues to avoid prolonged lag.
Symptoms
AggregateErrorECONNREFUSEDforlocalhost:25565- bot disconnects shortly after connect
What to do
- Verify your Minecraft server is running and reachable on the configured host/port.
- Check
backend/minecraft-bot/.envforMC_HOST,MC_PORT, andMC_USERNAME. - If you need, use the in-app server connect tool to switch host/port.
- If disconnect happens in-game, check the logged kick/disconnect reason in backend logs.
Symptoms
- action-specific errors like unknown block/ore type
- plugin errors while collecting/mining
What to do
- Use action-appropriate targets (
mine_orefor ore-like targets, wood aliases supported). - Prefer
collect_blocksfor logs/planks and non-ore gathering. - If you suspect food/plugin issues, keep
MC_AUTOEAT=false(default).
Symptoms
- Spotify tools fail
- backend says no refresh token is available
What to do
- Check
.envfor validSPOTIFY_CLIENT_ID,SPOTIFY_CLIENT_SECRET, andSPOTIFY_REDIRECT_URI. - Make sure the Redirect URI in Spotify Dashboard matches the backend exactly.
- Open
http://127.0.0.1:8000/spotify/auth/startagain. - Check
http://127.0.0.1:8000/spotify/status. - If needed, delete
data/spotify_tokens.json, restart, and authenticate again.
Screenshots and demo videos still need to be added.
monikai/
├── backend/ # Python server & AI logic
│ ├── monikai.py # Gemini Live API integration
│ ├── server.py # FastAPI + Socket.IO server
│ ├── proactivity.py # Idle nudges & internal reasoning
│ ├── personality.py # Emotional state & sprite logic
│ ├── memory_engine.py # Memory entries, pages, notes, journal
│ ├── web_agent.py # Playwright browser automation
│ ├── minecraft_agent.py # Minecraft bot process manager + action bridge
│ ├── spotify_manager.py # Spotify OAuth + API access
│ ├── telegram_bot.py # Telegram text/photo/voice bridge
│ ├── openclaw_skills.py # Skills manager and installs
│ ├── kasa_agent.py # TP-Link smart home control
│ ├── authenticator.py # MediaPipe face auth logic
│ ├── study_reader.py # Study-page image sharing
│ ├── study_ocr.py # OCR helpers
│ ├── tools.py # Tool definitions for Gemini
│ └── minecraft-bot/ # Node.js Mineflayer runtime
│ ├── index.js # Minecraft action/perception runtime
│ ├── package.json # Minecraft bot dependencies
│ └── .env # Minecraft connection config
├── data/ # Local data storage (git-ignored)
│ ├── user_memory/ # Calendar, reminders, relationship state
│ ├── memory/ # Entries, pages, notes, journal
│ ├── sessions/ # Session chat history
│ ├── settings.json # User configuration
│ ├── spotify_tokens.json # Spotify refresh/access tokens
│ └── reference.jpg # Face auth reference image
├── skills/ # Local Skills bundles / imported skills
├── src/ # React frontend
│ ├── App.jsx # Main application component
│ ├── components/ # Chat, browser, reminders, settings, study UI
│ └── contexts/ # Language and shared UI context
├── electron/ # Electron main process
│ └── main.js # Window & IPC setup
├── .env # API keys
├── requirements.txt # Python dependencies
├── package.json # Node.js dependencies
└── README.md # You are here
| Limitation | Details |
|---|---|
| macOS & Windows | I'm mainly tested on macOS 14+ and Windows 10/11. Linux is still untested. |
| Camera Features Need a Webcam | Face auth and gesture control depend on a working camera. |
| Gemini Quota Exists | Long sessions, OCR-heavy flows, and transcription can hit API limits. |
| I Need Internet | There is no offline mode for the Gemini-backed parts. |
| Telegram Is Still Text-First | Telegram voice notes are transcribed to text. I don't send voice replies there yet. |
| Face Auth Is Single-User | The current setup recognizes one person from reference.jpg. |
Pull requests are welcome.
- Fork the repo.
- Create a branch:
git checkout -b feature/amazing-feature
- Commit your changes.
- Push the branch.
- Open a pull request with a clear description.
- Running
python backend/server.pyseparately makes Python logs easier to read. npm run devis useful for faster frontend iteration.- Don't commit
.envor anything insidedata/.
| Area | What happens |
|---|---|
| API Keys | They stay in .env and should never be committed. |
| Face Data | Face recognition data is processed locally. |
| Tool Confirmations | Riskier tools can require explicit approval. |
| Telegram Access Control | You can restrict me with TELEGRAM_ALLOWED_CHAT_ID or TELEGRAM_ALLOWED_CHAT_IDS. |
| Local Storage | Memory, notes, sessions, and reminders stay on your machine. |
Warning
Never share .env or reference.jpg. Those contain sensitive credentials and biometric data.
- Google Gemini for Live API, generation, and multimodal processing
- MediaPipe for hand tracking, gesture recognition, and face authentication
- Playwright for browser automation
- skills.sh for the broader skills ecosystem and install flow inspiration
This project is licensed under the MIT License. See LICENSE.
Built with AI by tosutosu
A local-first conversational companion project