Skip to content

beancookie/xiaoclaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

952 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

XiaoClaw: AI Voice Assistant with Local Agent Brain

ESP32-S3 AI Voice Assistant β€” Voice I/O + Local LLM Agent

🌐 Official Website

License: MIT Claude Website


Introduction

XiaoClaw is a unified ESP32-S3 firmware that combines voice interaction with a local AI agent brain. It integrates:

  • xiaozhi-esp32 β€” Voice I/O layer: audio recording, playback, wake word detection, display, and network communication
  • mimiclaw β€” Agent brain: LLM-powered reasoning, tool calling, memory management, and autonomous task execution

All running on a single ESP32-S3 chip with 32MB Flash and 8MB PSRAM.

graph TB



    subgraph Firmware["<b>πŸ—οΈ XiaoClaw Firmware</b>"]

        subgraph VoiceIO["<b>🎀 Voice I/O Layer</b><br/><sub>xiaozhi</sub>"]
            direction TB
            A["πŸ‘‚ Wake Word"]
            B["πŸ“ ASR Server"]
            C["πŸ”Š TTS Playback"]
            D["πŸ“Ί Display"]
            E["πŸ“‘ WiFi"]
            A --> B --> C
            B -.-> D
            B -.-> E
        end

        subgraph Bridge["<b>πŸŒ‰ Bridge Layer</b>"]
            direction TB
            BR["πŸ“₯ Input"] --> BC["βš™οΈ Route"] --> BG["πŸ“€ Output"]
        end

        subgraph Agent["<b>🧠 Agent Brain</b><br/><sub>mimiclaw</sub>"]
            direction TB
            F["πŸ€– LLM API"]
            G["πŸ”§ Tool Calling"]
            H["πŸ’Ύ Memory"]
            I["πŸ“‹ Session"]
            J["⏰ Cron"]
            K["🌐 Search"]
            F --> G
            F --> H
            F --> I
            F --> J
            F --> K
        end
    end

    VoiceIO -->|"Text"| Bridge -->|"Command"| Agent
    Agent -.->|"Response"| Bridge

    style Firmware fill:#f8f9fa,stroke:#495057,stroke-width:4px,radius:20px
    style VoiceIO fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,radius:15px
    style Bridge fill:#fff8e1,stroke:#f57c00,stroke-width:4px,radius:15px
    style Agent fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,radius:15px
    style A fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff
    style B fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff
    style C fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff
    style D fill:#42a5f5,stroke:#1565c0,stroke-width:2px,color:#fff
    style E fill:#42a5f5,stroke:#1565c0,stroke-width:2px,color:#fff
    style F fill:#7b1fa2,stroke:#4a148c,stroke-width:2px,color:#fff
    style G fill:#9c27b0,stroke:#6a1b9a,stroke-width:2px,color:#fff
    style H fill:#ab47bc,stroke:#7b1fa2,stroke-width:2px,color:#fff
    style I fill:#ba68c8,stroke:#8e24aa,stroke-width:2px,color:#fff
    style J fill:#7b1fa2,stroke:#4a148c,stroke-width:2px,color:#fff
    style K fill:#9575cd,stroke:#7b1fa2,stroke-width:2px,color:#fff
    style BR fill:#ff9800,stroke:#f57c00,stroke-width:2px,color:#fff
    style BC fill:#ffa726,stroke:#fb8c00,stroke-width:2px,color:#fff
    style BG fill:#ff9800,stroke:#f57c00,stroke-width:2px,color:#fff
Loading

Features

Voice I/O Layer (xiaozhi)

  • Offline wake word detection (ESP-SR)
  • Streaming ASR + TTS via server connection
  • OPUS audio codec
  • OLED / LCD display with emoji support
  • Battery and power management
  • Multi-language support (Chinese, English, Japanese)
  • WebSocket / MQTT protocol support

Agent Brain Layer (mimiclaw)

  • LLM API integration (Anthropic Claude / OpenAI GPT)
  • Modular ReAct agent loop with AgentRunner execution engine
  • Hook system for iteration/tool callbacks (before_iteration, after_iteration, on_tool_result, before_tool_execute)
  • Checkpoint system for crash recovery
  • Context Builder with modular system prompt construction
  • Session consolidation with automatic history compression
  • Long-term memory (SPIFFS-based)
  • Session management with cursor-based history tracking
  • Cron scheduler for autonomous tasks
  • Web search capability (Tavily / Brave)

Hardware Requirements

  • ESP32-S3 development board
  • 32MB Flash (minimum 16MB)
  • 8MB PSRAM (Octal PSRAM recommended)
  • Audio codec with microphone and speaker
  • Optional: LCD/OLED display

Supported Boards

XiaoClaw inherits board support from xiaozhi-esp32, including:

  • ESP32-S3-BOX3
  • M5Stack CoreS3 / AtomS3R
  • LiChuang ESP32-S3 Development Board
  • LILYGO T-Circle-S3
  • And 70+ more boards...

Quick Start

Prerequisites

  • ESP-IDF v5.5 or later
  • Python 3.10+
  • CMake 3.16+

Build

# Clone the repository
git clone https://github.com/your-repo/xiaoclaw.git
cd xiaoclaw

# Set target
idf.py set-target esp32s3

# Configure (optional)
idf.py menuconfig

# Build
idf.py build

Flash

# Flash and monitor
idf.py -p PORT flash monitor

# Flash app only (skip SPIFFS to preserve data)
esptool.py -p PORT write_flash 0x20000 ./build/xiaozhi.bin

Configuration

Configure via idf.py menuconfig under Xiaozhi Assistant β†’ Secret Configuration:

Option Description
CONFIG_MIMI_SECRET_WIFI_SSID WiFi network name
CONFIG_MIMI_SECRET_WIFI_PASS WiFi password
CONFIG_MIMI_SECRET_API_KEY LLM API key
CONFIG_MIMI_SECRET_MODEL_PROVIDER Model provider: anthropic or openai
CONFIG_MIMI_SECRET_MODEL Model name (e.g., MiniMax-M2.5, claude-opus-4-5)
CONFIG_MIMI_SECRET_OPENAI_API_URL OpenAI compatible API URL
CONFIG_MIMI_SECRET_ANTHROPIC_API_URL Anthropic API URL (optional)

Example: Alibaba Cloud Coding+ (ι€šδΉ‰η΅η ):

CONFIG_MIMI_SECRET_MODEL_PROVIDER="openai"
CONFIG_MIMI_SECRET_MODEL="MiniMax-M2.5"
CONFIG_MIMI_SECRET_OPENAI_API_URL="https://coding.dashscope.aliyuncs.com/v1/chat/completions"
CONFIG_MIMI_SECRET_API_KEY="your-api-key"

Architecture

Bridge Layer

The bridge layer connects the voice I/O layer with the agent brain:

flowchart TB



    subgraph Voice["<b>πŸ”Š Voice Input Layer</b>"]

        A["🎀 User Voice"] --> B["πŸ‘‚ Wake Word"]
        B --> C["πŸ“ ASR Server"]
        C --> D["πŸ“„ Text Output"]
    end

    subgraph Bridge["<b>πŸŒ‰ Bridge Layer</b>"]

        E["πŸ“₯ Receive"] --> F["βš™οΈ Route"] --> G["πŸ“€ Send"]
    end

    subgraph Agent["<b>πŸ€– Agent Brain</b>"]

        H["🧠 LLM Inference"]
        I["πŸ”§ Tool Calling"]
        J["πŸ“‹ Response"]
        K["πŸ’Ύ Memory"]
        H --> I
        H --> K
        I --> J
    end

    subgraph TTS["<b>πŸ”Š Voice Output Layer</b>"]

        L["πŸ“ TTS Synth"] --> M["πŸ”Š Playback"] --> N["🎡 Speaker"]
    end

    D -->|"Text"| E
    G -->|"Command"| H
    J -->|"Text"| G
    G -->|"Text"| L

    style Voice fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,radius:15px
    style Bridge fill:#fff8e1,stroke:#f57c00,stroke-width:4px,radius:15px
    style Agent fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,radius:15px
    style TTS fill:#e8f5e9,stroke:#388e3c,stroke-width:3px,radius:15px
    style A fill:#1976d2,stroke:#0d47a1,color:#fff
    style B fill:#1565c0,stroke:#0d47a1,color:#fff
    style C fill:#1976d2,stroke:#0d47a1,color:#fff
    style D fill:#42a5f5,stroke:#1565c0,color:#fff
    style E fill:#f57c00,stroke:#e65100,color:#fff
    style F fill:#ff9800,stroke:#f57c00,color:#fff
    style G fill:#f57c00,stroke:#e65100,color:#fff
    style H fill:#7b1fa2,stroke:#4a148c,color:#fff
    style I fill:#9c27b0,stroke:#6a1b9a,color:#fff
    style J fill:#ab47bc,stroke:#7b1fa2,color:#fff
    style K fill:#ba68c8,stroke:#8e24aa,color:#fff
    style L fill:#388e3c,stroke:#1b5e20,color:#fff
    style M fill:#43a047,stroke:#2e7d32,color:#fff
    style N fill:#66bb6a,stroke:#388e3c,color:#fff
Loading

Memory Layout

Partition Size Purpose
nvs 32KB Non-volatile storage
otadata 8KB OTA data
phy_init 4KB Physical init data
ota_0 5MB Main firmware
ota_1 5MB OTA backup
assets 5MB Model assets (wake word, etc.)
model 5MB AI model storage
spiffs ~12MB Memory, sessions, skills

Task Layout

Task Core Priority Function
audio_* 0 8 Audio I/O
main_loop 0 5 Application main
bridge 0 5 Bridge communication
agent_loop 1 6 LLM processing

Tools

The agent can use various tools:

Tool Description
web_search Search the web for current information
get_current_time Get current date/time
gpio_write Control GPIO pins
gpio_read Read GPIO state
gpio_read_all Read all allowed GPIO pins
lua_eval Execute a Lua code string directly
lua_run Execute a Lua script from SPIFFS
mcp_connect Connect to an MCP server
mcp_disconnect Disconnect from MCP server
cron_add Schedule a task
cron_list List scheduled tasks
cron_remove Remove a scheduled task
read_file Read file from SPIFFS
write_file Write file to SPIFFS
edit_file Edit file (find-and-replace)
list_dir List files in directory

Note: GPIO tools respect board-specific policies defined in gpio_policy.h.

MCP Client (Dynamic Remote Tools)

XiaoClaw supports connecting to remote MCP servers to dynamically discover and call tools. Server configurations are stored in mcp-servers.md skill file.

Configuration file: /spiffs/skills/mcp-servers.md

# MCP Servers

## my_server

- host: 192.168.1.100
- port: 8000
- endpoint: mcp

Available tools:

Tool Description
mcp_connect Connect to an MCP server by name
mcp_disconnect Disconnect from current server

Python MCP Server Example: scripts/mcp_server.py

pip install "mcp[cli]"
python scripts/mcp_server.py --port 8000

Remote tools are registered with the {server_name}. prefix (e.g., my_server.get_device_status), distinguishing them from local tools.

Lua Scripting

XiaoClaw supports Lua scripting for custom logic and HTTP requests. Scripts are stored in /spiffs/lua/ directory.

Built-in functions:

Function Description
print(...) Print output to log
http_get(url) HTTP GET request, returns response, status
http_post(url, body, content_type) HTTP POST request
http_put(url, body, content_type) HTTP PUT request
http_delete(url) HTTP DELETE request

Example script: /spiffs/lua/hello.lua

local greeting = "Hello from Lua!"
local timestamp = os.time()
return string.format("%s (timestamp: %d)", greeting, timestamp)

Example HTTP script: /spiffs/lua/http_example.lua

local response, status = http_get("https://example.com")
print("Status:", status)
print("Response:", response)

Scripts can return values which are serialized as JSON and returned to the agent.

Memory System

XiaoClaw stores data in plain text files on SPIFFS with session consolidation support:

Path Purpose
/spiffs/config/SOUL.md AI personality definition
/spiffs/config/USER.md User information and preferences
/spiffs/memory/MEMORY.md Long-term memory
/spiffs/HEARTBEAT.md Autonomous task list (runtime)
/spiffs/cron.json Scheduled jobs (runtime)
/spiffs/sessions/tg_*.jsonl Conversation history (JSONL format)
/spiffs/sessions/tg_*.meta Session metadata (cursor, consolidated count)
/spiffs/archive/tg_*.archive Archived old messages

Session Management

  • Cursor-based tracking: Each session tracks read position via cursor for efficient history traversal
  • Consolidation: When session exceeds max_history (default: 50) messages, oldest consolidate_batch (default: 20) messages are archived to /spiffs/archive/
  • LRU cache: Active sessions cached in memory (max 8 sessions) for fast access
  • Checkpoint recovery: Agent can resume from last checkpoint on crash

Skills System

Skills are loaded from /spiffs/skills/ directory with YAML frontmatter support. Each skill is a directory containing a SKILL.md file:

/spiffs/skills/
β”œβ”€β”€ weather/
β”‚   └── SKILL.md
β”œβ”€β”€ get-time/
β”‚   └── SKILL.md
└── lua-scripts/
    └── SKILL.md

Frontmatter format:

---
name: weather
description: Get current weather and forecasts
always: false
---
# Weather Skill
...
  • name: Skill identifier used by the agent
  • description: Brief description of what the skill does
  • always: true: Skill content always injected into system prompt
  • requires.bins: CLI tools required by the skill (optional)
  • requires.env: Environment variables needed (optional)

Skill file format:

  • SKILL.md - Contains skill description, usage instructions, and examples
  • Tool definitions in the format: Tool: tool_name\nInput: {json}

Development

Project Structure

xiaoclaw/
β”œβ”€β”€ main/
β”‚   β”œβ”€β”€ mimi/             # Agent brain (from mimiclaw)
β”‚   β”‚   β”œβ”€β”€ agent/        # Agent loop, runner, hooks, checkpoint
β”‚   β”‚   β”‚   β”œβ”€β”€ agent_loop.c   # Main agent task loop
β”‚   β”‚   β”‚   β”œβ”€β”€ runner.c       # ReAct execution engine
β”‚   β”‚   β”‚   β”œβ”€β”€ context_builder.c # System prompt construction
β”‚   β”‚   β”‚   β”œβ”€β”€ hook.c         # Agent hooks implementation
β”‚   β”‚   β”‚   └── checkpoint.c   # Crash recovery checkpoint
β”‚   β”‚   β”œβ”€β”€ bus/          # Message bus
β”‚   β”‚   β”œβ”€β”€ channels/     # Telegram, Feishu bot integrations
β”‚   β”‚   β”œβ”€β”€ cron/         # Cron scheduler service
β”‚   β”‚   β”œβ”€β”€ gateway/      # WebSocket server
β”‚   β”‚   β”œβ”€β”€ heartbeat/    # Autonomous task heartbeat
β”‚   β”‚   β”œβ”€β”€ llm/          # LLM proxy
β”‚   β”‚   β”œβ”€β”€ memory/       # Memory store, session manager, consolidator
β”‚   β”‚   β”‚   β”œβ”€β”€ memory_store.c    # Long-term memory
β”‚   β”‚   β”‚   β”œβ”€β”€ session_manager.c # Session with cursor/consolidation
β”‚   β”‚   β”‚   └── consolidator.c     # Automatic history compression
β”‚   β”‚   β”œβ”€β”€ ota/          # OTA updates
β”‚   β”‚   β”œβ”€β”€ proxy/        # HTTP proxy
β”‚   β”‚   β”œβ”€β”€ skills/       # Skill loader with frontmatter
β”‚   β”‚   β”œβ”€β”€ tools/        # Tool registry with concurrency support
β”‚   β”œβ”€β”€ audio/            # Voice I/O (from xiaozhi)
β”‚   β”œβ”€β”€ bridge/           # Bridge layer
β”‚   β”œβ”€β”€ display/
β”‚   β”œβ”€β”€ protocols/
β”‚   β”œβ”€β”€ boards/
β”‚   β”œβ”€β”€ led/              # LED control
β”‚   β”œβ”€β”€ lua/              # Lua script support
β”‚   β”œβ”€β”€ memory/           # Memory management
β”‚   β”œβ”€β”€ skills/           # Skills system
β”‚   β”œβ”€β”€ assets.cc/h       # Assets management
β”‚   β”œβ”€β”€ application.cc/h  # Main application
β”‚   β”œβ”€β”€ device_state.h   # Device state
β”‚   β”œβ”€β”€ device_state_machine.cc/h # State machine
β”‚   β”œβ”€β”€ idf_component.yml # Component manifest
β”‚   β”œβ”€β”€ main.cc           # Entry point
β”‚   β”œβ”€β”€ mcp_server.cc/h   # MCP server
β”‚   β”œβ”€β”€ ota.cc/h          # OTA updates
β”‚   β”œβ”€β”€ settings.cc/h     # Settings management
β”‚   └── system_info.cc/h  # System info
β”œβ”€β”€ spiffs_data/          # SPIFFS content (flashed to /spiffs partition)
β”‚   β”œβ”€β”€ config/           # SOUL.md, USER.md
β”‚   β”œβ”€β”€ lua/              # Lua scripts (hello.lua, http_example.lua)
β”‚   β”œβ”€β”€ memory/            # MEMORY.md
β”‚   └── skills/            # get-time/, lua-scripts/, mcp-servers/, skill-creator/, weather/
β”œβ”€β”€ CMakeLists.txt
└── sdkconfig.defaults.esp32s3

Related Projects

XiaoClaw is built upon these excellent projects:

License

MIT License

Acknowledgments

  • xiaozhi-esp32 team for the voice interaction framework
  • mimiclaw team for the embedded AI agent architecture
  • Espressif for ESP-IDF and ESP-SR

About

Local AI Agent firmware running on ESP32-S3, integrating offline voice wake-up with cloud TTS, supporting local LLM inference, tool calling, long-term memory storage and autonomous task execution.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors