ESP32-S3 AI Voice Assistant β Voice I/O + Local LLM Agent
π Official Website
XiaoClaw is a unified ESP32-S3 firmware that combines voice interaction with a local AI agent brain. It integrates:
- xiaozhi-esp32 β Voice I/O layer: audio recording, playback, wake word detection, display, and network communication
- mimiclaw β Agent brain: LLM-powered reasoning, tool calling, memory management, and autonomous task execution
All running on a single ESP32-S3 chip with 32MB Flash and 8MB PSRAM.
graph TB
subgraph Firmware["<b>ποΈ XiaoClaw Firmware</b>"]
subgraph VoiceIO["<b>π€ Voice I/O Layer</b><br/><sub>xiaozhi</sub>"]
direction TB
A["π Wake Word"]
B["π ASR Server"]
C["π TTS Playback"]
D["πΊ Display"]
E["π‘ WiFi"]
A --> B --> C
B -.-> D
B -.-> E
end
subgraph Bridge["<b>π Bridge Layer</b>"]
direction TB
BR["π₯ Input"] --> BC["βοΈ Route"] --> BG["π€ Output"]
end
subgraph Agent["<b>π§ Agent Brain</b><br/><sub>mimiclaw</sub>"]
direction TB
F["π€ LLM API"]
G["π§ Tool Calling"]
H["πΎ Memory"]
I["π Session"]
J["β° Cron"]
K["π Search"]
F --> G
F --> H
F --> I
F --> J
F --> K
end
end
VoiceIO -->|"Text"| Bridge -->|"Command"| Agent
Agent -.->|"Response"| Bridge
style Firmware fill:#f8f9fa,stroke:#495057,stroke-width:4px,radius:20px
style VoiceIO fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,radius:15px
style Bridge fill:#fff8e1,stroke:#f57c00,stroke-width:4px,radius:15px
style Agent fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,radius:15px
style A fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff
style B fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff
style C fill:#1976d2,stroke:#0d47a1,stroke-width:2px,color:#fff
style D fill:#42a5f5,stroke:#1565c0,stroke-width:2px,color:#fff
style E fill:#42a5f5,stroke:#1565c0,stroke-width:2px,color:#fff
style F fill:#7b1fa2,stroke:#4a148c,stroke-width:2px,color:#fff
style G fill:#9c27b0,stroke:#6a1b9a,stroke-width:2px,color:#fff
style H fill:#ab47bc,stroke:#7b1fa2,stroke-width:2px,color:#fff
style I fill:#ba68c8,stroke:#8e24aa,stroke-width:2px,color:#fff
style J fill:#7b1fa2,stroke:#4a148c,stroke-width:2px,color:#fff
style K fill:#9575cd,stroke:#7b1fa2,stroke-width:2px,color:#fff
style BR fill:#ff9800,stroke:#f57c00,stroke-width:2px,color:#fff
style BC fill:#ffa726,stroke:#fb8c00,stroke-width:2px,color:#fff
style BG fill:#ff9800,stroke:#f57c00,stroke-width:2px,color:#fff
- Offline wake word detection (ESP-SR)
- Streaming ASR + TTS via server connection
- OPUS audio codec
- OLED / LCD display with emoji support
- Battery and power management
- Multi-language support (Chinese, English, Japanese)
- WebSocket / MQTT protocol support
- LLM API integration (Anthropic Claude / OpenAI GPT)
- Modular ReAct agent loop with
AgentRunnerexecution engine - Hook system for iteration/tool callbacks (
before_iteration,after_iteration,on_tool_result,before_tool_execute) - Checkpoint system for crash recovery
- Context Builder with modular system prompt construction
- Session consolidation with automatic history compression
- Long-term memory (SPIFFS-based)
- Session management with cursor-based history tracking
- Cron scheduler for autonomous tasks
- Web search capability (Tavily / Brave)
- ESP32-S3 development board
- 32MB Flash (minimum 16MB)
- 8MB PSRAM (Octal PSRAM recommended)
- Audio codec with microphone and speaker
- Optional: LCD/OLED display
XiaoClaw inherits board support from xiaozhi-esp32, including:
- ESP32-S3-BOX3
- M5Stack CoreS3 / AtomS3R
- LiChuang ESP32-S3 Development Board
- LILYGO T-Circle-S3
- And 70+ more boards...
- ESP-IDF v5.5 or later
- Python 3.10+
- CMake 3.16+
# Clone the repository
git clone https://github.com/your-repo/xiaoclaw.git
cd xiaoclaw
# Set target
idf.py set-target esp32s3
# Configure (optional)
idf.py menuconfig
# Build
idf.py build# Flash and monitor
idf.py -p PORT flash monitor
# Flash app only (skip SPIFFS to preserve data)
esptool.py -p PORT write_flash 0x20000 ./build/xiaozhi.binConfigure via idf.py menuconfig under Xiaozhi Assistant β Secret Configuration:
| Option | Description |
|---|---|
CONFIG_MIMI_SECRET_WIFI_SSID |
WiFi network name |
CONFIG_MIMI_SECRET_WIFI_PASS |
WiFi password |
CONFIG_MIMI_SECRET_API_KEY |
LLM API key |
CONFIG_MIMI_SECRET_MODEL_PROVIDER |
Model provider: anthropic or openai |
CONFIG_MIMI_SECRET_MODEL |
Model name (e.g., MiniMax-M2.5, claude-opus-4-5) |
CONFIG_MIMI_SECRET_OPENAI_API_URL |
OpenAI compatible API URL |
CONFIG_MIMI_SECRET_ANTHROPIC_API_URL |
Anthropic API URL (optional) |
Example: Alibaba Cloud Coding+ (ιδΉη΅η ):
CONFIG_MIMI_SECRET_MODEL_PROVIDER="openai"
CONFIG_MIMI_SECRET_MODEL="MiniMax-M2.5"
CONFIG_MIMI_SECRET_OPENAI_API_URL="https://coding.dashscope.aliyuncs.com/v1/chat/completions"
CONFIG_MIMI_SECRET_API_KEY="your-api-key"
The bridge layer connects the voice I/O layer with the agent brain:
flowchart TB
subgraph Voice["<b>π Voice Input Layer</b>"]
A["π€ User Voice"] --> B["π Wake Word"]
B --> C["π ASR Server"]
C --> D["π Text Output"]
end
subgraph Bridge["<b>π Bridge Layer</b>"]
E["π₯ Receive"] --> F["βοΈ Route"] --> G["π€ Send"]
end
subgraph Agent["<b>π€ Agent Brain</b>"]
H["π§ LLM Inference"]
I["π§ Tool Calling"]
J["π Response"]
K["πΎ Memory"]
H --> I
H --> K
I --> J
end
subgraph TTS["<b>π Voice Output Layer</b>"]
L["π TTS Synth"] --> M["π Playback"] --> N["π΅ Speaker"]
end
D -->|"Text"| E
G -->|"Command"| H
J -->|"Text"| G
G -->|"Text"| L
style Voice fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,radius:15px
style Bridge fill:#fff8e1,stroke:#f57c00,stroke-width:4px,radius:15px
style Agent fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,radius:15px
style TTS fill:#e8f5e9,stroke:#388e3c,stroke-width:3px,radius:15px
style A fill:#1976d2,stroke:#0d47a1,color:#fff
style B fill:#1565c0,stroke:#0d47a1,color:#fff
style C fill:#1976d2,stroke:#0d47a1,color:#fff
style D fill:#42a5f5,stroke:#1565c0,color:#fff
style E fill:#f57c00,stroke:#e65100,color:#fff
style F fill:#ff9800,stroke:#f57c00,color:#fff
style G fill:#f57c00,stroke:#e65100,color:#fff
style H fill:#7b1fa2,stroke:#4a148c,color:#fff
style I fill:#9c27b0,stroke:#6a1b9a,color:#fff
style J fill:#ab47bc,stroke:#7b1fa2,color:#fff
style K fill:#ba68c8,stroke:#8e24aa,color:#fff
style L fill:#388e3c,stroke:#1b5e20,color:#fff
style M fill:#43a047,stroke:#2e7d32,color:#fff
style N fill:#66bb6a,stroke:#388e3c,color:#fff
| Partition | Size | Purpose |
|---|---|---|
| nvs | 32KB | Non-volatile storage |
| otadata | 8KB | OTA data |
| phy_init | 4KB | Physical init data |
| ota_0 | 5MB | Main firmware |
| ota_1 | 5MB | OTA backup |
| assets | 5MB | Model assets (wake word, etc.) |
| model | 5MB | AI model storage |
| spiffs | ~12MB | Memory, sessions, skills |
| Task | Core | Priority | Function |
|---|---|---|---|
| audio_* | 0 | 8 | Audio I/O |
| main_loop | 0 | 5 | Application main |
| bridge | 0 | 5 | Bridge communication |
| agent_loop | 1 | 6 | LLM processing |
The agent can use various tools:
| Tool | Description |
|---|---|
web_search |
Search the web for current information |
get_current_time |
Get current date/time |
gpio_write |
Control GPIO pins |
gpio_read |
Read GPIO state |
gpio_read_all |
Read all allowed GPIO pins |
lua_eval |
Execute a Lua code string directly |
lua_run |
Execute a Lua script from SPIFFS |
mcp_connect |
Connect to an MCP server |
mcp_disconnect |
Disconnect from MCP server |
cron_add |
Schedule a task |
cron_list |
List scheduled tasks |
cron_remove |
Remove a scheduled task |
read_file |
Read file from SPIFFS |
write_file |
Write file to SPIFFS |
edit_file |
Edit file (find-and-replace) |
list_dir |
List files in directory |
Note: GPIO tools respect board-specific policies defined in gpio_policy.h.
XiaoClaw supports connecting to remote MCP servers to dynamically discover and call tools. Server configurations are stored in mcp-servers.md skill file.
Configuration file: /spiffs/skills/mcp-servers.md
# MCP Servers
## my_server
- host: 192.168.1.100
- port: 8000
- endpoint: mcpAvailable tools:
| Tool | Description |
|---|---|
mcp_connect |
Connect to an MCP server by name |
mcp_disconnect |
Disconnect from current server |
Python MCP Server Example: scripts/mcp_server.py
pip install "mcp[cli]"
python scripts/mcp_server.py --port 8000Remote tools are registered with the {server_name}. prefix (e.g., my_server.get_device_status), distinguishing them from local tools.
XiaoClaw supports Lua scripting for custom logic and HTTP requests. Scripts are stored in /spiffs/lua/ directory.
Built-in functions:
| Function | Description |
|---|---|
print(...) |
Print output to log |
http_get(url) |
HTTP GET request, returns response, status |
http_post(url, body, content_type) |
HTTP POST request |
http_put(url, body, content_type) |
HTTP PUT request |
http_delete(url) |
HTTP DELETE request |
Example script: /spiffs/lua/hello.lua
local greeting = "Hello from Lua!"
local timestamp = os.time()
return string.format("%s (timestamp: %d)", greeting, timestamp)Example HTTP script: /spiffs/lua/http_example.lua
local response, status = http_get("https://example.com")
print("Status:", status)
print("Response:", response)Scripts can return values which are serialized as JSON and returned to the agent.
XiaoClaw stores data in plain text files on SPIFFS with session consolidation support:
| Path | Purpose |
|---|---|
/spiffs/config/SOUL.md |
AI personality definition |
/spiffs/config/USER.md |
User information and preferences |
/spiffs/memory/MEMORY.md |
Long-term memory |
/spiffs/HEARTBEAT.md |
Autonomous task list (runtime) |
/spiffs/cron.json |
Scheduled jobs (runtime) |
/spiffs/sessions/tg_*.jsonl |
Conversation history (JSONL format) |
/spiffs/sessions/tg_*.meta |
Session metadata (cursor, consolidated count) |
/spiffs/archive/tg_*.archive |
Archived old messages |
- Cursor-based tracking: Each session tracks read position via cursor for efficient history traversal
- Consolidation: When session exceeds
max_history(default: 50) messages, oldestconsolidate_batch(default: 20) messages are archived to/spiffs/archive/ - LRU cache: Active sessions cached in memory (max 8 sessions) for fast access
- Checkpoint recovery: Agent can resume from last checkpoint on crash
Skills are loaded from /spiffs/skills/ directory with YAML frontmatter support. Each skill is a directory containing a SKILL.md file:
/spiffs/skills/
βββ weather/
β βββ SKILL.md
βββ get-time/
β βββ SKILL.md
βββ lua-scripts/
βββ SKILL.md
Frontmatter format:
---
name: weather
description: Get current weather and forecasts
always: false
---
# Weather Skill
...name: Skill identifier used by the agentdescription: Brief description of what the skill doesalways: true: Skill content always injected into system promptrequires.bins: CLI tools required by the skill (optional)requires.env: Environment variables needed (optional)
Skill file format:
SKILL.md- Contains skill description, usage instructions, and examples- Tool definitions in the format:
Tool: tool_name\nInput: {json}
xiaoclaw/
βββ main/
β βββ mimi/ # Agent brain (from mimiclaw)
β β βββ agent/ # Agent loop, runner, hooks, checkpoint
β β β βββ agent_loop.c # Main agent task loop
β β β βββ runner.c # ReAct execution engine
β β β βββ context_builder.c # System prompt construction
β β β βββ hook.c # Agent hooks implementation
β β β βββ checkpoint.c # Crash recovery checkpoint
β β βββ bus/ # Message bus
β β βββ channels/ # Telegram, Feishu bot integrations
β β βββ cron/ # Cron scheduler service
β β βββ gateway/ # WebSocket server
β β βββ heartbeat/ # Autonomous task heartbeat
β β βββ llm/ # LLM proxy
β β βββ memory/ # Memory store, session manager, consolidator
β β β βββ memory_store.c # Long-term memory
β β β βββ session_manager.c # Session with cursor/consolidation
β β β βββ consolidator.c # Automatic history compression
β β βββ ota/ # OTA updates
β β βββ proxy/ # HTTP proxy
β β βββ skills/ # Skill loader with frontmatter
β β βββ tools/ # Tool registry with concurrency support
β βββ audio/ # Voice I/O (from xiaozhi)
β βββ bridge/ # Bridge layer
β βββ display/
β βββ protocols/
β βββ boards/
β βββ led/ # LED control
β βββ lua/ # Lua script support
β βββ memory/ # Memory management
β βββ skills/ # Skills system
β βββ assets.cc/h # Assets management
β βββ application.cc/h # Main application
β βββ device_state.h # Device state
β βββ device_state_machine.cc/h # State machine
β βββ idf_component.yml # Component manifest
β βββ main.cc # Entry point
β βββ mcp_server.cc/h # MCP server
β βββ ota.cc/h # OTA updates
β βββ settings.cc/h # Settings management
β βββ system_info.cc/h # System info
βββ spiffs_data/ # SPIFFS content (flashed to /spiffs partition)
β βββ config/ # SOUL.md, USER.md
β βββ lua/ # Lua scripts (hello.lua, http_example.lua)
β βββ memory/ # MEMORY.md
β βββ skills/ # get-time/, lua-scripts/, mcp-servers/, skill-creator/, weather/
βββ CMakeLists.txt
βββ sdkconfig.defaults.esp32s3
XiaoClaw is built upon these excellent projects:
- xiaozhi-esp32 β Voice interaction framework
- mimiclaw β ESP32 AI agent
MIT License
- xiaozhi-esp32 team for the voice interaction framework
- mimiclaw team for the embedded AI agent architecture
- Espressif for ESP-IDF and ESP-SR