A browser-based humanoid robot control interface where the user speaks (or types, or shows the webcam) and an LLM picks the right tool calls to move the robot. The robot runs as a MuJoCo WASM physics simulation rendered with Three.js — no hardware required.
voice / text / video → LLM (tool calls) → typed control API → MuJoCo (physics) → Three.js (render)
- MuJoCo WASM (
@mujoco/mujoco3.8) running the canonical DeepMind humanoid (humanoid.xml) directly in the browser. - Three.js scene that mirrors the sim each frame from
data.geom_xpos/geom_xmat, with OrbitControls for camera, shadow-casting lights, and a Z-up→Y-up parent so MuJoCo's world reads naturally. - Typed
HumanoidControlAPI with PD joint targets that ride the simulator's step hook:raiseArm,lowerArm,bendElbow,stand(with optional kinematic root pin),releaseAll.turnHead/lookAtexist but throwUnsupportedControlErrorbecause the stock humanoid model has no head joint. - Two agent providers behind one interface:
MockAgent— regex-based offline agent so the chat surface works without an API key.ClaudeAgent— directfetchto the Anthropic Messages API (model:claude-haiku-4-5-20251001), with tool-use loop and ephemeral prompt-caching on the system prompt.
- Multimodal input:
- 🎙 Mic button — Web Speech API (
(webkit)SpeechRecognition) auto-submits the final transcript. - 📷 Webcam button —
getUserMedia+ canvas-rasterized JPEG attached as an image content block to multimodal turns.
- 🎙 Mic button — Web Speech API (
- Sliders + commands panel — per-actuator sliders bound directly to
data.ctrl[], plus buttons that exercise the typed control API (Raise L/R arm,Bend L/R elbow,Stand (PD),Stand (pinned),Release).
npm install
npm run dev # http://localhost:5173Production build:
npm run build
npm run previewType-check only:
npx tsc -b --noEmit- Click ⚙ in the chat panel (bottom-left).
- Switch provider to Claude Haiku 4.5 (API).
- Paste an Anthropic API key (stored in
localStorage— see warnings below). - Save.
- Type or speak a command: "raise your right arm to 90 degrees and bend the elbow to about 100".
The browser hits the Anthropic API directly using the anthropic-dangerous-direct-browser-access: true header. No backend involved.
Security note. Direct browser calls expose the API key to anyone with access to your browser/devtools. For production, proxy through a server you control. This project ships the direct-browser path for ergonomics — it's appropriate for local development and demos, not for sharing a URL with a baked-in key.
src/
sim/ MuJoCo WASM wrapper (typed state accessors, step hook, name lookup)
MujocoSim.ts
types.ts
render/ Three.js scene mirroring sim state
Scene.ts
control/ Typed kinematic control API (PD targets, stand, arm/elbow)
HumanoidControl.ts
agent/ LLM-facing surface
tools.ts JSON-schema tools + executeTool() dispatcher
AgentClient.ts Shared interface + runToolCalls() helper
MockAgent.ts Offline regex provider
ClaudeAgent.ts Anthropic Messages API + tool-use loop
SpeechRecognizer.ts (webkit)SpeechRecognition wrapper
WebcamCapture.ts getUserMedia + frame capture
ui/ React panels
ControlsPanel.tsx Actuator sliders + command buttons + pause/reset
ChatPanel.tsx Transcript + mic + camera + provider settings
App.tsx Boots sim → scene → control → renders panels
App.css Layout + theming
main.tsx React root
public/
assets/humanoid.xml Vendored DeepMind humanoid MJCF
All tools are intent-shaped so the model doesn't need to reason about MuJoCo internals.
| Tool | Args | Effect |
|---|---|---|
raise_arm |
side: "left"|"right", angle_deg: number |
PD-targets both shoulder hinges of side to angle_deg. |
lower_arm |
side |
Shoulder targets back to 0°. |
bend_elbow |
side, angle_deg |
PD-targets the elbow hinge. |
stand |
pin_root?: boolean |
All actuated joints PD-target their default angles. pin_root clamps the torso's free joint kinematically each step (won't fall but won't react to base forces). |
release_all |
— | Drops every PD target and unpins the root. |
Dispatched via src/agent/tools.ts::executeTool. Results ({ok, message}) are streamed back as tool results so the model sees its own success/failure.
- Z-up vs Y-up. MuJoCo's world is Z-up. The rendered scene applies a single
rotateX(-π/2)on the rootGroupso the parent transform handles the conversion and the per-geom matrix math stays a clean 1:1 copy ofgeom_xmat. - Three vs MuJoCo geom axes. MuJoCo capsules and cylinders extend along local Z; Three.js capsule/cylinder geometries extend along local Y. The geom factory rotates those geometries 90° about X at build time so per-frame
geom_xmatwrites need no special-casing. - WebAssembly resolution under Vite.
@mujoco/mujoco's ESM module resolves its.wasmvianew URL('mujoco.wasm', import.meta.url). Vite would relocate the JS during dep-optimization and break that relative URL, so the package is inoptimizeDeps.exclude.resolve.dedupe: ['three']keeps the OrbitControls addon and the mainthreeimport on the same instance. - Step hook. Higher-level controllers (the PD loop, future balance controller) register via
MujocoSim.setStepHook(fn). The sim calls every registered hook immediately beforemj_step, so writes todata.ctrl/data.qpostake effect on the next integration. - Memory. Embind-wrapped MuJoCo handles are not GC'd. Anything created in JS (
MjModel,MjData, accessor returns) must be.delete()-ed.MujocoSim.dispose()and the various wrappers do that.
- No head joint on the stock humanoid model.
turnHead/lookAtdeliberately throwUnsupportedControlErrorinstead of silently no-oping. To enable them, swap in a Menagerie model with a neck (e.g. Unitree G1) and add the relevant actuator names to the control API. - No real walking.
walk()isn't implemented yet. The plan (per the project context doc) is a kinematic root translation while physics handles the limbs — a deliberate cheat that's fine for demos. A real bipedal gait would need an MPC controller or a pretrained policy ported viaonnxruntime-web. - MuJoCo WASM is officially WIP. Develop on macOS or Linux with Chrome — Windows builds of MuJoCo WASM are flaky as of 3.8.
- Single-threaded physics. This project uses the single-threaded MuJoCo build, so physics runs on the main thread. Fine for one humanoid. For multi-agent scenarios, switch to
@mujoco/mujoco/mtand serve the right COOP/COEP headers.
The history is structured: one commit per logical layer (feat(sim), feat(render), feat(control), feat(agent), feat(ui), chore(vite)). Use git log --oneline to walk the steps from scaffold to multimodal chat.
MIT for the project code. The vendored humanoid.xml is © DeepMind under Apache 2.0 — see the header inside the file.