diff --git a/bootstraps/optional/idle-shutdown/ARCHITECTURE.md b/bootstraps/optional/idle-shutdown/ARCHITECTURE.md new file mode 100644 index 0000000..43ebaf5 --- /dev/null +++ b/bootstraps/optional/idle-shutdown/ARCHITECTURE.md @@ -0,0 +1,280 @@ +# Idle Shutdown & Wake — Architecture & Design + +This document explains the clever wake-on-text design: when your OpenClaw instance sleeps, you just send a message to wake it up. No links, no tokens, completely transparent. + +## Problem Statement + +Running an EC2 agent 24/7 in a sandbox account is powerful but expensive. We want: + +- **Automatic cost savings** — shutdown after user inactivity +- **Zero manual waking** — just send a message, instance wakes up +- **Transparent** — no clicking links, no entering tokens +- **Reliable** — messages don't get lost while instance is sleeping +- **Safe** — only authorized users can wake the instance + +## Architecture Overview + +``` +┌─ EC2 Instance (running) ──────────────────────────────────┐ +│ │ +│ OpenClaw (normal operation) │ +│ ├─ systemd timer every 5 min │ +│ ├─ idle-check.py scans session activity │ +│ ├─ if idle > 1 hour: emit stop signal │ +│ └─ ec2 stop-instances │ +│ │ +└────────────────────────────────────────────────────────────┘ + │ + (instance stops) + │ + ▼ +┌─ AWS EventBridge (runs while EC2 off) ────────────────────┐ +│ │ +│ Listener: EC2 state = stopped │ +│ └─ Invoke: notify-lambda │ +│ │ +│ notify-lambda: │ +│ ├─ Send random sleep message: "Going to sleep 😴" │ +│ ├─ Call Telegram setWebhook: │ +│ │ • URL → webhook-lambda (wake-on-text) │ +│ │ • Secret token (validated) │ +│ │ • Allowed updates: message, edited_message │ +│ └─ From now on, Telegram sends messages here (not bot) │ +│ │ +└─────────────────────────────────────────────────────────────┘ + │ + (instance is now sleeping) + (webhook active on Telegram) + │ + ▼ + ┌─ You (via Telegram) ─┐ + │ │ + │ Type: "hello" 👋 │ + │ (any message works) │ + │ │ + └───────────────────────┘ + │ + ▼ (Telegram sends to webhook) +┌─ webhook-lambda (wake-on-text) ───────────────────────────┐ +│ │ +│ Receives: Telegram message │ +│ ├─ Validate secret token │ +│ ├─ Check if sender is authorized │ +│ ├─ Check EC2 state (stopped?) │ +│ ├─ if yes: ec2.start_instances │ +│ ├─ Dedup check: prevent duplicate wakes │ +│ ├─ Reply: "☕ Waking up... 60 seconds" │ +│ └─ Return 503 to Telegram (retry, so messages queue) │ +│ │ +└─────────────────────────────────────────────────────────────┘ + │ + (EC2 starts booting) + │ + ▼ + ┌─ EC2 boots (5-60 sec) ─────────────┐ + │ OpenClaw starts │ + │ ├─ Fetch queued messages from TG │ + │ ├─ Process messages (normal flow) │ + │ └─ restore webhook to normal bot │ + └────────────────────────────────────┘ + │ + ▼ + ┌─ EventBridge fires (state=running) ┐ + │ └─ notify-lambda: │ + │ ├─ Fetch public IP │ + │ └─ Send: "🟢 up + IP + ssh cmd" │ + └────────────────────────────────────┘ +``` + +## Two Lambda Functions + +### 1. **notify-lambda** (EventBridge → Telegram) + +Runs when EC2 state changes (stopped/running). + +#### On Stop + +1. **Validate state:** Check EC2 is actually stopped (stale-event guard) +2. **Dedup:** Skip if we've already processed this event +3. **Set webhook:** Tell Telegram to send *all* messages to webhook-lambda + - URL: API Gateway → webhook-lambda + - Secret: Telegram validates with `X-Telegram-Bot-API-Secret-Token` + - Only messages/edits allowed (no other updates) +4. **Send sleep message:** Random message from list: + - "Going to sleep 😴" + - "Taking a nap 🥱" + - "Powering down... zzzz 💤" + - etc. (20 messages) + +#### On Start + +1. **Fetch public IP:** Retry 3x (EC2 eventual consistency) +2. **Send startup message:** "🟢 Machine is up\n\nIP: 1.2.3.4\n\nssh ec2-user@1.2.3.4" + +### 2. **webhook-lambda** (Telegram → EC2 Wake) + +Runs when Telegram sends a message (via webhook, while EC2 is stopped). + +**Process:** +1. **Validate webhook secret:** Confirm it's from Telegram +2. **Authorize sender:** Check `user_id` is in `ALLOWED_USERS` +3. **Check state:** Is instance actually stopped? (avoid waking running instance) +4. **Start instance:** `ec2.start_instances()` +5. **Dedup:** Track `message_id` to avoid duplicate wakes if Telegram retries +6. **Reply:** "☕ Waking up... give me about 60 seconds." +7. **Return 503:** Tell Telegram to retry later (keeps messages queued until OpenClaw processes them on boot) + +## Key Design Decisions + +### Wake-on-Text (Not Links) + +**Why this is better:** +- No tokens to share/copy/forget +- No "tap link" friction +- Works from any device (phone, web, CLI) +- Natural: just send a message like normal +- No credentials in URLs (safer than token links) + +**Cost:** +- `setWebhook` call: 1 per stop (EC2 charges: $0) +- Webhook Lambda: on-demand (free tier covers thousands) +- Total: **~$0/month** + +### Return 503 from Webhook + +When webhook-lambda returns `503 Retry`, Telegram *requeues* the message instead of confirming receipt. This ensures: + +- While EC2 is sleeping: messages accumulate in Telegram's queue +- On boot: OpenClaw's normal bot handler processes all queued messages +- No message loss, no duplicate processing + +### Webhook Secret Validation + +Telegram sends `X-Telegram-Bot-API-Secret-Token` header with every webhook call. We validate it against SSM before processing: + +```python +if secret_token != ssm.get_parameter(WEBHOOK_SECRET_PARAM): + return 403 # Unauthorized +``` + +This prevents: +- Random internet traffic from triggering wakes +- DDoS attacks on the endpoint +- Accidental invocations + +### Dedup via message_id + +Telegram may retry webhook delivery if we don't respond quickly. Track `message_id` in SSM to skip duplicate wakes: + +```python +last_id = ssm.get_parameter(LAST_WAKE_ID_PARAM) +if message_id != last_id: + ec2.start_instances() # New message, wake it + ssm.put_parameter(LAST_WAKE_ID_PARAM, message_id) +``` + +### Stale-Event Guard + +EventBridge may deliver events out-of-order. Before setting the webhook, verify the instance is *actually* stopped: + +```python +actual_state = ec2.describe_instances(InstanceIds=[id])['State']['Name'] +if actual_state not in ('stopped', 'stopping'): + return # Don't overwrite webhook (instance is still running) +``` + +### User Authorization + +Only allow specific Telegram users to wake the instance: + +```bash +--allowed-users 123456789,987654321 +``` + +Environment variable: `ALLOWED_USERS` (comma-separated user IDs). + +## Failure Modes & Recovery + +| Failure | Symptom | Recovery | +|---------|---------|----------| +| Webhook not set | Message sent but nothing happens | Manually wake from AWS console, check logs | +| Invalid secret | 403 from webhook | Regenerate `webhook-secret` SSM param | +| User not authorized | Message received, no wake | Add user ID to `ALLOWED_USERS` | +| EC2 describe fails | No wake response | AWS issue, check instance health | +| Message_id dedup stale | Duplicate wake on retry | Harmless (instance already starting) | +| Return 503 breaks | Messages lost if we confirm receipt | Telegram deques, but OpenClaw on boot won't see them | + +## Installation + +The installer (install-idle.sh) creates: + +- **notify-lambda:** EventBridge → Telegram (on stop, send sleep message + set webhook) +- **webhook-lambda:** Telegram → EC2 wake (on message, start instance if stopped) +- **EventBridge rule:** Listens for EC2 state changes (stopped/running) +- **API Gateway:** HTTP endpoint for webhook +- **SSM parameters:** + - Bot token + - Webhook secret + - Webhook URL + - Allowed users + +## Operational Thresholds + +Tunable in idle-check.sh: + +| Variable | Default | Meaning | +|----------|---------|---------| +| `IDLE_THRESHOLD_HOURS` | `1.0` | Stop after 1 hour idle | +| `MIN_UPTIME_HOURS` | `0.25` | Skip checks if booted < 15 min ago | +| `MAX_NO_ACTIVITY_HOURS` | `1.0` | If no messages ever, stop after 1 hour uptime | + +## Cost Analysis + +- **EventBridge:** Free (included in CloudWatch Events) +- **Lambda (notify):** ~2 invocations/month → free tier +- **Lambda (webhook):** On-demand, typically < 100/month → **~$0.00** +- **API Gateway:** $1/million requests → ~0 requests (webhook is direct) +- **SSM:** 5 parameters → free tier +- **EC2 start/stop:** Free operation + +**Total: ~$0/month** for this feature (you're already paying for the instance). + +## Future Enhancements + +- **Smart wake:** Wake on priority messages only (e.g., `/wake` command) +- **Scheduled wake:** "Wake me at 9am Monday" +- **Wake on CI/CD:** GitHub webhook triggers wake +- **Multiple instances:** One webhook, multiple EC2 instances +- **Rate limiting:** Prevent spam wakes +- **Admin commands:** `/status`, `/force-sleep`, etc. via Telegram + +## Testing + +### Dry-run the idle check + +```bash +~/.openclaw/workspace/idle-check.sh --dry-run +tail ~/.openclaw/logs/idle-check.log +``` + +### Trigger a stop manually + +```bash +aws ec2 stop-instances --instance-ids i-xxxxx +``` + +### Test webhook delivery + +```bash +# Send a message via Telegram bot (@your_bot) while instance is stopped +# Check Lambda logs in CloudWatch +aws logs tail /aws/lambda/openclaw-telegram-wake-webhook --follow +``` + +### Verify webhook is active + +```bash +curl -s https://api.telegram.org/bot{TOKEN}/getWebhookInfo | jq . +``` + +Should show webhook URL + secret token. diff --git a/bootstraps/optional/idle-shutdown/notify-lambda/handler.py b/bootstraps/optional/idle-shutdown/notify-lambda/handler.py index 9f38d5c..012ec15 100644 --- a/bootstraps/optional/idle-shutdown/notify-lambda/handler.py +++ b/bootstraps/optional/idle-shutdown/notify-lambda/handler.py @@ -1,18 +1,47 @@ import json import os +import random import time -import uuid import urllib.request +import urllib.parse import boto3 -# Timeout for external HTTP calls (seconds) +# Random sleep messages sent when instance shuts down +SLEEP_MESSAGES = [ + "Going to sleep 😴", + "Taking a nap 🥱", + "Shutting my eyes for a bit 😪", + "BRB, hibernating 🐻", + "Powering down... zzzz 💤", + "Offline. Don't miss me too much 🌙", + "Gone to the land of nod 🛌", + "See you on the flip side 😴", + "Hitting the bed 🛌", + "Out cold 🥶", + "Clocking out ⏰", + "Lights out 💡", + "Dreaming of electric sheep 🐑", + "Do not disturb 🚫", + "Saving state and suspending 💾", + "Gone fishing 🎣", + "In maintenance mode 🔧", + "Recharging 🔋", + "Away from keyboard... and everything else 👋", + "CPU at rest 🧠", +] + HTTP_TIMEOUT = 10 def handler(event, context): + """EventBridge handler for EC2 state changes. + + On stop: Send random sleeping message + set Telegram webhook for wake-on-text. + On start: Send "woke up" message with public IP. + """ instance_id = event['detail']['instance-id'] state = event['detail']['state'] - event_id = event.get('id', '') # EventBridge event ID for logging + event_id = event.get('id', '') target_instance = os.environ.get('INSTANCE_ID') if target_instance and instance_id != target_instance: @@ -20,7 +49,6 @@ def handler(event, context): ssm = boto3.client('ssm') - # Load Telegram credentials — fail fast if unavailable try: telegram_token = ssm.get_parameter( Name='/openclaw/wake-config/telegram-bot-token', WithDecryption=True @@ -30,104 +58,96 @@ def handler(event, context): raise chat_id = os.environ['TELEGRAM_CHAT_ID'] - wake_url = os.environ.get('WAKE_URL', '') - - if state == 'running': - ec2 = boto3.client('ec2') - - # Retry for public IP — can arrive after running event - public_ip = None - for attempt in range(3): - try: - response = ec2.describe_instances(InstanceIds=[instance_id]) - public_ip = response['Reservations'][0]['Instances'][0].get('PublicIpAddress') - except Exception as e: - print(f"WARNING: DescribeInstances attempt {attempt+1} failed: {e}") - if public_ip: - break - if attempt < 2: - time.sleep(5) - - if not public_ip: - # Send fallback notification — don't leave user without any signal - try: - fallback_body = json.dumps({ - 'chat_id': chat_id, - 'text': '🟡 Machine is running but public IP not available yet. Check again in a moment.', - 'disable_web_page_preview': True - }).encode() - fallback_req = urllib.request.Request( - f"https://api.telegram.org/bot{telegram_token}/sendMessage", - data=fallback_body, - headers={'Content-Type': 'application/json'} - ) - urllib.request.urlopen(fallback_req, timeout=HTTP_TIMEOUT) - except Exception as e: - print(f"WARNING: Fallback Telegram notification failed: {e}") - print(f"event_id={event_id} — no public IP after 3 attempts, sent fallback") - return - - message = ( - f"🟢 Machine is up and running\n\n" - f"Public IP: {public_ip}\n\n" - f"ssh ec2-user@{public_ip}" - ) - elif state == 'stopped': - token_param = '/openclaw/wake-token' - - # Guard: verify instance is actually stopped before rotating token. - # A delayed/out-of-order stopped event can arrive after a successful wake. + if state == 'stopped': + """Instance stopped. Send sleep message + enable wake-on-Telegram-text.""" + + # Verify instance is actually stopped (stale-event guard) try: ec2_check = boto3.client('ec2') check_resp = ec2_check.describe_instances(InstanceIds=[instance_id]) actual_state = check_resp['Reservations'][0]['Instances'][0]['State']['Name'] if actual_state not in ('stopped', 'stopping'): - print(f"event_id={event_id} — stale stopped event, instance is actually {actual_state}") + print(f"event_id={event_id} — stale stopped event, instance is {actual_state}") return except Exception as e: - # FAIL CLOSED: can't confirm instance is stopped — don't rotate token - print(f"WARNING: EC2 state check failed — not rotating token (fail closed): {e}") + print(f"WARNING: EC2 state check failed: {e}") return - # Deduplicate stop events — EventBridge is at-least-once. + # Dedup: don't send multiple sleep messages for the same event dedup_param = '/openclaw/wake-config/last-stop-event-id' if event_id: try: last_event = ssm.get_parameter(Name=dedup_param)['Parameter']['Value'] if last_event == event_id: - print(f"event_id={event_id} — duplicate stopped event, skipping token rotation") + print(f"event_id={event_id} — duplicate, skipping") return except ssm.exceptions.ParameterNotFound: pass except Exception as e: - print(f"WARNING: Dedup check failed (continuing): {e}") - - # Always generate a fresh token on stop — prevents stale token reuse - token = str(uuid.uuid4()) - ssm.put_parameter(Name=token_param, Value=token, - Type='String', Overwrite=True) - token_status = "✅ Fresh wake token generated." + print(f"WARNING: Dedup check failed: {e}") - # Record this event ID for dedup + # Record this event for future dedup if event_id: try: - ssm.put_parameter(Name=dedup_param, Value=event_id, - Type='String', Overwrite=True) + ssm.put_parameter(Name=dedup_param, Value=event_id, Type='String', Overwrite=True) except Exception as e: print(f"WARNING: Dedup marker write failed: {e}") - wake_link = f"{wake_url}?token={token}" - message = ( - f"🔴 Machine is shut down.\n\n" - f"{token_status}\n\n" - f"👉 Tap to wake me up:\n{wake_link}" - ) + # === Set Telegram webhook for wake-on-text === + # Now that the instance is stopped, set the webhook so ANY Telegram message + # triggers the wake Lambda instead of the normal message handler. + try: + webhook_url = ssm.get_parameter( + Name='/openclaw/wake-config/telegram-webhook-url' + )['Parameter']['Value'] + webhook_secret = ssm.get_parameter( + Name='/openclaw/wake-config/webhook-secret', WithDecryption=True + )['Parameter']['Value'] + webhook_data = urllib.parse.urlencode({ + 'url': webhook_url, + 'secret_token': webhook_secret, + 'allowed_updates': '["message","edited_message"]' + }).encode() + webhook_req = urllib.request.Request( + f"https://api.telegram.org/bot{telegram_token}/setWebhook", + data=webhook_data, + headers={'Content-Type': 'application/x-www-form-urlencoded'} + ) + webhook_resp = urllib.request.urlopen(webhook_req, timeout=HTTP_TIMEOUT) + print(f"event_id={event_id} — webhook set: {webhook_resp.read().decode()}") + except Exception as e: + print(f"WARNING: setWebhook failed (wake-on-text unavailable): {e}") + + # Send random sleep message + message = random.choice(SLEEP_MESSAGES) + + elif state == 'running': + """Instance started. Send message with public IP + SSH command.""" + ec2 = boto3.client('ec2') + + # Fetch public IP (with retries) + public_ip = None + for attempt in range(3): + try: + response = ec2.describe_instances(InstanceIds=[instance_id]) + public_ip = response['Reservations'][0]['Instances'][0].get('PublicIpAddress') + except Exception as e: + print(f"WARNING: DescribeInstances attempt {attempt+1} failed: {e}") + if public_ip: + break + if attempt < 2: + time.sleep(5) + + if not public_ip: + message = '🟡 Machine is running but public IP not available yet.' + else: + message = f"🟢 Machine is up and running\n\nPublic IP: {public_ip}\n\nssh ec2-user@{public_ip}" else: return - # Telegram notification — best effort + # Send message to Telegram try: tg_body = json.dumps({ 'chat_id': chat_id, @@ -143,5 +163,4 @@ def handler(event, context): except Exception as e: print(f"WARNING: Telegram notification failed: {e}") - print(f"event_id={event_id} state={state} — notification sent") - return {'state': state, 'event_id': event_id} + print(f"event_id={event_id} state={state} — message sent") diff --git a/bootstraps/optional/idle-shutdown/wake-lambda/webhook.py b/bootstraps/optional/idle-shutdown/wake-lambda/webhook.py new file mode 100644 index 0000000..62c474b --- /dev/null +++ b/bootstraps/optional/idle-shutdown/wake-lambda/webhook.py @@ -0,0 +1,134 @@ +import json +import os +import boto3 +import urllib.request + +"""Telegram webhook Lambda for wake-on-text. + +Intercepted by Telegram while EC2 is stopped. +When user sends any message, this Lambda wakes the instance. + +Dedup protection: tracks last message_id to avoid duplicate wake requests +(Telegram may retry webhook delivery). +""" + +ssm = boto3.client('ssm', region_name=os.environ.get('REGION', 'us-east-1')) +ec2 = boto3.client('ec2', region_name=os.environ.get('REGION', 'us-east-1')) + +INSTANCE_ID = os.environ['INSTANCE_ID'] +ALLOWED_USERS = set(os.environ.get('ALLOWED_USERS', '').split(',')) +BOT_TOKEN_PARAM = os.environ.get('SSM_BOT_TOKEN_PARAM', '/openclaw/wake-config/telegram-bot-token') +WEBHOOK_SECRET_PARAM = os.environ.get('SSM_WEBHOOK_SECRET_PARAM', '/openclaw/wake-config/webhook-secret') +LAST_WAKE_ID_PARAM = '/openclaw/wake-config/last-wake-update-id' + + +def get_param(name, decrypt=False): + """Fetch SSM parameter.""" + return ssm.get_parameter(Name=name, WithDecryption=decrypt)['Parameter']['Value'] + + +def put_param(name, value): + """Store SSM parameter.""" + try: + ssm.put_parameter(Name=name, Value=str(value), Type='String', Overwrite=True) + except Exception as e: + print(f"SSM put error: {e}") + + +def send_telegram(bot_token, chat_id, text, reply_to_message_id=None): + """Send Telegram message (best-effort).""" + body = { + "chat_id": chat_id, + "text": text + } + if reply_to_message_id: + body["reply_parameters"] = {"message_id": reply_to_message_id} + + try: + req = urllib.request.Request( + f"https://api.telegram.org/bot{bot_token}/sendMessage", + json.dumps(body).encode(), + {"Content-Type": "application/json"} + ) + urllib.request.urlopen(req, timeout=8) + except Exception as e: + print(f"Telegram send error: {e}") + + +def handler(event, context): + """Main webhook handler. + + Called by Telegram when: + - User sends a message + - User edits a message + + Only accepts messages from allowed users. + Only wakes if instance is stopped (prevents duplicate wakes). + Dedup-protected via message_id tracking. + """ + + # === Validate webhook secret === + headers = {k.lower(): v for k, v in (event.get('headers') or {}).items()} + secret = headers.get('x-telegram-bot-api-secret-token', '') + + try: + expected_secret = get_param(WEBHOOK_SECRET_PARAM, decrypt=True) + except Exception: + return {'statusCode': 403, 'body': 'forbidden'} + + if secret != expected_secret: + return {'statusCode': 403, 'body': 'forbidden'} + + # === Parse Telegram update === + try: + update = json.loads(event.get('body', '{}')) + except: + return {'statusCode': 200, 'body': 'ok'} + + # Handle both new messages and edited messages + message = update.get('message') or update.get('edited_message') or {} + chat_id = message.get('chat', {}).get('id') + user_id = str(message.get('from', {}).get('id', '')) + message_id = message.get('message_id') + + if not chat_id or user_id not in ALLOWED_USERS: + return {'statusCode': 200, 'body': 'ok'} + + # === Check instance state === + try: + status = ec2.describe_instance_status( + InstanceIds=[INSTANCE_ID], + IncludeAllInstances=True + )['InstanceStatuses'][0]['InstanceState']['Name'] + except Exception as e: + print(f"EC2 describe error: {e}") + return {'statusCode': 200, 'body': 'ok'} + + # Load bot token for response + try: + bot_token = get_param(BOT_TOKEN_PARAM, decrypt=True) + except Exception as e: + print(f"Failed to load bot token: {e}") + return {'statusCode': 200, 'body': 'ok'} + + # === Wake if stopped === + if status == 'stopped': + # Dedup check: don't wake twice for the same message + try: + last_message_id = get_param(LAST_WAKE_ID_PARAM) + except: + last_message_id = None + + if str(message_id) != str(last_message_id): + # This is a new message, wake the instance + try: + ec2.start_instances(InstanceIds=[INSTANCE_ID]) + send_telegram(bot_token, chat_id, "☕ Waking up... give me about 60 seconds.", message_id) + put_param(LAST_WAKE_ID_PARAM, message_id) + except Exception as e: + print(f"Start error: {e}") + send_telegram(bot_token, chat_id, f"⚠️ Failed to wake: {e}", message_id) + + # Always return 503 to Telegram so messages stay queued until we process them on boot + # This ensures no messages are lost while instance is sleeping + return {'statusCode': 503, 'body': 'retry'}