High-performance, hot-swappable distributed actor mesh for the modern era.
Features • Architecture • Installation • Usage • Distributed Mesh
Iris is a distributed actor-model runtime built in Rust with deep Python and Node.js integration. It is designed for systems that require the extreme concurrency of Erlang, the raw speed of Rust, and the flexibility of high-level scripting languages.
Unlike standard message queues, Iris implements a cooperative reduction-based scheduler. This allows the runtime to manage millions of actors with microsecond latency, providing built-in fault tolerance, hot code swapping, and location-transparent messaging across a global cluster.
Iris supports two distinct actor patterns:
- Push Actors (Green Threads): Extremely lightweight. Rust "pushes" messages to a callback only when data arrives.
- Pull Actors (OS Threads): Specialized blocking actors that "pull" messages from a
Mailbox.- Python: Runs in a dedicated OS thread, blocking on
recv()(releasing the GIL).
- Python: Runs in a dedicated OS thread, blocking on
Inspired by the BEAM (Erlang VM), Iris uses a Cooperative Reduction Scheduler to ensure system fairness.
- Fairness: Every actor is assigned a "reduction budget". The message processing loop tracks this budget and explicitly yields control back to the Tokio runtime via
yield_now()once the limit is reached. - No Starvation: This prevents high-throughput actors from "monopolizing" a CPU core, ensuring all actors in the mesh get a chance to process their mailboxes.
Update your application logic while it is running.
- Zero Downtime: Swap out a Python handler or a Node.js function in memory without stopping the actor or losing its mailbox state.
- Safe Transition: Future messages are processed by the new logic instantly; current messages finish processing under the old logic.
Actors are first-class citizens of the network.
- Name Registry: Register actors with human-readable strings (e.g.,
"auth-provider") usingregister/unregisterand look them up viawhereis. - Async Discovery: Resolve remote service PIDs using native
await(Python) or Promises (Node.js) without blocking the runtime. - Location Transparency: Send messages to actors whether they reside on the local CPU or a server across the globe.
Built-in fault tolerance modeled after the "Let it Crash" philosophy.
- Heartbeat Monitoring: The mesh automatically sends
PING/PONGsignals (0x02/0x03) to detect silent failures (e.g., GIL freezes or half-open TCP connections). - Structured System Messages: Actor exits, hot-swaps, and heartbeats are delivered as System Messages, providing rich context for supervisor logic.
- Self-Healing Factories: Define closures that automatically re-resolve and restart connections when a remote node comes back online.
Iris actors are extremely lightweight, but the implementation differs by type:
- Push Actors: Purely state-machine driven. They consume ~2KB of RAM and exist only as futures in the Tokio runtime.
- Pull Actors (Python): Allocated a dedicated stack and OS thread (via Tokio's blocking pool). They are heavier but allow for straightforward, blocking, synchronous logic without "colored functions" (async/await).
Iris uses a proprietary length-prefixed binary protocol over TCP for inter-node communication.
| Packet Type | Function | Payload Structure |
|---|---|---|
0x00 |
User Message | [PID: u64][LEN: u32][DATA: Bytes] |
0x01 |
Resolve Request | [LEN: u32][NAME: String] → Returns [PID: u64] |
0x02 |
Heartbeat (Ping) | [Empty] — Probe remote node health |
0x03 |
Heartbeat (Pong) | [Empty] — Acknowledge health |
Iris bridges the gap between Rust’s memory safety and dynamic languages using PyO3 (Python) and N-API (Node.js):
- Membrane Hardening: The runtime uses
block_in_placeandThreadSafeFunctionqueues to safely handle synchronous calls from within asynchronous Rust contexts. - GIL Management: * Python
recv()calls release the GIL, allowing other threads to run in parallel while the actor waits for messages. - Atomic RwLocks: Actor behaviors are protected by thread-safe pointer swaps, ensuring hot-swapping is thread-safe.
- Rust 1.70+
- Python 3.8+ OR Node.js 14+
- Maturin (for Python) / NAPI-RS (for Node)
# Clone the repository
git clone https://github.com/SSL-ACTX/iris.git
cd iris
# Build and install the Python extension
maturin develop --release
# Clone the repository
git clone https://github.com/SSL-ACTX/iris.git
cd iris
# Build the N-API binding
npm install
npm run build
Iris provides a unified API across both supported languages.
Use spawn for maximum throughput (100k+ actors). Rust owns the scheduling and only invokes the guest language when a message arrives.
import iris
rt = iris.Runtime()
def fast_worker(msg):
print(f"Processed: {msg}")
# Spawn 1000 workers instantly (Green Threads)
for _ in range(1000):
rt.spawn(fast_worker, budget=50)const { NodeRuntime } = require('./index.js');
const rt = new NodeRuntime();
const fastWorker = (msg) => {
// msg is a Buffer
console.log(`Processed: ${msg.toString()}`);
};
for (let i = 0; i < 1000; i++) {
rt.spawn(fastWorker, 50);
}Use spawn_with_mailbox for complex logic where you want to block and wait for specific messages. No async/await required in Python.
# Runs in a dedicated OS thread. Blocking is safe.
def saga_coordinator(mailbox):
# Blocks thread, releases GIL
msg = mailbox.recv()
print("Starting Saga...")
# Wait up to 5 seconds for next message
confirm = mailbox.recv(timeout=5.0)
if confirm:
print("Confirmed")
else:
print("Timed out")
rt.spawn_with_mailbox(saga_coordinator, budget=100)const sagaCoordinator = async (mailbox) => {
const msg = await mailbox.recv();
console.log("Starting Saga...");
// Wait up to 5 seconds
const confirm = await mailbox.recv(5.0);
if (confirm) {
console.log("Confirmed");
} else {
console.log("Timed out");
}
};
rt.spawnWithMailbox(sagaCoordinator, 100);Actors can now be spawned with a parent PID. When the parent exits (normal or crash) all of its direct children are automatically stopped as well. This mirrors the behaviour of many functional runtimes and makes it easy to manage lifetimes for short‑lived helper tasks.
let rt = Runtime::new();
let parent = rt.spawn_actor(|mut rx| async move { /* ... */ });
let child = rt.spawn_child(parent, |mut rx| async move { /* will die with parent */ });
rt.send(parent, Message::User(Bytes::from("quit"))).unwrap();
// after the parent exits the child mailbox is closed as well
assert!(!rt.is_alive(child));rt = iris.Runtime()
parent = rt.spawn(lambda msg: print("parent got", msg))
child = rt.spawn_child(parent, lambda msg: print("child got", msg))
rt.send(parent, b"quit")
import time; time.sleep(0.1)
assert not rt.is_alive(child)There are three variants of the API:
spawn_child(parent, handler)– mailbox‑based actor.spawn_child_with_budget(parent, handler, budget)– same but with a reduction budget.spawn_child_handler_with_budget(parent, handler, budget)– message‑style handler (used by Python/Node wrappers).
Python and Node bindings expose matching helpers (spawn_child, spawn_child_with_mailbox, etc.) which accept the same arguments as their non‑child counterparts plus the parent PID.
Note
Network hardening: the underlying TCP protocol now imposes a 1 MiB payload ceiling, per-operation timeouts, and diligent logging. Malformed or oversized messages are dropped rather than crashing the node, and remote resolution/send operations will fail fast instead of hanging indefinitely.
# 1. Register a local actor
pid = rt.spawn(my_handler)
rt.register("auth_worker", pid)
# 2. Look it up later (Local)
target = rt.whereis("auth_worker")
# 3. Look it up remotely (Network)
async def find_remote():
addr = "192.168.1.5:9000"
# Non-blocking resolution
remote_pid = await rt.resolve_remote_py(addr, "auth_worker")
if remote_pid:
rt.send_remote(addr, remote_pid, b"login")// 1. Register
const pid = rt.spawn(myHandler);
rt.register("auth_worker", pid);
// 2. Resolve Remote
async function findAndQuery() {
const addr = "192.168.1.5:9000";
const targetPid = await rt.resolveRemote(addr, "auth_worker");
if (targetPid) {
rt.sendRemote(addr, targetPid, Buffer.from("login"));
}
}Iris supports hierarchical path registrations (e.g. /svc/payment/processor) and allows you to create a supervisor that is scoped to a path prefix. This is useful for grouping related actors and applying supervision policies per-service or per-tenant.
Key Python APIs:
rt.create_path_supervisor(path)— create a per-path supervisor instance.rt.path_supervisor_watch(path, pid)— register an actor PID with the path supervisor.rt.path_supervisor_children(path)— list PIDs currently supervised for the path.rt.remove_path_supervisor(path)— remove the path supervisor.rt.spawn_with_path_observed(budget, path)— spawn and register an observed actor underpath(useful for testing/monitoring).
Python example:
rt = iris.Runtime()
# spawn an observed actor and register it under a hierarchical path
pid = rt.spawn_with_path_observed(10, "/svc/test/one")
# create a supervisor for the '/svc/test' prefix and register the pid
rt.create_path_supervisor("/svc/test")
rt.path_supervisor_watch("/svc/test", pid)
# inspect supervised children
children = rt.path_supervisor_children("/svc/test")
print(children) # [pid]
# remove supervisor when done
rt.remove_path_supervisor("/svc/test")This mechanism makes it easy to apply restart strategies or monitoring rules to logical groups of actors without affecting the global supervisor.
messages = rt.get_messages(observer_pid)
for msg in messages:
if isinstance(msg, iris.PySystemMessage):
if msg.type_name == "EXIT":
print(f"Actor {msg.target_pid} has crashed!")// Node wrappers return wrapped objects { data: Buffer, system: Object }
const messages = rt.getMessages(observerPid);
messages.forEach(msg => {
if (msg.system) {
if (msg.system.typeName === "EXIT") {
console.log(`Actor ${msg.system.targetPid} has crashed!`);
}
} else {
console.log(`User Data: ${msg.data.toString()}`);
}
});def behavior_a(msg): print("Logic A")
def behavior_b(msg): print("Logic B (Upgraded!)")
pid = rt.spawn(behavior_a, budget=10)
rt.send(pid, b"test") # Prints "Logic A"
rt.hot_swap(pid, behavior_b)
rt.send(pid, b"test") # Prints "Logic B (Upgraded!)"Iris exposes lightweight mailbox introspection and actor-local timers so guest languages can inspect queue sizes and schedule timed messages.
- Rust:
Runtime::mailbox_size(pid: u64) -> Option<usize>returns the number of user messages queued forpid(excludes system messages). - Python:
rt.mailbox_size(pid)mirrors the Rust API and returnsNoneif the PID is unknown.
Python example:
size = rt.mailbox_size(pid)
print(f"Mailbox size for {pid}: {size}")You can schedule one-shot or repeating messages to an actor's mailbox. Timers are cancellable via an id returned at creation.
- Rust APIs:
send_after(pid, delay_ms, payload),send_interval(pid, interval_ms, payload),cancel_timer(timer_id) - Python:
rt.send_after(pid, ms, b'data'),rt.send_interval(pid, ms, b'data'),rt.cancel_timer(timer_id)
Python example (one-shot):
timer_id = rt.send_after(pid, 200, b'tick') # send 'tick' after 200ms
# cancel if needed
rt.cancel_timer(timer_id)Python example (repeating):
timer_id = rt.send_interval(pid, 1000, b'heartbeat') # every 1s
# stop later
rt.cancel_timer(timer_id)When an actor exits the runtime sends a structured EXIT system message that includes the reason and optional metadata. This allows supervisors and link/watch logic to make informed decisions.
Common ExitReason variants:
Normal— actor finished cleanly.Killed— requested shutdown.Panic— runtime detected a panic;ExitInfomay include panic metadata.Crash— user code returned an unrecoverable error.
Python example receiving exit info:
for msg in rt.get_messages(supervisor_pid):
if isinstance(msg, iris.PySystemMessage) and msg.type_name == 'EXIT':
print('from:', msg.from_pid)
print('target:', msg.target_pid)
print('reason:', msg.reason) # e.g. 'Normal' | 'Panic' | 'Killed'
print('metadata:', msg.metadata) # optional dict/bytes with extra infoNode.js receives system objects with the same fields: fromPid, targetPid, reason, and optional metadata.
In addition to the environment variables documented above, the Python runtime exposes programmatic setters:
rt.set_release_gil_limits(max_threads: int, gil_pool_size: int)— set the per-process cap and fallback pool size at runtime.rt.set_release_gil_strict(strict: bool)— whentrueaspawn(..., release_gil=True)will return an error if the dedicated-thread cap is reached instead of falling back to the shared pool.
By default, all Iris mailboxes are unbounded. For some high-throughput
workloads you may want a fixed capacity with a drop-new policy — useful for
rate-limiting fast producers. Use Runtime.spawn_py_handler_bounded in Python
or runtime.spawn_bounded in Node and specify the mailbox capacity (in
messages).
Python example:
from iris import Runtime
rt = Runtime()
# handler simply prints messages
pid = rt.spawn_py_handler_bounded(lambda m: print('got', m), budget=100, capacity=2)
# first two sends succeed
assert rt.send(pid, b'one')
assert rt.send(pid, b'two')
# third message is dropped and send returns False
assert not rt.send(pid, b'three')You can control whether push-based Python actors run their callbacks on a blocking thread
that acquires the GIL (avoiding holding the GIL on the async worker) using the
release_gil flag on Runtime.spawn:
from iris import Runtime
import time
rt = Runtime()
def handler_no(msg):
print('no release', __import__('threading').get_ident())
def handler_yes(msg):
print('release', __import__('threading').get_ident())
pid_no = rt.spawn(handler_no, budget=10, release_gil=False)
pid_yes = rt.spawn(handler_yes, budget=10, release_gil=True)
rt.send(pid_no, b'ping')
rt.send(pid_yes, b'ping')
time.sleep(0.2)When release_gil=True the handler runs inside a spawn_blocking worker that
acquires the GIL; release_gil=False keeps the previous behavior (callback runs
directly while holding the GIL on the async worker).
Iris supports hierarchical path registrations (for example /system/service/one) so you can group, query and supervise actors by logical paths.
Python APIs (examples): rt.register_path(path, pid), rt.unregister_path(path), rt.whereis_path(path), rt.list_children(prefix), rt.list_children_direct(prefix), rt.watch_path(prefix), rt.spawn_with_path_observed(budget, path), rt.child_pids(), rt.children_count().
Python example:
from iris import Runtime
rt = Runtime()
# Spawn and register
pid = rt.spawn(lambda m: None, 10)
rt.register_path("/system/service/one", pid)
print(rt.whereis_path("/system/service/one"))
print(rt.list_children("/system/service"))
print(rt.list_children_direct("/system"))
# Shallow watch: registers current direct children with the supervisor
rt.watch_path("/system/service")
print(rt.child_pids(), rt.children_count())Node.js example (conceptual):
const { NodeRuntime } = require('./index.js');
const rt = new NodeRuntime();
const pid = rt.spawn(myHandler, 10);
rt.registerPath('/system/service/one', pid);
const children = rt.listChildren('/system/service');Notes:
list_childrenreturns all descendant registrations under a prefix.list_children_directreturns only immediate children one level below the prefix.watch_pathperforms a shallow registration of direct children with the supervisor — path-scoped supervisors are planned as a next step.
Linux / macOS / Android (Termux)
Fully supported. High performance via multi-threaded Tokio runtime. Note: Android builds require NDK or local clang configuration.Windows
Supported. Ensure you have the latest Microsoft C++ Build Tools installed for PyO3/N-API compilation.Important
Production Status: Iris is currently in Alpha.
-
Performance Metrics (v0.3.0):
-
Push Actors: Validated to scale to 100k+ concurrent actors with message throughput exceeding ~1.2M+ msgs/sec on modern cloud vCPUs and ~409k msgs/sec on single-core legacy hardware.
-
Pull Actors: High-performance threaded actors supporting 100k+ concurrent instances with throughput reaching ~1.5M+ msgs/sec, demonstrating massive scaling beyond traditional thread-pool limitations.
-
Hot-Swapping: Logic upgrades validated at ~136k swaps/sec while maintaining active message processing.
-
Notice: The binary protocol is subject to change.
-
Stability: Always use the
Supervisorfor critical actor lifecycles to ensure automatic recovery and location transparency.
Author: Seuriin (SSL-ACTX)
v0.3.0