Skip to content

saejin-moon/heist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

HEIST: Hierarchical Environment for Interdependent Sequential Tasks

HEIST is a PettingZoo-compliant Multi-Agent Reinforcement Learning (MARL) benchmark specifically engineered to stress-test cooperative credit assignment. By forcing agents to navigate parallel physical spaces while remaining strictly gated by sequential causal dependencies and dynamic partial observability, HEIST isolates and exposes the failure modes of standard value-decomposition algorithms (e.g., QMIX, MAPPO).

Causal Credit Dilution

Standard cooperative MARL algorithms (such as QMIX and MAPPO) rely on shared reward functions, assuming agents contribute to the global state simultaneously. However, HEIST introduces Causal Credit Dilution. While agents navigate the environment in parallel, their objective interactions are bound by strict sequential dependencies (e.g., the Extractor cannot secure the loot until the Hacker disables the terminal). Consequently, if a downstream agent fails late in the episode, the shared negative reward propagates backward, diluting the credit upstream agents deserved for executing their prerequisites flawlessly.

Environment Architecture

To preserve the Markov Property under strict causal gating, HEIST abandons standard flat observation arrays in favor of a multi-tensor Dict space for each agent:

  1. observation (5x5 matrix): The agent's local physical view, restricted by a dynamic Bresenham-raycast Fog of War.
  2. action_mask (6-element vector): Mathematically enforces the sequential causal gates. Downstream actions (e.g., extracting loot) are dynamically masked out until upstream dependencies are resolved, preventing the policy network from wasting gradient updates on impossible actions.
  3. global_state (4-element vector): A global broadcast of the heist's phase (Terminal status, Loot status, Alarm status, Step count). Without this, downstream agents would experience action-mask changes as non-stationary magic, fatally violating the Markov Property.

Adversary and Loss Conditions

To prevent the environment from degrading into a trivial pathfinding task, HEIST introduces two opposing adversarial pressures. First, Rule-Based Adversaries (Guards) execute dynamic random-walk patrols. If a guard's Manhattan distance to any agent closes to $\le 1$, a global alarm triggers, terminating the episode with a catastrophic -10.0 shared reward. Second, a constant time penalty (-0.01 per step) acts as a baseline bleed. This dual-pressure system prevents policy collapse: agents cannot safely sprint blindly to objectives, nor can they exploit a "hide-and-wait" policy to avoid the guards entirely.

Installation and Developer Tooling

uv must be installed before running this repository.

# clone
git clone https://github.com/saejin-moon/heist.git
cd heist
# create environment
uv venv .venv
# activate environment
source .venv/bin/activate
# install
uv pip install numpy pettingzoo gymnasium pygame
# run
uv run src/manual_control.py

Note: Press 1, 2, 3, or 4 in the manual loop to hot-swap which agent you are driving. The terminal will print the active agent's observation matrix and reward delta on every step.

About

A PettingZoo-compatible MARL environment designed to benchmark multi-agent coordination under strict sequential causal dependencies and partial observability.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages