Add teleop replay agent for non-interactive CI runs#5507
Add teleop replay agent for non-interactive CI runs#5507rwiltz wants to merge 6 commits intoisaac-sim:developfrom
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR adds a dedicated CI script (teleop_replay_agent.py) for replaying captured teleoperation sessions, along with an internal XCR replay driver module. The goal is to replace runtime patches that CI previously applied to the interactive teleop_se3_agent.py. The implementation is structurally sound but has several issues around error handling, missing module initialization, and potential race conditions.
Architecture Impact
- New public script:
scripts/environments/teleoperation/teleop_replay_agent.pybecomes a CI entry point - New internal module:
isaaclab_teleop.automation.xcr_replay- but theautomationsubpackage appears to lack an__init__.py, making imports fail - Dependencies: Relies on
isaaclab.devices.openxr.remove_camera_configsandisaaclab.devices.teleop_device_factory.create_teleop_device- need to verify these exist in the codebase - Cross-module: Uses
isaaclab_tasks.utils.parse_env_cfgwhich is a stable public API
Implementation Verdict
Significant concerns — Missing __init__.py will cause import failures, and several edge cases in the simulation loop need attention.
Test Coverage
The PR description claims tests were added ("I have added tests that prove my fix is effective or that my feature works") but no test files are included in the changeset. For a CI automation script, integration tests or at minimum a smoke test would be valuable. The XCR replay driver has complex async behavior that's difficult to test in isolation.
CI Status
No CI checks available yet — cannot verify the code actually runs in the target environment.
Findings
🔴 Critical: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py — Missing __init__.py for automation subpackage
The new automation subpackage has no __init__.py. The import at teleop_replay_agent.py:108:
from isaaclab_teleop.automation import XcrReplayConfig, start_xcr_replaywill raise ModuleNotFoundError: No module named 'isaaclab_teleop.automation'. You need to create source/isaaclab_teleop/isaaclab_teleop/automation/__init__.py that exports these symbols.
🔴 Critical: scripts/environments/teleoperation/teleop_replay_agent.py:141-145 — Unhandled exception silently breaks loop without cleanup
except Exception as e:
logger.error(f"Error during simulation step: {e}")
breakThis catches ALL exceptions including KeyboardInterrupt (which inherits from BaseException, so actually won't be caught) and programming errors. The break exits the loop but:
- The async replay driver keeps running and will eventually call
post_quit()on an already-closed env - The exception traceback is lost (only message logged, not the full trace)
- No re-raise means CI will exit 0 on simulation errors
Consider: logger.exception("...") to preserve traceback, and potentially raise after cleanup or set a non-zero exit code.
🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:137-140 — action.repeat() may fail on 1D tensors
action = teleop_interface.advance()
...
actions = action.repeat(env.num_envs, 1)If teleop_interface.advance() returns a 1D tensor of shape (action_dim,), then .repeat(env.num_envs, 1) produces shape (action_dim,) not (num_envs, action_dim). The correct pattern for replicating across envs is:
actions = action.unsqueeze(0).repeat(env.num_envs, 1)
# or
actions = action.expand(env.num_envs, -1)Verify the expected shape from advance() — if it's already 2D with batch dim 1, the current code is fine.
🟡 Warning: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:105-107 — Polling internal module state is fragile
while xcr_player._xcr_playback_subscription is not None:
logger.debug("XCR replay: waiting for playback subscription to clear")
await asyncio.sleep(5)Relying on the underscore-prefixed _xcr_playback_subscription internal variable is brittle — it could be renamed or removed in future Kit versions. The comment acknowledges this ("public-ish signal") but there's no timeout. If the subscription never clears (bug in xcr_player, crash, etc.), this loops forever. Add a timeout:
timeout = 3600 # 1 hour max
elapsed = 0
while xcr_player._xcr_playback_subscription is not None and elapsed < timeout:
await asyncio.sleep(5)
elapsed += 5
if elapsed >= timeout:
logger.warning("XCR replay: timed out waiting for playback to complete")🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:108-109 — asyncio.ensure_future without a running loop
asyncio.ensure_future(start_xcr_replay(XcrReplayConfig(...)))This requires an asyncio event loop to already be running. Kit's main loop typically provides this, but if called before Kit's async infrastructure is fully initialized, this will raise RuntimeError: no running event loop. The script structure suggests this is called after AppLauncher setup, so it's likely fine, but adding defensive handling or a comment explaining the assumption would help maintainability.
🔵 Improvement: scripts/environments/teleoperation/teleop_replay_agent.py:134-135 — Missing render call when action is None may cause frame drops
if action is None:
env.sim.render()
continueWhen action is None, you render but skip env.step(). This is correct for not advancing physics, but the simulation clock doesn't advance either. Over a 120-second warmup period, this could lead to thousands of render-only iterations. Consider adding a small sleep to avoid busy-spinning:
if action is None:
env.sim.render()
simulation_app.update() # May already be implicit
continue🔵 Improvement: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:91 — Instantiating XCRReplayAPI() without storing reference
XCRReplayAPI()This instantiates an object solely for its side effects (registering the replay backend). While this works, it's unclear to future readers. Add a comment or store in a variable with _ prefix to signal intentional discard:
_ = XCRReplayAPI() # Side effect: registers replay backend
Greptile SummaryThis PR adds a permanent CI entry-point (
Confidence Score: 5/5This PR is safe to merge; the new replay agent correctly handles env cleanup, future error propagation, and zero-pose suppression during startup. All substantive issues raised in prior review rounds have been addressed: the automation/init.py is present, the Future is retained with a done-callback, env cleanup is in try/finally, XCRReplayAPI is stored in a local, and the poll loop is bounded. No new defects were identified in this pass. No files require special attention; xcr_replay.py relies on a private Kit attribute for completion detection, which is acknowledged in both code comments and the changelog. Important Files Changed
Sequence DiagramsequenceDiagram
participant Main as teleop_replay_agent.py
participant Env as gym.Env
participant Device as OpenXRDevice (handtracking)
participant Replay as start_xcr_replay (asyncio)
participant Kit as omni.kit / XCR backend
Main->>Env: "gym.make(task, cfg=env_cfg)"
Main->>Device: create_teleop_device(handtracking, callbacks)
Main->>Env: env.reset() / teleop_interface.reset()
Main->>Replay: asyncio.ensure_future(start_xcr_replay(cfg))
Note over Replay: Future stored in _PENDING_REPLAY_TASKS
loop simulation_app.is_running()
Main->>Device: teleop_interface.advance()
alt "teleop_active == True"
Main->>Env: env.step(actions)
else
Main->>Env: env.sim.render()
end
end
Replay->>Kit: configure XCR settings via carb.settings
Replay->>Kit: XCRReplayAPI() stored in _replay_api
Note over Replay: asyncio.sleep(start_delay_s)
Replay->>Kit: start_replay_if_enabled()
Replay->>Kit: EnabledXRProfile(profile_name)
Kit-->>Device: OpenXR start event
Device-->>Main: "teleop_active = True"
loop poll every 5s up to max_replay_duration_s
Replay->>Kit: check xcr_player._xcr_playback_subscription
end
Replay->>Kit: omni.kit.app.post_quit()
Kit-->>Main: "simulation_app.is_running() = False"
Main->>Env: env.close() via finally
Main->>Kit: simulation_app.close()
Reviews (2): Last reviewed commit: "Merge branch 'develop' into rwiltz/integ..." | Re-trigger Greptile |
|
@greptile |
22725cc to
4949cca
Compare
Description
Adds a permanent, decoupled CI entry-point for replaying captured teleop
sessions against an Isaac Lab environment. Replaces the runtime patch the
teleop-cicdpipeline currently applies toscripts/environments/teleoperation/teleop_se3_agent.pyso theuser-journey script is no longer mutated at CI time.
Fixes # (issue)
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there