Skip to content

Add teleop replay agent for non-interactive CI runs#5507

Open
rwiltz wants to merge 6 commits intoisaac-sim:developfrom
rwiltz:rwiltz/integrate-teleop-cicd
Open

Add teleop replay agent for non-interactive CI runs#5507
rwiltz wants to merge 6 commits intoisaac-sim:developfrom
rwiltz:rwiltz/integrate-teleop-cicd

Conversation

@rwiltz
Copy link
Copy Markdown
Contributor

@rwiltz rwiltz commented May 5, 2026

Description

Adds a permanent, decoupled CI entry-point for replaying captured teleop
sessions against an Isaac Lab environment. Replaces the runtime patch the
teleop-cicd pipeline currently applies to
scripts/environments/teleoperation/teleop_se3_agent.py so the
user-journey script is no longer mutated at CI time.

Fixes # (issue)

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@rwiltz rwiltz requested a review from ooctipus as a code owner May 5, 2026 22:22
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 5, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds a dedicated CI script (teleop_replay_agent.py) for replaying captured teleoperation sessions, along with an internal XCR replay driver module. The goal is to replace runtime patches that CI previously applied to the interactive teleop_se3_agent.py. The implementation is structurally sound but has several issues around error handling, missing module initialization, and potential race conditions.

Architecture Impact

  • New public script: scripts/environments/teleoperation/teleop_replay_agent.py becomes a CI entry point
  • New internal module: isaaclab_teleop.automation.xcr_replay - but the automation subpackage appears to lack an __init__.py, making imports fail
  • Dependencies: Relies on isaaclab.devices.openxr.remove_camera_configs and isaaclab.devices.teleop_device_factory.create_teleop_device - need to verify these exist in the codebase
  • Cross-module: Uses isaaclab_tasks.utils.parse_env_cfg which is a stable public API

Implementation Verdict

Significant concerns — Missing __init__.py will cause import failures, and several edge cases in the simulation loop need attention.

Test Coverage

The PR description claims tests were added ("I have added tests that prove my fix is effective or that my feature works") but no test files are included in the changeset. For a CI automation script, integration tests or at minimum a smoke test would be valuable. The XCR replay driver has complex async behavior that's difficult to test in isolation.

CI Status

No CI checks available yet — cannot verify the code actually runs in the target environment.

Findings

🔴 Critical: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py — Missing __init__.py for automation subpackage
The new automation subpackage has no __init__.py. The import at teleop_replay_agent.py:108:

from isaaclab_teleop.automation import XcrReplayConfig, start_xcr_replay

will raise ModuleNotFoundError: No module named 'isaaclab_teleop.automation'. You need to create source/isaaclab_teleop/isaaclab_teleop/automation/__init__.py that exports these symbols.

🔴 Critical: scripts/environments/teleoperation/teleop_replay_agent.py:141-145 — Unhandled exception silently breaks loop without cleanup

except Exception as e:
    logger.error(f"Error during simulation step: {e}")
    break

This catches ALL exceptions including KeyboardInterrupt (which inherits from BaseException, so actually won't be caught) and programming errors. The break exits the loop but:

  1. The async replay driver keeps running and will eventually call post_quit() on an already-closed env
  2. The exception traceback is lost (only message logged, not the full trace)
  3. No re-raise means CI will exit 0 on simulation errors

Consider: logger.exception("...") to preserve traceback, and potentially raise after cleanup or set a non-zero exit code.

🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:137-140 — action.repeat() may fail on 1D tensors

action = teleop_interface.advance()
...
actions = action.repeat(env.num_envs, 1)

If teleop_interface.advance() returns a 1D tensor of shape (action_dim,), then .repeat(env.num_envs, 1) produces shape (action_dim,) not (num_envs, action_dim). The correct pattern for replicating across envs is:

actions = action.unsqueeze(0).repeat(env.num_envs, 1)
# or
actions = action.expand(env.num_envs, -1)

Verify the expected shape from advance() — if it's already 2D with batch dim 1, the current code is fine.

🟡 Warning: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:105-107 — Polling internal module state is fragile

while xcr_player._xcr_playback_subscription is not None:
    logger.debug("XCR replay: waiting for playback subscription to clear")
    await asyncio.sleep(5)

Relying on the underscore-prefixed _xcr_playback_subscription internal variable is brittle — it could be renamed or removed in future Kit versions. The comment acknowledges this ("public-ish signal") but there's no timeout. If the subscription never clears (bug in xcr_player, crash, etc.), this loops forever. Add a timeout:

timeout = 3600  # 1 hour max
elapsed = 0
while xcr_player._xcr_playback_subscription is not None and elapsed < timeout:
    await asyncio.sleep(5)
    elapsed += 5
if elapsed >= timeout:
    logger.warning("XCR replay: timed out waiting for playback to complete")

🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:108-109 — asyncio.ensure_future without a running loop

asyncio.ensure_future(start_xcr_replay(XcrReplayConfig(...)))

This requires an asyncio event loop to already be running. Kit's main loop typically provides this, but if called before Kit's async infrastructure is fully initialized, this will raise RuntimeError: no running event loop. The script structure suggests this is called after AppLauncher setup, so it's likely fine, but adding defensive handling or a comment explaining the assumption would help maintainability.

🔵 Improvement: scripts/environments/teleoperation/teleop_replay_agent.py:134-135 — Missing render call when action is None may cause frame drops

if action is None:
    env.sim.render()
    continue

When action is None, you render but skip env.step(). This is correct for not advancing physics, but the simulation clock doesn't advance either. Over a 120-second warmup period, this could lead to thousands of render-only iterations. Consider adding a small sleep to avoid busy-spinning:

if action is None:
    env.sim.render()
    simulation_app.update()  # May already be implicit
    continue

🔵 Improvement: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:91 — Instantiating XCRReplayAPI() without storing reference

XCRReplayAPI()

This instantiates an object solely for its side effects (registering the replay backend). While this works, it's unclear to future readers. Add a comment or store in a variable with _ prefix to signal intentional discard:

_ = XCRReplayAPI()  # Side effect: registers replay backend

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR adds a permanent CI entry-point (teleop_replay_agent.py) for replaying captured teleop sessions against an Isaac Lab environment, replacing the runtime patch previously applied by the teleop-cicd pipeline to teleop_se3_agent.py. It also introduces a small isaaclab_teleop.automation subpackage backing the replay driver.

  • teleop_replay_agent.py: Builds the teleop env, attaches the legacy handtracking device, schedules the XCR replay coroutine via asyncio.ensure_future (retained in a strong-ref set with a done-callback to surface failures), wraps env lifecycle in try/finally, and gates env.step() behind START/STOP callbacks to avoid feeding zero-pose targets during the start-delay window.
  • xcr_replay.py: Drives Kit's OpenXR XCR backend; stores the XCRReplayAPI instance to prevent premature GC, uses a bounded poll loop (max_replay_duration_s) to avoid an indefinite hang if xcr_player._xcr_playback_subscription never clears, and calls post_quit() on completion or timeout.
  • automation/__init__.py: Adds the explicit subpackage __init__.py to make from isaaclab_teleop.automation import ... reliable.

Confidence Score: 5/5

This PR is safe to merge; the new replay agent correctly handles env cleanup, future error propagation, and zero-pose suppression during startup.

All substantive issues raised in prior review rounds have been addressed: the automation/init.py is present, the Future is retained with a done-callback, env cleanup is in try/finally, XCRReplayAPI is stored in a local, and the poll loop is bounded. No new defects were identified in this pass.

No files require special attention; xcr_replay.py relies on a private Kit attribute for completion detection, which is acknowledged in both code comments and the changelog.

Important Files Changed

Filename Overview
scripts/environments/teleoperation/teleop_replay_agent.py New CI entry-point for non-interactive teleop replay; correctly wraps env lifecycle in try/finally, retains Future in a strong-ref set with a done-callback, and guards the step loop behind START/STOP callbacks.
source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py XCR replay driver coroutine; addresses previously flagged issues (stored XCRReplayAPI reference, bounded poll timeout), but the poll loop's initial-state assumption on _xcr_playback_subscription is worth watching.
source/isaaclab_teleop/isaaclab_teleop/automation/init.py Adds the automation subpackage __init__.py, exporting XcrReplayConfig and start_xcr_replay; clean and minimal.
source/isaaclab_teleop/changelog.d/rwiltz-xcr-replay-agent.minor.rst Changelog fragment accurately describes the new script, the start-delay zero-pose fix, the Future hang fix, and the bounded-poll addition.

Sequence Diagram

sequenceDiagram
    participant Main as teleop_replay_agent.py
    participant Env as gym.Env
    participant Device as OpenXRDevice (handtracking)
    participant Replay as start_xcr_replay (asyncio)
    participant Kit as omni.kit / XCR backend

    Main->>Env: "gym.make(task, cfg=env_cfg)"
    Main->>Device: create_teleop_device(handtracking, callbacks)
    Main->>Env: env.reset() / teleop_interface.reset()
    Main->>Replay: asyncio.ensure_future(start_xcr_replay(cfg))
    Note over Replay: Future stored in _PENDING_REPLAY_TASKS

    loop simulation_app.is_running()
        Main->>Device: teleop_interface.advance()
        alt "teleop_active == True"
            Main->>Env: env.step(actions)
        else
            Main->>Env: env.sim.render()
        end
    end

    Replay->>Kit: configure XCR settings via carb.settings
    Replay->>Kit: XCRReplayAPI() stored in _replay_api
    Note over Replay: asyncio.sleep(start_delay_s)
    Replay->>Kit: start_replay_if_enabled()
    Replay->>Kit: EnabledXRProfile(profile_name)
    Kit-->>Device: OpenXR start event
    Device-->>Main: "teleop_active = True"

    loop poll every 5s up to max_replay_duration_s
        Replay->>Kit: check xcr_player._xcr_playback_subscription
    end

    Replay->>Kit: omni.kit.app.post_quit()
    Kit-->>Main: "simulation_app.is_running() = False"
    Main->>Env: env.close() via finally
    Main->>Kit: simulation_app.close()
Loading

Reviews (2): Last reviewed commit: "Merge branch 'develop' into rwiltz/integ..." | Re-trigger Greptile

Comment thread source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py
Comment thread scripts/environments/teleoperation/teleop_replay_agent.py
Comment thread scripts/environments/teleoperation/teleop_replay_agent.py
Comment thread source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py
Comment thread source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py
@rwiltz
Copy link
Copy Markdown
Contributor Author

rwiltz commented May 8, 2026

@greptile

@rwiltz rwiltz force-pushed the rwiltz/integrate-teleop-cicd branch from 22725cc to 4949cca Compare May 8, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants