Skip to content

refactor: replace iptables with nftables for sandbox and VM networking #1335

@russellb

Description

@russellb

Problem Statement

OpenShell uses legacy iptables for network rule management in two subsystems: sandbox bypass detection (inside the sandbox network namespace) and VM driver host-side NAT/forwarding. iptables has been superseded by nftables since kernel 3.13 (2014), and most modern distros now ship nftables as the default backend. Migrating to nftables directly simplifies the codebase (eliminates the xt_extensions probe and iptables-legacy fallback), enables atomic ruleset installation (all rules applied in a single transaction vs. 7+ separate process spawns), and aligns with the direction of the Linux networking ecosystem.

Technical Context

iptables rules are managed by shelling out to the iptables/ip6tables CLI via std::process::Command. There is no Rust iptables library in use. The sandbox supervisor installs OUTPUT chain rules inside the sandbox network namespace for bypass detection and logging. The VM driver installs NAT, FORWARD, and INPUT rules on the host for TAP-based VM networking. Both sites use the same execution pattern (explicit argument arrays, no shell interpolation) but have independent code paths with no shared abstraction.

The custom VM kernel config already enables nftables kernel modules (NF_TABLES, NFT_CT, NFT_NAT, NFT_MASQ, etc.) — these were added for kube-proxy nftables mode.

Affected Components

Component Key Files Role
Sandbox supervisor crates/openshell-sandbox/src/sandbox/linux/netns.rs Installs iptables OUTPUT chain rules for bypass detection inside the sandbox netns
Bypass monitor crates/openshell-sandbox/src/bypass_monitor.rs Parses iptables LOG entries from /dev/kmsg to detect bypass attempts
Sandbox orchestration crates/openshell-sandbox/src/lib.rs Calls install_bypass_rules() and spawns bypass monitor
VM driver crates/openshell-driver-vm/src/runtime.rs Host-side NAT MASQUERADE, FORWARD, and INPUT rules for TAP networking
VM kernel config crates/openshell-driver-vm/runtime/kernel/openshell.kconfig Kernel module config for iptables and nftables
OCSF events crates/openshell-ocsf/src/events/network_activity.rs, objects/firewall_rule.rs, format/shorthand.rs References "iptables" as firewall engine name
BYOC example examples/bring-your-own-container/Dockerfile Installs iptables package
Documentation crates/openshell-driver-podman/NETWORKING.md, docs/security/best-practices.mdx iptables references

Technical Investigation

Architecture Overview

                        Host Network
                            |
                    ┌───────┴──────────┐
                    │                  │
         VM Driver (TAP)      Container Drivers
         iptables on HOST     (Podman/Docker/K8s)
         - MASQUERADE NAT     - Delegated to
         - FORWARD rules        container runtime
         - INPUT port-scope    - No direct iptables
                    │                  │
                    └───────┬──────────┘
                            │
                   Supervisor Process
                            │
                    ┌───────┴───────┐
                    │               │
              Proxy Listener    Inner Sandbox Netns
              (host-side veth)  (veth 10.200.0.2)
                                    │
                              iptables OUTPUT chain:
                              ACCEPT proxy, lo, conntrack
                              LOG + REJECT everything else
                                    │
                              Agent/User Code

Two independent iptables usage sites at different layers:

  1. Sandbox netns rules (inner): Security enforcement and diagnostics inside the sandbox network namespace. Rules are installed via nsenter --net=/var/run/netns/<name> -- /usr/sbin/iptables <args>. No cleanup needed — rules are destroyed with the namespace on Drop.

  2. VM TAP rules (host): Infrastructure NAT/routing on the host to connect VMs to the network. Uses delete-then-add for idempotency. Explicit teardown on TapGuard::drop() plus stale interface cleanup.

Code References

Location Description
crates/openshell-sandbox/src/sandbox/linux/netns.rs:245-628 All bypass detection iptables rules — ACCEPT proxy/lo/conntrack, LOG bypass attempts (rate-limited), REJECT remaining TCP/UDP
crates/openshell-sandbox/src/sandbox/linux/netns.rs:770-886 Binary discovery (find_iptables), xt_extensions probe, iptables-legacy fallback
crates/openshell-sandbox/src/bypass_monitor.rs:1-595 Parses LOG entries from /dev/kmsg — extracts DST=, DPT=, SPT=, PROTO=, UID= fields
crates/openshell-sandbox/src/lib.rs:514-634 Orchestration: calls install_bypass_rules(), spawns bypass monitor
crates/openshell-driver-vm/src/runtime.rs:447-529 setup_tap_networking() — NAT MASQUERADE, FORWARD ACCEPT, INPUT port-scope
crates/openshell-driver-vm/src/runtime.rs:532-583 teardown_tap_networking() — explicit rule deletion
crates/openshell-driver-vm/src/runtime.rs:382-406 cleanup_stale_tap_interfaces() — scans /sys/class/net for leftover vmtap-*
crates/openshell-driver-vm/runtime/kernel/openshell.kconfig:60-66 iptables kernel modules
crates/openshell-driver-vm/runtime/kernel/openshell.kconfig:68-83 nftables kernel modules (already enabled)
crates/openshell-ocsf/src/events/network_activity.rs:14 Comment: "iptables-level bypass detection"
crates/openshell-sandbox/src/bypass_monitor.rs:224 OCSF event: .firewall_rule("bypass-detect", "iptables")

Current Behavior

Sandbox bypass detection (netns.rs):

  • find_iptables() probes hardcoded paths (/usr/sbin/iptables, /sbin/iptables, /usr/bin/iptables)
  • If found, xt_extensions_unavailable() tests whether the xtables comment module works by creating a temp chain
  • If xtables fail, falls back to iptables-legacy binary
  • install_bypass_rules() runs 7+ separate nsenter ... iptables invocations sequentially (IPv4 + IPv6)
  • Each rule is a separate process spawn — there is a brief window where partial rules are in effect
  • If iptables is not found or rules fail to install, bypass detection is skipped with an OCSF event (graceful degradation)

Bypass monitor (bypass_monitor.rs):

  • Spawns nsenter ... dmesg --follow
  • Parses lines matching the openshell:bypass:<ns-id>: prefix
  • Extracts packet metadata (DST, DPT, SPT, PROTO, UID) from the kernel log format
  • These fields come from the kernel's nf_log_packet infrastructure, shared between iptables and nftables

VM driver (runtime.rs):

  • setup_tap_networking() runs 4 iptables commands with delete-then-add for idempotency
  • teardown_tap_networking() runs matching -D commands
  • Stale cleanup scans for orphaned TAP interfaces and removes their rules

What Would Need to Change

Sandbox bypass detection:

  • Replace find_iptables() with find_nft() — probe /usr/sbin/nft, /sbin/nft, /usr/bin/nft
  • Remove xt_extensions_unavailable() probe and iptables-legacy fallback (no longer needed with nftables)
  • Replace 7+ sequential iptables invocations with a single nft -f - command that atomically loads the full ruleset
  • IPv4 and IPv6 rules can be unified using inet family tables in nftables (vs. separate iptables/ip6tables calls today)
  • Update the OCSF firewall engine string from "iptables" to "nftables"

Bypass monitor:

  • Verify that nftables log prefix produces identical /dev/kmsg field format (DST=, DPT=, etc.)
  • The kernel's nf_log_packet infrastructure is shared, so format is expected to be identical, but needs empirical verification
  • Check whether nftables LOG supports the --log-uid equivalent (likely via meta skuid in the nft rule)

VM driver:

  • Replace iptables commands with nft equivalents
  • Use named tables (openshell_vm_<tap_device>) for clean per-sandbox isolation
  • Teardown becomes nft delete table ip openshell_vm_<tap_device> (single atomic operation vs. 4+ deletes)
  • Decide on fallback strategy if nft is not available on the host

Container images and docs:

  • Update examples/bring-your-own-container/Dockerfile to install nftables instead of iptables
  • Update NETWORKING.md, README.md, best-practices.mdx

Alternative Approaches Considered

  1. Rust nftables library (rustables crate): Pure Rust, direct netlink to kernel, no CLI dependency. However, it has "rough edges" per its maintainers, and the current codebase shells out to CLI tools throughout. Switching to a library changes the execution model unnecessarily. Better suited as a follow-up optimization.

  2. Dual support (try nft, fall back to iptables): Adds complexity with two code paths. Given that iptables on modern distros is usually an nftables shim (iptables-nft), the practical risk of nft being absent is low. The sandbox already has graceful degradation if neither is found. Dual support is not recommended.

  3. nftables for sandbox only, keep iptables for VM driver: Reduces blast radius. The VM driver touches the host (higher risk), while the sandbox rules are self-contained in a namespace. This is the recommended phasing — see Proposed Approach.

Patterns to Follow

  • Command execution: Use std::process::Command with explicit argument arrays (same pattern as existing iptables calls)
  • Binary discovery: Probe hardcoded paths with find_trusted_binary() style function
  • Graceful degradation: If nft is not found, skip bypass detection and emit an OCSF event (existing pattern in netns.rs)
  • Atomic operations: nftables supports nft -f - for loading a full ruleset from stdin — use this instead of sequential rule additions

Proposed Approach

A phased migration: Phase 1 replaces the sandbox bypass detection rules (self-contained, lower risk, eliminates the xt_extensions/legacy fallback complexity). Phase 2 replaces the VM driver host-side rules (higher risk, host dependency). Phase 3 updates documentation. The sandbox phase can merge independently. The key architectural improvement is atomic ruleset loading — all rules installed in a single transaction instead of 7+ sequential process spawns.

Scope Assessment

  • Complexity: Medium
  • Confidence: Medium — the sandbox replacement is well-understood, but the bypass monitor log format and VM driver host availability need empirical verification
  • Estimated files to change: ~10
  • Issue type: refactor

Risks & Open Questions

  • Bypass monitor log format compatibility: nftables LOG and iptables LOG both use the kernel's nf_log_packet and should produce identical field formats (DST=, DPT=, PROTO=, etc.), but this must be verified empirically on actual hardware. If the format differs, the parse_kmsg_line() parser in bypass_monitor.rs will need adjustment.
  • --log-uid equivalent in nftables: iptables LOG supports --log-uid to include the socket owner UID. The nftables equivalent may require meta skuid in the logging rule. Needs verification.
  • Host nft availability for VM driver: The VM driver runs rules on the HOST. Switching to nft means the host must have the nft CLI installed. This is a stronger requirement than the current approach (which works with any iptables backend). Decision needed: require nft on host, or implement a fallback?
  • Container image updates: The BYOC example Dockerfile and potentially the sandbox rootfs install iptables. These need updating to nftables. Confirm that the nftables package is available in the base images used.
  • Podman/Docker driver interaction: These drivers delegate networking to the container runtime and don't use iptables directly. However, Podman's Netavark may have opinions about coexisting nftables tables. Low risk but worth a spot-check.

Test Considerations

  • Bypass monitor parsing: Existing unit tests in bypass_monitor.rs cover parse_kmsg_line(). These tests use hardcoded log line strings — they'll need updating to reflect nftables log format (likely identical, but must be verified)
  • Integration testing: The sandbox e2e tests should be run to verify bypass detection still works end-to-end after the migration
  • VM driver testing: Manual testing on a host with nft installed, verifying NAT, forwarding, and teardown work correctly
  • Graceful degradation: Test that the fallback path (nft not found) still works — the sandbox should start without bypass detection, emitting the appropriate OCSF event
  • Atomic ruleset loading: Verify that a partial failure in nft -f - rolls back cleanly (nftables provides this guarantee, but worth confirming)
  • Root-required tests: The existing #[ignore] tests in netns.rs for find_trusted_binary() and namespace operations will need updating for nft binary discovery

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

Labels

area:sandboxSandbox runtime and isolation workarea:supervisorProxy and routing-path work

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions