Skip to content

ALIEN-59: K8s agent self-update — chart guardrails, /v1/rejoin recovery, persistence default#69

Open
yonatan-alien wants to merge 7 commits into
mainfrom
yonatan/alien-59-self-update-for-the-alien-agent-across-all-pull-mode-2
Open

ALIEN-59: K8s agent self-update — chart guardrails, /v1/rejoin recovery, persistence default#69
yonatan-alien wants to merge 7 commits into
mainfrom
yonatan/alien-59-self-update-for-the-alien-agent-across-all-pull-mode-2

Conversation

@yonatan-alien

Copy link
Copy Markdown

Summary

Closes ALIEN-59. Manager-driven agent self-update on Kubernetes, plus the two production-hardening fixes that came out of live e2e.

Feature

  • New agent_upgrade loop in alien-agent consumes agent_target.helm on /v1/sync and spawns a Helm-runner Job that runs helm upgrade --reuse-values --set runtime.image.tag=<target> under an alien-agent-upgrader ServiceAccount.
  • alien-helm chart guardrails — required management.{url,name,token,deploymentId}, /livez+/readyz probes, Recreate strategy, namespace-scoped upgrader Role/Binding, upgrader.enabled opt-out.
  • Sync schema gains agent_version / agent_os / agent_arch / regime / agent_image_repository; K8s worker transport set to local; K8s ServiceAccount controller is skipped (chart owns it).
  • alien-manager exposes PUT /v1/deployments/:id/target-agent-version; build_agent_target emits agent_target.helm when target ≠ reported and regime = kubernetes.

Production hardening (would silently break otherwise; both found via live e2e)

  • fix(helm): enable agent runtime-data persistence by default — without a PVC, the self-update's pod roll wipes the agent's deployment_id/sync token, the new pod hits 409 from initialize, CrashLoopBackOffs, and helm --atomic rolls back. Documented opt-out.
  • feat(agent,manager): /v1/rejoin endpoint — for the case where persistence is genuinely off (or the state directory is wiped for any other reason), the agent's initialize-409 falls through to a new POST /v1/rejoin that re-acquires a deployment-scoped sync token over the existing row. New RouterOptions::include_rejoin + skip_rejoin() builder lets multi-tenant embedders override, same pattern as /v1/initialize.
  • fix(agent): preserve 409 from initialize so rejoin fall-through fires — the agent's .context(ConfigurationError) was clobbering the SDK's real 409 with a hardcoded 500. New DeploymentNameAlreadyExists variant (409) lets the conflict surface so the rejoin branch actually triggers.

Paired platform-side PR: https://github.com/alienplatform/platform/pull/new/yonatan/alien-59-self-update-for-the-alien-agent-across-all-pull-mode-2

Test plan

  • cargo test -p alien-agent -p alien-core -p alien-deployment -p alien-helm -p alien-manager --lib
  • Live e2e against local managerx + orbstack K8s: pin a target version in the dashboard → manager emits agent_target → helm-upgrade Job rolls the agent pod → PVC keeps deployment_id → new agent reports the new version
  • Live e2e for rejoin: kubectl exec rm -rf /var/lib/alien-agent/* + bounce pod → agent logs Name already exists — assuming local state was wiped, rejoining… → recovers the original deployment_id and stays running
  • Reviewer to confirm: client-sdks/platform/rust/openapi*.json kept ours during the main merge to preserve the rejoin/target-agent-version schemas; main's added ManagerDomainBinding* + PackageDomain schemas in those files were dropped (they were generated from a private platform branch). Next platform-side openapi regen will bring them back.

Yonatan Schkolnik added 7 commits June 2, 2026 15:43
…IEN-59)

Squashed replay of the K8s self-update feature onto current main:

- `alien-agent`: new `agent_upgrade` loop that consumes `agent_target.helm`
  from /v1/sync and spawns a Helm-upgrade Job using the upgrader SA; sync
  loop persists deployment-scoped token across restarts.
- `alien-helm`: chart guardrails — `required management.{url,name,token,
  deploymentId}`, /livez + /readyz probes, Recreate strategy, upgrader SA
  + namespace-scoped Role, `upgrader.enabled` values block.
- `alien-core`: sync schema gains `agent_version`, `agent_os`, `agent_arch`,
  `regime`, `agent_image_repository`; K8s worker transport set to `local`.
- `alien-manager`: PUT /v1/deployments/:id/target-agent-version; sync emits
  `agent_target` when `target_agent_version` differs from the agent's
  reported version; sqlite store + migrations + ReconcileData carry the
  new inventory; sync-token persistence reuses across restarts.
- `alien-infra`: K8s ServiceAccount controller is skipped — the chart owns
  it, controllers reference it by name.
- `alien-deployment`: Kubernetes platform routes through local env-info
  collection (k8s-on-cloud picks the base cloud via existing context).
Copy the regenerated `apps/api/openapi.json` from the platform repo so
the Rust SDK (`alien-platform-api`, generated via `progenitor` at build
time) picks up the new `PUT /v1/deployments/:id/target-agent-version`
endpoint, the extended SyncReconcileRequest fields, and the
`basePlatform` property on `PersistImportedDeploymentRequest`.
Without a PVC backing `/var/lib/alien-agent`, the agent's `deployment_id`
+ deployment-scoped sync token live in an `emptyDir` and are wiped on any
pod restart. The self-update flow (manager emits `agent_target.helm` →
agent spawns helm-upgrade Job → Job rolls the agent Deployment) triggers
exactly such a restart, and the new pod re-runs `/v1/initialize`, hits a
DEPLOYMENT_NAME_ALREADY_EXISTS 409, CrashLoopBackOffs, helm `--atomic`
times out, and rolls back. The feature is silently unusable.

Default `runtime.data.persistence.enabled` to `true` so self-update
survives the pod roll on any cluster with a default StorageClass.
Operators on clusters without one can still set `storageClassName`,
point at an `existingClaim`, or explicitly opt out and accept that
self-update won't survive a pod roll.

Helm snapshot tests rebaselined.
The agent's `data_dir` (`/var/lib/alien-agent`) holds its persistent
`deployment_id` + deployment-scoped sync token. When the chart leaves
`runtime.data.persistence` off (or anything else wipes that directory —
pod eviction, manual reset, etc.) the agent restarts with no stored
state and falls into the "first startup" branch, hits `/v1/initialize`,
and the platform rejects with `DEPLOYMENT_NAME_ALREADY_EXISTS 409`
because the row is still there. Result: CrashLoopBackOff, and if the
restart was driven by self-update's helm-upgrade Job, helm `--atomic`
times out and rolls back.

Add `POST /v1/rejoin` so the agent can explicitly re-attach to an
existing deployment by name. Same dg bearer the chart mounts, no body
beyond the name; response is `{ deploymentId, token }` with a freshly
minted deployment-scoped token. The agent code-path catches the 409
from initialize and falls through to rejoin instead of erroring.

- `alien-manager`: new `RejoinRequest`/`RejoinResponse`, `rejoin`
  handler that mirrors the existing initialize idempotency branch
  (look up by `(dg, name)`, mint a fresh token), wired through a new
  `rejoin_router()` so multi-tenant embedders can override it the same
  way they override initialize. `RouterOptions::include_rejoin` + a
  `skip_rejoin()` builder method follow the existing pattern.
- `alien-agent`: on 409 from initialize, resolve the deployment name
  the same way initialize did (CLI arg / env / hostname) and call the
  new `rejoin_with_manager` helper. The recovered `(deployment_id,
  token)` are written back to `data_dir` like a normal init, so
  subsequent restarts go through the stored-state branch.
- `alien-manager-api` SDK regenerated; OSS `/v1/rejoin` is exposed in
  the openapi schema for downstream tooling.
The agent's initialize → rejoin fall-through keyed off
`http_status_code == Some(409)`, but `initialize_with_manager` wrapped
every SDK error in `ConfigurationError` whose hardcoded
`http_status_code = 500` clobbered the real status. Result: agent
received "Unexpected response: 409 Conflict" from the manager, hit the
`.context(ConfigurationError)` wrap, the outer status became 500, and
the agent crashed instead of falling through to `/v1/rejoin`.

- New `DeploymentNameAlreadyExists` variant on `ErrorData` (409,
  non-retryable, non-internal) so the 409 path has a distinct outer
  error.
- `initialize_with_manager` rebuilt: 409 from the SDK is re-emitted as
  `DeploymentNameAlreadyExists`; everything else still goes through
  `ConfigurationError` so existing error paths are unchanged.

Confirmed live: after wiping `/var/lib/alien-agent/*` and bouncing the
pod, the agent now logs `Name already exists — assuming local state was
wiped, rejoining…`, calls `/v1/rejoin`, recovers the same deployment_id,
and stays running.
…-update-for-the-alien-agent-across-all-pull-mode-2

# Conflicts:
#	client-sdks/platform/openapi.json
#	client-sdks/platform/rust/openapi-3.0.json
#	client-sdks/platform/rust/openapi.json
#	crates/alien-helm/src/generator.rs
#	crates/alien-helm/tests/generator/boot_paths_tests.rs
#	crates/alien-helm/tests/generator/snapshots/generator__generator__helpers__data_layer.snap
#	crates/alien-helm/tests/generator/snapshots/generator__generator__helpers__manager_only_pure_worker.snap
@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown

Greptile Summary

This PR lands manager-driven K8s agent self-update (Helm-runner Job spawned from the agent on /v1/sync), the /v1/rejoin recovery flow for wiped persistent state, and the chart guardrails (Recreate strategy, PVC-on-by-default, required value guards, /livez+/readyz probes) that make the upgrade cycle safe.

  • K8s self-update loop (agent_upgrade.rs): agent reads agent_target.helm from the sync response, resolves chart ref + runner image from env vars injected by the chart, and POSTs a one-shot Helm-runner Job to the in-cluster API using the alien-agent-upgrader SA.
  • /v1/rejoin endpoint (manager + agent cli.rs): catches the 409 from /v1/initialize (name already exists after a state wipe), re-acquires a fresh deployment-scoped sync token over the existing row, and persists it to the local SQLite state so pod restarts don't fall back to the chart-mounted DG token.
  • Chart hardening (generator.rs): PVC enabled by default, required template guards on management.*, Recreate rollout strategy, namespace-scoped upgrader Role/ClusterRoleBinding, and auto-generated encryption key via lookup.

Confidence Score: 3/5

The self-update happy path is solid, but a failed helm-runner Job silently blocks all retry attempts for up to 10 minutes with no operator warning.

The core upgrade loop, rejoin recovery, chart guardrails, and persistence defaults are well-reasoned and tested. The 409-on-failed-Job case in create_job quietly absorbs what should be a retryable error, and min_supported_version being pinned to the target version would block all upgrades once the agent enforces that field. The upgrade path is a new critical flow touching the agent self-replacement mechanism, so these gaps carry more weight than they would in a peripheral feature.

crates/alien-agent/src/loops/agent_upgrade.rs (failed-Job 409 handling) and crates/alien-manager/src/routes/sync.rs (build_agent_target min_supported_version) need the most attention before this ships to production.

Important Files Changed

Filename Overview
crates/alien-agent/src/loops/agent_upgrade.rs New Helm-runner Job spawner; 409 from a failed (but not yet TTL'd) Job is silently treated as idempotent success, stalling upgrades for up to 10 min with no operator warning.
crates/alien-manager/src/routes/sync.rs New agent_sync inventory persistence and build_agent_target; min_supported_version is set equal to the target version, which would block all agents from upgrading once that field is enforced.
crates/alien-manager/src/routes/deployments.rs New PUT /v1/deployments/:id/target-agent-version endpoint; auth, authz, and semver validation are present; is_plausible_semver rejects valid versions with both pre-release and build metadata.
crates/alien-helm/src/generator.rs Chart guardrails (required fields, Recreate strategy, persistence default, upgrader SA/Role/ClusterRole), encryption key auto-generation, and /livez+/readyz probe defaults all look correct.
crates/alien-agent/src/cli.rs Adds rejoin fall-through on 409 from initialize and sync-token persistence; logic and error routing look correct.
crates/alien-manager/src/stores/sqlite/deployment.rs update_agent_metadata and set_target_agent_version are correctly scoped; neither updates the updated_at column, leaving deployment timestamps stale after inventory syncs.
crates/alien-core/src/sync.rs SyncRequest/SyncResponse extended with new optional fields; AgentTarget, AgentHelmTarget, AgentBinaryTarget structs and back-compat tests look correct.
crates/alien-infra/src/core/executor.rs Kubernetes ServiceAccounts are correctly skipped at controller-lookup time and auto-marked Running as platform-provided resources; logic mirrors external-binding treatment.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Manager
    participant K8sAPI
    participant HelmRunnerJob

    Note over Agent,Manager: Normal sync loop
    Agent->>Manager: "POST /v1/sync (agent_version, regime=kubernetes)"
    Manager->>Manager: update_agent_metadata()
    Manager->>Manager: build_agent_target() if target differs from reported
    Manager-->>Agent: SyncResponse with agent_target helm payload

    Note over Agent,K8sAPI: Self-update actuator
    Agent->>K8sAPI: POST /apis/batch/v1/.../jobs (helm-runner Job)
    K8sAPI-->>Agent: 201 Created (or 409 if exists)
    K8sAPI->>HelmRunnerJob: schedule pod
    HelmRunnerJob->>HelmRunnerJob: helm upgrade --reuse-values --atomic --wait

    Note over Agent,Manager: Pod rolls (Recreate strategy)
    HelmRunnerJob->>K8sAPI: Deployment updated with new image tag
    Agent->>Agent: /readyz returns 503 until first sync
    Agent->>Manager: POST /v1/sync with new agent_version
    Agent->>Agent: "first_sync_completed=true, /readyz returns 200"

    Note over Agent,Manager: Rejoin after state wipe
    Agent->>Manager: POST /v1/initialize returns 409
    Agent->>Manager: POST /v1/rejoin with name
    Manager->>Manager: get_deployment_by_name() and mint new token
    Manager-->>Agent: deployment_id and fresh token
    Agent->>Agent: db.set_sync_token() persists for next restart
Loading

Comments Outside Diff (1)

  1. crates/alien-manager/src/stores/sqlite/deployment.rs, line 506-550 (link)

    P2 updated_at is not bumped during agent metadata writes

    update_agent_metadata (and similarly set_target_agent_version) never touches the updated_at column. Any API consumer that uses updated_at to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting updated_at = datetime('now') inside the same UPDATE, or at least doing so for set_target_agent_version which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/alien-manager/src/stores/sqlite/deployment.rs
    Line: 506-550
    
    Comment:
    **`updated_at` is not bumped during agent metadata writes**
    
    `update_agent_metadata` (and similarly `set_target_agent_version`) never touches the `updated_at` column. Any API consumer that uses `updated_at` to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting `updated_at = datetime('now')` inside the same UPDATE, or at least doing so for `set_target_agent_version` which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
crates/alien-agent/src/loops/agent_upgrade.rs:296-305
**Silently swallows a failed Job's 409 — upgrade stalls with no warning**

When the Helm-runner Job exists but is in a `Failed` state (exhausted its `backoffLimit`), the Kubernetes API still returns 409 on a new create. The code logs at `info` level and returns `Ok(())`, so `apply_agent_target` sees success. The manager keeps emitting `agent_target` on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to `ttlSecondsAfterFinished` seconds (10 minutes). During that window, there is no `warn` log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.

To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a `warn` when the existing Job is in a `Failed` condition, so operators can investigate before the TTL window expires.

### Issue 2 of 4
crates/alien-manager/src/routes/sync.rs:1107-1112
`min_supported_version` is set equal to `version`, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the `reported_version == Some(target_version)` guard returns `None` first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like `"1.0.0"`, or by explicitly recording the minimum agent version that is safe to self-upgrade.

```suggestion
    Some(alien_core::sync::AgentTarget {
        version: target_version.to_string(),
        // Allow all existing agents to receive the upgrade instruction.
        // Set to a meaningful floor (e.g. the first version that supports
        // self-update) once the agent enforces this field.
        min_supported_version: "1.0.0".to_string(),
        binary: None,
        helm,
    })
```

### Issue 3 of 4
crates/alien-manager/src/routes/deployments.rs:876-892
`splitn(2, |c| c == '-' || c == '+')` splits on the **first** occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. `1.4.0-rc.1+build.123`) fail: the remainder becomes `"rc.1+build.123"` and the `+` isn't in the allowed character set `[alphanumeric, '.', '-']`. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.

```suggestion
fn is_plausible_semver(v: &str) -> bool {
    // Strip optional +build metadata first so the pre-release check can
    // use a simple split on '-'.
    let without_build = v.splitn(2, '+').next().unwrap_or(v);
    let mut parts = without_build.splitn(2, '-');
    let core = parts.next().unwrap_or("");
    let core_ok = {
        let segs: Vec<&str> = core.split('.').collect();
        segs.len() == 3
            && segs
                .iter()
                .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
    };
    let rest_ok = parts.next().map_or(true, |r| {
        !r.is_empty()
            && r.chars()
                .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
    });
    core_ok && rest_ok
}
```

### Issue 4 of 4
crates/alien-manager/src/stores/sqlite/deployment.rs:506-550
**`updated_at` is not bumped during agent metadata writes**

`update_agent_metadata` (and similarly `set_target_agent_version`) never touches the `updated_at` column. Any API consumer that uses `updated_at` to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting `updated_at = datetime('now')` inside the same UPDATE, or at least doing so for `set_target_agent_version` which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.

Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Comment on lines +296 to +305
}
out
}

#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;

fn inputs() -> JobInputs {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silently swallows a failed Job's 409 — upgrade stalls with no warning

When the Helm-runner Job exists but is in a Failed state (exhausted its backoffLimit), the Kubernetes API still returns 409 on a new create. The code logs at info level and returns Ok(()), so apply_agent_target sees success. The manager keeps emitting agent_target on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to ttlSecondsAfterFinished seconds (10 minutes). During that window, there is no warn log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.

To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a warn when the existing Job is in a Failed condition, so operators can investigate before the TTL window expires.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-agent/src/loops/agent_upgrade.rs
Line: 296-305

Comment:
**Silently swallows a failed Job's 409 — upgrade stalls with no warning**

When the Helm-runner Job exists but is in a `Failed` state (exhausted its `backoffLimit`), the Kubernetes API still returns 409 on a new create. The code logs at `info` level and returns `Ok(())`, so `apply_agent_target` sees success. The manager keeps emitting `agent_target` on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to `ttlSecondsAfterFinished` seconds (10 minutes). During that window, there is no `warn` log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.

To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a `warn` when the existing Job is in a `Failed` condition, so operators can investigate before the TTL window expires.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1107 to +1112
Some(alien_core::sync::AgentTarget {
version: target_version.to_string(),
min_supported_version: target_version.to_string(),
binary: None,
helm,
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 min_supported_version is set equal to version, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the reported_version == Some(target_version) guard returns None first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like "1.0.0", or by explicitly recording the minimum agent version that is safe to self-upgrade.

Suggested change
Some(alien_core::sync::AgentTarget {
version: target_version.to_string(),
min_supported_version: target_version.to_string(),
binary: None,
helm,
})
Some(alien_core::sync::AgentTarget {
version: target_version.to_string(),
// Allow all existing agents to receive the upgrade instruction.
// Set to a meaningful floor (e.g. the first version that supports
// self-update) once the agent enforces this field.
min_supported_version: "1.0.0".to_string(),
binary: None,
helm,
})
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-manager/src/routes/sync.rs
Line: 1107-1112

Comment:
`min_supported_version` is set equal to `version`, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the `reported_version == Some(target_version)` guard returns `None` first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like `"1.0.0"`, or by explicitly recording the minimum agent version that is safe to self-upgrade.

```suggestion
    Some(alien_core::sync::AgentTarget {
        version: target_version.to_string(),
        // Allow all existing agents to receive the upgrade instruction.
        // Set to a meaningful floor (e.g. the first version that supports
        // self-update) once the agent enforces this field.
        min_supported_version: "1.0.0".to_string(),
        binary: None,
        helm,
    })
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +876 to +892
fn is_plausible_semver(v: &str) -> bool {
let mut parts = v.splitn(2, |c| c == '-' || c == '+');
let core = parts.next().unwrap_or("");
let core_ok = {
let segs: Vec<&str> = core.split('.').collect();
segs.len() == 3
&& segs
.iter()
.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
};
let rest_ok = parts.next().map_or(true, |r| {
!r.is_empty()
&& r.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
});
core_ok && rest_ok
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 splitn(2, |c| c == '-' || c == '+') splits on the first occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. 1.4.0-rc.1+build.123) fail: the remainder becomes "rc.1+build.123" and the + isn't in the allowed character set [alphanumeric, '.', '-']. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.

Suggested change
fn is_plausible_semver(v: &str) -> bool {
let mut parts = v.splitn(2, |c| c == '-' || c == '+');
let core = parts.next().unwrap_or("");
let core_ok = {
let segs: Vec<&str> = core.split('.').collect();
segs.len() == 3
&& segs
.iter()
.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
};
let rest_ok = parts.next().map_or(true, |r| {
!r.is_empty()
&& r.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
});
core_ok && rest_ok
}
fn is_plausible_semver(v: &str) -> bool {
// Strip optional +build metadata first so the pre-release check can
// use a simple split on '-'.
let without_build = v.splitn(2, '+').next().unwrap_or(v);
let mut parts = without_build.splitn(2, '-');
let core = parts.next().unwrap_or("");
let core_ok = {
let segs: Vec<&str> = core.split('.').collect();
segs.len() == 3
&& segs
.iter()
.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
};
let rest_ok = parts.next().map_or(true, |r| {
!r.is_empty()
&& r.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
});
core_ok && rest_ok
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-manager/src/routes/deployments.rs
Line: 876-892

Comment:
`splitn(2, |c| c == '-' || c == '+')` splits on the **first** occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. `1.4.0-rc.1+build.123`) fail: the remainder becomes `"rc.1+build.123"` and the `+` isn't in the allowed character set `[alphanumeric, '.', '-']`. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.

```suggestion
fn is_plausible_semver(v: &str) -> bool {
    // Strip optional +build metadata first so the pre-release check can
    // use a simple split on '-'.
    let without_build = v.splitn(2, '+').next().unwrap_or(v);
    let mut parts = without_build.splitn(2, '-');
    let core = parts.next().unwrap_or("");
    let core_ok = {
        let segs: Vec<&str> = core.split('.').collect();
        segs.len() == 3
            && segs
                .iter()
                .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
    };
    let rest_ok = parts.next().map_or(true, |r| {
        !r.is_empty()
            && r.chars()
                .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
    });
    core_ok && rest_ok
}
```

How can I resolve this? If you propose a fix, please make it concise.

/// POST a Job to the in-pod Kubernetes API. Returns Ok on 2xx, AlienError
/// otherwise. 409 (already exists) is treated as success — another sync
/// already kicked the Job and idempotency handles the retry.
async fn create_job(namespace: &str, body: &serde_json::Value) -> Result<()> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these APIs should be in alien-k8s-clients

pub agent_version: Option<String>,
/// `linux` / `macos` / `windows`. From `std::env::consts::OS`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub agent_os: Option<String>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an existing enum for this somewhere in alien-core

pub agent_os: Option<String>,
/// `x86_64` / `aarch64`. From `std::env::consts::ARCH`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub agent_arch: Option<String>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an existing enum for this somewhere in alien-core

/// How the agent is supervised — `os-service` (launcher) or `kubernetes`
/// (Helm). Detected at runtime from `KUBERNETES_SERVICE_HOST`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub regime: Option<AgentRegime>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regime is a really weird name haha

exec helm upgrade \"$RELEASE\" \"$CHART_REF\"{version_flag}{extra_args_flag} \
--namespace \"$NAMESPACE\" \
--reuse-values \
--atomic --wait \

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document why did we choose --atomic?

--values /tmp/values.json"
);

let body = serde_json::json!({

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if creating this job fails? Or pulling the job's image fails? Or the command inside it fails? Do we report back those failures in any way?

How many retries do we support (I think the answer should be unlimited)? Is there exponential backoff or constant interval between retries?

"name": job_name,
"namespace": inputs.namespace,
"labels": {
"app.kubernetes.io/managed-by": "alien-agent",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently improved our white-labeling support, need to make sure it fits there. See other places in alien-helm

@@ -0,0 +1,453 @@
//! Agent self-update actuator. When `/v1/sync` returns

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this will only work for Kubernetes deployments, right? No support for local service swap etc? What'll happen in local?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants