SafeDeploy is a reliability governance control plane that evaluates system health using SLOs and error budgets, manages incidents as stateful workflows, and enforces safe operational actions through GitOps.
It is not a CI/CD tool or deployment system.
It is a system that decides whether deployments should happen at all.
Modern systems change faster than teams can safely understand their impact.
As a result:
- Deployments introduce outages
- Monitoring reflects infrastructure, not user experience
- Incidents are reactive and inconsistent
- Teams repeat the same failures
- Automation is either unsafe or underutilized
SafeDeploy addresses this by treating reliability as a governed system, not an afterthought.
SafeDeploy is a control plane that:
- Defines what βhealthyβ means using SLOs
- Evaluates real-time system behavior via Prometheus
- Converts signals into error budgets and burn rates
- Manages incident lifecycle state
- Generates proposed operational actions (rollback, promotion control)
- Enforces policy and governance rules
- Executes changes only through Git (never directly)
SafeDeploy enforces a strict separation of concerns:
Responsible for reasoning, governance, and decision-making:
- SLO & error budget evaluation
- Incident lifecycle management
- Policy enforcement (SLOs, ownership, rollout strategy)
- Proposal generation (rollback / promotion control)
- Audit & evidence tracking
The control plane never mutates production systems directly.
Responsible only for applying approved changes:
- ArgoCD (GitOps synchronization)
- Argo Rollouts (progressive delivery)
- Kubernetes (runtime execution)
All actions flow through:
Control Plane β Git (PR/commit) β Execution Plane
There are no direct API calls or kubectl actions from SafeDeploy.
- Prometheus provides all runtime signals
- SafeDeploy is strictly read-only
- All decisions are derived from observable evidence
SafeDeploy never invents metrics or assumptions.
The system continuously evaluates service health:
-
Fetch metrics from Prometheus
-
Compute SLI β SLO β error budget
-
Calculate burn rate severity:
normalslow-burnfast-burnexhausted
-
If unhealthy:
- Create or update incident
- Generate rollback proposal (idempotent)
- Apply freeze window if needed
-
Evaluate promotion eligibility via policy engine
-
Append audit evidence
Incidents are treated as stateful workflows, not alerts:
detected β investigating β mitigated β resolved β postmortem-complete
Key properties:
- One active incident per service
- No duplication
- Explicit, timestamped transitions
- Fully persisted and replayable
SafeDeploy does not act directly.
Instead, it generates proposals:
rollback-rolloutpromote-canaryblock-promotion
Properties:
- Proposals are inert
- Require human approval
- Idempotent (no duplicates per incident)
When a proposal is approved:
- Proposal marked
approved - Rollback commit prepared in environment repo
- New Git branch created
- Pull request opened
- ArgoCD reconciles change
- Incident transitions to
mitigated
Git is the only actuator in the system.
SafeDeploy enforces both structural and runtime policies.
Every service must define:
- Owner
- SLOs
- Rollback strategy
- Deployment strategy (canary required)
Deployment is blocked if:
- Error budget is exhausted
- Burn rate exceeds safe threshold
- Remaining budget is below safety margin
- Freeze window is active
Violations result in:
- Policy incidents
- Promotion-block proposals
When error budget is exhausted:
- Service enters a temporary freeze period
- All promotions are blocked
- Stability is enforced before further changes
This prevents rapid redeploy cycles during instability.
Every decision produces structured evidence:
evidence/
incident-XXXX/
decision.json
slo.json
budget.json
proposal.json
approval.json
resolution.json
Additionally:
- Append-only audit logs
- Stream-separated domains (
metrics,budget,incidents,governance)
Guarantees:
- Full traceability
- Explainable decisions
- Replayable system behavior
All state is stored locally as artifacts:
incidents/*.jsonaction-proposals/*.jsoncontrol-plane/state/*control-plane/audit/*.log
This design prioritizes:
- Simplicity
- Debuggability
- Deterministic behavior
SafeDeploy can generate a full system snapshot including:
- Service metadata
- Error budget state
- Active incidents
- Active proposals
- Governance status
- TypeScript (Node.js)
- Zod (runtime validation)
- Vitest (testing)
- Prometheus (metrics & query API)
- Kubernetes
- ArgoCD (GitOps)
- Argo Rollouts (progressive delivery / canary)
- Docker (image build)
- pnpm (package management)
- ESLint + Prettier (code quality)
- Husky + lint-staged (pre-commit enforcement)
- GitOps
- Progressive Delivery (Canary)
- SLO / Error Budgetβdriven decisions
- Event-driven incident lifecycle
- Strict TypeScript (
strict,noUncheckedIndexedAccess) - Zod runtime validation (env + Prometheus responses)
- ~98% test coverage (Vitest)
- ESLint + security rules
- Pre-commit enforcement (Husky)
- Deterministic package management (pnpm)
- Single-service prototype (
demo-app) - File-based persistence (no DB yet)
- CLI-based approval workflow
This is intentional to prioritize correctness, traceability, and architecture clarity.
SafeDeploy demonstrates how to:
- Treat deployments as risk-managed decisions
- Use SLOs as the source of truth
- Convert metrics into governed actions
- Enforce reliability at the platform level
- Keep humans in control while enabling automation
- Git is the only path to production
- No direct production access
- Metrics before automation
- Humans retain authority
- Learning is mandatory and persistent
SafeDeploy is not a deployment tool.
It is a reliability governance system that decides:
Should this system change β and is it safe to do so right now?
ISC