Skip to content

ci(infra): apply per-environment tfvars via -var-file (fixes prod private topology)#429

Merged
james-tn merged 1 commit into
mainfrom
fix/pipeline-tfvars-private-topology
Jun 9, 2026
Merged

ci(infra): apply per-environment tfvars via -var-file (fixes prod private topology)#429
james-tn merged 1 commit into
mainfrom
fix/pipeline-tfvars-private-topology

Conversation

@james-tn

@james-tn james-tn commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a latent bug in the Terraform deploy step: it passed only individual -var flags and never referenced prod.tfvars / dev.tfvars, so enable_networking and enable_private_endpoint silently fell back to their defaults (false). The pipeline therefore deployed a public topology even though both tfvars declare the intended private topology (VNet-integrated Container Apps env + Cosmos private endpoint, public access disabled).

Why this matters (root cause of the prod /chat 500s)

Production ended up in a broken half-state:

  • A manual terraform apply -var-file=prod.tfvars had set Cosmos public_network_access = Disabled, but
  • the Container Apps environment was created without VNet integration (which is immutable post-creation), so the backend egresses over the public internet and Cosmos's firewall rejects it.

Result: every /chat returns 500 (Forbidden ... blocked by your Cosmos DB account firewall settings), which is what's failing integration-tests on PRs to main (e.g. #426) — even though those PRs don't touch the deployed backend.

The change

Select the per-environment var file and pass it as -var-file (listed before the explicit -var flags so env/subscription/location/ACR/images/iteration still override the file):

case "<inputs.environment>" in
  production|prod) VAR_FILE="prod.tfvars" ;;
  *)               VAR_FILE="dev.tfvars" ;;
esac
TF_VARS=( -var-file=${VAR_FILE} -var project_name=... )

⚠️ Deployment impact (read before merging/deploying)

Applying this to an environment whose Container Apps env was created without a VNet will force-replace that environment (VNet integration is immutable), which recreates the contained Container Apps → brief backend/MCP downtime. This also creates the VNet/subnets and the Cosmos private endpoint. Plan and redeploy during a maintenance window.

Both dev.tfvars and prod.tfvars are internally consistent (enable_networking=true + enable_private_endpoint=true), so per-developer integration-* environments will likewise deploy the private topology on their next run.

Related

The Terraform deploy step passed only individual -var flags and never
referenced prod.tfvars / dev.tfvars, so enable_networking and
enable_private_endpoint silently fell back to their defaults (false). As a
result the pipeline deployed a PUBLIC topology even though both tfvars declare
the intended PRIVATE topology (VNet-integrated Container Apps env + Cosmos
private endpoint, public access disabled).

This left production in a broken half-state: a manual 'apply -var-file=prod.tfvars'
had disabled Cosmos public access, but the Container Apps environment was
created without VNet integration (which is immutable post-creation), so the
backend egresses over the public internet and Cosmos's firewall rejects it —
every /chat returns 500.

Fix: select the per-environment var file (production → prod.tfvars, else
dev.tfvars) and pass it as -var-file, listed before the explicit -var flags so
those (env, subscription, location, ACR, images, iteration) still override the
file. Networking/model config now comes from the tfvars as intended.

NOTE: applying this to an environment whose Container Apps env was created
without a VNet will FORCE-REPLACE that environment (VNet integration is
immutable), recreating the contained Container Apps. Plan/redeploy during a
maintenance window.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@james-tn james-tn merged commit c026e2f into main Jun 9, 2026
18 checks passed
james-tn added a commit that referenced this pull request Jun 9, 2026
…aform apply (#430)

PRODUCTION RECOVERY.

Two bugs combined to take down the production Container Apps environment during
the private-topology redeploy (#429):

1. The container_apps subnet had no delegation, so creating the VNet-integrated
   Container Apps environment failed with:
     ManagedEnvironmentSubnetDelegationError: The subnet of the environment must
     be delegated to the service 'Microsoft.App/environments'.
   Add the required delegation (Microsoft.App/environments +
   Microsoft.Network/virtualNetworks/subnets/join/action).

2. The deploy step ran 'terraform apply ... | tee' inside an if-condition with no
   pipefail, so the pipeline reported tee's (success) exit status and MASKED the
   failed apply. The old environment had already been destroyed, so the job went
   green while production was left with no environment and no container apps.
   Capture terraform's real exit code via PIPESTATUS so a failed apply fails the
   job instead of silently reporting success.

With the delegation in place the environment can be (re)created, and the
PIPESTATUS fix ensures any future apply failure surfaces instead of corrupting
the environment.

Co-authored-by: James N. <james.nguyen@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant