ci(infra): apply per-environment tfvars via -var-file (fixes prod private topology) by james-tn · Pull Request #429 · microsoft/OpenAIWorkshop

james-tn · 2026-06-09T23:11:50Z

Summary

Fixes a latent bug in the Terraform deploy step: it passed only individual -var flags and never referenced prod.tfvars / dev.tfvars, so enable_networking and enable_private_endpoint silently fell back to their defaults (false). The pipeline therefore deployed a public topology even though both tfvars declare the intended private topology (VNet-integrated Container Apps env + Cosmos private endpoint, public access disabled).

Why this matters (root cause of the prod `/chat` 500s)

Production ended up in a broken half-state:

A manual terraform apply -var-file=prod.tfvars had set Cosmos public_network_access = Disabled, but
the Container Apps environment was created without VNet integration (which is immutable post-creation), so the backend egresses over the public internet and Cosmos's firewall rejects it.

Result: every /chat returns 500 (Forbidden ... blocked by your Cosmos DB account firewall settings), which is what's failing integration-tests on PRs to main (e.g. #426) — even though those PRs don't touch the deployed backend.

The change

Select the per-environment var file and pass it as -var-file (listed before the explicit -var flags so env/subscription/location/ACR/images/iteration still override the file):

case "<inputs.environment>" in
  production|prod) VAR_FILE="prod.tfvars" ;;
  *)               VAR_FILE="dev.tfvars" ;;
esac
TF_VARS=( -var-file=${VAR_FILE} -var project_name=... )

⚠️ Deployment impact (read before merging/deploying)

Applying this to an environment whose Container Apps env was created without a VNet will force-replace that environment (VNet integration is immutable), which recreates the contained Container Apps → brief backend/MCP downtime. This also creates the VNet/subnets and the Cosmos private endpoint. Plan and redeploy during a maintenance window.

Both dev.tfvars and prod.tfvars are internally consistent (enable_networking=true + enable_private_endpoint=true), so per-developer integration-* environments will likewise deploy the private topology on their next run.

The Terraform deploy step passed only individual -var flags and never referenced prod.tfvars / dev.tfvars, so enable_networking and enable_private_endpoint silently fell back to their defaults (false). As a result the pipeline deployed a PUBLIC topology even though both tfvars declare the intended PRIVATE topology (VNet-integrated Container Apps env + Cosmos private endpoint, public access disabled). This left production in a broken half-state: a manual 'apply -var-file=prod.tfvars' had disabled Cosmos public access, but the Container Apps environment was created without VNet integration (which is immutable post-creation), so the backend egresses over the public internet and Cosmos's firewall rejects it — every /chat returns 500. Fix: select the per-environment var file (production → prod.tfvars, else dev.tfvars) and pass it as -var-file, listed before the explicit -var flags so those (env, subscription, location, ACR, images, iteration) still override the file. Networking/model config now comes from the tfvars as intended. NOTE: applying this to an environment whose Container Apps env was created without a VNet will FORCE-REPLACE that environment (VNet integration is immutable), recreating the contained Container Apps. Plan/redeploy during a maintenance window. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aform apply (#430) PRODUCTION RECOVERY. Two bugs combined to take down the production Container Apps environment during the private-topology redeploy (#429): 1. The container_apps subnet had no delegation, so creating the VNet-integrated Container Apps environment failed with: ManagedEnvironmentSubnetDelegationError: The subnet of the environment must be delegated to the service 'Microsoft.App/environments'. Add the required delegation (Microsoft.App/environments + Microsoft.Network/virtualNetworks/subnets/join/action). 2. The deploy step ran 'terraform apply ... | tee' inside an if-condition with no pipefail, so the pipeline reported tee's (success) exit status and MASKED the failed apply. The old environment had already been destroyed, so the job went green while production was left with no environment and no container apps. Capture terraform's real exit code via PIPESTATUS so a failed apply fails the job instead of silently reporting success. With the delegation in place the environment can be (re)created, and the PIPESTATUS fix ensures any future apply failure surfaces instead of corrupting the environment. Co-authored-by: James N. <james.nguyen@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

james-tn temporarily deployed to production June 9, 2026 23:12 — with GitHub Actions Inactive

james-tn merged commit c026e2f into main Jun 9, 2026
18 checks passed

james-tn mentioned this pull request Jun 9, 2026

fix(infra): delegate Container Apps subnet + stop masking failed terraform apply (PROD RECOVERY) #430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(infra): apply per-environment tfvars via -var-file (fixes prod private topology)#429

ci(infra): apply per-environment tfvars via -var-file (fixes prod private topology)#429
james-tn merged 1 commit into
mainfrom
fix/pipeline-tfvars-private-topology

james-tn commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

james-tn commented Jun 9, 2026

Summary

Why this matters (root cause of the prod /chat 500s)

The change

⚠️ Deployment impact (read before merging/deploying)

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why this matters (root cause of the prod `/chat` 500s)