Skip to content

Switch Heroku build to container runtime; pin python:3.12-slim-bookworm#3588

Open
jstvz wants to merge 10 commits intomainfrom
restart/phase-0-spike
Open

Switch Heroku build to container runtime; pin python:3.12-slim-bookworm#3588
jstvz wants to merge 10 commits intomainfrom
restart/phase-0-spike

Conversation

@jstvz
Copy link
Copy Markdown
Contributor

@jstvz jstvz commented May 7, 2026

Summary

Switch the Heroku build to the container runtime, pin the Python base image to
python:3.12-slim-bookworm, and enable per-PR review apps on the metadeploy
pipeline so subsequent changes get an auto-built review environment.

Changes

  • heroku.yml — declares the container build for the existing Dockerfile
    and a release step that runs ./.heroku/release.sh (mirrors the previous
    Procfile's release: line). Per-process run: blocks are declared
    explicitly for web, devworker, worker, and worker-short.
  • Dockerfile:
    • Pins the base image from python:3.12 to python:3.12-slim-bookworm.
      Slim removes the compilers and -dev headers needed to build the C
      extensions a few of our dependencies still source-build (cryptography,
      lxml, psycopg2, multidict, …), so two small shims are added in-line:
      • apt install build-essential libxml2-dev libxslt-dev libpq-dev libffi-dev gettext redis-tools curl
      • pip install "setuptools<81" so cumulusci's
        pkg_resources.declare_namespace("cumulusci") import keeps working
        under modern pip.
        Pinned to -bookworm explicitly: the unpinned python:3.12-slim tag now
        resolves to debian trixie (gcc 14), whose stricter default warnings break
        multidict 6.0.4's pre-3.12-CPython C source.
    • Re-imports ARG BUILD_ENV / PROD_ASSETS / OMNIOUT_TOKEN inside the
      second stage so the yarn prod conditional actually sees them. ARGs
      declared above the first FROM are out of scope for RUN steps.
    • Switches CMD from /app/start-server.sh (dev-mode yarn serve under
      config.settings.local) to the Procfile's web command
      (daphne --bind 0.0.0.0 --port \$PORT metadeploy.asgi:application).
      Same behavior on Common Runtime where heroku.yml run wins; correct
      behavior on Private Spaces, where heroku.yml run is ignored and the
      in-image CMD is what runs. `docker-compose.yml` has its own `command:`
      override, so local dev is unaffected.
  • app.json:
    • "stack": "container" at the top level; review apps inherit on creation.
    • formation.{web,devworker,worker,worker-short}.size flipped from the
      dead "free" to "basic" (Heroku removed free dynos on 2022-11-28).
    • environments.review.scripts.postdeploy fixed from the non-existent
      ./manage.py populate_db to the actual command name populate_data
      (metadeploy/api/management/commands/populate_data.py).
    • Removed the buildpacks block — dead config under stack: container,
      contradicts the stack declaration and is ignored.
    • Removed the environments.test block — Heroku CI doesn't support
      container builds, and CI already moved to GitHub Actions in 2022
      (.github/workflows/test.yml).
  • docs/heroku-container-runtime.md (new) — operator-facing doc covering
    the build/release path (Heroku-built preferred, local container:push
    fallback), the Heroku Private Spaces `CMD`-vs-`heroku.yml run` quirk, the
    `heroku container:release` does-not-run-`release.command` quirk, and the
    manual CVE rebuild cadence (monthly + on Critical CVE) used until automated
    rebuild plumbing lands.
  • Review apps enabled on the `metadeploy` pipeline (autodeploy +
    autodestroy, 5-day stale).

Why slim-bookworm vs. `python:3.12`

  • ~510 MB smaller image
  • 66 fewer high/critical CVEs at the base layer

Verification (live, against the auto-built review app)

  • `/` → 200 (SPA renders, site context loads, `<title>MetaDeploy`)
  • `/api/products/` → 200 (35 products, 41 plans, 55 steps after `populate_data`)
  • `/api/plans/` → 200 (41 plans)
  • `/api/versions/` → 200 (36 versions)
  • `/products/eda/` → 200 (8.8 KB SPA shell renders for the EDA product page)
  • Container boot: daphne starts cleanly, migrations apply via release.sh
    (when run by Heroku's builder).
  • cumulusci Robot suite (`uv run cci task run robot --org enterprise -o vars "BASE_URL:https://metadeploy-pr-3588.herokuapp.com,PRODUCT:eda,PLAN:install"\`)
    drove the SPA end-to-end on a freshly created scratch org: home → EDA
    product → Install plan → Log In → Use Custom Domain → entered scratch
    instance_url → Continue. Stops at the Salesforce OAuth boundary with
    `redirect_uri_mismatch` because the per-app Connected App's callback
    allowlist doesn't include `https://metadeploy-pr-3588.herokuapp.com/accounts/salesforce/login/callback\`.
    Container-runtime, daphne, ASGI, frontend assets, websockets handshake,
    and Django session/CSRF middleware are all validated end-to-end; what's
    blocked is the SF OAuth handoff, which is outside Phase 0 scope.

Known follow-ons (not in this PR)

  • SF Connected App callback URL is the gating issue for any Robot/UI
    verification on review apps.
    Each review app gets a different URL,
    but the per-app Connected App's allowlist is static. Three resolution
    paths: (a) accept Robot stops at Log-In on review apps (HTTP smoke is
    the integration boundary); (b) per-PR Connected App via Salesforce
    Metadata API in `app.json` postdeploy; (c) fixed wildcard subdomain
    (Heroku-side). (a) is the cheapest and what we're doing today.
  • `heroku container:push` rebuilds the image without forwarding
    `--build-arg`.
    It runs its own `docker build` and ignores any
    locally-built and -tagged image. Result: every `heroku container:push`
    produces a `BUILD_ENV=development` image with empty `dist/prod/` even
    if the local registry has the right image. Workaround: build locally
    with `docker buildx build --no-cache --platform linux/amd64
    --build-arg BUILD_ENV=production -t registry.heroku.com//web --load .`
    then `docker push` directly (NOT `heroku container:push`), then
    `heroku container:release web -a `. Permanent fix candidates:
    add `build.config.BUILD_ENV: production` to `heroku.yml`; or
    restructure the Dockerfile to make asset compilation always-on.
  • populate_data sample-data limitations. The EDA plan slug is
    `install` (not `full-install`); no plan exposes a scratch-org install
    path because `supported_orgs` is unset on every populated plan. The
    Robot suite's `Tasks.Scratch Org` cannot reach the "Create Scratch Org"
    button. Either bias `populate_data` to a slug-pair Robot expects, or
    document the working slug pairs.
  • CCI 3.93.0 + `sf` CLI 2.131.7 incompatibility. CCI 3.93.0 hardcodes
    the removed `sfdx force:org:create` call. `cci flow run dev_org` fails
    with that. Workaround: `sf org create scratch -f orgs/.json
    --target-dev-hub --duration-days 1 -a ` then
    `cci org import `. Phase 7a (cumulusci v4.x harmonization)
    needs to confirm v4.x doesn't carry this; if it does, file upstream.
  • Frontend "Offline mode" banner under headless Chrome. Selenium-driven
    Chrome cannot establish the API websocket for live status updates. SPA
    navigation still works; anything dependent on the WS channel (preflight
    progress, install progress) won't update in real time during a Robot run.
  • Private Space prod cutover. `metadeploy-stg` runs in Private Space
    `metadeploy-staging`, where `heroku.yml run` is ignored and the in-image
    `CMD` is what dynos execute. The CMD change above makes the `web` process
    prod-correct, but `devworker`, `worker`, and `worker-short` still rely on
    `heroku.yml run` and would all launch daphne in Private Spaces. The proper
    fix is per-process Dockerfiles (`Dockerfile.web`, `Dockerfile.devworker`,
    `Dockerfile.worker`, `Dockerfile.worker-short`) pushed via
    `heroku container:push --recursive`.
  • `worker` dyno's Chrome path is buildpack-flavored.
    `.heroku/start_metadeploy_worker.sh` symlinks
    `/app/.apt/usr/bin/google-chrome`, a path created by the legacy heroku-apt
    buildpack. Under the container runtime this path doesn't exist. Doesn't
    block review-app verification because `worker` and `worker-short` are
    scaled to 0 for review apps.
  • `heroku ps:exec` not wired. No `.profile.d/heroku-exec.sh` and no
    `bash` symlink in the image, so `heroku ps:exec` shell debugging is
    unavailable. Cheap to add later.
  • Application-stack CVEs (~128 H/C) remain. The slim cutover removes
    base-layer CVEs but the application stack (`sfdx-cli`, deprecated npm
    packages, `cumulusci`/setuptools deprecation) is untouched here.
  • Heroku CLI 11.3.0 `reviewapps:enable` quirk. The CLI sends a
    malformed `deploy_target` and 404s; the platform API call works directly.
    If you hit this, use the API.
  • `OMNIOUT_TOKEN` not wired through `build.config`. The Dockerfile
    declares `ARG OMNIOUT_TOKEN` and `metadeploy-stg` has it set as a config
    var (review apps inherit), but it is not currently forwarded to the Docker
    build args via `heroku.yml`'s `build.config`. Fine because
    `yarn install --ignore-optional` skips the package that needs the token;
    add a `build.config.OMNIOUT_TOKEN: OMNIOUT_TOKEN` block later if the
    optional `@omnistudio/omniscript-lwc-compiler` install becomes required.
  • Heroku Container Registry 30-day retention. Images outside the 20
    most recent releases are deleted after 30 days. Only relevant if you ever
    need to re-release an old image directly (rebuilds from source are
    always fine).
  • Investigate why a fresh review-app web dyno needed `ps:restart` after
    `release.sh` migrations
    to see new tables. Likely Postgres
    connection-level metadata caching across the brief window where daphne
    opened connections before the release dyno finished migrating; needs a
    proper repro and a release-phase ordering audit.

jstvz and others added 3 commits May 6, 2026 22:39
Switch the Dockerfile FROM python:3.12 to python:3.12-slim-bookworm to
shrink the image by ~510 MB and drop 66 high/critical base-image CVEs
without touching requirements/*.txt or Phase 2 work. Required slim shims
are added in-line: build-essential plus -dev headers for source-built
wheels (multidict, etc.), and a setuptools<81 pin so cumulusci's
pkg_resources.declare_namespace import keeps working under modern pip.

Pin the slim base to -bookworm explicitly: the unpinned python:3.12-slim
tag now resolves to debian trixie (gcc 14), whose stricter default
warnings break multidict 6.0.4's pre-3.12-CPython C source.

Co-authored-by: Cursor <cursoragent@cursor.com>
M1: caveat the recommendation paragraph to mention whole-image CVEs
    (not just base-image CVEs) so a stop-after-recommendation reader
    doesn't leave with an inflated impression of the slim win.
M3: split 'Concerns to surface for sub-task 0.3' into '0.3 hand-offs'
    (items 1, 2) and 'Deferred / out-of-scope follow-ups' (items 3, 4, 5).
    The original heading misrepresented its own contents.
M5: add a Dockerfile cross-reference comment explaining why setuptools<81
    is pinned in two RUN lines (the second pip-install layer would
    otherwise re-resolve setuptools to >=81 via --upgrade pip-tools).

M2 (percentage) and M4 (alphabetize apt packages) skipped per reviewer
('skippable').

Co-authored-by: Cursor <cursoragent@cursor.com>
@jstvz jstvz requested a review from a team as a code owner May 7, 2026 06:13
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 06:15 Inactive
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 06:23 Inactive
Two consecutive review-app builds timed out 'waiting to start' because
heroku.yml only declared build.docker.web while app.json's formation
declares four process types (web, devworker, worker, worker-short).
Heroku's container build dispatcher couldn't reconcile that.

Also: the Dockerfile's CMD is start-server.sh (dev-mode: runs migrate +
populate_data + 'yarn serve' under config.settings.local), not the
production daphne entrypoint. Without explicit run.web in heroku.yml,
Heroku would have run start-server.sh on the production review app —
wrong command, wrong settings module.

run.* mirrors the existing Procfile commands exactly:
  web         -> daphne ASGI server (production)
  devworker   -> honcho dev-worker bundle
  worker      -> Selenium browser worker (note: chrome path is
                 buildpack-flavored; tracked as known follow-on; this
                 dyno stays quantity:0 in app.json review formation)
  worker-short -> honcho short-job worker bundle

Co-authored-by: Cursor <cursoragent@cursor.com>
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 06:30 Inactive
@jstvz jstvz force-pushed the restart/phase-0-spike branch from 34aeff5 to c6547e5 Compare May 7, 2026 06:50
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 06:52 Inactive
@jstvz jstvz changed the title Phase 0: container-runtime spike + base image decision Switch Heroku build to container runtime; pin python:3.12-slim-bookworm May 7, 2026
…prod

Without redeclaring ARG BUILD_ENV / PROD_ASSETS / OMNIOUT_TOKEN inside
the python:3.12-slim-bookworm stage, the values declared above the first
FROM are out of scope for RUN instructions. The yarn-prod conditional
\`[ "${BUILD_ENV}" = "production" ]\` evaluated empty-string against
"production", fell through to the else branch (\`mkdir -p dist/prod\`),
and shipped an empty dist/prod. The Django index.html template loader
then 500'd on \`/\` with TemplateDoesNotExist (the SPA bundle is built
into dist/prod/index.html).

Surfaced during the first end-to-end smoke of the container build on
metadeploy-pr-3588.
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 07:37 Inactive
… blocks

CMD: switch from /app/start-server.sh (dev-mode yarn serve under
config.settings.local) to the Procfile's web command
(daphne --bind 0.0.0.0 --port $PORT metadeploy.asgi:application).
This is the same command heroku.yml's run.web declares, so behavior
is unchanged on Common Runtime where heroku.yml run wins. It is
materially different on Private Spaces, where heroku.yml run is
ignored and the in-image CMD is what runs; the previous CMD would
have launched the dev server in production. docker-compose.yml has
its own command: override invoking start-server.sh, so local dev is
unaffected.

curl: add to the apt install list. Heroku release-phase log
streaming relies on curl in the image; without it, release output
silently degrades to app-logs only. Also useful for in-dyno debug.

app.json buildpacks block: dead config under stack: container.
heroku.yml is the source of truth for container builds; the
buildpacks declaration contradicts the stack and is ignored.

app.json environments.test block: Heroku CI does not support
container builds, and the metadeploy pipeline has zero recorded
test-runs. CI moved to GitHub Actions in 2022 (.github/workflows/
test.yml). The block was a silent no-op.
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 08:28 Inactive
Documents the container-runtime build/release path (Heroku-built
preferred, local container:push fallback), the Heroku Private
Spaces CMD-vs-heroku.yml-run quirk, and a manual CVE rebuild
cadence (monthly + on Critical CVE) to use until automated
rebuild plumbing lands. Replaces the buildpacks-shaped portions
of running_heroku.md (which still need a separate rewrite).

This page is the public-facing landing for operators. It
explains why the Dockerfile CMD must stay aligned with the
heroku.yml web run command, and why container:release skips
release.command (so manual release.sh runs are required after
a local container:push round).
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 08:53 Inactive
GitHub deprecated v3 of actions/upload-artifact, and active workflow
runs now hard-fail at job-prep with the message:
"This request has been automatically failed because it uses a
deprecated version of actions/upload-artifact: v3."

Bump three call sites: test.yml (Frontend coverage, Backend coverage)
and smoke_test.yml (Robot results on failure).

The change is unrelated to the Phase 0 container-runtime work but
is needed to get this PR's CI green. Without the bump, Build and
Lint pass while Frontend and Backend are blocked at start.
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 15:50 Inactive
@jstvz jstvz temporarily deployed to metadeploy-pr-3588 May 7, 2026 15:57 Inactive
@jstvz jstvz had a problem deploying to metadeploy-pr-3588 May 7, 2026 16:03 Failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant