Skip to content

Allow on-the-fly updates to adaptors, schemas, and icons#4473

Open
taylordowns2000 wants to merge 15 commits intomainfrom
christmas-for-devops
Open

Allow on-the-fly updates to adaptors, schemas, and icons#4473
taylordowns2000 wants to merge 15 commits intomainfrom
christmas-for-devops

Conversation

@taylordowns2000
Copy link
Member

@taylordowns2000 taylordowns2000 commented Feb 26, 2026

Description

Enables on-the-fly updates to the adaptor registry, credential schemas, and adaptor icons without requiring an application restart. Superusers can trigger refreshes from a new Settings > Maintenance admin page, and a configurable Oban cron job (ADAPTOR_REFRESH_INTERVAL_HOURS) keeps them in sync automatically across clustered nodes.

Closes #3114
Closes #2209
Closes #325 (wow what a golden oldie!)
Closes #1996

Changes

  • New modules: AdaptorIcons, CredentialSchemas, and AdaptorRefreshWorker — extract runtime refresh logic out of mix tasks into callable modules
  • AdaptorRegistry — refactored GenServer state from a bare list to a map (%{adaptors, cache_path, local_mode}); added refresh/1, refresh_sync/1, and PubSub-based cross-node sync via adaptor:refresh topic
  • MaintenanceLive — new LiveView under Settings with action buttons for each refresh operation, gated to superusers
  • Mix tasks (install_adaptor_icons, install_schemas) — slimmed down to thin wrappers around the new modules
  • Oban cron — AdaptorRefreshWorker runs on a configurable interval; skips in local adaptors mode; broadcasts to peer nodes on success
  • Minor .env.example typo fixes

Validation steps

  1. Clear you adaptor registry and credential schemas
  2. Log in as a superuser
  3. Note that you can't find adaptors
  4. Note that you can't find credential schemas
  5. Go to the admin dashbaord
  6. Go to "Maintenance"
  7. Click the buttons and check that the things come back!

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

Pre-submission checklist

  • I have performed an AI review of my code (we recommend using /review
    with Claude Code)
  • I have implemented and tested all related authorization policies.
    (e.g., :owner, :admin, :editor, :viewer)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

@github-project-automation github-project-automation bot moved this to New Issues in Core Feb 26, 2026
@taylordowns2000 taylordowns2000 marked this pull request as ready for review February 26, 2026 23:35
@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 60.88710% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.07%. Comparing base (e7c8d51) to head (9003c09).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
lib/lightning/maintenance.ex 0.00% 61 Missing ⚠️
lib/lightning_web/live/maintenance_live/index.ex 57.50% 17 Missing ⚠️
lib/lightning/adaptor_registry.ex 71.42% 8 Missing ⚠️
lib/lightning/adaptor_refresh_worker.ex 83.33% 4 Missing ⚠️
lib/lightning/config/bootstrap.ex 57.14% 3 Missing ⚠️
lib/lightning/application.ex 66.66% 2 Missing ⚠️
lib/lightning/credential_schemas.ex 97.43% 1 Missing ⚠️
lib/mix/tasks/install_adaptor_icons.ex 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4473      +/-   ##
==========================================
- Coverage   89.47%   89.07%   -0.41%     
==========================================
  Files         425      430       +5     
  Lines       20212    20388     +176     
==========================================
+ Hits        18085    18160      +75     
- Misses       2127     2228     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@josephjclark
Copy link
Collaborator

When I try and run this locally I get:

[notice] Application lightning exited: exited in: Lightning.Application.start(:normal, [:timex])
    ** (EXIT) an exception was raised:
        ** (RuntimeError) No Oban instance named `Oban` is running and config isn't available.

            (oban 2.20.1) lib/oban/registry.ex:37: Oban.Registry.config/1
            (oban 2.20.1) lib/oban.ex:609: Oban.insert/3
            (lightning 2.15.15-pre) lib/lightning/application.ex:155: Lightning.Application.start/2
            (kernel 10.2.6) application_master.erl:349: :application_master.start_supervisor/3
            (kernel 10.2.6) application_master.erl:331: :application_master.start_the_app/5
            (kernel 10.2.6) application_master.erl:317: :application_master.start_it_new/7

Main is fine

I haven't looked closely at this, but Claude says:

The error is happening in schedule_adaptor_refresh/0 at [application.ex:160-166](vscode-webview://0t3qgbcac14hmcf78knd4046spna3jtbcqlru439rntvjiue4v02/lib/lightning/application.ex#L160-L166).

The function calls Oban.insert/3 immediately after the supervisor starts, but Oban hasn't fully initialized yet by that point in the startup sequence. Oban needs a moment after its supervisor child starts before it can accept job insertions.

The root cause is a race condition in the startup sequence:

Supervisor.start_link(children, opts) starts all children (including Oban) and returns {:ok, pid}
But Oban.insert/3 is called synchronously right after, before Oban's internal registry/config is fully ready
The unless condition also won't protect against this — both local_adaptors_enabled?() and the :test env check pass, so it tries to insert the job.

How to fix it: You'd typically schedule this via a small delay (schedule_in: 5 or similar) or use a Task that waits, or restructure so the Oban insert happens after a confirmed ready signal. The simplest fix is wrapping the insert in a Task.start/1 so it runs asynchronously after the supervisor has returned:


defp schedule_adaptor_refresh do
  unless Lightning.AdaptorRegistry.local_adaptors_enabled?() or
           Lightning.Config.env() == :test do
    Task.start(fn ->
      Lightning.AdaptorRefreshWorker.new(%{}, schedule_in: 0)
      |> Oban.insert()
    end)
  end
end
Or just use schedule_in: 5 (seconds) to give Oban time to initialize before the job is inserted.

not sure if this is coincidence or a problem on the branch?

@taylordowns2000
Copy link
Member Author

Unfortunately that's just what happens when you start the app, @josephjclark !

I reported it first on September 10th, 2025. @stuartc reported it again more recently on January 21st.

Copy link
Collaborator

@josephjclark josephjclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this locally and it seems to work! The same mix commands still work as on main (so we don't need to change anything in deployments), and the buttons on the admin page work great.

I get Oban errors on this branch (but not others). Claude made some suggestions which made it go away. I honestly have no idea - an adult should take a look at it.

I'm a little discomforted that the logic of the schema install (and presumably others) has changed so much. Error handling looks different. I'm sure it's fine!

Copy link
Member

@stuartc stuartc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @taylordowns2000, thanks for tackling this -- you know it's something I've
wanted to address for a long time. I have a few architectural concerns specific
to clustered/ephemeral deployments.

Rolling restarts: late nodes miss the refresh

The Oban worker uses unique: [period: 3600], which means only the first node's
startup job gets accepted during a rolling deploy. The job can run on any node
-- including one that's about to be terminated. The PubSub broadcast only
reaches nodes that are alive and subscribed at that moment. Nodes that start
later get their Oban job deduped and never receive a broadcast, so they're stuck
with build-time data until the next cron run.

Rough timeline for a 3-node rolling deploy:

T=0   Old A, Old B, Old C running
T=1   New A starts -> Oban job accepted
T=3   Job runs on Old B (about to die)
T=8   Job finishes -> broadcasts -> New A receives it
T=10  Old B terminates (refreshed data lost)
T=12  New B starts -> Oban job DEDUPED -> no broadcast -> stuck with build-time data
T=17  New C starts -> same -> stuck

New A ends up with fresh icons/schemas, but New B and New C are stuck until the
next scheduled cron run (which could be hours away).

Icon path is different in releases

Plug.Static serves from :lightning's release priv dir (e.g.
/app/lib/lightning-x.y.z/priv/static/), but adaptor_icons_path in prod is
"priv/static/images/adaptors" -- a CWD-relative path that resolves to
/app/priv/static/images/adaptors/. These are different directories. Icons
written at runtime go to the cwd relative path but Phoenix serves from the
release priv dir, so refreshed icons get written to disk but are never actually
served to users.

The build-time mix task works because it runs before mix release, so the files
end up inside the release artifact. At runtime, adaptor_icons_path would need
to resolve via :code.priv_dir(:lightning) instead.

Also bare in mind that if we put files inside priv/static without finger
printint and updating the manifest file we lose cache control (i.e. cache
busting filenames), and gz encoding (not that big a deal I guess with compressed
images). So Plug.Static DOES do etags, and 304s, but if you have fingerprinted
paths then you can tell the browser to cache forever (1 year) and you never get
another request from the same client. Not that big of a deal with adaptor
icons though.

Ephemeral storage and the "go fetch yourself" pattern

On ephemeral containers, all runtime files are lost on restart. Each node
independently fetches from NPM, GitHub, and jsDelivr after receiving a broadcast
-- the broadcast says "go refresh yourself" rather than "data is available."
With 2-3 nodes this is ok, but it means each deploy triggers N independent
fetches from external services, and the data is immediately lost when that
container cycles.

A possible alternative direction

When I was working on this, my thinking was to invert this for
clustered/ephemeral setups something like:

  • PubSub signals "invalidate your cache" rather than "go fetch your own
    copy."
  • DB as single source of truth for registry data and schemas (they're just
    JSON). One Oban worker fetches and writes to Postgres, all nodes read from DB
    with an ETS cache. No stale window on startup -- new nodes read from DB
    immediately. (My old unfinished branch used the filesystem as well, but I
    think it may be time to use the db)
  • Lazy caching proxy for icons -- serve through a dynamic route (not
    Plug.Static) that fetches from github or wherever on first request with a
    TTL. This sidesteps the path divergence and ephemeral storage issues entirely,
    and avoids downloading the full adaptors tarball for icons that may never be
    requested. Or a mix of both.

This kind of stuff needs to be tested in an environment much closer to
production, the priv directory thing is super easy to miss in dev mode, it's not
the first time I've wanted something like this, but being able to build a proper
Elixir release image (or no image) easily and repeatibly would be super valuable
here.

Happy to chat about any of this -- I know this is tricky territory and if we
don't get it right it's gonna be a real pain, and equally awesome if we do.
Really glad to see this moving forward!

@github-project-automation github-project-automation bot moved this from New Issues to In review in Core Mar 5, 2026
@taylordowns2000
Copy link
Member Author

Awesome feedback @stuartc . I think I've understood and implemented it and my basic click-testing seems to work. Claude description of the changes below:

Overview

Replace the current filesystem + GenServer-state storage for adaptor registry, credential schemas, and adaptor icons with Postgres as single source of truth and ETS as a read-through cache. This eliminates three production issues:

  1. rolling-restart stale data,
  2. icon path divergence in releases, and
  3. ephemeral storage data loss.

Files Created (9 new)

┌──────────────────────────────────────────────────────────────────────┬─────────────────────────────────────┐
│                                 File                                 │               Purpose               │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ priv/repo/migrations/20260308204728_create_adaptor_cache_entries.exs │ Migration for adaptor_cache_entries │
│                                                                      │  table                              │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ lib/lightning/adaptor_data/cache_entry.ex                            │ Ecto schema                         │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ lib/lightning/adaptor_data.ex                                        │ DB context (put, get, put_many,     │
│                                                                      │ delete)                             │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ lib/lightning/adaptor_data/cache.ex                                  │ ETS read-through cache              │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ lib/lightning/adaptor_data/listener.ex                               │ PubSub invalidation listener        │
│                                                                      │ GenServer                           │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ lib/lightning_web/controllers/adaptor_icon_controller.ex             │ Lazy icon proxy (serves PNGs +      │
│                                                                      │ manifest)                           │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│                                                                      │ Endpoint plug to intercept          │
│ lib/lightning_web/plugs/adaptor_icons.ex                             │ /images/adaptors/* before           │
│                                                                      │ Plug.Static                         │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ test/lightning/adaptor_data_test.exs                                 │ Context CRUD tests                  │
├──────────────────────────────────────────────────────────────────────┼─────────────────────────────────────┤
│ test/lightning/adaptor_data/cache_test.exs + listener_test.exs       │ Cache + listener tests              │
└──────────────────────────────────────────────────────────────────────┴─────────────────────────────────────┘

Files Modified (key changes)

  ┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────────┐
  │                                File                                 │                Change                │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │                                                                     │ Reads from ETS/DB cache instead of   │
  │ lib/lightning/adaptor_registry.ex                                   │ GenServer state; removed PubSub      │
  │                                                                     │ subscription to old topic;           │
  │                                                                     │ simplified startup                   │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │                                                                     │ Writes to DB, broadcasts             │
  │ lib/lightning/adaptor_refresh_worker.ex                             │ {:invalidate_cache, kinds},          │
  │                                                                     │ uniqueness 3600→60s, no icon refresh │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ lib/lightning/credential_schemas.ex                                 │ Added fetch_and_store/0 for DB       │
  │                                                                     │ writes                               │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │                                                                     │ Rewritten: manifest from registry +  │
  │ lib/lightning/adaptor_icons.ex                                      │ lazy icon fetch from GitHub (no      │
  │                                                                     │ tarball)                             │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ lib/lightning/credentials.ex                                        │ get_schema/1 reads from cache with   │
  │                                                                     │ filesystem fallback                  │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ lib/lightning_web/live/credential_live/credential_form_component.ex │ get_type_options reads from cache    │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ lib/lightning_web/live/maintenance_live/index.ex                    │ Uses new DB-backed refresh functions │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ lib/lightning_web/endpoint.ex                                       │ Added AdaptorIcons plug before       │
  │                                                                     │ Plug.Static                          │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ lib/lightning/application.ex                                        │ ETS table init + Listener GenServer  │
  │                                                                     │ in supervision tree                  │
  ├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
  │ test/support/{conn_case,channel_case,data_case}.ex                  │ Sandbox allow for AdaptorRegistry +  │
  │                                                                     │ registry seeding                     │
  └─────────────────────────────────────────────────────────────────────┴──────────────────────────────────────┘

Architecture

  External sources (NPM, GitHub, jsDelivr)
          ↓ (Oban worker fetches once)
      PostgreSQL (adaptor_cache_entries)
          ↓ (read-through)
      ETS cache (Lightning.AdaptorData.Cache)
          ↓ (PubSub invalidation across cluster)
      All nodes read from ETS → DB fallback

@taylordowns2000 taylordowns2000 requested a review from stuartc March 8, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

3 participants