Allow on-the-fly updates to adaptors, schemas, and icons#4473
Allow on-the-fly updates to adaptors, schemas, and icons#4473taylordowns2000 wants to merge 15 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4473 +/- ##
==========================================
- Coverage 89.47% 89.07% -0.41%
==========================================
Files 425 430 +5
Lines 20212 20388 +176
==========================================
+ Hits 18085 18160 +75
- Misses 2127 2228 +101 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
When I try and run this locally I get: Main is fine I haven't looked closely at this, but Claude says: not sure if this is coincidence or a problem on the branch? |
|
Unfortunately that's just what happens when you start the app, @josephjclark ! I reported it first on September 10th, 2025. @stuartc reported it again more recently on January 21st. |
josephjclark
left a comment
There was a problem hiding this comment.
Tested this locally and it seems to work! The same mix commands still work as on main (so we don't need to change anything in deployments), and the buttons on the admin page work great.
I get Oban errors on this branch (but not others). Claude made some suggestions which made it go away. I honestly have no idea - an adult should take a look at it.
I'm a little discomforted that the logic of the schema install (and presumably others) has changed so much. Error handling looks different. I'm sure it's fine!
stuartc
left a comment
There was a problem hiding this comment.
Hey @taylordowns2000, thanks for tackling this -- you know it's something I've
wanted to address for a long time. I have a few architectural concerns specific
to clustered/ephemeral deployments.
Rolling restarts: late nodes miss the refresh
The Oban worker uses unique: [period: 3600], which means only the first node's
startup job gets accepted during a rolling deploy. The job can run on any node
-- including one that's about to be terminated. The PubSub broadcast only
reaches nodes that are alive and subscribed at that moment. Nodes that start
later get their Oban job deduped and never receive a broadcast, so they're stuck
with build-time data until the next cron run.
Rough timeline for a 3-node rolling deploy:
T=0 Old A, Old B, Old C running
T=1 New A starts -> Oban job accepted
T=3 Job runs on Old B (about to die)
T=8 Job finishes -> broadcasts -> New A receives it
T=10 Old B terminates (refreshed data lost)
T=12 New B starts -> Oban job DEDUPED -> no broadcast -> stuck with build-time data
T=17 New C starts -> same -> stuck
New A ends up with fresh icons/schemas, but New B and New C are stuck until the
next scheduled cron run (which could be hours away).
Icon path is different in releases
Plug.Static serves from :lightning's release priv dir (e.g.
/app/lib/lightning-x.y.z/priv/static/), but adaptor_icons_path in prod is
"priv/static/images/adaptors" -- a CWD-relative path that resolves to
/app/priv/static/images/adaptors/. These are different directories. Icons
written at runtime go to the cwd relative path but Phoenix serves from the
release priv dir, so refreshed icons get written to disk but are never actually
served to users.
The build-time mix task works because it runs before mix release, so the files
end up inside the release artifact. At runtime, adaptor_icons_path would need
to resolve via :code.priv_dir(:lightning) instead.
Also bare in mind that if we put files inside priv/static without finger
printint and updating the manifest file we lose cache control (i.e. cache
busting filenames), and gz encoding (not that big a deal I guess with compressed
images). So Plug.Static DOES do etags, and 304s, but if you have fingerprinted
paths then you can tell the browser to cache forever (1 year) and you never get
another request from the same client. Not that big of a deal with adaptor
icons though.
Ephemeral storage and the "go fetch yourself" pattern
On ephemeral containers, all runtime files are lost on restart. Each node
independently fetches from NPM, GitHub, and jsDelivr after receiving a broadcast
-- the broadcast says "go refresh yourself" rather than "data is available."
With 2-3 nodes this is ok, but it means each deploy triggers N independent
fetches from external services, and the data is immediately lost when that
container cycles.
A possible alternative direction
When I was working on this, my thinking was to invert this for
clustered/ephemeral setups something like:
- PubSub signals "invalidate your cache" rather than "go fetch your own
copy." - DB as single source of truth for registry data and schemas (they're just
JSON). One Oban worker fetches and writes to Postgres, all nodes read from DB
with an ETS cache. No stale window on startup -- new nodes read from DB
immediately. (My old unfinished branch used the filesystem as well, but I
think it may be time to use the db) - Lazy caching proxy for icons -- serve through a dynamic route (not
Plug.Static) that fetches from github or wherever on first request with a
TTL. This sidesteps the path divergence and ephemeral storage issues entirely,
and avoids downloading the full adaptors tarball for icons that may never be
requested. Or a mix of both.
This kind of stuff needs to be tested in an environment much closer to
production, the priv directory thing is super easy to miss in dev mode, it's not
the first time I've wanted something like this, but being able to build a proper
Elixir release image (or no image) easily and repeatibly would be super valuable
here.
Happy to chat about any of this -- I know this is tricky territory and if we
don't get it right it's gonna be a real pain, and equally awesome if we do.
Really glad to see this moving forward!
|
Awesome feedback @stuartc . I think I've understood and implemented it and my basic click-testing seems to work. Claude description of the changes below: OverviewReplace the current filesystem + GenServer-state storage for adaptor registry, credential schemas, and adaptor icons with Postgres as single source of truth and ETS as a read-through cache. This eliminates three production issues:
Files Created (9 new)Files Modified (key changes)Architecture |
Description
Enables on-the-fly updates to the adaptor registry, credential schemas, and adaptor icons without requiring an application restart. Superusers can trigger refreshes from a new Settings > Maintenance admin page, and a configurable Oban cron job (ADAPTOR_REFRESH_INTERVAL_HOURS) keeps them in sync automatically across clustered nodes.
Closes #3114
Closes #2209
Closes #325 (wow what a golden oldie!)
Closes #1996
Changes
Validation steps
AI Usage
Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):
You can read more details in our
Responsible AI Policy
Pre-submission checklist
/reviewwith Claude Code)
(e.g.,
:owner,:admin,:editor,:viewer)