Skip to content

fix: cache version probe result to prevent reconciliation deadlock#175

Merged
GrigoryPervakov merged 1 commit intoClickHouse:mainfrom
ashishch432:version-probe-revision
Apr 28, 2026
Merged

fix: cache version probe result to prevent reconciliation deadlock#175
GrigoryPervakov merged 1 commit intoClickHouse:mainfrom
ashishch432:version-probe-revision

Conversation

@ashishch432
Copy link
Copy Markdown
Contributor

@ashishch432 ashishch432 commented Apr 26, 2026

Why

When a completed version-probe Job's Pod is garbage-collected before the operator reads the termination message, readVersionFromJob returns a hard error on every reconcile. Since the Job is complete (not recreated) and the Pod is gone (can't be read), the reconciliation loop blocks indefinitely.

What

  • Cache the detected version and image hash (VersionProbeRevision) in CR Status. On subsequent reconciles, skip the probe entirely when the image hasn't changed.

Related Issues

Fixes #170

@ashishch432 ashishch432 force-pushed the version-probe-revision branch from 5ffb3b1 to f057f92 Compare April 26, 2026 19:38
@GrigoryPervakov GrigoryPervakov self-assigned this Apr 27, 2026
@ashishch432
Copy link
Copy Markdown
Contributor Author

@GrigoryPervakov two points, let me know your thoughts.

  1. For mutable image tags, the cached key (image string + pull policy) can go stale. Realistically for production use-cases I would assume it not be a problem, but if we want to handle it, will require some logic to invalidate the cache. An ideal cache key would have been image sha, but that requires the pod to be already running or a brittle talking to registry logic.
  2. With respect to the original issue, caching should practically solve it. But the original scenario is still an edge case i.e. if job is complete and it fails to read a version. I considered auto re-creation of completed job, but I'm not sure if there are any other scenarios besides missing pod for failure. We could leave this as is for now i.e. manual remediation.

@ashishch432 ashishch432 marked this pull request as ready for review April 27, 2026 21:34
@GrigoryPervakov GrigoryPervakov merged commit cee6692 into ClickHouse:main Apr 28, 2026
16 checks passed
@GrigoryPervakov
Copy link
Copy Markdown
Member

  1. For mutable image tags, the cached key (image string + pull policy) can go stale. Realistically for production use-cases I would assume it not be a problem, but if we want to handle it, will require some logic to invalidate the cache. An ideal cache key would have been image sha, but that requires the pod to be already running or a brittle talking to registry logic.

Let's keep it as it works now. For a production environment, it makes sense to use a full version tag.
I don't see a good way to use image SHA without running a Pod, so I prefer to wait for any demand fixing it.

  1. With respect to the original issue, caching should practically solve it. But the original scenario is still an edge case i.e. if job is complete and it fails to read a version. I considered auto re-creation of completed job, but I'm not sure if there are any other scenarios besides missing pod for failure. We could leave this as is for now i.e. manual remediation.

I thought about such scenarios, and I can't find any realistic cases where a successful Job produces invalid output, and this is not a persistent issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reconciliation blocked when version-probe pod is cleaned up before version is read

2 participants