Skip to content

fix(server): auto-recover session after Cassandra restart#2997

Merged
imbajin merged 4 commits intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect
Apr 25, 2026
Merged

fix(server): auto-recover session after Cassandra restart#2997
imbajin merged 4 commits intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect

Conversation

@dpol1
Copy link
Copy Markdown
Contributor

@dpol1 dpol1 commented Apr 18, 2026

Purpose of the PR

closes #2740

HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.

Root cause: CassandraSessionPool builds the Datastax Cluster without a
ReconnectionPolicy, CassandraSession.execute(...) calls the driver once
with no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient NoHostAvailableException /
OperationTimedOutException errors surface to the user and the pool stays
dead even after Cassandra comes back online.

Main Changes

  • Register ExponentialReconnectionPolicy(baseDelay, maxDelay) on the
    Cluster builder so the Datastax driver keeps retrying downed nodes in
    the background.

  • Wrap every Session.execute(...) in executeWithRetry(Statement) with
    exponential backoff on transient connectivity failures.

  • Implement reconnectIfNeeded() / reset() on CassandraSession so the
    pool reopens closed sessions and issues a lightweight health-check
    (SELECT now() FROM system.local) before subsequent queries.

  • Add four tunables in CassandraOptions (defaults preserve previous
    behavior for healthy clusters):

    Option Default Meaning
    cassandra.reconnect_base_delay 1000 ms Initial backoff for driver reconnection policy
    cassandra.reconnect_max_delay 10000 ms Cap for reconnection backoff
    cassandra.query_retry_max_attempts 3 Per-query retries on transient errors (0 disables)
    cassandra.query_retry_interval 1000 ms Base interval for per-query exponential backoff
  • Add unit tests covering defaults, overrides, disabling retries and option keys.

Verifying these changes

  • Need tests and can be verified as follows:
    • mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest — 13/13 pass

Does this PR potentially affect the following parts?

  • Modify configurations

Documentation Status

  • Doc - TODO

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working store Store module labels Apr 18, 2026
  - Register ExponentialReconnectionPolicy on the Cluster builder so the
    Datastax driver keeps retrying downed nodes in the background.
  - Wrap every Session.execute() in executeWithRetry() with exponential
    backoff on transient connectivity failures.
  - Implement reconnectIfNeeded()/reset() so the pool reopens closed
    sessions and issues a lightweight health-check (SELECT now() FROM
    system.local) before subsequent queries.
  - Add tunable options: cassandra.reconnect_base_delay,
    cassandra.reconnect_max_delay, cassandra.reconnect_max_retries,
    cassandra.reconnect_interval.
  - Add unit tests covering defaults, overrides, disabling retries and
    option keys.

  Fixes apache#2740
@dpol1 dpol1 force-pushed the fix/2740-cassandra-reconnect branch from 97de8e9 to fc3d291 Compare April 18, 2026 17:37
@imbajin
Copy link
Copy Markdown
Member

imbajin commented Apr 18, 2026

⚠️ commitAsync() bypasses retry — still calls this.session.executeAsync(s) directly

The PR wraps execute() and commit() with executeWithRetry, but commitAsync() (line 177 in the base file) still calls this.session.executeAsync(s) directly. If a Cassandra restart happens during an async batch commit, the same connectivity failure will surface without any retry.

Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR).

@dpol1
Copy link
Copy Markdown
Contributor Author

dpol1 commented Apr 20, 2026

Thanks @imbajin for the feedback, changed!

@dpol1 dpol1 requested a review from imbajin April 20, 2026 14:49
- Reset driver session after each transient failure in executeWithRetry()
  so retries reopen cleanly via lazy open()
- Remove redundant finally block in reconnectIfNeeded(); null session
  directly on DriverException
- Store retryBaseDelay as field, reuse in open() (removes double-read)
- One-time LOG.warn via AtomicBoolean for commitAsync() retry gap
- Tighten defaults: max_delay 60s→10s, max_retries 10→3, interval 5s→1s
- Wire retry config via HugeConfig in tests; add cross-validator tests
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses HugeGraphServer becoming unresponsive after a Cassandra restart by adding automatic session recovery and configurable reconnection/retry behavior in the Cassandra backend.

Changes:

  • Add new Cassandra reconnect/retry configuration options and validate their relationships.
  • Configure the Datastax driver with an exponential reconnection policy and add per-query retry logic with backoff.
  • Add unit tests covering option defaults/overrides and basic retry behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java Adds driver reconnection policy, per-query retry wrapper, and session liveness probing/reset logic
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraOptions.java Introduces new reconnect/retry tunables with defaults and validators
hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/unit/cassandra/CassandraTest.java Adds tests for reconnect option behavior and execute-with-retry flow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dpol1
Copy link
Copy Markdown
Contributor Author

dpol1 commented Apr 25, 2026

Going through the feedback — NPE in commitAsync() when session is unavailable, TODO pointing to wrong type (CompletableFuture → Datastax ResultSetFuture), reset() not actually called on heatlh-check failure, catch too broad, reset() on every transient failure, idempotency check missing for OperationTimedOutException retries. Unit tests for all of it. Pushing soon.

Quick question before I do: worth renaming cassandra.reconnect_max_retries / cassandra.reconnect_interval to cassandra.query_retry_max_attempts / cassandra.query_retry_interval? The current names bleed into ExponentialReconnectionPolicy semantics. Keys are new in this PR so cost is low — but happy to leave it if you want to keep the diff minimal.

@imbajin
Copy link
Copy Markdown
Member

imbajin commented Apr 25, 2026

Quick question before I do: worth renaming cassandra.reconnect_max_retries / cassandra.reconnect_interval to cassandra.query_retry_max_attempts / cassandra.query_retry_interval? The current names bleed into ExponentialReconnectionPolicy semantics. Keys are new in this PR so cost is low — but happy to leave it if you want to keep the diff minimal.

+1 on the rename. query_retry_* is much clearer — avoids confusion with the driver-level ReconnectionPolicy. Since these keys are new in this PR, no compatibility cost. Go for it.

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 25, 2026
@dpol1 dpol1 requested a review from imbajin April 25, 2026 18:40
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, THX

BTW, C*/*SQL will not be maintained after 1.5.0, the community focus on the RocksDB(standalone & cluster) instead

@github-project-automation github-project-automation Bot moved this from In progress to In review in HugeGraph PD-Store Tasks Apr 25, 2026
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 25, 2026
@imbajin imbajin changed the title fix(cassandra): auto-recover session after Cassandra restart fix(server): auto-recover session after Cassandra restart Apr 25, 2026
@imbajin imbajin merged commit 836b348 into apache:master Apr 25, 2026
13 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in HugeGraph PD-Store Tasks Apr 25, 2026
@dpol1 dpol1 deleted the fix/2740-cassandra-reconnect branch April 25, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files. store Store module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug] Hugegraph isn't responding after Cassandra restarted.

3 participants