fix(server): auto-recover session after Cassandra restart by dpol1 · Pull Request #2997 · apache/hugegraph

dpol1 · 2026-04-18T15:38:23Z

Purpose of the PR

HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.

Root cause: CassandraSessionPool builds the Datastax Cluster without a
ReconnectionPolicy, CassandraSession.execute(...) calls the driver once
with no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient NoHostAvailableException /
OperationTimedOutException errors surface to the user and the pool stays
dead even after Cassandra comes back online.

Main Changes

Register ExponentialReconnectionPolicy(baseDelay, maxDelay) on the
Cluster builder so the Datastax driver keeps retrying downed nodes in
the background.
Wrap every Session.execute(...) in executeWithRetry(Statement) with
exponential backoff on transient connectivity failures.
Implement reconnectIfNeeded() / reset() on CassandraSession so the
pool reopens closed sessions and issues a lightweight health-check
(SELECT now() FROM system.local) before subsequent queries.

Add four tunables in CassandraOptions (defaults preserve previous
behavior for healthy clusters):

Option	Default	Meaning
`cassandra.reconnect_base_delay`	`1000` ms	Initial backoff for driver reconnection policy
`cassandra.reconnect_max_delay`	`10000` ms	Cap for reconnection backoff
`cassandra.query_retry_max_attempts`	`3`	Per-query retries on transient errors (`0` disables)
`cassandra.query_retry_interval`	`1000` ms	Base interval for per-query exponential backoff

Add unit tests covering defaults, overrides, disabling retries and option keys.

Verifying these changes

Need tests and can be verified as follows:
- mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest — 13/13 pass

Does this PR potentially affect the following parts?

Modify configurations

Documentation Status

Doc - TODO

- Register ExponentialReconnectionPolicy on the Cluster builder so the Datastax driver keeps retrying downed nodes in the background. - Wrap every Session.execute() in executeWithRetry() with exponential backoff on transient connectivity failures. - Implement reconnectIfNeeded()/reset() so the pool reopens closed sessions and issues a lightweight health-check (SELECT now() FROM system.local) before subsequent queries. - Add tunable options: cassandra.reconnect_base_delay, cassandra.reconnect_max_delay, cassandra.reconnect_max_retries, cassandra.reconnect_interval. - Add unit tests covering defaults, overrides, disabling retries and option keys. Fixes apache#2740

imbajin · 2026-04-18T19:47:32Z

⚠️ commitAsync() bypasses retry — still calls this.session.executeAsync(s) directly

The PR wraps execute() and commit() with executeWithRetry, but commitAsync() (line 177 in the base file) still calls this.session.executeAsync(s) directly. If a Cassandra restart happens during an async batch commit, the same connectivity failure will surface without any retry.

Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR).

dpol1 · 2026-04-20T14:38:04Z

Thanks @imbajin for the feedback, changed!

- Reset driver session after each transient failure in executeWithRetry() so retries reopen cleanly via lazy open() - Remove redundant finally block in reconnectIfNeeded(); null session directly on DriverException - Store retryBaseDelay as field, reuse in open() (removes double-read) - One-time LOG.warn via AtomicBoolean for commitAsync() retry gap - Tighten defaults: max_delay 60s→10s, max_retries 10→3, interval 5s→1s - Wire retry config via HugeConfig in tests; add cross-validator tests

Copilot

Pull request overview

This PR addresses HugeGraphServer becoming unresponsive after a Cassandra restart by adding automatic session recovery and configurable reconnection/retry behavior in the Cassandra backend.

Changes:

Add new Cassandra reconnect/retry configuration options and validate their relationships.
Configure the Datastax driver with an exponential reconnection policy and add per-query retry logic with backoff.
Add unit tests covering option defaults/overrides and basic retry behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java	Adds driver reconnection policy, per-query retry wrapper, and session liveness probing/reset logic
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraOptions.java	Introduces new reconnect/retry tunables with defaults and validators
hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/unit/cassandra/CassandraTest.java	Adds tests for reconnect option behavior and execute-with-retry flow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dpol1 · 2026-04-25T17:47:08Z

Going through the feedback — NPE in commitAsync() when session is unavailable, TODO pointing to wrong type (CompletableFuture → Datastax ResultSetFuture), reset() not actually called on heatlh-check failure, catch too broad, reset() on every transient failure, idempotency check missing for OperationTimedOutException retries. Unit tests for all of it. Pushing soon.

Quick question before I do: worth renaming cassandra.reconnect_max_retries / cassandra.reconnect_interval to cassandra.query_retry_max_attempts / cassandra.query_retry_interval? The current names bleed into ExponentialReconnectionPolicy semantics. Keys are new in this PR so cost is low — but happy to leave it if you want to keep the diff minimal.

imbajin · 2026-04-25T18:30:50Z

Quick question before I do: worth renaming cassandra.reconnect_max_retries / cassandra.reconnect_interval to cassandra.query_retry_max_attempts / cassandra.query_retry_interval? The current names bleed into ExponentialReconnectionPolicy semantics. Keys are new in this PR so cost is low — but happy to leave it if you want to keep the diff minimal.

+1 on the rename. query_retry_* is much clearer — avoids confusion with the driver-level ReconnectionPolicy. Since these keys are new in this PR, no compatibility cost. Go for it.

imbajin

LGTM, THX

BTW, C*/*SQL will not be maintained after 1.5.0, the community focus on the RocksDB(standalone & cluster) instead

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working store Store module labels Apr 18, 2026

github-project-automation Bot added this to HugeGraph PD-Store Tasks Apr 18, 2026

github-project-automation Bot moved this to In progress in HugeGraph PD-Store Tasks Apr 18, 2026

dpol1 force-pushed the fix/2740-cassandra-reconnect branch from 97de8e9 to fc3d291 Compare April 18, 2026 17:37

imbajin reviewed Apr 18, 2026

View reviewed changes

Comment thread ...ssandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java

imbajin reviewed Apr 18, 2026

View reviewed changes

Comment thread ...ssandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java

imbajin reviewed Apr 18, 2026

View reviewed changes

Comment thread ...ssandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java

fix: Address reviewer feedback

5ac3990

dpol1 requested a review from imbajin April 20, 2026 14:49