fix(server): auto-recover session after Cassandra restart#2997
fix(server): auto-recover session after Cassandra restart#2997imbajin merged 4 commits intoapache:masterfrom
Conversation
- Register ExponentialReconnectionPolicy on the Cluster builder so the
Datastax driver keeps retrying downed nodes in the background.
- Wrap every Session.execute() in executeWithRetry() with exponential
backoff on transient connectivity failures.
- Implement reconnectIfNeeded()/reset() so the pool reopens closed
sessions and issues a lightweight health-check (SELECT now() FROM
system.local) before subsequent queries.
- Add tunable options: cassandra.reconnect_base_delay,
cassandra.reconnect_max_delay, cassandra.reconnect_max_retries,
cassandra.reconnect_interval.
- Add unit tests covering defaults, overrides, disabling retries and
option keys.
Fixes apache#2740
97de8e9 to
fc3d291
Compare
|
The PR wraps Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR). |
|
Thanks @imbajin for the feedback, changed! |
- Reset driver session after each transient failure in executeWithRetry() so retries reopen cleanly via lazy open() - Remove redundant finally block in reconnectIfNeeded(); null session directly on DriverException - Store retryBaseDelay as field, reuse in open() (removes double-read) - One-time LOG.warn via AtomicBoolean for commitAsync() retry gap - Tighten defaults: max_delay 60s→10s, max_retries 10→3, interval 5s→1s - Wire retry config via HugeConfig in tests; add cross-validator tests
There was a problem hiding this comment.
Pull request overview
This PR addresses HugeGraphServer becoming unresponsive after a Cassandra restart by adding automatic session recovery and configurable reconnection/retry behavior in the Cassandra backend.
Changes:
- Add new Cassandra reconnect/retry configuration options and validate their relationships.
- Configure the Datastax driver with an exponential reconnection policy and add per-query retry logic with backoff.
- Add unit tests covering option defaults/overrides and basic retry behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java | Adds driver reconnection policy, per-query retry wrapper, and session liveness probing/reset logic |
| hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraOptions.java | Introduces new reconnect/retry tunables with defaults and validators |
| hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/unit/cassandra/CassandraTest.java | Adds tests for reconnect option behavior and execute-with-retry flow |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Going through the feedback — NPE in Quick question before I do: worth renaming |
+1 on the rename. |
Purpose of the PR
closes #2740
HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.
Root cause:
CassandraSessionPoolbuilds the DatastaxClusterwithout aReconnectionPolicy,CassandraSession.execute(...)calls the driver oncewith no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient
NoHostAvailableException/OperationTimedOutExceptionerrors surface to the user and the pool staysdead even after Cassandra comes back online.
Main Changes
Register
ExponentialReconnectionPolicy(baseDelay, maxDelay)on theClusterbuilder so the Datastax driver keeps retrying downed nodes inthe background.
Wrap every
Session.execute(...)inexecuteWithRetry(Statement)withexponential backoff on transient connectivity failures.
Implement
reconnectIfNeeded()/reset()onCassandraSessionso thepool reopens closed sessions and issues a lightweight health-check
(
SELECT now() FROM system.local) before subsequent queries.Add four tunables in
CassandraOptions(defaults preserve previousbehavior for healthy clusters):
cassandra.reconnect_base_delay1000mscassandra.reconnect_max_delay10000mscassandra.query_retry_max_attempts30disables)cassandra.query_retry_interval1000msAdd unit tests covering defaults, overrides, disabling retries and option keys.
Verifying these changes
mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest— 13/13 passDoes this PR potentially affect the following parts?
Documentation Status
Doc - TODO