Skip to content

RetryableMultiRegion: try_join_all drops JoinHandles on error, detaching spawned tasks that panic on runtime shutdown #534

@pingyu

Description

@pingyu

Bug Report

Problem

When using tikv-client in a context where the Tokio runtime may be dropped (e.g., a short-lived runtime created inside a worker thread), the program may panic with:

A Tokio 1.x context was found, but it is being shutdown.

Backtrace

The panic originates from TimerEntry::poll_elapsed in tokio-1.26.0/src/runtime/time/entry.rs:550-551:

TimerEntry::poll_elapsed                          ← PANIC: runtime timer driver is shut down
  ← tokio::time::sleep::Sleep::poll               ← sleep() called inside task
    ← tonic::transport::grpc_timeout::ResponseFuture::poll  ← gRPC timeout
      ← tonic AddOrigin service call
        ← tower buffer future
          ← tikv_client::ScanRequest::dispatch
            ← tikv_client::KvRpcClient::dispatch
              ← Dispatch<Req>::execute
                ← ResolveLock<P,PdC>::execute
                  ← RetryableMultiRegion::single_shard_handler  ← spawned via tokio::spawn
                    ← tokio multi_thread worker (async task)

Scenario

The user intends to:

  1. Create a short-lived Tokio runtime.
  2. Create a TransactionClient on that runtime.
  3. Perform scan operations via snapshot.scan() within runtime.block_on().
  4. Eventually drop the runtime when the worker thread exits.

During scan operations, tikv-client internally spawns tasks on the current runtime to shard the request across multiple regions. If one of these tasks errors and the user's runtime is short-lived enough to be dropped before the spawned tasks naturally terminate, the process panics.

Root Cause

In request/plan.rs, RetryableMultiRegion::single_plan_handler spawns concurrent per-shard handler tasks via tokio::spawn and waits for them using futures::future::try_join_all:

// plan.rs:122-136
for shard in shards {
    let handle = tokio::spawn(Self::single_shard_handler(...));
    handles.push(handle);
}
let results = try_join_all(handles).await?;

try_join_all (from futures-util 0.3.28) cancels remaining futures immediately when any one future returns an error (source: try_join_all.rs lines 165-168 break on first error, line 182 drops remaining futures). This means:

  1. When one single_shard_handler task fails (e.g., gRPC timeout, region error, leader-not-found), try_join_all drops the remaining JoinHandles.
  2. Tokio's JoinHandle::Drop does not cancel the spawned task — it detaches the task, which continues running on the runtime.
  3. If the runtime is dropped while detached tasks are still alive, the tasks' subsequent tokio::time::sleep calls (via tonic's gRPC timeout wrapper or tikv-client's retry backoff) panic because the timer driver is shut down.

The same pattern exists in RetryableAllStores::single_store_handler (plan.rs:463).

Proposed Fix

Instead of try_join_all, use futures::future::join_all to await all spawned tasks to completion before propagating errors:

let results = join_all(handles).await;
let mut err = None;
let mut outputs = Vec::with_capacity(results.len());
for r in results {
    match r.unwrap() { // JoinError only on panic
        Ok(ok) => outputs.push(Ok(ok)),
        Err(e) if err.is_none() => err = Some(e),
        _ => {}
    }
}
if let Some(e) = err {
    return Err(e);
}
Ok(outputs)

Alternatively, use tokio::task::JoinSet which provides join_all() semantics and tracks all spawned tasks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions