[GLUTEN-11895][VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan by guowangy · Pull Request #11896 · apache/gluten

guowangy · 2026-04-09T04:56:59Z

What changes are proposed in this pull request?

Fix SIGSEGV on CPUThreadPool threads during HDFS scan caused by DetachCurrentThread poisoning libhdfs.so's TLS-cached JNIEnv*.

Fixes #11895

Root cause

libhdfs.so caches JNIEnv* per thread in a two-level TLS structure:

Fast path: static __thread ThreadLocalState *quickTlsEnv (ELF linker-initialized, zero-cost read, no re-validation)
Slow path: pthread_getspecific(gTlsKey) (mutex-protected, only on first call per thread)

After the first AttachCurrentThread on a CPUThreadPool thread, libhdfs caches the JNIEnv* in quickTlsEnv. The fast path returns this pointer on all subsequent calls without checking validity.

Two Gluten destructors called vm_->DetachCurrentThread() unconditionally after JNI cleanup:

JniColumnarBatchIterator::~JniColumnarBatchIterator() (cpp/core/jni/JniCommon.cc)
JavaInputStreamAdaptor::Close() (cpp/core/jni/JniWrapper.cc)

Both used attachCurrentThreadAsDaemonOrThrow() followed by unconditional DetachCurrentThread(). This was not a proper attach/detach pair: attachCurrentThreadAsDaemonOrThrow only attaches if the thread is not already attached, but DetachCurrentThread ran regardless — detaching threads that libhdfs had attached. This freed the JVM's JavaThread object while quickTlsEnv still held the stale pointer.

The crash sequence:

CPUThreadPool21 runs a preload task → libhdfs calls AttachCurrentThread, caches JNIEnv* in quickTlsEnv
Object cleanup on the same thread → Gluten destructor calls DetachCurrentThread → JavaThread freed, but quickTlsEnv still holds stale pointer
CPUThreadPool21 runs next preload task → hdfsGetPathInfo() → libhdfs fast path returns stale env → jni_NewStringUTF(stale_env) → SIGSEGV

CPUThreadPool threads live for the entire executor JVM lifetime (created once in VeloxBackend::initConnector, destroyed only in VeloxBackend::tearDown), so the stale pointer persists across all subsequent queries.

Confirmed by core dump analysis (core.CPUThreadPool21.1770392 from TPC-DS on YARN):

RDI = 0x0 (NULL JavaThread*, set by block_if_vm_exited)
R12 = 0x7f3003a52200 (stale JNIEnv* from libhdfs TLS cache)
Memory at stale env shows JVM method resolution data (reused memory), not a valid JNI function table

Fix

Remove DetachCurrentThread() from both destructors with the following considerations:

Broken ownership: the original code detached threads it didn't attach — attachCurrentThreadAsDaemonOrThrow is conditional (no-op if already attached) but DetachCurrentThread was unconditional
Daemon threads (AttachCurrentThreadAsDaemon) do not block JVM shutdown
folly::ThreadLocal<HdfsFile> destructors call hdfsCloseFile() (JNI) during thread exit — detaching mid-lifetime would crash these too
libhdfs pthread_key destructor (hdfsThreadDestructor) calls DetachCurrentThread when the thread actually exits — proper cleanup still happens at thread exit when libhdfs is in use
Non-HDFS paths (S3, GCS, local): libhdfs is absent, no pthread_key is registered, daemon-attached threads are safely reclaimed when the executor JVM exits

How was this patch tested?

JniThreadDetachTest.testIteratorDestructorDoesNotDetachThread: on a native std::thread (simulating CPUThreadPool), saves JNIEnv* (simulating libhdfs TLS cache), creates and destroys a real JniColumnarBatchIterator, then reuses the saved env for FindClass (simulating libhdfs's next hdfsGetPathInfo). With the bug: SIGSEGV crashes the JVM. With the fix: succeeds normally.
Existing surefire tests pass (mvn surefire:test -pl backends-velox)
Full TPC-DS benchmark on HDFS — no more CPUThreadPool SIGSEGV

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Opus 4.6

FelixYBW · 2026-04-09T05:27:30Z

Hold on to merge until we fix #11452 (comment)

zhztheplayer

Thanks @guowangy. I wonder simply removing these detach statements will cause Java Thread object leak.

You can give it a try to see whether the heap usage keeps going up with this change. Or have a look at #11895 (comment).

guowangy · 2026-04-15T12:45:23Z

@zhztheplayer

Thanks @guowangy. I wonder simply removing these detach statements will cause Java Thread object leak.

In my opinions, no leak.

For HDFS threads: libhdfs has its own cleanup — it calls DetachCurrentThread automatically when each thread exits, via hdfsThreadDestructor. The old code was calling DetachCurrentThread prematurely at object destruction time — before thread exit — which corrupted libhdfs's cached JNIEnv* and caused the crash.

For non-HDFS native threads: attachCurrentThreadAsDaemonOrThrow attaches them as daemon on first use. They are joined by ioExecutor_.reset() inside the JVM shutdown hook before the JVM exits. Daemon threads do not block JVM shutdown. The attached-thread count is bounded by pool size, not by query or operation count.

For Spark task threads and shutdown threads: these are JVM-managed and already attached — attachCurrentThreadAsDaemonOrThrow is a no-op (GetEnv returns JNI_OK). No new JavaThread is created.

You can give it a try to see whether the heap usage keeps going up with this change. Or have a look at #11895 (comment).

I don't see heap usage keeps going up in an 8-hours testing.

zhztheplayer · 2026-04-15T13:50:16Z

Thanks for the experiment @guowangy.

I revisited the code and ioExecutor_ is global and should not cause leak as you said. It's worth checking spillExecutor_ , whether the spill threads have any callbacks to Gluten JNI code to cause them to be attached. The feature is turned on by setting spark.gluten.sql.columnar.backend.velox.spillThreadNum.

I don't see heap usage keeps going up in an 8-hours testing.

Perhaps, using jstack to view all Java threads is a simpler way to guess such leakages.

FelixYBW · 2026-04-15T21:11:02Z

CPUThreadPool21 runs a preload task → libhdfs calls AttachCurrentThread, caches JNIEnv* in quickTlsEnv

Object cleanup on the same thread → Gluten destructor calls DetachCurrentThread → JavaThread freed, but quickTlsEnv still holds stale pointer

CPUThreadPool21 runs next preload task → hdfsGetPathInfo() → libhdfs fast path returns stale env → jni_NewStringUTF(stale_env) → SIGSEGV

Should we cleanup the stale pointer at step2 and re-attach in 3?

FelixYBW · 2026-04-15T21:11:49Z

It's a libhdfs only issue, right? Does the PR work on S3, abfs, gcs?

guowangy · 2026-04-16T06:00:04Z

@FelixYBW

CPUThreadPool21 runs a preload task → libhdfs calls AttachCurrentThread, caches JNIEnv* in quickTlsEnv

Object cleanup on the same thread → Gluten destructor calls DetachCurrentThread → JavaThread freed, but quickTlsEnv still holds stale pointer

CPUThreadPool21 runs next preload task → hdfsGetPathInfo() → libhdfs fast path returns stale env → jni_NewStringUTF(stale_env) → SIGSEGV

Should we cleanup the stale pointer at step2 and re-attach in 3?

It requires clearing quickTlsEnv after DetachCurrentThread. That's impossible from outside libhdfs because quickTlsEnv is a static __thread variable declared inside the body of threadLocalStorageGet() — it has no external symbol, is not exported from libhdfs.so, and no public API exists to zero it. The only code that can write it is libhdfs itself.

Without being able to clear quickTlsEnv, any HDFS call after DetachCurrentThread hits the fast path, sees the non-null but stale pointer, and crashes — the re-attach in step 3 never gets a chance to run.

It's a libhdfs only issue, right? Does the PR work on S3, abfs, gcs?

Yes, hdfs only. Others don't have such problem.

guowangy · 2026-04-16T06:20:40Z

I revisited the code and ioExecutor_ is global and should not cause leak as you said. It's worth checking spillExecutor_ , whether the spill threads have any callbacks to Gluten JNI code to cause them to be attached. The feature is turned on by setting spark.gluten.sql.columnar.backend.velox.spillThreadNum.

@zhztheplayer Good catch. spillExecutor_ is the real issue — it's created per task, and its threads get attached to the JVM via SparkAllocationListener but never detached, so JavaThread objects accumulate over time.

Proposal Fix: Add a thin folly::ThreadFactory wrapper for spillExecutor_ that attaches threads at pool creation and calls DetachCurrentThread inside the thread body after all work completes. Spill threads never call libhdfs, so this is safe.

Does this approach sound reasonable to you?

zhztheplayer · 2026-04-16T11:44:48Z

@guowangy

Proposal Fix: Add a thin folly::ThreadFactory wrapper for spillExecutor_ that attaches threads at pool creation and calls DetachCurrentThread inside the thread body after all work completes. Spill threads never call libhdfs, so this is safe.

This sounds reasonable to me.

github-actions bot added the VELOX label Apr 9, 2026

Fix SIGSEGV on IOThreadPool threads during HDFS scan

368f446

guowangy force-pushed the velox-io-thread-fix branch from a962aba to 368f446 Compare April 9, 2026 04:57

guowangy changed the title ~~[GLUTEN-11895] [VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan~~ [GLUTEN-11895][VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan Apr 9, 2026

zhztheplayer reviewed Apr 14, 2026

View reviewed changes

guowangy mentioned this pull request Apr 14, 2026

[VL] SIGSEGV in IOThreadPool during HDFS scan #11895

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-11895][VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan#11896

[GLUTEN-11895][VL] Fix SIGSEGV on IOThreadPool threads during HDFS scan#11896
guowangy wants to merge 1 commit intoapache:mainfrom
guowangy:velox-io-thread-fix

guowangy commented Apr 9, 2026

Uh oh!

FelixYBW commented Apr 9, 2026

Uh oh!

zhztheplayer left a comment

Uh oh!

guowangy commented Apr 15, 2026 •

edited

Loading

Uh oh!

zhztheplayer commented Apr 15, 2026

Uh oh!

FelixYBW commented Apr 15, 2026

Uh oh!

FelixYBW commented Apr 15, 2026

Uh oh!

guowangy commented Apr 16, 2026

Uh oh!

guowangy commented Apr 16, 2026

Uh oh!

zhztheplayer commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guowangy commented Apr 9, 2026

What changes are proposed in this pull request?

Root cause

Fix

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

FelixYBW commented Apr 9, 2026

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

guowangy commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer commented Apr 15, 2026

Uh oh!

FelixYBW commented Apr 15, 2026

Uh oh!

FelixYBW commented Apr 15, 2026

Uh oh!

guowangy commented Apr 16, 2026

Uh oh!

guowangy commented Apr 16, 2026

Uh oh!

zhztheplayer commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guowangy commented Apr 15, 2026 •

edited

Loading