Skip to content

[VL] Ignore null values when copy velox string buffer to arrow#11944

Open
wankunde wants to merge 2 commits intoapache:mainfrom
wankunde:velox_to_arrow
Open

[VL] Ignore null values when copy velox string buffer to arrow#11944
wankunde wants to merge 2 commits intoapache:mainfrom
wankunde:velox_to_arrow

Conversation

@wankunde
Copy link
Copy Markdown
Contributor

@wankunde wankunde commented Apr 15, 2026

What changes are proposed in this pull request?

Bug fix for VeloxHashShuffleWriter.
Description:

  • VeloxHashShuffleWriter::write() method will try to convert the input velox flat vectors into Arrow buffer.
  • For StringView vectors, VeloxHashShuffleWriter calculates the total size of all input StringViews, and initializes the Arrow buffer accordingly.
  • For null values, we do not actually need to copy the dirty values into Arrow buffers, thereby saving execution memory.

Test logs in our productions:

I20260415 17:50:28.520978  5474 VeloxHashShuffleWriter.cc:156] collectFlatVectorBufferStringView for VARCHAR
I20260415 17:50:28.521070  5474 VeloxHashShuffleWriter.cc:110] DEBUG, lengthBufferSize = 40, flatVector->size() = 10
I20260415 17:50:28.521078  5474 VeloxHashShuffleWriter.cc:114] DEBUG flatVector = [FLAT VARCHAR: 10 elements, 9 nulls] data: 0: null
1: null
2: null
3: 114535255
4: null
5: null
6: null
7: null
8: null
I20260415 17:50:28.521092  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 0, isNull = true, rawValues[i].size() = 1957213616, length = 0,  offset = 0
I20260415 17:50:28.521100  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 1, isNull = true, rawValues[i].size() = 2588080704, length = 0,  offset = 0
I20260415 17:50:28.521106  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 2, isNull = true, rawValues[i].size() = 2706762760, length = 0,  offset = 0
I20260415 17:50:28.521111  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 3, isNull = false, rawValues[i].size() = 9, length = 9,  offset = 9
I20260415 17:50:28.521117  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 4, isNull = true, rawValues[i].size() = 0, length = 0,  offset = 9
I20260415 17:50:28.521122  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 5, isNull = true, rawValues[i].size() = 1, length = 0,  offset = 9
I20260415 17:50:28.521127  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 6, isNull = true, rawValues[i].size() = 2832791872, length = 0,  offset = 9
I20260415 17:50:28.521132  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 7, isNull = true, rawValues[i].size() = 2588222376, length = 0,  offset = 9
I20260415 17:50:28.521138  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 8, isNull = true, rawValues[i].size() = 0, length = 0,  offset = 9
I20260415 17:50:28.521143  5474 VeloxHashShuffleWriter.cc:120] DEBUG flatVector dump, i = 9, isNull = true, rawValues[i].size() = 0, length = 0,  offset = 9

How was this patch tested?

Tested in out production environment.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the VELOX label Apr 15, 2026
@wankunde
Copy link
Copy Markdown
Contributor Author

Retest this please

Copy link
Copy Markdown
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jinchengchenghh jinchengchenghh changed the title [CPP] Ignore null values when copy velox buffer to arrow [VL] Ignore null values when copy velox buffer to arrow Apr 16, 2026
@jinchengchenghh jinchengchenghh changed the title [VL] Ignore null values when copy velox buffer to arrow [VL] Ignore null values when copy velox string buffer to arrow Apr 16, 2026
@zhouyuan
Copy link
Copy Markdown
Member

@wankunde thanks for the fix, would you please also add a unit test to guard on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants