Skip to content

fix: correct accounting in DictEncoder::estimated_memory_size, Interner::estimated_memory_size#9720

Open
mzabaluev wants to merge 7 commits intoapache:mainfrom
mzabaluev:fix-estimated-memory-size-on-dict-encoder
Open

fix: correct accounting in DictEncoder::estimated_memory_size, Interner::estimated_memory_size#9720
mzabaluev wants to merge 7 commits intoapache:mainfrom
mzabaluev:fix-estimated-memory-size-on-dict-encoder

Conversation

@mzabaluev
Copy link
Copy Markdown
Contributor

@mzabaluev mzabaluev commented Apr 14, 2026

Which issue does this PR close?

Rationale for this change

The returned value should estimate the actual memory usage, but instead it uses the evaluation of the encoded size of the dictionary data, and bypasses the hash table memory usage added by the Interner member. The implementation of Storage::estimated_memory_size implementation for the unique key storage was not correct as well, but it was unused.

What changes are included in this PR?

Correct both problems by making the KeyStorage's implementation of estimated_memory_size return the size of the allocated uniques vector added with the values' sizes if applicable, and make DictEncoder::estimated_memory_size delegate to the interner, which calls the method of KeyStorage and adds accounting for its own data structure.

Are these changes tested?

Added tests verifying that at least the expected added amounts are accounted for when values are added. Overreporting is hard to disprove due to dependency on allocation behavior internal to other libraries.

Are there any user-facing changes?

No.

The returned value should estimate the actual memory usage, but
instead it used the evaluation of the encoded size of the dictionary
data, and bypassed the hash table memory usage added by the Interner.
The implementation of Storage::estimated_memory_size for the
unique key storage was not correct as well, but it was unused.
Correct both problems.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 14, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 16, 2026

Is there some way to add tests for this change?

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @mzabaluev

Comment on lines +185 to +186
fn estimated_memory_size(&self) -> usize {
self.interner.storage().size_in_bytes + self.indices.len() * std::mem::size_of::<usize>()
self.interner.estimated_memory_size() + self.indices.len() * std::mem::size_of::<usize>()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right direction (in the sense of account for the size in the structures that hold the memory, rather than the wrapper)

However, it is not clear to me that KeyStorage includes the heap bytes for types like BYTE_ARRAY

I think we need to add some tests for this code to make sure we have it right

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing upper bounds would require intimate knowledge of reallocation behavior of Vec and hashbrown, but I'll try to get some confirmation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are correct about the byte arrays, I need to account for variable- and fixed length array values.

Copy link
Copy Markdown
Contributor Author

@mzabaluev mzabaluev Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have accounted for the byte arrays' allocations and added some tests.

@mzabaluev
Copy link
Copy Markdown
Contributor Author

Nobody seems to have really tested these methods, because this Interner code is also wrong:
https://github.com/mzabaluev/arrow-rs/blob/19002abc3522af60b6eaa99c0d5d4999b70cd681/parquet/src/util/interner.rs#L82

It should multiply the table capacity by the number of keys, not add them.

Should multiply hash table capacity by the key size, not add them.
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 16, 2026

Nobody seems to have really tested these methods, because this Interner code is also wrong: https://github.com/mzabaluev/arrow-rs/blob/19002abc3522af60b6eaa99c0d5d4999b70cd681/parquet/src/util/interner.rs#L82

It should multiply the table capacity by the number of keys, not add them.

The fact there are no tests and the code is wrong is probably related

@mzabaluev mzabaluev marked this pull request as draft April 16, 2026 16:44
To estimate allocation size, the vector capacity is the right thing to
use, not the length.
As we cannot test the exact memory allocation behavior or even
give the upper bound without delving into implementation details of
Vec and hashbrown, the only reliable test is to confirm that the lower
bounds are respected, that is, increases in memory use from the empty
state (where all vectors are empty and therefore have added no
allocations) add at least the expected amount of memory.
@mzabaluev mzabaluev marked this pull request as ready for review April 16, 2026 17:28
@mzabaluev mzabaluev changed the title fix: correct accounting in DictEncoder::estimated_memory_size fix: correct accounting in DictEncoder::estimated_memory_size, Interner::estimated_memory_size Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect accounting in DictEncoder::estimated_memory_size

2 participants