fix: correct accounting in DictEncoder::estimated_memory_size, Interner::estimated_memory_size#9720
Conversation
The returned value should estimate the actual memory usage, but instead it used the evaluation of the encoded size of the dictionary data, and bypassed the hash table memory usage added by the Interner. The implementation of Storage::estimated_memory_size for the unique key storage was not correct as well, but it was unused. Correct both problems.
|
Is there some way to add tests for this change? |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @mzabaluev
| fn estimated_memory_size(&self) -> usize { | ||
| self.interner.storage().size_in_bytes + self.indices.len() * std::mem::size_of::<usize>() | ||
| self.interner.estimated_memory_size() + self.indices.len() * std::mem::size_of::<usize>() |
There was a problem hiding this comment.
I think this is the right direction (in the sense of account for the size in the structures that hold the memory, rather than the wrapper)
However, it is not clear to me that KeyStorage includes the heap bytes for types like BYTE_ARRAY
I think we need to add some tests for this code to make sure we have it right
There was a problem hiding this comment.
Testing upper bounds would require intimate knowledge of reallocation behavior of Vec and hashbrown, but I'll try to get some confirmation.
There was a problem hiding this comment.
Oh, you are correct about the byte arrays, I need to account for variable- and fixed length array values.
There was a problem hiding this comment.
I have accounted for the byte arrays' allocations and added some tests.
|
Nobody seems to have really tested these methods, because this It should multiply the table capacity by the number of keys, not add them. |
Should multiply hash table capacity by the key size, not add them.
The fact there are no tests and the code is wrong is probably related |
To estimate allocation size, the vector capacity is the right thing to use, not the length.
As we cannot test the exact memory allocation behavior or even give the upper bound without delving into implementation details of Vec and hashbrown, the only reliable test is to confirm that the lower bounds are respected, that is, increases in memory use from the empty state (where all vectors are empty and therefore have added no allocations) add at least the expected amount of memory.
DictEncoder::estimated_memory_sizeDictEncoder::estimated_memory_size, Interner::estimated_memory_size
Which issue does this PR close?
DictEncoder::estimated_memory_size#9719, Incorrect accounting inInterner::estimated_memory_size#9744.Rationale for this change
The returned value should estimate the actual memory usage, but instead it uses the evaluation of the encoded size of the dictionary data, and bypasses the hash table memory usage added by the
Internermember. The implementation ofStorage::estimated_memory_sizeimplementation for the unique key storage was not correct as well, but it was unused.What changes are included in this PR?
Correct both problems by making the
KeyStorage's implementation ofestimated_memory_sizereturn the size of the allocateduniquesvector added with the values' sizes if applicable, and makeDictEncoder::estimated_memory_sizedelegate to theinterner, which calls the method ofKeyStorageand adds accounting for its own data structure.Are these changes tested?
Added tests verifying that at least the expected added amounts are accounted for when values are added. Overreporting is hard to disprove due to dependency on allocation behavior internal to other libraries.
Are there any user-facing changes?
No.