Skip to content

feat(parquet): dictionary fallback heuristic based on compression efficiency#9700

Open
mzabaluev wants to merge 19 commits intoapache:mainfrom
mzabaluev:parquet-dict-fallback-heuristic
Open

feat(parquet): dictionary fallback heuristic based on compression efficiency#9700
mzabaluev wants to merge 19 commits intoapache:mainfrom
mzabaluev:parquet-dict-fallback-heuristic

Conversation

@mzabaluev
Copy link
Copy Markdown
Contributor

@mzabaluev mzabaluev commented Apr 13, 2026

Which issue does this PR close?

What changes are included in this PR?

Added ColumnProperties option dictionary_fallback, getting a DictionaryFallback enum value.
Two behavior variants are provided (initially, the enum is non-exhaustive to allow more to be added later if necessary):

  • OnPageSizeLimit - the prior behavior and the default, triggers fallback on exceeding the dictionary page size limit.
  • OnUnfavorableAfter - a new behavior, includes the page size limit check and adds a check for encoded size not exceeding the plain data size.

Implemented the new optional behavior in the encoder.

Are these changes tested?

Added new tests exercising the OnUnfavorableAfter behavior.
The existing tests exercise OnPageSizeLimit.

Are there any user-facing changes?

Added API in parquet:

  • The DictionaryFallback enum
  • ColumnProperties::dictionary_fallback, ColumnProperties::set_dictionary_fallback
  • WriterPropertiesBuilder::set_dictionary_fallback, WriterPropertiesBuilder::set_column_dictionary_fallback
  • WriterProperties::dictionary_fallback

In parquet, add `dictionary_fallback` option to `ColumnProperties`.
Its value type is defined as the `DictionaryFallback` enum with
the varians `OnDataPageSize` (the previous behavior, the default)
and `OnUnfavorableCompression`. The latter replicates the behavior of
parquet-java, which falls back to non-dictionary encoding when the
estimated compressed size of the data page with dictionary encoding is
not smaller than the estimated size of the data page with
plain encoding.
The method as it is used now does not need to return the tuple of
base size and number of elements, instead just compute the encoded
size as appropriate for the type.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 13, 2026
It's not really necessary because the counter is only used while
dictionary encoding is enabled, but it's good to safeguard against
future refactoring.
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 13, 2026

Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, however, if even the parquet-java version needs to be updated.

I looked at the java code, and it has been around for quite some time (it was added in late 2014). At that time, I believe the default page size was on the order of a megabyte, so using this heuristic after a single page was probably not a bad idea. However, when the page indexes were added, parquet-java was modified to by default limit pages to 20000 rows (this crate adopted the default 20k page size quite some time later). IMO, 20000 values is too small a sample to decide if a dictionary is having a beneficial effect. Let's say one has a relatively low cardinality (32k) i64 column with a somewhat random distribution. After encoding one 20k row page I think the heuristic here will almost certainly choose plain vs dictionary, but if one were to encode 10 pages, dictionary would then be seen to be superior by far.

I like that this is opt-in, but then wonder if a user knows this heuristic will be helpful (i.e. they know it's a high cardinality column), could they not instead simply disable dictionary encoding for the column in question.

Comment thread parquet/src/data_type.rs
(std::mem::size_of::<u32>(), self.len())
fn dict_encoding_size(&self) -> usize {
4 + self.len()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use size_of perhaps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a u32, so it will never be different from 4.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so you should use size_of, which is clear why^

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

Comment thread parquet/src/data_type.rs Outdated
}

fn dict_encoding_size(&self) -> usize {
12
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a type that encodes to 96 bits. The compound Rust type is furthermore not declared with a repr that nails down its in-memory size (the compiler might decide to align such small arrays to 16 bytes one day and jack up the size accordingly, for example) so I'd argue using size_of would not be squeaky-clean here.

Copy link
Copy Markdown
Contributor

@EmilyMatt EmilyMatt Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, so a const would be enough, just not a magic number.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the array alignment is nailed down in the spec, but Int96 wrapping the array is a struct with a Rust repr, so... 🤷 Could be a case for declaring the struct with repr(packed), but I'd rather argue we use explicit numbers here because the encoding is not a straight memory copy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion about the const, I've added Int96::SIZE_IN_BYTES for this purpose.

Comment thread parquet/src/data_type.rs Outdated
Ok(values_read)
}

fn dict_encoding_size(&self) -> usize {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's not only that using size_of would be dubious, the encoding actually uses one bit per value, so this method is a leaky abstraction. I'll see about reworking it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dicitionary encoding is never used on bool types, so this just needs some clarifying comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or, better, a panic because this method should never be called for BoolType: the column encoder does a check to fall back to plain/RLE in this case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this probably needs an unreachable!()
Though I believe a main principle is that this project should be really low in the dependency graph, which merits not using panics, maybe a const here as well will be fine, or 0, but I fear this may cause a never-ending block/row-group if this assumption ever changes(for example, encoding a full-true group or something.

@mzabaluev
Copy link
Copy Markdown
Contributor Author

mzabaluev commented Apr 14, 2026

Good points @etseidl. Our motivation for adding this is that in some cases e.g. with high cardinality, the Rust parquet writer produces much larger encoded Parquet than the Spark workloads we're aiming to replace. So using a default option that enables the heuristic akin to the one hardcoded into parquet-java would get us on par (or maybe better, because this implementation may choose to fall back at any page in the chunk).

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 14, 2026

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

In dict_encoded_size, use size_of and symbolic constants instead of
hardcoded values. For the boolean type, backstop with a panic
since any returned value would be bogus for this method's contract
(as plain encoding for BOOLEAN is bit-packed).
The dictionary encoding is never used for boolean values.
@mzabaluev
Copy link
Copy Markdown
Contributor Author

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

Can this be effected with other existing properties by an educated programmer? There may be adverse effects to tweaking other defaults dependently on this. If you mean data_page_row_count_limit, this is a writer property, so I'd be hesitant to auto-adjust it based on any column properties.

I'd rather have the option to choose between:

  1. The default simple behavior, i.e. OnPageSizeLimit;
  2. The Java-like heuristic to closely follow Spark and other systems using parquet-java (however, a bit-for-bit workalike would require a larger rework of the logic and is perhaps not desirable);
  3. Better adaptive behaviors, if and when implemented.

For 3, I have purposefully left the DictionaryFallback enum open-ended.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sympathize with wanting to be a bit smarter about when to give up on dictionary encoding. I would, however, like to see something a bit more defensible before proceeding with this change. For example, I'd like to see examples where this heuristic outperforms the current defaults by more than 10%, say, and also outperforms disabling dictionary encoding altogether (something which is already an opt-in option, as this new heuristic would be).

Comment thread parquet/src/column/writer/encoder.rs Outdated
}

#[test]
fn test_dict_page_size_decided_by_compression_fallback() {
Copy link
Copy Markdown
Contributor

@etseidl etseidl Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a test, I saved the output from this and examined the sizing. Without the heuristic, the encoded size for col0 is 8658384 bytes (the default fallback mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, a savings of 3%.

I also modified the test to mod the index with 32767. In that instance, col1 was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller.

I know this is not entirely representative, but it does again point out the pitfalls of too simplistic an approach.

Edit: I did a test of spark with the latter file (32k cardinality). By default, it opts to fallback for all pages, so the file is even larger. If I modify the global parquet.page.row.count.limit to 132000, it then opts for dictionary encoding as it should.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the test in 1b6dd37 to demonstrate a case when even an early fallback decision brings about 12% compression. But I generally agree with your assessment, so more work is needed.

Another quirk is seen in this test: a dictionary page is still flushed to encode the first data page, even though there is no benefit. Parquet-java takes care to hand over the accumulated values to the plain encoder to be re-encoded.

/// that the writer should fall back to the plain encoding when at a certain point,
/// e.g. after encoding the first batch, the total size of unencoded data
/// is calculated as smaller than `(encodedSize + dictionarySize)`.
pub struct PlainDataSizeCounter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my comment about sample size, perhaps this can be adapted to keep track of both the plain and dictionary encoded sizes, and then only make a decision after some critical number of rows or bytes have been processed.

Copy link
Copy Markdown
Contributor Author

@mzabaluev mzabaluev Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a value governing this (as the minimum number of rows) directly into the OnUnfavorableCompression variant. Or perhaps, add another property alongside for future extensibility, since it's not clear if everyone would want it to be rows and there'd be a preference for the byte size threshold later, and I don't want to introduce a nested open-ended property struct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just values entered to cover lists as well. I think as long as there are a sufficient number of samples with which to make a decision, this could be quite nice in the end.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in da73778.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mzabaluev, I'll give it a look later today

Comment thread parquet/src/data_type.rs
(std::mem::size_of::<u32>(), self.len())
fn dict_encoding_size(&self) -> usize {
4 + self.len()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

To plain_encoded_data_size as suggested in review.
Modify the compression fallback test to illustrate the benefit in
an admittedly differently contrived case. The heuristic borrowed from
parquet-java is still not ideal for all cases, so we'll need more
configurability.
Change OnUnfavorableCompression to the variant carrying a value
specicifying the minimal sample length, to give the user more control
over when to fall back to the plain encoding.
@mzabaluev-flarion mzabaluev-flarion force-pushed the parquet-dict-fallback-heuristic branch from 98025cc to da73778 Compare April 16, 2026 04:56
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 16, 2026

I agree that benchmarks / explanations about the theoretical foundation for this change is important rather than just blindling doing what the Java implementation does

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a batch of comments. I like the direction this is moving, but it still needs some work.

How do you feel about changing this to draft while we iterate?

Comment thread parquet/src/data_type.rs Outdated
(std::mem::size_of::<Self>(), 1)
}
/// Return the size in bytes for the value encoded in the dictionary.
fn dict_encoding_size(&self) -> usize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never really paid attention to this before, but it's a curious name for this function. Since the dictionary page is PLAIN encoded, this is really the plain encoded size (and you actually use it as such in the counter). No need to change the name, but perhaps the docstring could explain this.

Comment thread parquet/src/data_type.rs Outdated
Comment thread parquet/src/encodings/encoding/plain_counter.rs Outdated
Comment thread parquet/src/column/writer/encoder.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/byte_array.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated
// Set dictionary fallback to trigger fallback to PLAIN encoding on unfavorable compression
let props = WriterProperties::builder()
.set_dictionary_fallback(DictionaryFallback::OnUnfavorableAfter(1))
.set_data_page_size_limit(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.set_data_page_size_limit(1)
.set_data_page_row_count_limit(2)

There's an issue here due to the fact that with these settings the page is flushed before the check for dict fallback is called. Keeping the batch size at 1 but the row count at 2 should allow the check to actually force fallback, resulting in one RLE_DICTIONARY encoded data page and 5 PLAIN encoded.

Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated
"Expected a dictionary page"
);

assert!(reader.metadata().offset_index().is_some());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is not testing the encoding, merely counting the number of data pages. Rather than this you should be examining the page encoding stats.

        let options = ReadOptionsBuilder::new()
            .with_encoding_stats_as_mask(false)
            .build();
...
        // check page encoding stats, should be one dict page, one dict encoded page, and 9
        // plain encoded pages
        let stats = column[0].page_encoding_stats().unwrap();
        println!("pes: {stats:?}");
        assert!(
            stats
                .iter()
                .any(|s| s.page_type == PageType::DICTIONARY_PAGE)
        );
        let num_dict_encoded: i32 = stats
            .iter()
            .filter(|s| {
                s.page_type == PageType::DATA_PAGE && s.encoding == Encoding::RLE_DICTIONARY
            })
            .map(|s| s.count)
            .sum();
        assert_eq!(num_dict_encoded, 1);
        let num_plain_encoded: i32 = stats
            .iter()
            .filter(|s| {
                s.page_type == PageType::DATA_PAGE && s.encoding == Encoding::PLAIN
            })
            .map(|s| s.count)
            .sum();
        assert_eq!(num_plain_encoded, 9);

Coded this way, the test fails with

thread 'arrow::arrow_writer::tests::arrow_writer_dictionary_fallback_on_unfavorable_compression' (10294973) panicked at parquet/src/arrow/arrow_writer/mod.rs:2649:9:
assertion `left == right` failed
  left: 10
 right: 1

indicating that all pages are dict encoded and fallback did not occur

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more comments.

Another thing to think through is that PLAIN isn't the only fallback encoding. If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder.

Comment thread parquet/src/column/writer/mod.rs Outdated
// Second check, if enabled: the compression heuristic.
// For similar logic in parquet-java,
// see DictionaryValuesWriter.isCompressionSatisfying
if self.encoder.num_values() >= self.dict_fallback_sample_len {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoder.num_values() will only be how many values are currently buffered, and will be reset when the page is flushed. I'm afraid a large value for the sample len will result in this never evaluating true.

I think instead much of the data should live within the plain counter. It should know the cumulative size of plain encoded data, as it currently does, but it should also maintain the count of values added, and a cumulative value for number of bytes of RLE encoded data (perhaps update this at page flush time). Then feed in the current size of the dictionary to get make the determination.

Some(dict_encoder) => {
dict_encoder.encode(values, indices);
if let Some(counter) = encoder.plain_data_size_counter.as_mut() {
for idx in indices {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about performance here. It would be nice if after we've collected enough samples and decided on dict vs fallback, we stop gathering these statistics.

Copy link
Copy Markdown
Contributor Author

@mzabaluev mzabaluev Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a mind to keep comparing after every encoded data page, for cases when the configured minimal sample is still not indicative of the overall value distribution and the efficiency degrades somewhere farther down the page chunk. But I understand the concern. Since this behavior is tunable per column through the writer API, I think it's OK to cut counting. For consistency, this should be also done in the generic encoder, I assume?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't want to flag it in both places. 😄

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a mind to keep comparing after every encoded data page, for cases when the configured minimal sample is still not indicative of the overall value distribution and the efficiency degrades somewhere farther down the page chunk.

Fair...but then perhaps the size limit will catch it. In any event, we should stop collectin after we have actually fallen back 😉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should stop collectin after we have actually fallen back 😉

That's already the case, with the plain_data_size_counter member set to None in both flush_dict_page implementations, and the collecting is also not happening in the put methods in case there is no dictionary. Though if I implement a fix for #9739, this may need to be refactored.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counting shuts down after reaching the sample size threshold in 3ff12c8.

@mzabaluev mzabaluev marked this pull request as draft April 16, 2026 21:13
Comment thread parquet/src/arrow/arrow_writer/mod.rs Outdated
.set_dictionary_page_size_limit(1024 * 1024)
.set_column_dictionary_fallback(
ColumnPath::from("col0"),
DictionaryFallback::OnUnfavorableAfter(8192),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is set to a value larger than a page (30000 say), then fallback occurs only after the dictionary gets too large.

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@mzabaluev
Copy link
Copy Markdown
Contributor Author

If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder.

Since this is only a heuristic, and the wrong decision is not fatal, I thought that the estimation does not have to be perfect. The plain encoded size is easy and quick to compute – no need to even read the values for fixed-length types – and it gives a good approximation of the worst case (all the other encodings were invented to improve over the plain one, after all). I'll think of further developing this by giving a cheaply computed upper size bound for the actually used fallback encoding, but I don't want to make it too precise at the cost of extra computation and memory reads.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 16, 2026

If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder.

Since this is only a heuristic, and the wrong decision is not fatal, I thought that the estimation does not have to be perfect. The plain encoded size is easy and quick to compute – no need to even read the values for fixed-length types – and it gives a good approximation of the worst case (all the other encodings were invented to improve over the plain one, after all). I'll think of further developing this by giving a cheaply computed upper size bound for the actually used fallback encoding, but I don't want to make it too precise at the cost of extra computation and memory reads.

I think that's fine for now, and probably always ok for string columns (well, if they fallback to DELTA_LENGTH_BYTE_ARRAY at least). And as you say, the worst case here is sticking with dictionary when perhaps DELTA_BINARY_PACKED might be superior. Then again, these are just defaults, and power users should know their data and pick encodings appropriate to their use cases. (Or use something like https://github.com/XiangpengHao/parquet-linter)

@mzabaluev
Copy link
Copy Markdown
Contributor Author

For delta-based encodings there is this language in the specification:

Writers must not use more bits when bit packing the miniblock data than would be required to PLAIN encode the physical type (e.g. INT32 data must not use more than 32 bits).

So the estimate for the plain encoding should work as a pessimistic estimate for DELTA_BINARY_PACKED and DELTA_LENGTH_BYTE_ARRAY.

The web-edited suggestions did not include the import.
Rename the ParquetValueType::dict_encoding_size trait method
to plain_encoded_size, to better reflect its usage since it's also
used to calculate the comparative size of the plain encoding for the
dictionary fallback heuristic.
Fix the counting logic that mistakenly relied on values that are reset
with every flushed page. Move the fallback decision logic to the
DictFallbackCounter implementation which supersedes
PlainDataSizeCounter.
Once the sample size reaches the configured minimum, shut down
the fallback counter even if the dictionary encoding is still favorable,
to avoid the accounting overhead.
@mzabaluev mzabaluev marked this pull request as ready for review April 17, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet dictionary fallback heuristics

4 participants