feat(parquet): dictionary fallback heuristic based on compression efficiency by mzabaluev · Pull Request #9700 · apache/arrow-rs

mzabaluev · 2026-04-13T14:56:37Z

Which issue does this PR close?

Closes Parquet dictionary fallback heuristics #9699.

What changes are included in this PR?

Added ColumnProperties option dictionary_fallback, getting a DictionaryFallback enum value.
Two behavior variants are provided (initially, the enum is non-exhaustive to allow more to be added later if necessary):

OnPageSizeLimit - the prior behavior and the default, triggers fallback on exceeding the dictionary page size limit.
OnUnfavorableAfter - a new behavior, includes the page size limit check and adds a check for encoded size not exceeding the plain data size.

Implemented the new optional behavior in the encoder.

Are these changes tested?

Added new tests exercising the OnUnfavorableAfter behavior.
The existing tests exercise OnPageSizeLimit.

Are there any user-facing changes?

Added API in parquet:

The DictionaryFallback enum
ColumnProperties::dictionary_fallback, ColumnProperties::set_dictionary_fallback
WriterPropertiesBuilder::set_dictionary_fallback, WriterPropertiesBuilder::set_column_dictionary_fallback
WriterProperties::dictionary_fallback

In parquet, add `dictionary_fallback` option to `ColumnProperties`. Its value type is defined as the `DictionaryFallback` enum with the varians `OnDataPageSize` (the previous behavior, the default) and `OnUnfavorableCompression`. The latter replicates the behavior of parquet-java, which falls back to non-dictionary encoding when the estimated compressed size of the data page with dictionary encoding is not smaller than the estimated size of the data page with plain encoding.

The method as it is used now does not need to return the tuple of base size and number of elements, instead just compute the encoded size as appropriate for the type.

It's not really necessary because the counter is only used while dictionary encoding is enabled, but it's good to safeguard against future refactoring.

etseidl · 2026-04-13T23:29:05Z

Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, however, if even the parquet-java version needs to be updated.

I looked at the java code, and it has been around for quite some time (it was added in late 2014). At that time, I believe the default page size was on the order of a megabyte, so using this heuristic after a single page was probably not a bad idea. However, when the page indexes were added, parquet-java was modified to by default limit pages to 20000 rows (this crate adopted the default 20k page size quite some time later). IMO, 20000 values is too small a sample to decide if a dictionary is having a beneficial effect. Let's say one has a relatively low cardinality (32k) i64 column with a somewhat random distribution. After encoding one 20k row page I think the heuristic here will almost certainly choose plain vs dictionary, but if one were to encode 10 pages, dictionary would then be seen to be superior by far.

I like that this is opt-in, but then wonder if a user knows this heuristic will be helpful (i.e. they know it's a high cardinality column), could they not instead simply disable dictionary encoding for the column in question.

EmilyMatt · 2026-04-14T04:51:32Z

-            (std::mem::size_of::<u32>(), self.len())
+        fn dict_encoding_size(&self) -> usize {
+            4 + self.len()
        }


Use size_of perhaps?

It's a u32, so it will never be different from 4.

yes, so you should use size_of, which is clear why^

To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

EmilyMatt · 2026-04-14T04:51:50Z

        }

+        fn dict_encoding_size(&self) -> usize {
+            12


This is a type that encodes to 96 bits. The compound Rust type is furthermore not declared with a repr that nails down its in-memory size (the compiler might decide to align such small arrays to 16 bytes one day and jack up the size accordingly, for example) so I'd argue using size_of would not be squeaky-clean here.

I understand, so a const would be enough, just not a magic number.

OK, the array alignment is nailed down in the spec, but Int96 wrapping the array is a struct with a Rust repr, so... 🤷 Could be a case for declaring the struct with repr(packed), but I'd rather argue we use explicit numbers here because the encoding is not a straight memory copy.

Good suggestion about the const, I've added Int96::SIZE_IN_BYTES for this purpose.

EmilyMatt · 2026-04-14T04:52:09Z

            Ok(values_read)
        }

+        fn dict_encoding_size(&self) -> usize {


Hmm, it's not only that using size_of would be dubious, the encoding actually uses one bit per value, so this method is a leaky abstraction. I'll see about reworking it.

The dicitionary encoding is never used on bool types, so this just needs some clarifying comments.

... or, better, a panic because this method should never be called for BoolType: the column encoder does a check to fall back to plain/RLE in this case.

Yeah this probably needs an unreachable!()
Though I believe a main principle is that this project should be really low in the dependency graph, which merits not using panics, maybe a const here as well will be fine, or 0, but I fear this may cause a never-ending block/row-group if this assumption ever changes(for example, encoding a full-true group or something.

mzabaluev · 2026-04-14T09:25:15Z

Good points @etseidl. Our motivation for adding this is that in some cases e.g. with high cardinality, the Rust parquet writer produces much larger encoded Parquet than the Spark workloads we're aiming to replace. So using a default option that enables the heuristic akin to the one hardcoded into parquet-java would get us on par (or maybe better, because this implementation may choose to fall back at any page in the chunk).

etseidl · 2026-04-14T13:36:19Z

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

In dict_encoded_size, use size_of and symbolic constants instead of hardcoded values. For the boolean type, backstop with a panic since any returned value would be bogus for this method's contract (as plain encoding for BOOLEAN is bit-packed). The dictionary encoding is never used for boolean values.

mzabaluev · 2026-04-14T16:52:58Z

I wonder if this could be adapted to take a larger sample before deciding. Maybe raise the number of values per page if enabled.

Can this be effected with other existing properties by an educated programmer? There may be adverse effects to tweaking other defaults dependently on this. If you mean data_page_row_count_limit, this is a writer property, so I'd be hesitant to auto-adjust it based on any column properties.

I'd rather have the option to choose between:

The default simple behavior, i.e. OnPageSizeLimit;
The Java-like heuristic to closely follow Spark and other systems using parquet-java (however, a bit-for-bit workalike would require a larger rework of the logic and is perhaps not desirable);
Better adaptive behaviors, if and when implemented.

For 3, I have purposefully left the DictionaryFallback enum open-ended.

etseidl

I sympathize with wanting to be a bit smarter about when to give up on dictionary encoding. I would, however, like to see something a bit more defensible before proceeding with this change. For example, I'd like to see examples where this heuristic outperforms the current defaults by more than 10%, say, and also outperforms disabling dictionary encoding altogether (something which is already an opt-in option, as this new heuristic would be).

etseidl · 2026-04-14T16:25:20Z

    }

+    #[test]
+    fn test_dict_page_size_decided_by_compression_fallback() {


As a test, I saved the output from this and examined the sizing. Without the heuristic, the encoded size for col0 is 8658384 bytes (the default fallback mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, a savings of 3%.

I also modified the test to mod the index with 32767. In that instance, col1 was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller.

I know this is not entirely representative, but it does again point out the pitfalls of too simplistic an approach.

Edit: I did a test of spark with the latter file (32k cardinality). By default, it opts to fallback for all pages, so the file is even larger. If I modify the global parquet.page.row.count.limit to 132000, it then opts for dictionary encoding as it should.

I have modified the test in 1b6dd37 to demonstrate a case when even an early fallback decision brings about 12% compression. But I generally agree with your assessment, so more work is needed.

Another quirk is seen in this test: a dictionary page is still flushed to encode the first data page, even though there is no benefit. Parquet-java takes care to hand over the accumulated values to the plain encoder to be re-encoded.

etseidl · 2026-04-14T16:27:40Z

+/// that the writer should fall back to the plain encoding when at a certain point,
+/// e.g. after encoding the first batch, the total size of unencoded data
+/// is calculated as smaller than `(encodedSize + dictionarySize)`.
+pub struct PlainDataSizeCounter {


To my comment about sample size, perhaps this can be adapted to keep track of both the plain and dictionary encoded sizes, and then only make a decision after some critical number of rows or bytes have been processed.

I can add a value governing this (as the minimum number of rows) directly into the OnUnfavorableCompression variant. Or perhaps, add another property alongside for future extensibility, since it's not clear if everyone would want it to be rows and there'd be a preference for the byte size threshold later, and I don't want to introduce a nested open-ended property struct.

Perhaps just values entered to cover lists as well. I think as long as there are a sufficient number of samples with which to make a decision, this could be quite nice in the end.

Done in da73778.

Thanks @mzabaluev, I'll give it a look later today

etseidl · 2026-04-14T16:28:42Z

-            (std::mem::size_of::<u32>(), self.len())
+        fn dict_encoding_size(&self) -> usize {
+            4 + self.len()
        }


To be fair, there's magic all over this module. But I agree with @EmilyMatt there's no need to add more.

To plain_encoded_data_size as suggested in review.

Modify the compression fallback test to illustrate the benefit in an admittedly differently contrived case. The heuristic borrowed from parquet-java is still not ideal for all cases, so we'll need more configurability.

Change OnUnfavorableCompression to the variant carrying a value specicifying the minimal sample length, to give the user more control over when to fall back to the plain encoding.

alamb · 2026-04-16T10:57:16Z

I agree that benchmarks / explanations about the theoretical foundation for this change is important rather than just blindling doing what the Java implementation does

etseidl

Flushing a batch of comments. I like the direction this is moving, but it still needs some work.

How do you feel about changing this to draft while we iterate?

etseidl · 2026-04-16T16:44:27Z

-            (std::mem::size_of::<Self>(), 1)
-        }
+        /// Return the size in bytes for the value encoded in the dictionary.
+        fn dict_encoding_size(&self) -> usize;


I never really paid attention to this before, but it's a curious name for this function. Since the dictionary page is PLAIN encoded, this is really the plain encoded size (and you actually use it as such in the counter). No need to change the name, but perhaps the docstring could explain this.

etseidl · 2026-04-16T18:45:30Z

+        // Set dictionary fallback to trigger fallback to PLAIN encoding on unfavorable compression
+        let props = WriterProperties::builder()
+            .set_dictionary_fallback(DictionaryFallback::OnUnfavorableAfter(1))
+            .set_data_page_size_limit(1)


Suggested change

.set_data_page_size_limit(1)

.set_data_page_row_count_limit(2)

There's an issue here due to the fact that with these settings the page is flushed before the check for dict fallback is called. Keeping the batch size at 1 but the row count at 2 should allow the check to actually force fallback, resulting in one RLE_DICTIONARY encoded data page and 5 PLAIN encoded.

etseidl · 2026-04-16T18:47:18Z

+            "Expected a dictionary page"
+        );
+
+        assert!(reader.metadata().offset_index().is_some());


The following is not testing the encoding, merely counting the number of data pages. Rather than this you should be examining the page encoding stats.

let options = ReadOptionsBuilder::new() .with_encoding_stats_as_mask(false) .build(); ... // check page encoding stats, should be one dict page, one dict encoded page, and 9 // plain encoded pages let stats = column[0].page_encoding_stats().unwrap(); println!("pes: {stats:?}"); assert!( stats .iter() .any(|s| s.page_type == PageType::DICTIONARY_PAGE) ); let num_dict_encoded: i32 = stats .iter() .filter(|s| { s.page_type == PageType::DATA_PAGE && s.encoding == Encoding::RLE_DICTIONARY }) .map(|s| s.count) .sum(); assert_eq!(num_dict_encoded, 1); let num_plain_encoded: i32 = stats .iter() .filter(|s| { s.page_type == PageType::DATA_PAGE && s.encoding == Encoding::PLAIN }) .map(|s| s.count) .sum(); assert_eq!(num_plain_encoded, 9);

Coded this way, the test fails with

thread 'arrow::arrow_writer::tests::arrow_writer_dictionary_fallback_on_unfavorable_compression' (10294973) panicked at parquet/src/arrow/arrow_writer/mod.rs:2649:9: assertion `left == right` failed left: 10 right: 1

indicating that all pages are dict encoded and fallback did not occur

etseidl

Added a few more comments.

Another thing to think through is that PLAIN isn't the only fallback encoding. If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder.

etseidl · 2026-04-16T21:01:56Z

+        // Second check, if enabled: the compression heuristic.
+        // For similar logic in parquet-java,
+        // see DictionaryValuesWriter.isCompressionSatisfying
+        if self.encoder.num_values() >= self.dict_fallback_sample_len {


encoder.num_values() will only be how many values are currently buffered, and will be reset when the page is flushed. I'm afraid a large value for the sample len will result in this never evaluating true.

I think instead much of the data should live within the plain counter. It should know the cumulative size of plain encoded data, as it currently does, but it should also maintain the count of values added, and a cumulative value for number of bytes of RLE encoded data (perhaps update this at page flush time). Then feed in the current size of the dictionary to get make the determination.

etseidl · 2026-04-16T21:04:40Z

+        Some(dict_encoder) => {
+            dict_encoder.encode(values, indices);
+            if let Some(counter) = encoder.plain_data_size_counter.as_mut() {
+                for idx in indices {


I'm a little worried about performance here. It would be nice if after we've collected enough samples and decided on dict vs fallback, we stop gathering these statistics.

I had a mind to keep comparing after every encoded data page, for cases when the configured minimal sample is still not indicative of the overall value distribution and the efficiency degrades somewhere farther down the page chunk. But I understand the concern. Since this behavior is tunable per column through the writer API, I think it's OK to cut counting. For consistency, this should be also done in the generic encoder, I assume?

Yes, I didn't want to flag it in both places. 😄

I had a mind to keep comparing after every encoded data page, for cases when the configured minimal sample is still not indicative of the overall value distribution and the efficiency degrades somewhere farther down the page chunk.

Fair...but then perhaps the size limit will catch it. In any event, we should stop collectin after we have actually fallen back 😉

we should stop collectin after we have actually fallen back 😉

That's already the case, with the plain_data_size_counter member set to None in both flush_dict_page implementations, and the collecting is also not happening in the put methods in case there is no dictionary. Though if I implement a fix for #9739, this may need to be refactored.

The counting shuts down after reaching the sample size threshold in 3ff12c8.

etseidl · 2026-04-16T21:19:25Z

+            .set_dictionary_page_size_limit(1024 * 1024)
+            .set_column_dictionary_fallback(
+                ColumnPath::from("col0"),
+                DictionaryFallback::OnUnfavorableAfter(8192),


If this is set to a value larger than a page (30000 say), then fallback occurs only after the dictionary gets too large.

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

mzabaluev · 2026-04-16T22:02:20Z

If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder.

Since this is only a heuristic, and the wrong decision is not fatal, I thought that the estimation does not have to be perfect. The plain encoded size is easy and quick to compute – no need to even read the values for fixed-length types – and it gives a good approximation of the worst case (all the other encodings were invented to improve over the plain one, after all). I'll think of further developing this by giving a cheaply computed upper size bound for the actually used fallback encoding, but I don't want to make it too precise at the cost of extra computation and memory reads.

etseidl · 2026-04-16T22:34:14Z

If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder.

Since this is only a heuristic, and the wrong decision is not fatal, I thought that the estimation does not have to be perfect. The plain encoded size is easy and quick to compute – no need to even read the values for fixed-length types – and it gives a good approximation of the worst case (all the other encodings were invented to improve over the plain one, after all). I'll think of further developing this by giving a cheaply computed upper size bound for the actually used fallback encoding, but I don't want to make it too precise at the cost of extra computation and memory reads.

I think that's fine for now, and probably always ok for string columns (well, if they fallback to DELTA_LENGTH_BYTE_ARRAY at least). And as you say, the worst case here is sticking with dictionary when perhaps DELTA_BINARY_PACKED might be superior. Then again, these are just defaults, and power users should know their data and pick encodings appropriate to their use cases. (Or use something like https://github.com/XiangpengHao/parquet-linter)

mzabaluev · 2026-04-17T02:49:08Z

For delta-based encodings there is this language in the specification:

Writers must not use more bits when bit packing the miniblock data than would be required to PLAIN encode the physical type (e.g. INT32 data must not use more than 32 bits).

So the estimate for the plain encoding should work as a pessimistic estimate for DELTA_BINARY_PACKED and DELTA_LENGTH_BYTE_ARRAY.

The web-edited suggestions did not include the import.

Rename the ParquetValueType::dict_encoding_size trait method to plain_encoded_size, to better reflect its usage since it's also used to calculate the comparative size of the plain encoding for the dictionary fallback heuristic.

Fix the counting logic that mistakenly relied on values that are reset with every flushed page. Move the fallback decision logic to the DictFallbackCounter implementation which supersedes PlainDataSizeCounter.

Once the sample size reaches the configured minimum, shut down the fallback counter even if the dictionary encoding is still favorable, to avoid the accounting overhead.

mzabaluev added 3 commits April 13, 2026 17:54

test: fallback with OnUnfavorableCompression

3aacbe5

refactor(parquet): simplify dict_encoding_size

83ded36

The method as it is used now does not need to return the tuple of base size and number of elements, instead just compute the encoded size as appropriate for the type.

github-actions bot added the parquet Changes to the parquet crate label Apr 13, 2026

mzabaluev added 3 commits April 13, 2026 20:25

fix: unset plain data counter when flushing dict

674a1b0

It's not really necessary because the counter is only used while dictionary encoding is enabled, but it's good to safeguard against future refactoring.

chore: fix clippy

13e042e

chore: license on plain_counter

becdcef

EmilyMatt reviewed Apr 14, 2026

View reviewed changes

etseidl requested changes Apr 14, 2026

View reviewed changes

chore(parquet): rename uncompressed_data_size

701ff2b

To plain_encoded_data_size as suggested in review.

etseidl mentioned this pull request Apr 14, 2026

RequiresFallback.isCompressionSatisfying is too aggressive with current default page size apache/parquet-java#3479

Open

mzabaluev added 2 commits April 15, 2026 13:16

test: compression fallback wins

1b6dd37

Modify the compression fallback test to illustrate the benefit in an admittedly differently contrived case. The heuristic borrowed from parquet-java is still not ideal for all cases, so we'll need more configurability.

feat: DictionaryFallback::OnUnfavorableAfter

da73778

Change OnUnfavorableCompression to the variant carrying a value specicifying the minimal sample length, to give the user more control over when to fall back to the plain encoding.

mzabaluev-flarion force-pushed the parquet-dict-fallback-heuristic branch from 98025cc to da73778 Compare April 16, 2026 04:56

refactor: more compact plain data counter init

b392738

mzabaluev mentioned this pull request Apr 16, 2026

Parquet dictionary encoding fallback is sub-optimal, may violate writer parameters #9739

Open

mzabaluev added 2 commits April 16, 2026 18:08

test(parquet): fix up compression fallback

bd914bb

test(parquet): test dictionary_fallback property

898e7e5

etseidl requested changes Apr 16, 2026

View reviewed changes

etseidl reviewed Apr 16, 2026

View reviewed changes

mzabaluev marked this pull request as draft April 16, 2026 21:13

etseidl reviewed Apr 16, 2026

View reviewed changes

chore: suggestions from code review

1e81173

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

mzabaluev added 5 commits April 17, 2026 14:37

chore: add missing import

34b819d

The web-edited suggestions did not include the import.

chore: rename dict_encoding_size

692018e

Rename the ParquetValueType::dict_encoding_size trait method to plain_encoded_size, to better reflect its usage since it's also used to calculate the comparative size of the plain encoding for the dictionary fallback heuristic.

fix: rework plain_counter to DictFallbackCounter

4511917

Fix the counting logic that mistakenly relied on values that are reset with every flushed page. Move the fallback decision logic to the DictFallbackCounter implementation which supersedes PlainDataSizeCounter.

test: adjust dict fallback tests as per review

289e4e1

feat: make dict fallback decision one-off

3ff12c8

Once the sample size reaches the configured minimum, shut down the fallback counter even if the dictionary encoding is still favorable, to avoid the accounting overhead.

mzabaluev marked this pull request as ready for review April 17, 2026 22:56

	.set_data_page_size_limit(1)
	.set_data_page_row_count_limit(2)

Conversation

mzabaluev commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Apr 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilyMatt Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzabaluev commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etseidl commented Apr 14, 2026

Uh oh!

mzabaluev commented Apr 14, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

etseidl Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzabaluev Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mzabaluev commented Apr 13, 2026 •

edited

Loading

EmilyMatt Apr 14, 2026 •

edited

Loading

mzabaluev commented Apr 14, 2026 •

edited

Loading

etseidl Apr 14, 2026 •

edited

Loading

mzabaluev Apr 14, 2026 •

edited

Loading

mzabaluev Apr 16, 2026 •

edited

Loading