Parquet Incremental Sync by sapienza88 · Pull Request #768 · apache/incubator-xtable

sapienza88 · 2025-12-10T19:54:49Z

What is the purpose of the pull request

Adds incremental syncing ability to the ParquetSource

Brief change log

Adds a new class ParquetDataManager.java for handling the fetching of data files for Parquet Source
Updates IT to include incremental source

Verify this pull request

new tests added to ITParquetConversionSource

… into the parquet table

…ds, interfacing with ConversionSource

rahil-c · 2025-12-15T16:19:52Z

I can do first review for this @the-other-tim-brown @vinishjail97

vinishjail97 · 2025-12-17T19:16:16Z

+    try (ParquetWriter<Group> writer =
+        new ParquetWriter<Group>(
+            outputFile,
+            new GroupWriteSupport(),
+            parquetFileConfig.getCodec(),
+            (int) parquetFileConfig.getRowGroupSize(),
+            pageSize,
+            pageSize, // dictionaryPageSize
+            true, // enableDictionary
+            false, // enableValidation
+            ParquetWriter.DEFAULT_WRITER_VERSION,
+            conf)) {
+      Group currentGroup = null;
+      while ((currentGroup = (Group) reader.read()) != null) {
+        writer.write(currentGroup);


Why are we writing new parquet files again like this through the writer? I think there's some misunderstanding with the parquet incremental sync feature here.

Parquet Incremental Sync Requirements.

You have a target table where parquet files [p1/f1.parquet, p1/f2.parquet, p2/f1.parquet] have been synced to hudi, iceberg and delta for example.

In the source changes some changes have been made a new file in partition p1 was added and p2's file was deleted. The incremental sync should now sync the new changes incrementally.

@sapienza88 It's better to align on the approach first here before we push PR's. Can you add the approach for parquet incremental sync in the PR description or any google doc if possible?

@sapienza88 XTable shouldn't be writing any new data or parquet files it operates at a metadata level. Can you see this comment for reference?
#550 (comment)
Fetch the parquet files that have been added since last syncInstant to retrieve the change log. We can this via the same list call and filtering files based on their creationTime is the simplest way but it's expensive

@vinishjail97 thanks for the suggestion, but that isn't helping. Could you elaborate on that idea and how you could manage the metadata only for the task of retrieving data from a particular (modification) date? at the very least the current ConversionSource wasn't coded with that in mind.

sapienza88 · 2025-12-17T19:55:39Z

@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit.

…ing)

vinishjail97 · 2025-12-22T19:46:35Z

XTable shouldn't be writing any new data or parquet files it operators at a metadata level. Can you see this comment for reference? I had written few approaches on how to do incremental parquet sync.
#550 (comment)

vinishjail97 · 2025-12-29T07:46:07Z

@sapienza88 I'm adding a more detailed design and a class level structure to unblock this PR.

Design Principle
XTable operates at a metadata level only. The current PR approach of writing new Parquet files with filtered data is incorrect. XTable should:

Discover existing Parquet files from storage
Generate table format metadata (Hudi, Iceberg, Delta) for those files
NEVER write new Parquet files or transform data.

Architecture

  ┌────────────────────────────────────────────────────────────┐
  │                  ParquetConversionSource                   │
  │  - Uses ParquetFileDiscovery to find files                 │
  │  - Converts file metadata to InternalDataFile              │
  │  - Returns snapshots and table changes                     │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │              ParquetFileDiscovery (new class)              │
  │  - Lists all .parquet files from filesystem                │
  │  - Filters files by modification time                      │
  │  - Returns lightweight file metadata                       │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │            FileSystem (HDFS/S3/GCS/Azure)                  │
  │  - fs.listFiles(basePath, recursive=true)                  │
  └────────────────────────────────────────────────────────────┘

Use file modification time as commit identifier, you will be able to identify which files have been synced and which haven't been synced. The files not synced need to have metadata generated. The future functionality like making it optimized, handling deleted parquet files in storage can be handled incrementally, hoping to scope low for this PR.

…ds using the FileStatus' modifTime attribute

…ificationTime selector

…ppend and 2) filter for sync

the-other-tim-brown · 2026-03-15T18:00:23Z

    }
  }

+  @Test


Line 184 needs to be updated to include INCREMENTAL as well

vinishjail97 · 2026-04-20T17:09:58Z

+ * parquet files and filtering the files based on the modification times.
+ */
+@Log4j2
+@RequiredArgsConstructor


Nit — exposing both the Lombok-generated 3-arg ctor (@RequiredArgsConstructor) and an explicit 2-arg ctor creates ambiguity about the public API. Production code calls the 2-arg form; tests call the 3-arg form. Consider dropping @RequiredArgsConstructor and either annotating the 3-arg ctor @VisibleForTesting or using a package-private static factory method for tests.

I'll leave the rest of your comments here to somebody else to do them.

vinishjail97 · 2026-04-20T17:09:58Z

+            RowFactory.create(103, "BA", 2027, 11));
+
+    Dataset<Row> dfInit = sparkSession.createDataFrame(data, schema);
+    Path fixedPath = Paths.get("target", "fixed-parquet-data", "parquet_table_test_2");


Relative path Paths.get("target", "fixed-parquet-data", "parquet_table_test_2") pollutes the workspace across test runs, isn't cleaned up, and makes the test order-dependent when re-run without ./gradlew clean. This class already uses the @TempDir pattern — please use it here too. Also drop the commented-out // String outputPath = fixedPath.toString(); on line 457.

vinishjail97 · 2026-04-20T17:09:58Z

+    assertNotNull(result);
+    List<ParquetFileInfo> fileList = result.collect(Collectors.toList());
+    assertEquals(3, fileList.size());
+    assertEquals(1000L, fileList.get(0).getModificationTime());


Nit — asserting positional ordering here relies on RemoteIterator + Collectors.toList() preserving the mock insertion order. In production, FS listing order is platform-dependent and not guaranteed. Either sort inside getCurrentFilesInfo() (and document the ordering contract) or switch these assertions to Set<Long> comparisons. Same pattern applies to testGetParquetFilesMetadataAfterTime_someMatch and _exactTimeMatch.

… ParquetDataManager and adjusted operator for mostRecentFile + spotless

… ParquetDataManager and adjusted operator for mostRecentFile + spotless + import error fixed

given a parquet file return data from a certain modification time

e541a71

sapienza88 changed the title ~~Parquet Incremental Sync: Given a parquet file return data from a certain modification time~~ Parquet Incremental Sync Dec 10, 2025

Selim Soufargi added 3 commits December 13, 2025 18:20

create the path based on the partition then inject the file to append…

15e282a

… into the parquet table

Handle case of path construction with file partitioned over many fiel…

2ee71c9

…ds, interfacing with ConversionSource

test append Parquet file into table init

6032e5f

add function to test schema equivalence before appending

f6fdc72

vinishjail97 self-requested a review December 16, 2025 08:31

Selim Soufargi added 2 commits December 16, 2025 12:59

construct path to inject to based on partitions

a94c3f3

fix imports

f8bdbfe

vinishjail97 requested changes Dec 17, 2025

View reviewed changes

refactoring (lombok, logs, javadocs and function and approach comment…

c04a983

…ing)

Selim Soufargi added 15 commits January 1, 2026 18:03

use appendFile to append a file into a table while tracking the appen…

5f2541e

…ds using the FileStatus' modifTime attribute

find the files that satisfy to the time condition

47e7076

treat appends as separate files to add in the target partition folder

fbb09ec

update approach: selective block compaction

fe19a60

update approach: added a basic test to check data selection using mod…

da7f300

…ificationTime selector

fix append based on partition value

a8730b7

fix test with basic example where partitions are not considered

d19ccbf

fix test with basic example where partitions are not considered2

aecb204

fix test with basic example where partitions are not considered3

0ec8cbb

test with time of last append is now

9cb75df

test appendFile with Parquet: TODO test with multiple partitions 1) a…

9e125f2

…ppend and 2) filter for sync

merge recursively one partition files

233ca77

fix paths for files to append

b4cba5a

fix bug of appending file path

a564b29

fix bug of schema

d1ceafb

remove old blocking tests

9c598ab

sapienza88 force-pushed the parquet_incr_sync branch from c94066a to 2020a84 Compare March 15, 2026 14:02

Selim Soufargi added 9 commits March 15, 2026 15:03

solving conflicts

2020a84

spotless:apply and more tests

f4f9e9e

spotless:apply and test fix

4fb91cf

spotless:apply and test fix

f0ebcf6

spotless:apply and test fix

4d636e0

test fix

6e3b5aa

spotless

5364315

spotless imports

9e27f44

spotless imports

9c0ca1d

the-other-tim-brown reviewed Mar 15, 2026

View reviewed changes

Selim Soufargi and others added 6 commits March 15, 2026 20:15

add syncMode Incr

51e8dd9

revert changes

f46305d

add syncMode Incr

5733dbe

spotless:apply

d69d944

update naming

2e826e1

minimize diff with main

7975980

vinishjail97 reviewed Apr 20, 2026

View reviewed changes

Selim Soufargi added 11 commits April 20, 2026 22:42

added test for tableChangeAddedFiles and changed path construction of…

292bfbe

… ParquetDataManager and adjusted operator for mostRecentFile + spotless

added test for tableChangeAddedFiles and changed path construction of…

d7076f6

… ParquetDataManager and adjusted operator for mostRecentFile + spotless + import error fixed

tests fixed

daae5f2

tests fixed

df448cf

tests fixed

ac3f534

log for actual changes

47d0d82

log for actual changes

1d70041

tests fixed

1a18297

spotless

a8f713d

removed comment

350b74f

spotless

75fcdde

                   }
                 }
+                @Test

Conversation

sapienza88 commented Dec 10, 2025 • edited by the-other-tim-brown Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

vinishjail97 commented Dec 22, 2025

Uh oh!

vinishjail97 commented Dec 29, 2025

Uh oh!

the-other-tim-brown Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

sapienza88 Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sapienza88 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vinishjail97 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sapienza88 commented Dec 10, 2025 •

edited by the-other-tim-brown

Loading

vinishjail97 Dec 22, 2025 •

edited

Loading

sapienza88 Dec 23, 2025 •

edited

Loading