Skip to content

feat(patch_set): support git diff application#59

Draft
weihanglo wants to merge 10 commits intobmwill:masterfrom
weihanglo:gitdiff
Draft

feat(patch_set): support git diff application#59
weihanglo wants to merge 10 commits intobmwill:masterfrom
weihanglo:gitdiff

Conversation

@weihanglo
Copy link
Copy Markdown
Contributor

@weihanglo weihanglo commented Apr 12, 2026

Add git diff output parsing and application support. Some hightlights:

  • The first few refactor commits add parse options for Patch, making skipping preamble and rejecting hunks configurable. Optional preamble skipping is required for git diff parsing because git diff patch may have only git headers without any unidiff portion, followed by the next diff --git header. If we still skip preamble, the next diff --git will be swallowed.
  • Binary patch is not yet supported, though we have skip and fail options which user can configure to either skip binary patch, or fail on encountering a binary patch. I plan to add a Keep enum variant when we add the support the binary patch parsing/application. (See https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/patches/parse.rs#L172-L197)

Comment thread src/patch_set/parse.rs
}
};

// FIXME: error spans point at `diff --git` line, not the specific offending line
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it for future, I am lazy again to write new code 🙇🏾‍♂️.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if there are explicit limitation or things that we know will be addressed in the future as long as we document the issue i'm fine with avoiding scope bloat on individual PRs

@weihanglo weihanglo force-pushed the gitdiff branch 2 times, most recently from ffb8730 to 2515dd3 Compare April 12, 2026 21:08
Comment thread tests/replay.rs
Comment on lines +333 to +349
// Can't use `--numstat` for GitDiff: it shows `-\t-\t` for both
// actual binary diffs AND pure binary renames (100% similarity).
// Parser correctly handles pure renames (rename headers, no binary content).
//
// Use `--raw` for total count, subtract actual binary markers from diff.
//
// TODO: once `--binary` is passed above, count ALL `--raw`
// entries — every file will have patch data (delta, literal, or text).
let raw = git(repo, &["diff", "--raw", parent, child]);
let total = raw.lines().filter(|l| !l.is_empty()).count();
let type_changes = count_type_changes(&raw);
let binary = diff_output
.lines()
.filter(|l| l.starts_with("Binary files ") || l.starts_with("GIT binary patch"))
.count();
skipped += binary;
total - binary + type_changes
Copy link
Copy Markdown
Contributor Author

@weihanglo weihanglo Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial port of https://github.com/weihanglo/diffy/blob/e50f9286b95671b0f131fb5ac4f5725933e78232/tests/replay.rs#L292-L304. We will remove the binary filter eventually to avoid duplicate the parsing logic in test when we add binary patch support.

@weihanglo weihanglo force-pushed the gitdiff branch 2 times, most recently from eaa47ff to 49376b2 Compare April 13, 2026 19:24
`.lines()` strips line endings, so callers tracking byte offsets
need to re-add the `\r\n` or `\n` length manually.
Extract the repeated inline pattern into a reusable helper.
* Parse `diff --git` extended headers
* split multi-file git diffs at `diff --git` boundaries
Compat test for also `git apply`.
Unlike unidiff,
gitdiff produces patches for empty file creations/deletions
(`0\t0` in numstat)
because they carry `diff --git` + extended headers even without hunks.

Binary files (`-\t-\t`) are skipped in gitdiff mode for now.
Comment thread src/patch_set/mod.rs
Comment on lines +116 to +120
/// Skip binary diffs silently.
pub fn skip_binary(mut self) -> Self {
self.binary = Binary::Skip;
self
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good actually. We may want a binary marker like patch so people explicitly know they are skipping something

Comment thread src/patch_set/parse.rs
Comment on lines +491 to +494
/// * `a/<path> b/<path>` (default prefix)
/// * `<path> <path>` (`git diff --no-prefix`)
/// * `<src-prefix><path> <dst-prefix><path>` (custom prefix)
/// * `"<prefix><path>" "<prefix><path>"` (quoted, with escapes)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add more compat tests around this logic.

Comment thread src/patch_set/parse.rs
Comment on lines +388 to +396
return Ok(FileOperation::Rename {
from: Cow::Borrowed(from),
to: Cow::Borrowed(to),
});
}
if let (Some(from), Some(to)) = (header.copy_from, header.copy_to) {
return Ok(FileOperation::Copy {
from: Cow::Borrowed(from),
to: Cow::Borrowed(to),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might return the quoted filename. This need to be fixed.

Comment thread src/patch_set/parse.rs
fn extract_file_op_gitdiff<'a>(
header: &GitHeader<'a>,
patch: &Patch<'a, str>,
) -> Result<FileOperation<'a>, PatchSetParseError> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing issue: This FileOperation preserves the raw path with prefix unstripped, e.g., a/src/lib.rs and b/src/lib.rs. I personally think this is the right chose on syntactic level and consumer should know their patch better than us. However, the API doc should call this out more explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that we have yet supported non-UTF8 path even in #64.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that we have yet supported non-UTF8 path

This was one other thing i was going to mention. How do you want to handle that? punt on that for a follow up PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also cherry pick whatever in #64, if that is preferred.

Comment thread src/patch_set/parse.rs
Comment on lines +92 to +94
Binary::Skip => {
return self.next_gitdiff_patch();
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is recursive and some inputs could trivially overflow the stack. This should hopefully be fairly easy to restructure into a looping construct instead which can avoid the possibility of blowing the stack.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better, instead of silent skip, we always emit a binary marker patch.

#59 (comment)

Comment thread src/patch_set/parse.rs
}
// Select split with longest common path suffix (matches Git behavior)
if let Some(path) = longest_common_path_suffix(left, right) {
if path.len() > longest_path.len() {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple splits with the same length what does git do in those situations? as-is this will prefer the first one we encounter.

Comment thread src/patch_set/parse.rs
/// Path component boundary means:
///
/// * At `/` character (e.g., `foo/bar.rs` vs `fooo/bar.rs` → `bar.rs`)
/// * Or the entire string is identical
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "entire string is identical" case takes care of when a and b have no/s correct?

Comment thread src/patch/parse.rs
Comment thread src/patch/parse.rs
Comment thread src/patch_set/parse.rs
Comment on lines 61 to 62
// Strip email preamble once at construction
let input = strip_email_preamble(input);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new in this change but I was just wondering how we would properly handle mbox streams (a concatenation of a bunch of email patchsets). Maybe worth adding a comment to comeback and address in a followup?

Copy link
Copy Markdown
Contributor Author

@weihanglo weihanglo Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered last time I tried, GNU patch and git apply both failed on this case. We now have compat test infra we can verify with some tests.

Comment thread src/patch_set/parse.rs
}

/// See [`parse_diff_git_path`].
fn parse_quoted_diff_git_path(line: &str) -> Option<(Cow<'_, str>, Cow<'_, str>)> {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for quote-within-quote eg diff --git "a/with\"quote"

Comment thread src/patch_set/parse.rs
let end = loop {
match bytes.get(i)? {
b'"' => break i + 1,
b'\\' => i += 2,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be concerned about full octal awareness here? If we don't then maybe we should add a comment indicating why its ok?

Comment thread src/utils.rs
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 15, 2026
Octal escape \377 decodes to 0xFF, which is not valid UTF-8.
When parsing into `Patch<'_, str>`, `convert_cow_to_str` panics
via `unwrap()` instead of returning a parse error.

This documents the pre-existing bug that the reviewer flagged:
bmwill#59 (comment)
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 15, 2026
Octal escape \377 decodes to 0xFF, which is not valid UTF-8.
When parsing into `Patch<'_, str>`,
`convert_cow_to_str` panics via `unwrap()`
instead of returning a parse error.

See bmwill#59 (comment)
@weihanglo weihanglo marked this pull request as draft April 15, 2026 23:11
@weihanglo
Copy link
Copy Markdown
Contributor Author

Just put this on hold and created #65 and #66 for refactoring/improving the existing stuff.

bmwill pushed a commit that referenced this pull request Apr 16, 2026
Octal escape \377 decodes to 0xFF, which is not valid UTF-8.
When parsing into `Patch<'_, str>`,
`convert_cow_to_str` panics via `unwrap()`
instead of returning a parse error.

See #59 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants