Add VerificationResult.rowLevelResultsAsDataFrame support#262
Open
billpratt wants to merge 3 commits into
Open
Add VerificationResult.rowLevelResultsAsDataFrame support#262billpratt wants to merge 3 commits into
billpratt wants to merge 3 commits into
Conversation
Wrap deequ's VerificationResult.rowLevelResultsAsDataFrame as a classmethod on pydeequ's VerificationResult. This returns the original DataFrame with additional Boolean columns indicating which rows passed or failed each Check. - Add rowLevelResultsAsDataFrame classmethod to VerificationResult - Add tests covering completeness, containedIn, ANDed constraints, aggregate-only checks, column preservation, and pandas output - Update README with usage example Closes awslabs#261
Address review feedback: Spark DataFrames have no guaranteed row order, so add explicit orderBy() before collect() in all tests that assert row-level values.
Verify that rowLevelResultsAsDataFrame preserves the same number of rows as the original DataFrame.
|
No issues found. Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: d21e43dc) — may not be fully accurate. Reply if this doesn't help. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Python wrapper for deequ's
VerificationResult.rowLevelResultsAsDataFrame, enabling users to get per-row pass/fail results for data quality checks directly from pydeequ.This is critical for workflows that need to quarantine rows with data quality issues rather than just getting aggregate check results.
Closes #261
What it does
Output includes all original columns plus a
quality_checkBoolean column. Multiple constraints within a single Check are ANDed together.Changes
pydeequ/verification.py— AddedrowLevelResultsAsDataFrameclassmethod toVerificationResulttests/test_verification.py— 6 new tests (completeness, containedIn, ANDed constraints, aggregate-only, column preservation, pandas output)README.md— Added usage exampleSupported constraint types
isComplete/hasCompletenesshasPatternisContainedIn/satisfieshasMinLength/hasMaxLengthhasMin/hasMaxisUnique/isPrimaryKeyhasSize,hasEntropy, etc.Known limitation
As noted in #234, checks using Python lambda assertions (e.g.,
hasMin("b", lambda x: x == 0)) can cause serialization errors withrowLevelResultsAsDataFrame. This is a pre-existingScalaFunction1proxy issue, not introduced by this PR.Testing
All 6 new tests pass. Full existing suite (152 tests) passes with no regressions.