Skip to content

Add VerificationResult.rowLevelResultsAsDataFrame support#262

Open
billpratt wants to merge 3 commits into
awslabs:masterfrom
billpratt:row-level-results
Open

Add VerificationResult.rowLevelResultsAsDataFrame support#262
billpratt wants to merge 3 commits into
awslabs:masterfrom
billpratt:row-level-results

Conversation

@billpratt
Copy link
Copy Markdown

Summary

Adds a Python wrapper for deequ's VerificationResult.rowLevelResultsAsDataFrame, enabling users to get per-row pass/fail results for data quality checks directly from pydeequ.

This is critical for workflows that need to quarantine rows with data quality issues rather than just getting aggregate check results.

Closes #261

What it does

check = Check(spark, CheckLevel.Error, "quality_check")
check = check.isComplete("email").isContainedIn("status", ["active", "inactive"])

result = VerificationSuite(spark).onData(df).addCheck(check).run()
row_level_df = VerificationResult.rowLevelResultsAsDataFrame(spark, result, df)
row_level_df.show()

Output includes all original columns plus a quality_check Boolean column. Multiple constraints within a single Check are ANDed together.

Changes

  • pydeequ/verification.py — Added rowLevelResultsAsDataFrame classmethod to VerificationResult
  • tests/test_verification.py — 6 new tests (completeness, containedIn, ANDed constraints, aggregate-only, column preservation, pandas output)
  • README.md — Added usage example

Supported constraint types

Constraint Row-level output
isComplete / hasCompleteness
hasPattern
isContainedIn / satisfies
hasMinLength / hasMaxLength
hasMin / hasMax
isUnique / isPrimaryKey
hasSize, hasEntropy, etc. ❌ (aggregate-only, silently skipped)

Known limitation

As noted in #234, checks using Python lambda assertions (e.g., hasMin("b", lambda x: x == 0)) can cause serialization errors with rowLevelResultsAsDataFrame. This is a pre-existing ScalaFunction1 proxy issue, not introduced by this PR.

Testing

All 6 new tests pass. Full existing suite (152 tests) passes with no regressions.

Wrap deequ's VerificationResult.rowLevelResultsAsDataFrame as a
classmethod on pydeequ's VerificationResult. This returns the original
DataFrame with additional Boolean columns indicating which rows passed
or failed each Check.

- Add rowLevelResultsAsDataFrame classmethod to VerificationResult
- Add tests covering completeness, containedIn, ANDed constraints,
  aggregate-only checks, column preservation, and pandas output
- Update README with usage example

Closes awslabs#261
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: d21e43dc) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/verification.py
Comment thread tests/test_verification.py
Comment thread tests/test_verification.py
Comment thread tests/test_verification.py
Comment thread tests/test_verification.py
Address review feedback: Spark DataFrames have no guaranteed row
order, so add explicit orderBy() before collect() in all tests that
assert row-level values.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: d21e43dc) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/verification.py
Comment thread tests/test_verification.py
Comment thread tests/test_verification.py
Comment thread tests/test_verification.py
@billpratt billpratt marked this pull request as ready for review May 12, 2026 22:41
Verify that rowLevelResultsAsDataFrame preserves the same number of
rows as the original DataFrame.
@github-actions
Copy link
Copy Markdown

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: d21e43dc) — may not be fully accurate. Reply if this doesn't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plans to expose deequ's VerificationResult.rowLevelResultsAsDataFrame?

1 participant