Skip to content

feat: add Presidio integration for PII detection and anonymization#3075

Open
SyedShahmeerAli12 wants to merge 13 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:feat/presidio-integration
Open

feat: add Presidio integration for PII detection and anonymization#3075
SyedShahmeerAli12 wants to merge 13 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:feat/presidio-integration

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

@SyedShahmeerAli12 SyedShahmeerAli12 commented Apr 1, 2026

Closes #3063

What this adds

Three new Haystack components using Microsoft Presidio:

  • PresidioDocumentCleaner — anonymizes PII in list[Document], returns new Documents without mutating inputs
  • PresidioTextCleaner — anonymizes PII in list[str], useful for sanitizing user queries before LLM calls
  • PresidioEntityExtractor — detects PII entities and stores them in Document metadata under the "entities" key

Usage example

from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
from haystack import Document

cleaner = PresidioDocumentCleaner()

result = cleaner.run(
    documents=[Document(content="My name is shahhmeer, email: ashahmeer73@gmail.com")]
)

# → "My name is <PERSON>, email: <EMAIL_ADDRESS>"

@SyedShahmeerAli12 SyedShahmeerAli12 requested a review from a team as a code owner April 1, 2026 07:53
@SyedShahmeerAli12 SyedShahmeerAli12 requested review from sjrl and removed request for a team April 1, 2026 07:53
@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 1, 2026
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 1, 2026

@SyedShahmeerAli12 thanks for contributing! Please read the contribution guidelines here https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration and run the scaffolding script to help fill in some missing aspects of your contribution.

Also make sure to run the linter and the tests

@sjrl sjrl self-assigned this Apr 1, 2026
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

▎ Hello @sjrl I've implemented all three components proposed in this issue:

PresidioDocumentCleaner : anonymizes PII in list[Document]
PresidioTextCleaner : anonymizes PII in list[str] for query sanitization
PresidioEntityExtractor : detects PII entities and stores them in Document metadata

All CI checks are passing. Happy to make any changes based on your feedback

@sjrl sjrl removed their assignment Apr 2, 2026
Comment thread integrations/presidio/README.md
Comment thread .github/workflows/presidio.yml Outdated
Comment thread integrations/presidio/pyproject.toml
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 2, 2026

@SyedShahmeerAli12 a few more high-level comments before I do an indepth review:

@SyedShahmeerAli12 SyedShahmeerAli12 force-pushed the feat/presidio-integration branch from 081b9d3 to 82de46b Compare April 2, 2026 14:40
Implements three Haystack components using Microsoft Presidio:
- PresidioDocumentCleaner: anonymizes PII in list[Document]
- PresidioTextCleaner: anonymizes PII in list[str] (for query sanitization)
- PresidioEntityExtractor: detects PII entities and stores them in Document metadata
@SyedShahmeerAli12 SyedShahmeerAli12 force-pushed the feat/presidio-integration branch from 82de46b to 80c8c1d Compare April 2, 2026 14:45
Comment thread .github/workflows/CI_coverage_comment.yml Outdated
Comment thread README.md Outdated
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

Addressed all comments removed the ## Contributing header from README to match the pgvector format, fixed alphabetical ordering in both the root README table and CI_coverage_comment.yml. Python 3.14 was already in place.

…, type hints, dataclasses.replace, doc links

- Add keyword-only arguments (*, ) to all three component __init__ methods
- Move AnalyzerEngine/AnonymizerEngine initialization to warm_up() since they load spaCy ML models
- Fix run() return types from dict[str, Any] to proper typed dicts
- Use dataclasses.replace() in PresidioEntityExtractor instead of Document()
- Add Presidio documentation links for language, entities, and score_threshold params
- Update integration tests to call warm_up() before run()
- Add missing _anonymizer mock in test_run_skips_on_error tests
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

- Add `_is_warmed_up` guard to `warm_up()` so repeated calls are idempotent
- Auto-warm on first `run()` call instead of raising RuntimeError
- Update component docstrings to reflect lazy loading behavior
- Fix broken Presidio doc link (supported_languages → analyzer/languages)
- Add `_make_*_with_mocks()` helper in each test class to centralize mock
  setup and prevent auto-warm from overwriting injected mocks
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

Thanks @sjrl Addressed all four points:

  • Broken doc link — Fixed supported_languages/analyzer/languages/
  • warm_up() guard — Added _is_warmed_up flag; repeated calls return early
  • Auto-warm on run() — Replaced RuntimeError with lazy warm_up() call on first run()
  • Docstrings — Updated to reflect lazy loading behaviour
  • Tests — Added _make_*_with_mocks() helper to prevent auto-warm from overwriting injected mocks

Comment thread .github/workflows/presidio.yml
Comment on lines +110 to +115
logger.warning(
"Could not anonymize document {doc_id}. Skipping it. Error: {error}",
doc_id=doc.id,
error=e,
)
cleaned.append(doc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading we say skipping it which I took as that we are skipping the Document entirely and what is meant is that we are skipping anonymization.

I'd say given that we probably don't want non-anonymized documents to make it through we should drop the cleaned.append(doc) line and leave the warning message as is.

Comment on lines +118 to +124
except Exception as e:
logger.warning(
"Could not extract entities from document {doc_id}. Skipping it. Error: {error}",
doc_id=doc.id,
error=e,
)
result_docs.append(doc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would update the warning message to say we are skipping extraction but keeping the document.

Comment on lines +103 to +108
except Exception as e:
logger.warning(
"Could not anonymize text. Skipping it. Error: {error}",
error=e,
)
cleaned.append(text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as for the document cleaner. I would keep the warning message as-is and remove the cleaned.append(text) line

- Regenerate presidio.yml workflow from template (compute-test-matrix job,
  pinned action versions, push trigger, coverage steps)
- Add integration-cov-append-retry script to pyproject.toml
- Drop un-anonymized documents on error in PresidioDocumentCleaner instead
  of passing them through unanonymized
- Clarify warning message in PresidioEntityExtractor to say extraction is
  skipped but document is kept
- Update test to assert failed docs are dropped in PresidioDocumentCleaner
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

Hi @sjrl, addressed all three comments:

  • presidio.yml — Regenerated from the template (compute-test-matrix job, pinned action versions, push trigger, coverage steps, integration-cov-append-retry). Also added integration-cov-append-retry script to pyproject.toml.
  • presidio_document_cleaner.py — Removed cleaned.append(doc) so documents that fail anonymization are dropped entirely rather than passed through unanonymized.
  • presidio_entity_extractor.py — Updated warning message to "Skipping extraction, keeping document" to accurately reflect the behaviour.
  • Test — Updated test_run_skips_on_error in the cleaner tests to assert len(result["documents"]) == 0.

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 20, 2026

Thanks @SyedShahmeerAli12 ! Almost there, could you look at the failing CI check here

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

@sjrl all checks are passing now. Ready for another review when you get a chance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:presidio topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Presidio integration

2 participants