feat: add Presidio integration for PII detection and anonymization by SyedShahmeerAli12 · Pull Request #3075 · deepset-ai/haystack-core-integrations

SyedShahmeerAli12 · 2026-04-01T07:53:47Z

What this adds

Three new Haystack components using Microsoft Presidio:

PresidioDocumentCleaner — anonymizes PII in list[Document], returns new Documents without mutating inputs
PresidioTextCleaner — anonymizes PII in list[str], useful for sanitizing user queries before LLM calls
PresidioEntityExtractor — detects PII entities and stores them in Document metadata under the "entities" key

Usage example

from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
from haystack import Document

cleaner = PresidioDocumentCleaner()

result = cleaner.run(
    documents=[Document(content="My name is shahhmeer, email: ashahmeer73@gmail.com")]
)

# → "My name is <PERSON>, email: <EMAIL_ADDRESS>"

sjrl · 2026-04-01T07:59:59Z

@SyedShahmeerAli12 thanks for contributing! Please read the contribution guidelines here https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration and run the scaffolding script to help fill in some missing aspects of your contribution.

Also make sure to run the linter and the tests

SyedShahmeerAli12 · 2026-04-01T08:14:16Z

▎ Hello @sjrl I've implemented all three components proposed in this issue:

PresidioDocumentCleaner : anonymizes PII in list[Document]
PresidioTextCleaner : anonymizes PII in list[str] for query sanitization
PresidioEntityExtractor : detects PII entities and stores them in Document metadata

All CI checks are passing. Happy to make any changes based on your feedback

sjrl · 2026-04-02T07:58:38Z

@SyedShahmeerAli12 a few more high-level comments before I do an indepth review:

Please add a new entry in .github/labeler.yml
We recently add code coverage support so please add a new entry in .github/workflows/CI_coverage_comment.yml
Add the new integration to the table in the root README.md https://github.com/deepset-ai/haystack-core-integrations/blob/main/README.md

Implements three Haystack components using Microsoft Presidio: - PresidioDocumentCleaner: anonymizes PII in list[Document] - PresidioTextCleaner: anonymizes PII in list[str] (for query sanitization) - PresidioEntityExtractor: detects PII entities and stores them in Document metadata

…e type mismatch

…d coverage entries

…tical ordering

SyedShahmeerAli12 · 2026-04-09T12:18:20Z

Addressed all comments removed the ## Contributing header from README to match the pgvector format, fixed alphabetical ordering in both the root README table and CI_coverage_comment.yml. Python 3.14 was already in place.

…, type hints, dataclasses.replace, doc links - Add keyword-only arguments (*, ) to all three component __init__ methods - Move AnalyzerEngine/AnonymizerEngine initialization to warm_up() since they load spaCy ML models - Fix run() return types from dict[str, Any] to proper typed dicts - Use dataclasses.replace() in PresidioEntityExtractor instead of Document() - Add Presidio documentation links for language, entities, and score_threshold params - Update integration tests to call warm_up() before run() - Add missing _anonymizer mock in test_run_skips_on_error tests

SyedShahmeerAli12 · 2026-04-09T14:00:09Z

warm_up() — AnalyzerEngine/AnonymizerEngine do load spaCy ML models, so moved engine init into warm_up()
across all three components. Updated integration tests accordingly.
Keyword-only args — Added *, to all three init methods.
Return types — Replaced dict[str, Any] with proper typed returns, removed unused Any import.
dataclasses.replace() — Used in PresidioEntityExtractor instead of constructing a new Document.
Doc links — Added https://microsoft.github.io/presidio/supported_languages/,
https://microsoft.github.io/presidio/supported_entities/, and https://microsoft.github.io/presidio/analyzer/
links to docstrings in all three components.

- Add `_is_warmed_up` guard to `warm_up()` so repeated calls are idempotent - Auto-warm on first `run()` call instead of raising RuntimeError - Update component docstrings to reflect lazy loading behavior - Fix broken Presidio doc link (supported_languages → analyzer/languages) - Add `_make_*_with_mocks()` helper in each test class to centralize mock setup and prevent auto-warm from overwriting injected mocks

SyedShahmeerAli12 · 2026-04-13T08:35:14Z

Thanks @sjrl Addressed all four points:

Broken doc link — Fixed supported_languages/ → analyzer/languages/
warm_up() guard — Added _is_warmed_up flag; repeated calls return early
Auto-warm on run() — Replaced RuntimeError with lazy warm_up() call on first run()
Docstrings — Updated to reflect lazy loading behaviour
Tests — Added _make_*_with_mocks() helper to prevent auto-warm from overwriting injected mocks

sjrl · 2026-04-20T09:54:17Z

+                logger.warning(
+                    "Could not anonymize document {doc_id}. Skipping it. Error: {error}",
+                    doc_id=doc.id,
+                    error=e,
+                )
+                cleaned.append(doc)


This is misleading we say skipping it which I took as that we are skipping the Document entirely and what is meant is that we are skipping anonymization.

I'd say given that we probably don't want non-anonymized documents to make it through we should drop the cleaned.append(doc) line and leave the warning message as is.

sjrl · 2026-04-20T09:55:32Z

+            except Exception as e:
+                logger.warning(
+                    "Could not extract entities from document {doc_id}. Skipping it. Error: {error}",
+                    doc_id=doc.id,
+                    error=e,
+                )
+                result_docs.append(doc)


Here I would update the warning message to say we are skipping extraction but keeping the document.

sjrl · 2026-04-20T09:57:14Z

+            except Exception as e:
+                logger.warning(
+                    "Could not anonymize text. Skipping it. Error: {error}",
+                    error=e,
+                )
+                cleaned.append(text)


Same here as for the document cleaner. I would keep the warning message as-is and remove the cleaned.append(text) line

- Regenerate presidio.yml workflow from template (compute-test-matrix job, pinned action versions, push trigger, coverage steps) - Add integration-cov-append-retry script to pyproject.toml - Drop un-anonymized documents on error in PresidioDocumentCleaner instead of passing them through unanonymized - Clarify warning message in PresidioEntityExtractor to say extraction is skipped but document is kept - Update test to assert failed docs are dropped in PresidioDocumentCleaner

SyedShahmeerAli12 · 2026-04-20T13:44:36Z

Hi @sjrl, addressed all three comments:

presidio.yml — Regenerated from the template (compute-test-matrix job, pinned action versions, push trigger, coverage steps, integration-cov-append-retry). Also added integration-cov-append-retry script to pyproject.toml.
presidio_document_cleaner.py — Removed cleaned.append(doc) so documents that fail anonymization are dropped entirely rather than passed through unanonymized.
presidio_entity_extractor.py — Updated warning message to "Skipping extraction, keeping document" to accurately reflect the behaviour.
Test — Updated test_run_skips_on_error in the cleaner tests to assert len(result["documents"]) == 0.

sjrl · 2026-04-20T13:52:35Z

Thanks @SyedShahmeerAli12 ! Almost there, could you look at the failing CI check here

…ctor

SyedShahmeerAli12 · 2026-04-20T14:25:31Z

@sjrl all checks are passing now. Ready for another review when you get a chance!

SyedShahmeerAli12 requested a review from a team as a code owner April 1, 2026 07:53

SyedShahmeerAli12 requested review from sjrl and removed request for a team April 1, 2026 07:53

github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 1, 2026

sjrl self-assigned this Apr 1, 2026

sjrl removed their assignment Apr 2, 2026

sjrl reviewed Apr 2, 2026

View reviewed changes

Comment thread integrations/presidio/README.md

sjrl reviewed Apr 2, 2026

View reviewed changes

Comment thread .github/workflows/presidio.yml Outdated

sjrl reviewed Apr 2, 2026

View reviewed changes

Comment thread integrations/presidio/pyproject.toml

SyedShahmeerAli12 force-pushed the feat/presidio-integration branch from 081b9d3 to 82de46b Compare April 2, 2026 14:40

SyedShahmeerAli12 added 7 commits April 2, 2026 19:44

fix(presidio): add missing README.md required by hatchling build

b7f0359

fix(presidio): fix lint errors and add missing pydoc config

e162949

fix(presidio): apply ruff format to test files

cc518d1

fix(presidio): add py.typed marker for mypy type checking

a363f89

fix(presidio): suppress mypy arg-type error for presidio cross-packag…

a8b2004

…e type mismatch

Address PR review: update README, add Python 3.14 support, labeler an…

80c8c1d

…d coverage entries

SyedShahmeerAli12 force-pushed the feat/presidio-integration branch from 82de46b to 80c8c1d Compare April 2, 2026 14:45