feat: add Presidio integration for PII detection and anonymization#3075
feat: add Presidio integration for PII detection and anonymization#3075SyedShahmeerAli12 wants to merge 13 commits intodeepset-ai:mainfrom
Conversation
|
@SyedShahmeerAli12 thanks for contributing! Please read the contribution guidelines here https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration and run the scaffolding script to help fill in some missing aspects of your contribution. |
|
▎ Hello @sjrl I've implemented all three components proposed in this issue: PresidioDocumentCleaner : anonymizes PII in list[Document] All CI checks are passing. Happy to make any changes based on your feedback |
|
@SyedShahmeerAli12 a few more high-level comments before I do an indepth review:
|
081b9d3 to
82de46b
Compare
Implements three Haystack components using Microsoft Presidio: - PresidioDocumentCleaner: anonymizes PII in list[Document] - PresidioTextCleaner: anonymizes PII in list[str] (for query sanitization) - PresidioEntityExtractor: detects PII entities and stores them in Document metadata
…d coverage entries
82de46b to
80c8c1d
Compare
|
Addressed all comments removed the ## Contributing header from README to match the pgvector format, fixed alphabetical ordering in both the root README table and CI_coverage_comment.yml. Python 3.14 was already in place. |
…, type hints, dataclasses.replace, doc links - Add keyword-only arguments (*, ) to all three component __init__ methods - Move AnalyzerEngine/AnonymizerEngine initialization to warm_up() since they load spaCy ML models - Fix run() return types from dict[str, Any] to proper typed dicts - Use dataclasses.replace() in PresidioEntityExtractor instead of Document() - Add Presidio documentation links for language, entities, and score_threshold params - Update integration tests to call warm_up() before run() - Add missing _anonymizer mock in test_run_skips_on_error tests
|
- Add `_is_warmed_up` guard to `warm_up()` so repeated calls are idempotent - Auto-warm on first `run()` call instead of raising RuntimeError - Update component docstrings to reflect lazy loading behavior - Fix broken Presidio doc link (supported_languages → analyzer/languages) - Add `_make_*_with_mocks()` helper in each test class to centralize mock setup and prevent auto-warm from overwriting injected mocks
|
Thanks @sjrl Addressed all four points:
|
02ea61c to
10d90f5
Compare
| logger.warning( | ||
| "Could not anonymize document {doc_id}. Skipping it. Error: {error}", | ||
| doc_id=doc.id, | ||
| error=e, | ||
| ) | ||
| cleaned.append(doc) |
There was a problem hiding this comment.
This is misleading we say skipping it which I took as that we are skipping the Document entirely and what is meant is that we are skipping anonymization.
I'd say given that we probably don't want non-anonymized documents to make it through we should drop the cleaned.append(doc) line and leave the warning message as is.
| except Exception as e: | ||
| logger.warning( | ||
| "Could not extract entities from document {doc_id}. Skipping it. Error: {error}", | ||
| doc_id=doc.id, | ||
| error=e, | ||
| ) | ||
| result_docs.append(doc) |
There was a problem hiding this comment.
Here I would update the warning message to say we are skipping extraction but keeping the document.
| except Exception as e: | ||
| logger.warning( | ||
| "Could not anonymize text. Skipping it. Error: {error}", | ||
| error=e, | ||
| ) | ||
| cleaned.append(text) |
There was a problem hiding this comment.
Same here as for the document cleaner. I would keep the warning message as-is and remove the cleaned.append(text) line
- Regenerate presidio.yml workflow from template (compute-test-matrix job, pinned action versions, push trigger, coverage steps) - Add integration-cov-append-retry script to pyproject.toml - Drop un-anonymized documents on error in PresidioDocumentCleaner instead of passing them through unanonymized - Clarify warning message in PresidioEntityExtractor to say extraction is skipped but document is kept - Update test to assert failed docs are dropped in PresidioDocumentCleaner
|
Hi @sjrl, addressed all three comments:
|
|
Thanks @SyedShahmeerAli12 ! Almost there, could you look at the failing CI check here |
|
@sjrl all checks are passing now. Ready for another review when you get a chance! |
Closes #3063
What this adds
Three new Haystack components using Microsoft Presidio:
list[Document], returns new Documents without mutating inputslist[str], useful for sanitizing user queries before LLM calls"entities"keyUsage example