Skip to content

Multilingual PHI detection — support Spanish, French, German patient data #5

@sarvanithin

Description

@sarvanithin

Overview

The current RegexPHIEngine is English-only. Clinical AI systems in Europe and Latin America handle patient data in Spanish, French, German, and Portuguese. PHI patterns differ by locale (e.g., Spanish DNI vs US SSN, European date formats DD/MM/YYYY).

What to build

  • medguard/guardrails/phi_i18n.py — locale-specific PHI patterns
    • Spanish: DNI \d{8}[A-Z], NIE [XYZ]\d{7}[A-Z], date \d{2}/\d{2}/\d{4}
    • French: INSEE (SS) number, RPPS doctor ID
    • German: Krankenversichertennummer (KVNR)
  • LocalePHIEngine(locale="es") — wraps RegexPHIEngine with locale patterns
  • Add locale: str = "en" to PHIConfig

Files to create/modify

  • medguard/guardrails/phi_i18n.py — locale pattern definitions
  • medguard/guardrails/phi.pyLocalePHIEngine class
  • medguard/config.py — add locale to PHIConfig
  • tests/test_phi.py — parametrized tests for each locale

Acceptance criteria

  • Spanish DNI 12345678Z detected and redacted
  • French INSEE 1 84 12 76 451 099 52 detected
  • No false positives on clean text in each locale
  • PHIConfig(locale="es") selects Spanish patterns
  • All existing English tests still pass

Resources

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions