Core Concepts
Worka PII uses a small set of composable primitives that keep the pipeline explicit and testable.
Entity types
An entity type describes what kind of sensitive value was detected. Built-in types include Email, Phone, IpAddress, CreditCard, and Person. Custom entity types can be added when you need domain-specific identifiers such as EmployeeId or TicketNumber.
Recognizers
Recognizers emit candidate detections. They come in several forms:
- Regex recognizers for structured patterns.
- Validator recognizers that confirm candidate checksums or formats.
- Dictionary recognizers for curated terms.
- NER recognizers backed by an NLP engine.
Each recognizer produces candidates with a score, offsets, and an entity type.
NLP artifacts
The NLP engine produces artifacts such as tokens, offsets, lemma, POS, and optional NER spans. Recognizers operate on these artifacts so the pipeline is deterministic and reproducible.
Candidates and detections
Candidates are merged and resolved by the decision engine. Overlaps are resolved using stable scoring rules so the same input yields the same detections in the same order.
Policies and anonymization
Policies define what happens to each entity type. Anonymization operators include redact, mask, replace, and hash. The anonymizer uses the detected spans and policies to produce redacted output and an audit log of replacements.