Core Concepts

Worka PII uses a small set of composable primitives that keep the pipeline explicit and testable.

Entity types

An entity type describes what kind of sensitive value was detected. Built-in types include Email, Phone, IpAddress, CreditCard, and Person. Custom entity types can be added when you need domain-specific identifiers such as EmployeeId or TicketNumber.

Recognizers

Recognizers emit candidate detections. They come in several forms:

Regex recognizers for structured patterns.
Validator recognizers that confirm candidate checksums or formats.
Dictionary recognizers for curated terms.
NER recognizers backed by an NLP engine.

Each recognizer produces candidates with a score, offsets, and an entity type.

NLP artifacts

The NLP engine produces artifacts such as tokens, offsets, lemma, POS, and optional NER spans. Recognizers operate on these artifacts so the pipeline is deterministic and reproducible.

Candidates and detections

Candidates are merged and resolved by the decision engine. Overlaps are resolved using stable scoring rules so the same input yields the same detections in the same order.

Policies and anonymization

Policies define what happens to each entity type. Anonymization operators include redact, mask, replace, and hash. The anonymizer uses the detected spans and policies to produce redacted output and an audit log of replacements.

Entity types​

Recognizers​

NLP artifacts​

Candidates and detections​

Policies and anonymization​

Entity types

Recognizers

NLP artifacts

Candidates and detections

Policies and anonymization