Re-identification risk accumulates over time. Most de-identification pipelines ignore that. This system tracks cumulative exposure across modalities and adjusts masking strength dynamically.
Standard de-identification pipelines treat every record as isolated. Detect PHI, remove it, move on. That works for single documents. It breaks down in streaming systems where the same patient appears across hundreds of events over time: clinical notes, ASR transcripts, imaging metadata, waveform headers.
Per-document masking. No memory of prior events. No model of cumulative risk. Risk accumulates invisibly across the stream.
Stateful exposure tracking. Rolling risk computation across modalities and time. Masking strength proportional to actual accumulated risk.
The adaptive controller achieves full utility while keeping leakage close to the redact floor, without the utility collapse that full redaction causes. All results generated from fully synthetic data.
| Policy | Leak Total | Utility Proxy | Mean Latency (ms) | P90 Latency (ms) |
|---|---|---|---|---|
| raw | 3.03 | 1.0 | 0.123 | 0.154 |
| weak | 2.0 | 0.51 | 0.141 | 0.18 |
| pseudo | 0.51 | 1.0 | 0.159 | 0.188 |
| redact | 0.51 | 0.51 | 0.157 | 0.192 |
| adaptive ★ | 0.56 | 1.0 | 1.16 | 1.237 |
Each module handles a distinct layer of the exposure-aware masking pipeline.
Add events for a patient across modalities and watch risk accumulate in real time. Try adding the same patient ID with different modalities to trigger cross-modal detection.
Run the full benchmark locally. Results are written to the results/ directory.
# Install pip install phi-exposure-guard # Run the benchmark python -m amphi_rl_dpgraph.run_demo
Or open the Colab notebook — no setup required:
▶ Open in ColabIf you use this system or dataset in academic or technical work, please cite via the CITATION.cff file in the GitHub repository.