Automated de-identification and unstructured textual electronic medical record data in Manitoba

Thumbnail Image
McDermott, Katelin
Journal Title
Journal ISSN
Volume Title
Introduction: Unstructured textual electronic medical record (EMR) data contain valuable patient details that can benefit health research. Personal health information (PHI) must be de-identified for EMR data to be used for secondary purposes. A considerable amount of de-identification research has been conducted using existing synthetic, de-identified, and annotated data sets. To date, little is known about how existing de-identification literature applies to unstructured EMR data in Manitoba. Objectives: The research objectives were to: 1) categorize the types and frequency of PHI in Manitoba EMR data, 2) assess the applicability of de-identification literature on Manitoba EMR data, and 3) test how NLM-Scrubber, an existing de-identification tool validated to be successful, redacts PHI in Manitoba EMR data. Methods: The Manitoba data set comprised of 750 unstructured textual EMR encounter notes from 2003 to 2017 from the Manitoba Primary Care Research Network. In-scope PHI included name, personal health information number, address, phone number, and date (excluding year). Two annotators tagged PHI in the Manitoba data using the Visual Tagging Tool. Comparison of Manitoba data and the 2014 i2b2 corpus examined note compilation and PHI prevalence. NLM-Scrubber’s de-identification of Manitoba data was assessed using performance measures and tested against the null hypothesis that NLM-Scrubber will recall ≥87% of PHI in Manitoba data. Results: The Manitoba EMR data contained 3,314 PHI instances, demonstrating 1.6% PHI prevalence. All in-scope PHI types were present. The Manitoba data offered more independent notes and broader variety of note types than the i2b2 corpus. The Manitoba EMR data contained nearly twice as many name PHI instances as the i2b2 corpus (62% and 32%, respectively) but fewer date instances (31% and 55%, respectively). NLM-Scrubber’s PHI recall was 75.4% (95% CI, 72.9-77.8%), leading to rejection of the null hypothesis. Conclusion: Direct and indirect PHI represent a small proportion of Manitoba EMR data. De-identification literature may have limited applicability to Manitoba EMR data. NLM-Scrubber may not be acceptable for use on Manitoba EMR data due to its low recall performance. Attention should be directed to trained machine learning solutions that enable customization, adjustment of rule-based methods, and pseudo PHI to protect patient privacy.
electronic medical records, de-identification, unstructured data, personal health information, machine learning