Automated de-identification and unstructured textual electronic medical record data in Manitoba

McDermott, Katelin

Automated de-identification and unstructured textual electronic medical record data in Manitoba

dc.contributor.author	McDermott, Katelin
dc.contributor.examiningcommittee	Lix, Lisa (Community Health Sciences)	en_US
dc.contributor.examiningcommittee	Singer, Alexander (Family Medicine)	en_US
dc.contributor.supervisor	Katz, Alan
dc.date.accessioned	2022-08-29T13:26:16Z
dc.date.available	2022-08-29T13:26:16Z
dc.date.copyright	2022-08-22
dc.date.issued	2022-08-17
dc.date.submitted	2022-08-22T22:53:16Z	en_US
dc.degree.discipline	Community Health Sciences	en_US
dc.degree.level	Master of Science (M.Sc.)	en_US
dc.description.abstract	Introduction: Unstructured textual electronic medical record (EMR) data contain valuable patient details that can benefit health research. Personal health information (PHI) must be de-identified for EMR data to be used for secondary purposes. A considerable amount of de-identification research has been conducted using existing synthetic, de-identified, and annotated data sets. To date, little is known about how existing de-identification literature applies to unstructured EMR data in Manitoba. Objectives: The research objectives were to: 1) categorize the types and frequency of PHI in Manitoba EMR data, 2) assess the applicability of de-identification literature on Manitoba EMR data, and 3) test how NLM-Scrubber, an existing de-identification tool validated to be successful, redacts PHI in Manitoba EMR data. Methods: The Manitoba data set comprised of 750 unstructured textual EMR encounter notes from 2003 to 2017 from the Manitoba Primary Care Research Network. In-scope PHI included name, personal health information number, address, phone number, and date (excluding year). Two annotators tagged PHI in the Manitoba data using the Visual Tagging Tool. Comparison of Manitoba data and the 2014 i2b2 corpus examined note compilation and PHI prevalence. NLM-Scrubber’s de-identification of Manitoba data was assessed using performance measures and tested against the null hypothesis that NLM-Scrubber will recall ≥87% of PHI in Manitoba data. Results: The Manitoba EMR data contained 3,314 PHI instances, demonstrating 1.6% PHI prevalence. All in-scope PHI types were present. The Manitoba data offered more independent notes and broader variety of note types than the i2b2 corpus. The Manitoba EMR data contained nearly twice as many name PHI instances as the i2b2 corpus (62% and 32%, respectively) but fewer date instances (31% and 55%, respectively). NLM-Scrubber’s PHI recall was 75.4% (95% CI, 72.9-77.8%), leading to rejection of the null hypothesis. Conclusion: Direct and indirect PHI represent a small proportion of Manitoba EMR data. De-identification literature may have limited applicability to Manitoba EMR data. NLM-Scrubber may not be acceptable for use on Manitoba EMR data due to its low recall performance. Attention should be directed to trained machine learning solutions that enable customization, adjustment of rule-based methods, and pseudo PHI to protect patient privacy.	en_US
dc.description.note	October 2022	en_US
dc.identifier.uri	http://hdl.handle.net/1993/36794
dc.language.iso	eng	en_US
dc.rights	open access	en_US
dc.subject	electronic medical records	en_US
dc.subject	de-identification	en_US
dc.subject	unstructured data	en_US
dc.subject	personal health information	en_US
dc.subject	machine learning	en_US
dc.title	Automated de-identification and unstructured textual electronic medical record data in Manitoba	en_US
dc.type	master thesis	en_US
local.subject.manitoba	yes	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: mcdermott_katelin.pdf
Size:: 846.8 KB
Format:: Adobe Portable Document Format
Description:: Thesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.2 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

FGS - Electronic Theses and Practica
Manitoba Heritage Theses