Automated de-identification and unstructured textual electronic medical record data in Manitoba

dc.contributor.authorMcDermott, Katelin
dc.contributor.examiningcommitteeLix, Lisa (Community Health Sciences)en_US
dc.contributor.examiningcommitteeSinger, Alexander (Family Medicine)en_US
dc.contributor.supervisorKatz, Alan
dc.date.accessioned2022-08-29T13:26:16Z
dc.date.available2022-08-29T13:26:16Z
dc.date.copyright2022-08-22
dc.date.issued2022-08-17
dc.date.submitted2022-08-22T22:53:16Zen_US
dc.degree.disciplineCommunity Health Sciencesen_US
dc.degree.levelMaster of Science (M.Sc.)en_US
dc.description.abstractIntroduction: Unstructured textual electronic medical record (EMR) data contain valuable patient details that can benefit health research. Personal health information (PHI) must be de-identified for EMR data to be used for secondary purposes. A considerable amount of de-identification research has been conducted using existing synthetic, de-identified, and annotated data sets. To date, little is known about how existing de-identification literature applies to unstructured EMR data in Manitoba. Objectives: The research objectives were to: 1) categorize the types and frequency of PHI in Manitoba EMR data, 2) assess the applicability of de-identification literature on Manitoba EMR data, and 3) test how NLM-Scrubber, an existing de-identification tool validated to be successful, redacts PHI in Manitoba EMR data. Methods: The Manitoba data set comprised of 750 unstructured textual EMR encounter notes from 2003 to 2017 from the Manitoba Primary Care Research Network. In-scope PHI included name, personal health information number, address, phone number, and date (excluding year). Two annotators tagged PHI in the Manitoba data using the Visual Tagging Tool. Comparison of Manitoba data and the 2014 i2b2 corpus examined note compilation and PHI prevalence. NLM-Scrubber’s de-identification of Manitoba data was assessed using performance measures and tested against the null hypothesis that NLM-Scrubber will recall ≥87% of PHI in Manitoba data. Results: The Manitoba EMR data contained 3,314 PHI instances, demonstrating 1.6% PHI prevalence. All in-scope PHI types were present. The Manitoba data offered more independent notes and broader variety of note types than the i2b2 corpus. The Manitoba EMR data contained nearly twice as many name PHI instances as the i2b2 corpus (62% and 32%, respectively) but fewer date instances (31% and 55%, respectively). NLM-Scrubber’s PHI recall was 75.4% (95% CI, 72.9-77.8%), leading to rejection of the null hypothesis. Conclusion: Direct and indirect PHI represent a small proportion of Manitoba EMR data. De-identification literature may have limited applicability to Manitoba EMR data. NLM-Scrubber may not be acceptable for use on Manitoba EMR data due to its low recall performance. Attention should be directed to trained machine learning solutions that enable customization, adjustment of rule-based methods, and pseudo PHI to protect patient privacy.en_US
dc.description.noteOctober 2022en_US
dc.identifier.urihttp://hdl.handle.net/1993/36794
dc.language.isoengen_US
dc.rightsopen accessen_US
dc.subjectelectronic medical recordsen_US
dc.subjectde-identificationen_US
dc.subjectunstructured dataen_US
dc.subjectpersonal health informationen_US
dc.subjectmachine learningen_US
dc.titleAutomated de-identification and unstructured textual electronic medical record data in Manitobaen_US
dc.typemaster thesisen_US
local.subject.manitobayesen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
mcdermott_katelin.pdf
Size:
846.8 KB
Format:
Adobe Portable Document Format
Description:
Thesis
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.2 KB
Format:
Item-specific license agreed to upon submission
Description: