Measuring the quality of unstructured text in routinely collected electronic health data: a review and application

dc.contributor.authorNesca, Marcello
dc.contributor.examiningcommitteeKatz, Alan (Community Health Sciences)en_US
dc.contributor.examiningcommitteeLeung, Carson (Computer Science)en_US
dc.contributor.supervisorLix, Lisa (Community Health Sciences)en_US
dc.date.accessioned2022-01-11T14:37:29Z
dc.date.available2022-01-11T14:37:29Z
dc.date.copyright2022-01-07
dc.date.issued2021-11-29en_US
dc.date.submitted2022-01-07T15:25:54Zen_US
dc.degree.disciplineCommunity Health Sciencesen_US
dc.degree.levelMaster of Science (M.Sc.)en_US
dc.description.abstractIntroduction: Routinely collected electronic health data (RCEHD), can be comprised of structured, semi-structured, or unstructured information. Electronic medical records (EMRs), one type of RCEHD, often contain unstructured text data (UTD), which are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. At present, there are few studies about the specific types of NLP methods used to preprocess UTD to address data quality issues prior to analysis or modelling. Purpose & Objectives: The purpose was to examine preprocessing methods for UTD and evaluate the quality of UTD in EMRs. The objectives were to: 1) systematically document current research and practices for preprocessing UTD to describe or improve its quality, and 2) apply data quality indicators identified from current research and practices to UTD in EMRs from the Manitoba Primary Care Research Network and describe the quality of these data. Methods: Objective 1 involved a scoping review. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature on current research and practices to prepare UTD for analysis, up to and including 2021. For objective 2, a case study was undertaken where data quality indicators and preprocessing methods identified in the scoping review were applied to UTD from EMRs. Results: 41 articles were included in the scoping review for objective 1; over 50% were published between 2016 and 2021 and over 90% were empirical research articles. Data quality indicator topics for UTD in EMRs included misspelled words, security, word variability, sources of noise, quality of annotations, ambiguous abbreviations, and manual annotations. For objective 2, we selected 193,206 clinical encounter notes from EMRs between 1985 and 2020. Overall, the clinical encounter notes contained an average (standard deviation [SD]) of 27.3 (27.0) stop words, 25.7 (27.8) punctuation symbols, 12.1 (11.1) spelling errors, and 2.9 (2.6) special characters. The average (SD) length of a clinical encounter note was 555.8 (551.1) characters, and 71.5 (59.7) words. Lexical diversity, had a mean (SD) of 86.2 (11.9). Conclusion: This study identified multiple data quality indicators that have been used to preprocess UTD in published literature and demonstrated their application to real-world data.en_US
dc.description.noteFebruary 2022en_US
dc.identifier.urihttp://hdl.handle.net/1993/36163
dc.language.isoengen_US
dc.rightsopen accessen_US
dc.subjectData qualityen_US
dc.subjectNatural language processingen_US
dc.subjectpre-processing unstructured text dataen_US
dc.subjectElectronic Medical Recordsen_US
dc.subjectHealth researchen_US
dc.titleMeasuring the quality of unstructured text in routinely collected electronic health data: a review and applicationen_US
dc.typemaster thesisen_US
local.subject.manitobayesen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Nesca_Marcello.pdf
Size:
2.43 MB
Format:
Adobe Portable Document Format
Description:
main article
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.2 KB
Format:
Item-specific license agreed to upon submission
Description: