Measuring the quality of unstructured text in routinely collected electronic health data: a review and application

Thumbnail Image
Nesca, Marcello
Journal Title
Journal ISSN
Volume Title
Introduction: Routinely collected electronic health data (RCEHD), can be comprised of structured, semi-structured, or unstructured information. Electronic medical records (EMRs), one type of RCEHD, often contain unstructured text data (UTD), which are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. At present, there are few studies about the specific types of NLP methods used to preprocess UTD to address data quality issues prior to analysis or modelling. Purpose & Objectives: The purpose was to examine preprocessing methods for UTD and evaluate the quality of UTD in EMRs. The objectives were to: 1) systematically document current research and practices for preprocessing UTD to describe or improve its quality, and 2) apply data quality indicators identified from current research and practices to UTD in EMRs from the Manitoba Primary Care Research Network and describe the quality of these data. Methods: Objective 1 involved a scoping review. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature on current research and practices to prepare UTD for analysis, up to and including 2021. For objective 2, a case study was undertaken where data quality indicators and preprocessing methods identified in the scoping review were applied to UTD from EMRs. Results: 41 articles were included in the scoping review for objective 1; over 50% were published between 2016 and 2021 and over 90% were empirical research articles. Data quality indicator topics for UTD in EMRs included misspelled words, security, word variability, sources of noise, quality of annotations, ambiguous abbreviations, and manual annotations. For objective 2, we selected 193,206 clinical encounter notes from EMRs between 1985 and 2020. Overall, the clinical encounter notes contained an average (standard deviation [SD]) of 27.3 (27.0) stop words, 25.7 (27.8) punctuation symbols, 12.1 (11.1) spelling errors, and 2.9 (2.6) special characters. The average (SD) length of a clinical encounter note was 555.8 (551.1) characters, and 71.5 (59.7) words. Lexical diversity, had a mean (SD) of 86.2 (11.9). Conclusion: This study identified multiple data quality indicators that have been used to preprocess UTD in published literature and demonstrated their application to real-world data.
Data quality, Natural language processing, pre-processing unstructured text data, Electronic Medical Records, Health research