Assessing the effect of preprocessing of clinical notes on classification tasks and similarity measures

Loading...
Thumbnail Image
Date
2024-01-23
Authors
Moni, Md Moniruzzaman
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Background: Unstructured text data (UTD) in electronic medical records (EMRs) may be challenging to use because of noise including spelling errors, abbreviations, and punctuation symbols. It is expected that preprocessing improves the performance of statistical or machine learning models. Objectives: The research objectives were to assess the effect of the number and order of preprocessing methods (1) on the detection of health conditions from UTD, (2) on clinical and demographic cohort selection criteria in UTD, (3) on the similarity of information contained in pairs of EMR notes for the same patient, and (4) on accurate de-identification of UTD. Method: Study data were from the Informatics for Integrating Biology and the Bedside (i2b2) challenges. The 2008, 2014 and 2018 i2b2 datasets were used for different Objectives. Preprocessing methods included tokenization, removing punctuation symbols, correcting spelling errors, expanding abbreviations, word stemming, and lemmatization. A nested experimental design was adopted, in which order was nested within the number of methods. Balanced random forest, support vector machine, and bidirectional long short-term memory-conditional random field models were used for Objectives 1, 2, and 4, respectively, and model performance was evaluated by accuracy, sensitivity, specificity, F1 score, and precision. For Objective 3, cosine similarity was used to measure the similarity between pairs of notes. Analysis of Variance (ANOVA) and descriptive statistics were used to test research hypotheses. Results: Mean sensitivity, specificity, F1 score, accuracy, and precision were similar across the orders of methods and numbers of methods, for Objectives 1 and 2. Cosine similarity scores were similar across orders of methods for Objective 3. Deep learning models for Objective 4 were not trainable with the preprocessed data. The ANOVA F test results showed no significant effect of the order of methods for different numbers and identical mean values of outcome variables implied no difference among the numbers of methods. Conclusion: The order and number of preprocessing methods did not have any effect on model performance for a variety of tasks applied to text data. Future research could investigate the effect of the source of spelling correction libraries, medical dictionaries, and abbreviation lists on model performance results.
Description
Keywords
Natural Language Processing (NLP), Clinical notes, Medical records, Medical text data, Preprocessing
Citation