Assessing the effect of preprocessing of clinical notes on classification tasks and similarity measures

dc.contributor.authorMoni, Md Moniruzzaman
dc.contributor.examiningcommitteeKatz, Alan (Community Health Sciences)
dc.contributor.examiningcommitteeNoman, Mohammed (Computer Science)
dc.contributor.supervisorLix, Lisa
dc.date.accessioned2024-03-26T21:07:23Z
dc.date.available2024-03-26T21:07:23Z
dc.date.issued2024-01-23
dc.date.submitted2024-03-17T21:20:30Zen_US
dc.degree.disciplineCommunity Health Sciencesen_US
dc.degree.levelMaster of Science (M.Sc.)
dc.description.abstractBackground: Unstructured text data (UTD) in electronic medical records (EMRs) may be challenging to use because of noise including spelling errors, abbreviations, and punctuation symbols. It is expected that preprocessing improves the performance of statistical or machine learning models. Objectives: The research objectives were to assess the effect of the number and order of preprocessing methods (1) on the detection of health conditions from UTD, (2) on clinical and demographic cohort selection criteria in UTD, (3) on the similarity of information contained in pairs of EMR notes for the same patient, and (4) on accurate de-identification of UTD. Method: Study data were from the Informatics for Integrating Biology and the Bedside (i2b2) challenges. The 2008, 2014 and 2018 i2b2 datasets were used for different Objectives. Preprocessing methods included tokenization, removing punctuation symbols, correcting spelling errors, expanding abbreviations, word stemming, and lemmatization. A nested experimental design was adopted, in which order was nested within the number of methods. Balanced random forest, support vector machine, and bidirectional long short-term memory-conditional random field models were used for Objectives 1, 2, and 4, respectively, and model performance was evaluated by accuracy, sensitivity, specificity, F1 score, and precision. For Objective 3, cosine similarity was used to measure the similarity between pairs of notes. Analysis of Variance (ANOVA) and descriptive statistics were used to test research hypotheses. Results: Mean sensitivity, specificity, F1 score, accuracy, and precision were similar across the orders of methods and numbers of methods, for Objectives 1 and 2. Cosine similarity scores were similar across orders of methods for Objective 3. Deep learning models for Objective 4 were not trainable with the preprocessed data. The ANOVA F test results showed no significant effect of the order of methods for different numbers and identical mean values of outcome variables implied no difference among the numbers of methods. Conclusion: The order and number of preprocessing methods did not have any effect on model performance for a variety of tasks applied to text data. Future research could investigate the effect of the source of spelling correction libraries, medical dictionaries, and abbreviation lists on model performance results.
dc.description.noteMay 2024
dc.identifier.urihttp://hdl.handle.net/1993/38084
dc.language.isoeng
dc.rightsopen accessen_US
dc.subjectNatural Language Processing (NLP)
dc.subjectClinical notes
dc.subjectMedical records
dc.subjectMedical text data
dc.subjectPreprocessing
dc.titleAssessing the effect of preprocessing of clinical notes on classification tasks and similarity measures
dc.typemaster thesisen_US
local.subject.manitobano
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Moni_Thesis Paper final.pdf
Size:
1.19 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
770 B
Format:
Item-specific license agreed to upon submission
Description: