Privacy-preserving data publishing using deep learning techniques

dc.contributor.authorAhmed, Tanbir
dc.contributor.examiningcommitteeLeung, Carson K. (Computer Science)en_US
dc.contributor.examiningcommitteeWang, Yang (Computer Science)en_US
dc.contributor.supervisorMohammed, Noman (Computer Science)en_US
dc.date.accessioned2021-07-08T20:34:49Z
dc.date.available2021-07-08T20:34:49Z
dc.date.copyright2021-07-08
dc.date.issued2021-05-10en_US
dc.date.submitted2021-07-07T18:43:36Zen_US
dc.date.submitted2021-07-08T20:03:52Zen_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelMaster of Science (M.Sc.)en_US
dc.description.abstractAccording to a recent study, around 99 percent of hospitals across the United States now use electronic health record systems. One of the most common types of EHR data is unstructured textual data and unlocking hidden details from this data is critical for improving current medical practices and research endeavors. However, these textual data contain sensitive information, which could compromise our privacy. Therefore, medical textual data cannot be released publicly without any privacy protection. De-identification is a process of detecting and removing all sensitive information present in EHRs, and it is a necessary step towards privacy-preserving EHR data sharing. Since 2016, we have seen several deep learning-based approaches for de-identification, which achieved over 98% accuracy. However, these models are trained with sensitive information and can unwittingly memorize some of their training data, and a careful analysis of these models can reveal patients' data. This thesis presents two contributions. First, We introduce new methods to de-identify textual based on self-attention mechanism and stacked Recurrent Neural Network. Experimental results on three different datasets show that our model performs better than all state-of-the-art mechanisms irrespective of the dataset. Additionally, our proposed method is significantly faster than existing techniques. We also introduced three utility metrics to judge the quality of the de-identified data. Second, we propose a differentially private ensemble framework for de-identification, allowing medical researchers to collaborate through publicly publishing the de-identification models. Experiments in three different datasets showed competitive results compared to the state-of-the-art methods with guaranteed differential privacy.en_US
dc.description.noteOctober 2021en_US
dc.identifier.urihttp://hdl.handle.net/1993/35736
dc.language.isoengen_US
dc.rightsopen accessen_US
dc.subjectData Privacyen_US
dc.subjectNatural Language Processingen_US
dc.subjectElectronic Health Recorden_US
dc.subjectDeep Learningen_US
dc.titlePrivacy-preserving data publishing using deep learning techniquesen_US
dc.typemaster thesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ahmed_tanbir.pdf
Size:
829.97 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.2 KB
Format:
Item-specific license agreed to upon submission
Description: