Privacy-preserving data analysis techniques for biomedical data
Anjum, Md. Monowar
Privacy is a fundamental aspect of modern distributed systems. The data collection mechanism and subsequent analysis often reveals private information about individuals. This is especially true when designing contact tracing systems to combat a pandemic. Contact tracing systems collect vital information about individuals such as their social interaction graph, their frequently visited places, and other sensitive information. Majority of the proposed systems use centralized architecture and population wide deployment. Such macro-level design perspective is prone to privacy and scalability issues. In the first part of the thesis, we address the problems in recently proposed contact tracing systems. We propose a micro-level system design instead of a macro level system design. We propose a system that can be implemented at organizational level and can be scaled without any steep infrastructure cost. Privacy considerations are baked into the system design. The system only stores strictly necessary information from the user and the data never leaves the organization premises. Our proposed system can be scaled up rapidly without the requirement of population wide adoption. Subsequent data analysis from the aggregate statistics of the raw data collected by our proposed system is performed in a privacy-preserving manner. In the field of epidemiology and clinical modeling, a summary of raw biomedical data is used to fit or train disease-specific specialized models. Generalized Linear Mixed Model is one such widely used model. Training such models on sensitive data in a collaborative setting often entail privacy risks. Standard privacy-preserving mechanisms such as differential privacy can be used to mitigate the privacy risk during training the model. However, experimental evidence suggests that adding differential privacy to the training of the model can cause significant utility loss which makes the model impractical for real-world usage. Therefore, it becomes clear that generalized linear mixed models which lose their usability under differential privacy require a different approach for privacy-preserving model training. In the second part of the thesis, we propose a value-blind training method in a federated setting for generalized linear mixed models. In our proposed training method, the central server optimizes model parameters without ever getting access to the raw training data or intermediate computation values. Intermediate computation values that are shared by the collaborating parties with the central server are encrypted using homomorphic encryption. We formally prove the security of our proposed model. Experimentation on multiple datasets suggests that the model trained by our proposed method achieves a very low error rate while preserving privacy. To the best of our knowledge, this is the first work that performs a systematic privacy analysis of generalized linear mixed model training in a federated setting.
data security, machine learning, federated learning, biomedical data privacy, homomorphic encryption