Natural language guided image and video understanding

Ye, Linwei

Natural language guided image and video understanding

dc.contributor.author	Ye, Linwei
dc.contributor.examiningcommittee	Hu, Pingzhao (Computer Science/Biochem&Med) Ashraf, Ahmed (Electrical and Computer Engineering)	en_US
dc.contributor.guestmembers	Guo, Yuhong (Carleton University)	en_US
dc.contributor.supervisor	Wang, Yang (Computer Science)	en_US
dc.date.accessioned	2020-09-25T14:45:58Z
dc.date.available	2020-09-25T14:45:58Z
dc.date.copyright	2020-09-18
dc.date.issued	2020-09	en_US
dc.date.submitted	2020-09-18T21:18:07Z	en_US
dc.degree.discipline	Computer Science	en_US
dc.degree.level	Doctor of Philosophy (Ph.D.)	en_US
dc.description.abstract	Vision and language are two important media of communication used by humans. In this thesis, we focus on the problem of natural language guided image and video understanding and study several computer vision tasks at the intersection of vision and language. We first start the research with the challenging task of referring image segmentation. A dual convolutional LSTM network consisting of an encoder network and a decoder network is proposed to capture spatial and sequential information. The proposed network is able to focus on more informative words for an effective multimodal interaction and integrates multiple level features to produce a precise segmentation mask. Attention mechanism has been widely used in various vision or language tasks to capture the importance of image regions or words. We extend the attention mechanism over vision and language simultaneously. A cross-modal self-attention module is introduced to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features. Besides, we further introduce a cross-frame self-attention module to effectively integrate temporal information in consecutive frames. It enables our method to work in the case of referring segmentation in videos. The third contribution of this thesis is to segment a moment in the video according to a query sentence. Existing methods require segment proposals and only exploit limited local contexts. This limitation severely hinders the potential use of these models in real world applications. To overcome this problem, we propose a novel PointerNet with local and global contexts to directly determine start and end positions of the moment. Last, to fully take advantage of language input for a vision task, we present a self-supervised auxiliary learning method to improve the primary segmentation task by adding an auxiliary task for better generalization. The auxiliary task is established to reconstruct the input sentence representation so that the multimodal representation can be adapted to a specific query. We investigate this framework on actor and action video segmentation from natural language task and achieve better performance than existing approaches.	en_US
dc.description.note	February 2021	en_US
dc.identifier.citation	L. Ye, M. Rochan, Z. Liu, and Y. Wang. Cross-modal self-attention network for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10502–10511, 2019.	en_US
dc.identifier.citation	L. Ye, Z. Liu, and Y. Wang. Dual convolutional lstm network for referring image segmentation. IEEE Transactions on Multimedia, 2020.	en_US
dc.identifier.uri	http://hdl.handle.net/1993/35091
dc.language.iso	eng	en_US
dc.rights	open access	en_US
dc.subject	Computer science, Vision and language, Deep learning	en_US
dc.title	Natural language guided image and video understanding	en_US
dc.type	doctoral thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ye_Linwei.pdf
Size:: 43.44 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.2 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

FGS - Electronic Theses and Practica