Natural language guided image and video understanding

dc.contributor.authorYe, Linwei
dc.contributor.examiningcommitteeHu, Pingzhao (Computer Science/Biochem&Med) Ashraf, Ahmed (Electrical and Computer Engineering)en_US
dc.contributor.guestmembersGuo, Yuhong (Carleton University)en_US
dc.contributor.supervisorWang, Yang (Computer Science)en_US
dc.date.accessioned2020-09-25T14:45:58Z
dc.date.available2020-09-25T14:45:58Z
dc.date.copyright2020-09-18
dc.date.issued2020-09en_US
dc.date.submitted2020-09-18T21:18:07Zen_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelDoctor of Philosophy (Ph.D.)en_US
dc.description.abstractVision and language are two important media of communication used by humans. In this thesis, we focus on the problem of natural language guided image and video understanding and study several computer vision tasks at the intersection of vision and language. We first start the research with the challenging task of referring image segmentation. A dual convolutional LSTM network consisting of an encoder network and a decoder network is proposed to capture spatial and sequential information. The proposed network is able to focus on more informative words for an effective multimodal interaction and integrates multiple level features to produce a precise segmentation mask. Attention mechanism has been widely used in various vision or language tasks to capture the importance of image regions or words. We extend the attention mechanism over vision and language simultaneously. A cross-modal self-attention module is introduced to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features. Besides, we further introduce a cross-frame self-attention module to effectively integrate temporal information in consecutive frames. It enables our method to work in the case of referring segmentation in videos. The third contribution of this thesis is to segment a moment in the video according to a query sentence. Existing methods require segment proposals and only exploit limited local contexts. This limitation severely hinders the potential use of these models in real world applications. To overcome this problem, we propose a novel PointerNet with local and global contexts to directly determine start and end positions of the moment. Last, to fully take advantage of language input for a vision task, we present a self-supervised auxiliary learning method to improve the primary segmentation task by adding an auxiliary task for better generalization. The auxiliary task is established to reconstruct the input sentence representation so that the multimodal representation can be adapted to a specific query. We investigate this framework on actor and action video segmentation from natural language task and achieve better performance than existing approaches.en_US
dc.description.noteFebruary 2021en_US
dc.identifier.citationL. Ye, M. Rochan, Z. Liu, and Y. Wang. Cross-modal self-attention network for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10502–10511, 2019.en_US
dc.identifier.citationL. Ye, Z. Liu, and Y. Wang. Dual convolutional lstm network for referring image segmentation. IEEE Transactions on Multimedia, 2020.en_US
dc.identifier.urihttp://hdl.handle.net/1993/35091
dc.language.isoengen_US
dc.rightsopen accessen_US
dc.subjectComputer science, Vision and language, Deep learningen_US
dc.titleNatural language guided image and video understandingen_US
dc.typedoctoral thesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ye_Linwei.pdf
Size:
43.44 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.2 KB
Format:
Item-specific license agreed to upon submission
Description: