Natural language guided image and video understanding

Thumbnail Image
Ye, Linwei
Journal Title
Journal ISSN
Volume Title
Vision and language are two important media of communication used by humans. In this thesis, we focus on the problem of natural language guided image and video understanding and study several computer vision tasks at the intersection of vision and language. We first start the research with the challenging task of referring image segmentation. A dual convolutional LSTM network consisting of an encoder network and a decoder network is proposed to capture spatial and sequential information. The proposed network is able to focus on more informative words for an effective multimodal interaction and integrates multiple level features to produce a precise segmentation mask. Attention mechanism has been widely used in various vision or language tasks to capture the importance of image regions or words. We extend the attention mechanism over vision and language simultaneously. A cross-modal self-attention module is introduced to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features. Besides, we further introduce a cross-frame self-attention module to effectively integrate temporal information in consecutive frames. It enables our method to work in the case of referring segmentation in videos. The third contribution of this thesis is to segment a moment in the video according to a query sentence. Existing methods require segment proposals and only exploit limited local contexts. This limitation severely hinders the potential use of these models in real world applications. To overcome this problem, we propose a novel PointerNet with local and global contexts to directly determine start and end positions of the moment. Last, to fully take advantage of language input for a vision task, we present a self-supervised auxiliary learning method to improve the primary segmentation task by adding an auxiliary task for better generalization. The auxiliary task is established to reconstruct the input sentence representation so that the multimodal representation can be adapted to a specific query. We investigate this framework on actor and action video segmentation from natural language task and achieve better performance than existing approaches.
Computer science, Vision and language, Deep learning
L. Ye, M. Rochan, Z. Liu, and Y. Wang. Cross-modal self-attention network for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10502–10511, 2019.
L. Ye, Z. Liu, and Y. Wang. Dual convolutional lstm network for referring image segmentation. IEEE Transactions on Multimedia, 2020.