Connecting vision and language via image retrieval and captioning

Hosseinzadeh, Mehrdad

Connecting vision and language via image retrieval and captioning

dc.contributor.author	Hosseinzadeh, Mehrdad
dc.contributor.examiningcommittee	Leung, Carson (Computer Science) Ho, Carl (Electrical and Computer Engineering) Taylor, Graham (Engineering, University of Guelph)	en_US
dc.contributor.supervisor	Wang, Yang (Computer Science)	en_US
dc.date.accessioned	2021-11-17T22:01:49Z
dc.date.available	2021-11-17T22:01:49Z
dc.date.copyright	2021-10-25
dc.date.issued	2021-10	en_US
dc.date.submitted	2021-10-25T21:37:11Z	en_US
dc.degree.discipline	Computer Science	en_US
dc.degree.level	Doctor of Philosophy (Ph.D.)	en_US
dc.description.abstract	Many real-world problems involve jointly understanding vision and language, e.g. image/video captioning, multi-modal image retrieval, visual question answering. In this thesis, we consider several problems in cross-modal learning from vision and language. First, the problem of composed query image retrieval is studied. In this problem, the objective is to pick the most related images to a given query in a pool of images. The query in this problem consists of a reference image and a modification text describing the desired changes to the reference image. Next, we visit the problem of video captioning for a future event. In this setting, the goal is to take what has happened in a video stream so far and describe what is the most likely to happen next. Moving forward, we focus on the problem of image captioning in two different scenarios. In the first scenario where we call ``image change captioning", the task consists of taking two very similar images as input and describe the (subtle) difference between them. In the second scenario, the problem of personalized image captioning is studied in a few-shot settings. Unlike traditional image captioning where a generic sentence is generated for a given user, in our setting, the personality of the user is taken into account for caption generation. However, since collecting data for such a task is a non-trivial problem, we study the few-shot setting. In this setting, for each new user (and hence new personality trait), only a few pairs of image-caption are available for quick adaption. We explore different ways to establish interactions between vision and language modalities and propose new methods to solve the aforementioned problems. Our proposed methods are evaluated on benchmark datasets for each problem and are compared with other state-of-the-art and/or baseline methods.	en_US
dc.description.note	February 2022	en_US
dc.identifier.uri	http://hdl.handle.net/1993/36119
dc.language.iso	eng	en_US
dc.rights	open access	en_US
dc.subject	Computer science	en_US
dc.subject	Computer vision	en_US
dc.subject	Multi-modal learning	en_US
dc.title	Connecting vision and language via image retrieval and captioning	en_US
dc.type	doctoral thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: hosseinzadeh_mehrdad.pdf
Size:: 8.46 MB
Format:: Adobe Portable Document Format
Description:: Main Thesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.2 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

FGS - Electronic Theses and Practica