Connecting vision and language via image retrieval and captioning

dc.contributor.authorHosseinzadeh, Mehrdad
dc.contributor.examiningcommitteeLeung, Carson (Computer Science) Ho, Carl (Electrical and Computer Engineering) Taylor, Graham (Engineering, University of Guelph)en_US
dc.contributor.supervisorWang, Yang (Computer Science)en_US
dc.date.accessioned2021-11-17T22:01:49Z
dc.date.available2021-11-17T22:01:49Z
dc.date.copyright2021-10-25
dc.date.issued2021-10en_US
dc.date.submitted2021-10-25T21:37:11Zen_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelDoctor of Philosophy (Ph.D.)en_US
dc.description.abstractMany real-world problems involve jointly understanding vision and language, e.g. image/video captioning, multi-modal image retrieval, visual question answering. In this thesis, we consider several problems in cross-modal learning from vision and language. First, the problem of composed query image retrieval is studied. In this problem, the objective is to pick the most related images to a given query in a pool of images. The query in this problem consists of a reference image and a modification text describing the desired changes to the reference image. Next, we visit the problem of video captioning for a future event. In this setting, the goal is to take what has happened in a video stream so far and describe what is the most likely to happen next. Moving forward, we focus on the problem of image captioning in two different scenarios. In the first scenario where we call ``image change captioning", the task consists of taking two very similar images as input and describe the (subtle) difference between them. In the second scenario, the problem of personalized image captioning is studied in a few-shot settings. Unlike traditional image captioning where a generic sentence is generated for a given user, in our setting, the personality of the user is taken into account for caption generation. However, since collecting data for such a task is a non-trivial problem, we study the few-shot setting. In this setting, for each new user (and hence new personality trait), only a few pairs of image-caption are available for quick adaption. We explore different ways to establish interactions between vision and language modalities and propose new methods to solve the aforementioned problems. Our proposed methods are evaluated on benchmark datasets for each problem and are compared with other state-of-the-art and/or baseline methods.en_US
dc.description.noteFebruary 2022en_US
dc.identifier.urihttp://hdl.handle.net/1993/36119
dc.language.isoengen_US
dc.rightsopen accessen_US
dc.subjectComputer scienceen_US
dc.subjectComputer visionen_US
dc.subjectMulti-modal learningen_US
dc.titleConnecting vision and language via image retrieval and captioningen_US
dc.typedoctoral thesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
hosseinzadeh_mehrdad.pdf
Size:
8.46 MB
Format:
Adobe Portable Document Format
Description:
Main Thesis
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.2 KB
Format:
Item-specific license agreed to upon submission
Description: