• Libraries
    • Log in to:
    View Item 
    •   MSpace Home
    • Faculty of Graduate Studies (Electronic Theses and Practica)
    • FGS - Electronic Theses and Practica
    • View Item
    •   MSpace Home
    • Faculty of Graduate Studies (Electronic Theses and Practica)
    • FGS - Electronic Theses and Practica
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Connecting vision and language via image retrieval and captioning

    Thumbnail
    View/Open
    Main Thesis (8.463Mb)
    Approval form (183.7Kb)
    Date
    2021-10
    Author
    Hosseinzadeh, Mehrdad
    Hosseinzadeh, Mehrdad
    Metadata
    Show full item record
    Abstract
    Many real-world problems involve jointly understanding vision and language, e.g. image/video captioning, multi-modal image retrieval, visual question answering. In this thesis, we consider several problems in cross-modal learning from vision and language. First, the problem of composed query image retrieval is studied. In this problem, the objective is to pick the most related images to a given query in a pool of images. The query in this problem consists of a reference image and a modification text describing the desired changes to the reference image. Next, we visit the problem of video captioning for a future event. In this setting, the goal is to take what has happened in a video stream so far and describe what is the most likely to happen next. Moving forward, we focus on the problem of image captioning in two different scenarios. In the first scenario where we call ``image change captioning", the task consists of taking two very similar images as input and describe the (subtle) difference between them. In the second scenario, the problem of personalized image captioning is studied in a few-shot settings. Unlike traditional image captioning where a generic sentence is generated for a given user, in our setting, the personality of the user is taken into account for caption generation. However, since collecting data for such a task is a non-trivial problem, we study the few-shot setting. In this setting, for each new user (and hence new personality trait), only a few pairs of image-caption are available for quick adaption. We explore different ways to establish interactions between vision and language modalities and propose new methods to solve the aforementioned problems. Our proposed methods are evaluated on benchmark datasets for each problem and are compared with other state-of-the-art and/or baseline methods.
    URI
    http://hdl.handle.net/1993/36119
    Collections
    • FGS - Electronic Theses and Practica [25529]

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of MSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    Statistics

    View Usage Statistics

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV