Withdraw
Loading…
Visual relationship understanding
Hung, Zih-Siou
Loading…
Permalink
https://hdl.handle.net/2142/108027
Description
- Title
- Visual relationship understanding
- Author(s)
- Hung, Zih-Siou
- Issue Date
- 2020-05-12
- Director of Research (if dissertation) or Advisor (if thesis)
- Lazebnik, Svetlana
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Visual Relationship Detection
- Action Recognition
- Abstract
- This thesis addresses two visual understanding tasks: visual relationship detection (VRD) and video action recognition. The majority of the thesis is focused on VRD, which is our main contribution. Relations amongst entities play a central role in image and video understanding. In the first three chapters, we discuss visual relationship detection, whose goal is to recognize all (subject, predicate, object) tuples in a given image. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE [1], we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object. Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate = union (subject, object) - subject - object. In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation. In the final part of the thesis, we consider action understanding in videos. In many scenarios, we observe moving objects instead of still images. Thus, it is also important to capture motion information and recognize the action being performed. Recent work either applies 3D convolution operators to extract the motion implicitly or adds an additional optical flow path to leverage temporal features. In our work, we propose to use a novel correlation operator to establish a matching between consecutive frames. This matching encodes the movement of objects through time. Combined with the classical appearance stream, the proposed method hence learns the appearance and motion representations in parallel. On the challenging Something-Something dataset [2], we empirically demonstrate that our network achieves comparable performance to the state-of-the-art method.
- Graduation Semester
- 2020-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/108027
- Copyright and License Information
- Copyright 2020 Zih-Siou Hung
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…