Withdraw
Loading…
CURATING MULTIVIEW HAND-OBJECT INTERACTIONS FROM IN-THE-WILD DATA
Jin, Matthew
Loading…
Permalink
https://hdl.handle.net/2142/124842
Description
- Title
- CURATING MULTIVIEW HAND-OBJECT INTERACTIONS FROM IN-THE-WILD DATA
- Author(s)
- Jin, Matthew
- Issue Date
- 2023-05-01
- Keyword(s)
- hand-object interactions; in-the-wild video; data curation; supervised learning
- Abstract
- Understanding how human hands manipulate different objects in in-the-wild environments is important for building embodied agents that can interact with the real world. One form of in-the-wild hand-object interaction understanding is the reconstruction of both the object and hand in 3D. This hand-object reconstruction often requires multiple views of the interaction for reliable reconstruction. Recently, large-scale egocentric in-the-wild video datasets have been collected to study several tasks, e.g., action recognition and forecasting, showing a rich set of multiview hand-object interactions for a variety of objects in diverse indoor and outdoor settings. Most of the associated tasks and annotations focus on capturing a semantic understanding of the videos. However, these annotations and data in their raw form may not be suitable for tasks that require geometric understanding, e.g., 3D hand-object reconstruction, due to challenges like blur, occlusion, and clipping. If we could extract specific sequences of video frames that depict the hand-held object from multiple viewpoints without these challenges, we could quickly generate possible sources of supervision for hand-held object reconstruction in the wild. Therefore, we aim to create an automatic curating tool to efficiently find these sequences from in-the-wild videos. We first manually curate these multiple frame sequences from the EPIC-KITCHENS VISOR dataset for proof of concept. We then leverage the curated sequences as training data to train supervised deep learning models that can identify similar sequences. We explore two different model architectures: 1) an early fusion approach in which we fine-tune a model that takes a sequence of frames in concatenated form as input and 2) a late fusion approach where we first extract features from individual frames and then concatenate those features to input into a multilayer perceptron. Our experiments on held-out sequences from VISOR and qualitative analysis on the iigeneralization to Ego4D show our model is effective in curating these in-the-wild sequences.
- Type of Resource
- text
- Language
- eng
Owning Collections
Senior Theses - Electrical and Computer Engineering PRIMARY
The best of ECE undergraduate researchManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…