Unifying multiple descriptions to determine the details of an everyday event can be a challenging task for humans. Though incorporating other modalities like images or videos can help humans unify such descriptions, this remains a challenging task for computational systems. We define entity-based scene understanding as the task of identifying the entities in a visual scene from multiple descriptions. This task subsumes coreference resolution, bridging resolution, and grounding to produce mutually consistent relations between entity mentions and groundings between mentions and image regions. Using neural classifiers and integer linear program inference, we show that grounding is improved when forced to conform to relation predictions. We introduce the Flickr30k Entities v2 dataset, and show how our methods can be used to automatically generate similarly rich annotations for the MSCOCO dataset.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.