Compositional visual generation with energy-based modeling

Liu, Nan

Compositional visual generation with energy-based modeling

Liu, Nan

Permalink

https://hdl.handle.net/2142/121419

Description

Title

Compositional visual generation with energy-based modeling

Author(s)

Liu, Nan

Issue Date

2023-06-21

Director of Research (if dissertation) or Advisor (if thesis)

Lazebnik, Svetlana

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

energy-based models
diffusion models

Abstract

Our understanding of the visual world around us is highly compositional in nature, since humans can rapidly understand individual concepts in a scene, and even compose them to describe the world states we encounter. However, machines struggle to understand complex composition of challenging concepts, such as confusing attributes of different objects or relations between objects. While a larger body of work has explored inferring and understanding objects in a scene, less work has been done on building a composable system that can enable “infinite use of finite means”, i.e., repeatedly reuse and recombine acquired concepts. This thesis endeavors to construct machine learning systems to have such com- positional capabilities, particularly in the context of generative modeling. First, existing works primarily compose relations by utilizing a holistic encoder that encodes inputs into fixed-size vectors, in the form of text or graphs. We instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. Second, we further extend our previous work to understand the composition of various concepts, including objects, relations and text descriptions. Our alternative structured approach for compositional generation involves interpreting diffusion models as energy-based mod- els, which allow us to explicitly combine data distributions defined by energy functions. This proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen. Third, we consider the inverse problem – given a collection of different images, can we discover the underlying generative concepts that represent each image? We present an approach to decompose and represent images into a set of different concepts, disentangling different art styles in paintings, objects, and lighting from kitchen scenes, and discovering image classes given ImageNet images. We illustrate how such discovered concepts accurately represent the underlying content of images and illustrate how they may further be composed with other concepts to construct new artistic and hybrid images. In summary, the proposed methods in this thesis showcase the potential for compositional modeling to enhance machine learning systems’ ability to generate complex and realistic scenes by intelligently combining learned generative concepts.

Graduation Semester

2023-08

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Compositional visual generation with energy-based modeling

Liu, Nan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In