Withdraw
Loading…
Methods for generating visual programs with optimizable vision models
Levine, Joshua
Loading…
Permalink
https://hdl.handle.net/2142/124357
Description
- Title
- Methods for generating visual programs with optimizable vision models
- Author(s)
- Levine, Joshua
- Issue Date
- 2024-04-29
- Director of Research (if dissertation) or Advisor (if thesis)
- Hoiem, Derek
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- visual programming
- visual question answering
- large language models
- computer vision
- program generation
- Abstract
- End-to-end vision-language models often fail to handle compositional tasks, necessitating alternative approaches for more complex problem-solving. Leveraging the visual programming paradigm, we propose a novel method for composing foundational vision models through program generation to tackle compositional tasks effectively. We investigate prompting and execution strategies that enable the synthesis of fine-tunable code by trainable large language models aimed at improving the effectiveness of the programs in solving vision-language tasks. Capitalizing on the robust compositional reasoning capabilities of large language models (LLMs), we employ pre-trained LLMs to architect programs constructed using a catalog of pre-defined atomic functions. These atomic functions, implemented with pre-trained vision models, serve as the building blocks for the visual programs generated by our system. Our methodology supports programs in various formats, always offering the flexibility to fine-tune the constituent vision models and the LLM code generator. This study concentrates on image-based question-answering. This focus underscores the critical need for advanced compositional reasoning in interpreting and responding to complex visual queries. Our evaluation encompasses the executability and correctness of the produced programs, providing a comprehensive assessment of our approach's effectiveness. This paper lays the groundwork for a subsequent investigation into the joint training of the LLMs and atomic functions, setting the stage for significant advancements in program generation and compositional reasoning in computer vision.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Joshua Levine
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…