Withdraw
Loading…
Exploring screen summarization with large language and multimodal models
Adrakatti, Vivek
Loading…
Permalink
https://hdl.handle.net/2142/124410
Description
- Title
- Exploring screen summarization with large language and multimodal models
- Author(s)
- Adrakatti, Vivek
- Issue Date
- 2024-04-29
- Director of Research (if dissertation) or Advisor (if thesis)
- Kumar, Ranjitha
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Screen Understanding
- Mobile UI Understanding
- Screen Summarization
- Mobile UI Summarization
- Artificial Intelligence
- Machine Learning
- Human-Computer Interaction
- Large Language Models
- Vision Language Models
- Large Multimodal Models
- Abstract
- Mobile UIs are inherently multimodal. They can be represented by both visual representations, (screenshots) and structural metadata (view hierarchies). The image modality is rich and can contain information such as images, colors and positional information, while the view-hierarchies represent a tree of UI Elements, with data such as element types, element text and bounding boxes. Recently, pre-trained large language models (LLMs) and large multimodal models (LMMs) have been shown to have strong performance on various downstream tasks such as sentiment analysis and text classification. Further, LLMs have been shown to have a competitive performance when applied to screen summarization tasks with effective prompting strategies. First, we seek to understand whether the LLMs benefit from the hierarchical representation of a view-hierarchy, and explore different heuristics to represent the view-heirarchy in text, and observe whether this can affect performance in screen summarization. We study the impact of in-context learning, and note a visible gain in performance. We also explore the use of LMMs to effectively use the multimodal nature of mobile UIs. To this end, we experiment with the performance of state-of-the-art LMMs, LLaVA and GPT-4V when provided with both screenshots as well as view hierarchies. We investigate whether LMMs are able to achieve good summarization capability with a single modality, or if multimodality can help create more complete summaries. Finally, we conduct a deep analysis to study the characteristics of LLM and LMM generated screen summaries. We show that our multimodal approach not only yields competitive performance but also offers generalizability and user-friendly application. We achieve a CLAIR score of 0.79, and show qualitative examples to highlight that both modalities hold information that are useful to create informative and complete screen summaries. We are able to achieve this performance without requiring the creation of dedicated datasets and expensive pre-training or fine-tuning. This study contributes to the broader understanding of how mobile UIs can be represented textually and how their multimodal nature can be effectively leveraged in practical applications.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Vivek Adrakatti
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…