Mining, Indexing and Similarity Search in Large Graph Data Sets

Yan, Xifeng

Mining, Indexing and Similarity Search in Large Graph Data Sets

Yan, Xifeng

Permalink

https://hdl.handle.net/2142/11256

Description

Title

Mining, Indexing and Similarity Search in Large Graph Data Sets

Author(s)

Yan, Xifeng

Issue Date

2006-12

Keyword(s)

graph mining
data mining
database
database management

Abstract

Scalable graph mining and graph database management tools become increasingly crucial to applications with complex data in domains ranging from software engineering to computational biology. Due to their high complexity, it is often difficult, if not impossible, for human beings to manually analyze any reasonably large collection of graphs. In this dissertation, we investigate two fundamental problems in large scale graph data mining and graph database management: Given a graph data set, what are the hidden structural patterns and how can we find them? and how can we index graphs and perform similarity search in large graph data sets? Graph mining and graph data management themselves are expensive computational problems since subgraph isomorphism is NP-complete. Existing Apriori-based mining solutions face inevitable overheads when they join two existing graph patterns to form larger candidates. We develop a graph canonical labeling system, gSpan, showing both theoretically and empirically that the join operation can be avoided. Graph indexing, the second problem addressed in this dissertation, may incur an exponential number of index entries if all of the substructures in a graph database are used for indexing. The solution, gIndex, proposes a pattern-based index, which is built on frequent and discriminative graph patterns discovered through a mining process. This mining-based indexing methodology leads to the development of a compact but effective graph index structure that is orders of magnitude smaller in size but an order of magnitude faster in performance than traditional approaches. Besides graph mining and search, this dissertation provides thorough investigation of pattern summarization, pattern-based classification, constraint pattern mining, and graph similarity searching, which could leverage the mining of graph patterns. It also explores several critical applications in bioinformatics, computer systems, and software engineering, including gene relevance network analysis for functional annotation, and program flow analysis for automated software bug isolation. The developed concepts, theories, and systems hence increase our understanding of data mining principles in structural pattern discovery, interpretation and search. The formulation of a general graph information system through this study could provide fundamental supports to graph-intensive applications.

Type of Resource

text

Permalink

http://hdl.handle.net/2142/11256

Copyright and License Information

You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).

Mining, Indexing and Similarity Search in Large Graph Data Sets

Yan, Xifeng

Permalink

Description

Owning Collections

Research and Tech Reports - Computer Science PRIMARY

Mining, Indexing and Similarity Search in Large Graph Data Sets

Yan, Xifeng

Permalink

Description

Owning Collections

Research and Tech Reports - Computer Science PRIMARY

Log In