Examination of machine learning methods for multi-label classification of intellectual property documents

Hall, John William

Examination of machine learning methods for multi-label classification of intellectual property documents

Hall, John William

Permalink

https://hdl.handle.net/2142/97430

Description

Title

Examination of machine learning methods for multi-label classification of intellectual property documents

Author(s)

Hall, John William

Issue Date

2017-04-24

Director of Research (if dissertation) or Advisor (if thesis)

Shih, Chilin

Department of Study

Linguistics

Discipline

Linguistics

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.A.

Degree Level

Thesis

Keyword(s)

Machine learning
Multi-label classification
Multi-class classification
Patent classification
Document classification

Abstract

This thesis explores the performance of a variety of machine learning techniques for the task of multi-label document classification applied to a corpus of United States patent grants. The rapidly rising number of patent applications in the past several decades has led to a rising need for enhanced automatic patent processing tools. The task of automated document classification in particular has been targeted as an important point of research. However, the development of adequate tools has been limited in part by the esoteric writing style particular to intellectual property and the overlapping categorizations of the branched hierarchical classification system employed by the CPC. A patent document corpus offers a large, publicly available training set consisting of both structured and unstructured data. The application of machine learning techniques to this corpus may help relieve the increasing need for highly trained human classifiers. The contributions of the present work are 2-fold. First, the present work constructed a patent document corpus by gathering 4500 patent documents from years 2015 and 2014 and compiling relevant structured and textual data relevant to an automated classification task. Second, it offers an examination of five different machine learning techniques as automated classifiers for patent documents by section. Test trials under different preprocessing conditions utilizing principal component analysis and word selection were applied in training supervised learning classifiers. It was found that principal component analysis of the patent documents without further feature selection yielded the greatest performance for all machine learning models. This approach also revealed an effect of dataset size where increasing the size of the training set increased the overall performance of Decision Tree, Support Vector Machine, Logistic Regression, and Neural Net models. It was further found that some classifiers trained on data not subject to principal component analysis showed decreasing performance metrics with increasing data sizes.

Graduation Semester

2017-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/97430

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Examination of machine learning methods for multi-label classification of intellectual property documents

Hall, John William

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Linguistics

Log In