Term Weights for 235k Language and Literature Texts
Organisciak, Peter
Loading…
Permalink
https://hdl.handle.net/2142/89691
Description
Title
Term Weights for 235k Language and Literature Texts
Author(s)
Organisciak, Peter
Issue Date
2016-03
Keyword(s)
data, text analysis, digital library
Abstract
A popular form of term weighting in texts is to use TF*IDF, which takes a text's term frequencies and weighs them by a measure derived from document frequency called Inverse Document Frequency (IDF). This dataset provides IDF weights for terms in 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, inverse book frequency and inverse page frequency are provided. Book frequency is the count of books that the term occurs in, page frequency is the number of pages that have the term. This data is derived from the holdings of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.