Term Frequencies for 235k Language and Literature Texts
Organisciak, Peter
Loading…
Permalink
https://hdl.handle.net/2142/89515
Description
Title
Term Frequencies for 235k Language and Literature Texts
Author(s)
Organisciak, Peter
Issue Date
2016-03
Keyword(s)
data
text analysis
digital library
Abstract
Corpus-level term statistics are valuable for numerous text analysis activities, such as term weighting or probability distribution smoothing. In instances where there is an insufficient corpus to calculate such statistics, falling back on a general corpus of similar texts is useful.
This dataset provides statistics for a collection of 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, book frequency, page frequency, and term frequency are provided. Book frequency is the count of books that the term is seen in, page frequency is the number of pages that have the term, and term frequency is the overall count of the term. This data is derived from the holding of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.