Withdraw
Loading…
CONTENT-BASED CHARACTERIZATION OF THE END OF TERM WEB ARCHIVE
Phillips, Mark E.; Phillips, Kristy K.; Alam, Sawood
Loading…
Permalink
https://hdl.handle.net/2142/121091
Description
- Title
- CONTENT-BASED CHARACTERIZATION OF THE END OF TERM WEB ARCHIVE
- Author(s)
- Phillips, Mark E.
- Phillips, Kristy K.
- Alam, Sawood
- Issue Date
- 2023
- Keyword(s)
- web archives
- End of Term Web Archive
- WARC Metadata Sidecar
- Abstract
- Since 2008, the End of Term Web Archive has been gathering snapshots of the federal web, consisting of the publicly accessible .gov and .mil websites. In 2022, the End of Term team began to package these crawls into a public dataset which they released as part of the Amazon Open Data Partnership program. In total, over 460TB of WARC data was moved from local repositories at the Internet Archive and the University of North Texas Libraries. From the original WARC content, derivative datasets were created that address common use cases for web archives. These derivatives include WAT, WET, CDX and a format called a WARC Metadata Sidecar. This WARC Metadata Sidecar includes content-based characterizations of files held in the archive, including character set, language, file format identifier, and soft 404 detection. This paper describes the decisions made in the creation of these derivatives, the technologies used, and introduces the WARC Metadata Sidecar, which presents a useful approach for creating and storing auxiliary metadata for web archives.
- Series/Report Name or Number
- iPRES 2023
- Type of Resource
- text
- Language
- en
- Copyright and License Information
- Copyright held by the author(s). The text of this paper is published under a CC BY-SA license (https://creativecommons.org/licenses/by/4.0/).
Owning Collections
Long Papers PRIMARY
Manage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…