Withdraw
Loading…
Maintaining Highly Accurate Frequency Counts over Data Streams: A Practitioner's Perspective
Lu, Ying; Han, Jiawei
Loading…
Permalink
https://hdl.handle.net/2142/11068
Description
- Title
- Maintaining Highly Accurate Frequency Counts over Data Streams: A Practitioner's Perspective
- Author(s)
- Lu, Ying
- Han, Jiawei
- Issue Date
- 2005-07
- Keyword(s)
- data streams
- computer science
- Abstract
- Maintaining frequency counts for data streams has attracted much interest among the research community recently since it provides the base for many stream mining applications. Most existing work followed the same paradigm: Given an error requirement, find an algorithm that maintains approximate frequency counts satisfying the error requirement within a theoretical memory bound. While actual algorithms are different, the following common weakness has been observed. First, most algorithms are satisfied with having the maintained counts within a certain error bound and are not concerned with maintaining the counts of individual items as accurate as possible. Second, most work is more theoretical in the sense that they focus on finding the theoretical memory bounds with less consideration for estimating the memory requirement for a given application. The bound provided can hardly be used as an accurate gauge of the true memory requirement that often varies drastically depending on the data characteristics and arrival orders. In this paper, we study the problem of maintaining frequency counts over data streams of infinite length from a practitioner's perspective. Given a fixed memory size, maintain frequency counts for individual items as accurately as possible. We introduce a novel one-pass deterministic algorithm, {\em Progressive Lazy Pruning} (PLP). Given a fixed memory size, PLP employs a {\em progressive pruning} technique that can make full use of available memory to maintain highly accurate approximate counts for items in data streams. The estimation error is not only bounded but also independent of the total number of arrivals. If an error requirement $\epsilon$ is specified, it can also maintain $\epsilon$-approximate count as most existing work does. Our performance study indicates that, in comparison with one of the best existing deterministic algorithms on frequency counting, PLP achieves higher accuracy using the same amount of memory over various kind of data distribution.
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/11068
- Copyright and License Information
- You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Owning Collections
Manage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…