Distributed interactive analytics engines (Druid, Redshift, Pinot)
need to achieve low query latency while using the least storage
space. This paper presents a solution to the problem of replication
of data blocks and routing of queries. Our techniques decide
the replication level of individual data blocks (based on popularity,
access counts), as well as output optimal placement patterns for
such data blocks. For the static version of the problem (given set
of queries accessing some segments), our techniques are provably
optimal in both storage and query latency. For the dynamic version
of the problem, we build a system called Getafix that dynamically
tracks data block popularity, adjusts replication levels, dynamically
routes queries, and garbage collects less useful data blocks. We implemented
Getafix into Druid, the most popular open-source interactive
analytics engine. Our experiments use both synthetic traces
and production traces from Yahoo! Inc.’s production Druid cluster.
Compared to existing techniques Getafix either improves storage
space used by up to 3.5x while achieving comparable query
latency, or improves query latency by up to 60% while using comparable
storage.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.