Satisfying service level objectives in stream processing systems

Kalim, Faria

Satisfying service level objectives in stream processing systems

Kalim, Faria

Permalink

https://hdl.handle.net/2142/108600

Description

Title

Satisfying service level objectives in stream processing systems

Author(s)

Kalim, Faria

Issue Date

2020-07-14

Director of Research (if dissertation) or Advisor (if thesis)

Gupta, Indranil

Doctoral Committee Chair(s)

Gupta, Indranil

Committee Member(s)

Nahrstedt, Klara
Xu, Tianyin
Ananthanarayanan, Ganesh

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2020-10-07T22:44:32Z

Keyword(s)

Stream Processing
Service Level Objectives
Bottlenecks
Distributed Systems

Abstract

An increasing number of real-world applications today consume massive amounts of data in real-time to produce up to date results. These applications include social media sites that show top trends and recent comments, streaming video analytics that identify traffic patterns and movement, and jobs that process ad pipelines. This has led to the proliferation of stream processing systems that process such data to produce real-time results. As these applications must produce results quickly, users often wish to impose performance requirements on the stream processing jobs, in the form of service level objectives (SLOs) that include producing results within a specified deadline or producing results at a certain throughput. For example, an application that identifies traffic accidents can have tight latency SLOs as paramedics may need to be informed, where given a video sequence, results should be produced within a second. A social media site could have a throughput SLO where top trends should be updated with all received input per minute. Satisfying job SLOs is a hard problem that requires tuning various deployment parameters of these jobs. This problem is made more complex by challenges such as 1) job input rates that are highly variable across time e.g., more traffic can be expected during the day than at night, 2) transparent components in the jobs' deployed structure that the job developer is unaware of, as they only understand the application-level business logic of the job, and 3) different deployment environments per job e.g., on a cloud infrastructure vs. on a local cluster. In order to handle such challenges and ensure that SLOs are always met, developers often over-allocate resources to jobs, thus wasting resources. In this thesis, we show that SLO satisfaction can be achieved by resolving (i.e., preventing or mitigating) bottlenecks in key components of a job's deployed structure. Bottlenecks occur when tasks in a job do not have sufficient allocation of resources (CPU, memory or disk), or when the job tasks are assigned to machines in a way that does not preserve locality and causes unnecessary message passing over the network, or when there are an insufficient number of tasks to process job input. We have built three systems that tackle the challenges of satisfying SLOs of stream processing jobs that face a combination of these bottlenecks in various environments. We have developed Henge, a system that achieves SLO satisfaction of stream processing jobs deployed on a multi-tenant cluster of resources. As the input rates of jobs change dynamically, Henge makes cluster-level resource allocation decisions to continually meet jobs' SLOs in spite of limited cluster resources. Second, we have developed Meezan, a system that aims to remove the burden of finding the ideal resource allocation of jobs deployed on commercial cloud platforms, in terms of performance and cost, for new users of stream processing. When a user submits their job to Meezan, it provides them with a spectrum of throughput SLOs for their jobs, where the most performant choice is associated with the highest resource usage and consequently cost, and vice versa. Finally, we have built Caladrius in collaboration with Twitter that enables users to model and predict how input rates of jobs may change in the future. This allows Caladrius to preemptively scale a job out when it anticipates high workloads to prevent SLO misses. Henge is built atop Apache Storm, while Meezan and Caladrius are integrated with Apache Heron.

Graduation Semester

2020-08

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/108600

Copyright and License Information

2020 Faria Kalim

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Satisfying service level objectives in stream processing systems

Kalim, Faria

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In