Scheduling Solutions for Data Stream Processing Applications on Cloud-Edge Infrastructure

Accès directs

Soutenance de thèse

Date

Jeudi 10 décembre 2020

Horaires

14h00

Lieu(x)

Site Monod

Intervenant(s)

Soutenance de M. Felipe DE SOUZA sous la Direction de M. Eddy CARON

Organisateur(s)

Laboratoire de l'Informatique du Parallélisme (LIP)

Langue(s) des interventions

Français

Technology has evolved to a point where applications and devices are highly connected and produce ever-increasing amounts of data used by organizations and individuals to make daily decisions. For the data to become information that can be used in decision making, it requires processing.
The speed in which information is extracted from generated data affects how fast organizations and individuals react to environmental changes. One way to process the data under short delays is through Data Stream Processing (DSP) applications. DSP applications can be structured as graphs, where the vertexes are data sources, operators, and data sinks, and the edges are how the data streams flow through the graph. The data source is the part of the application responsible for data ingestion. The operators are the application components that receive a data stream, apply some transformation over this data stream and produce a new data stream, until it reaches a data sink, where the data is stored, visualized or provided to another application or individual.
Usually, DSP applications are designed to run on a cloud or a homogeneous cluster of computing resources, due to the large set of resources that such infrastructures can provide and the low constraints imposed by the network. In scenarios where the data consumed by the DSP application is produced on the cloud, deploying the application in the cloud is good approach. However, advances in Internet of Things (IoT) are creating scenarios where DSP applications are consuming streams generated at the edges of the network, by numerous geographically distributed resources. In such scenarios, sending these streams of data through the Internet to a cloud, far from the edges of the network, leads to increasing network traffic, hence introducing high latency.
A recent trend is combining cloud with computational resources at the edge of the network to deploy DSP applications. The idea is that resources at the edges of the network can provide low latency to part of the computation, reduce the amount of data sent to the cloud, whereas the cloud can be used as a central place to receive data from geographically distributed streams. The downside of using resources at the network edges is that these resources are constrained with respect to CPU, memory, storage, and even power availability. In addition to solving the operator placement problem, which consists of finding a set of resources to host the operators of a DSP application, a DSP scheduling solution needs to consider the computing constraints of edge resources, and hence decide which parts of the application to offload to the edge and which parts keep in the cloud. Moreover, the solution must explore the resources at the edges of the network by adapting the application parallelism to split the application load among the numerous edge resources to cope with the load.
In this work, we propose a model for the operator placement problem, accounting for resources at the edges of the network as well as cloud ones, and their heterogeneity. The model also addresses the operator parallelism to create low weight replicas to explore a large number of computationally constrained devices at the edges. Along with the model, we propose an optimal solution based on linear formulation to reduce the end-to-end latency and deployment costs. Since an optimal solution based on linear formulations suffers scalability issues, and the an edge-cloud environment is composed of a large number of resources, we propose a heuristic to reduce the search space without compromising the solution’s performance. Results using a discrete-event simulation shown that the proposed solution can achieve an end-to-end latency at least ≃ 80% and monetary costs at least ≃ 30% better than traditional cloud deployment, and when combining the proposed solution with the search space reduction heuristic, it can find placements 94% faster, with a 12% quality reduction, that still is better than state-of-the-art approaches.

Gratuit

Mots clés

Disciplines

Informatique

Accès directs

Outils

Scheduling Solutions for Data Stream Processing Applications on Cloud-Edge Infrastructure