The increasing availability of sensors, mobile phones, and other devices has led to an explosion in the volume, variety and velocity of data generated and that requires analysis of some type. Society is becoming more interconnected producing vast amounts of data as result of instrumented business processes, monitoring of user activity [0, 1], wearable assistance , sensors, finance, large-scale scientific experiments, among other reasons. This data deluge is often termed as big data due to the challenges it poses to existing infrastructure regarding data transfer, storage, and processing .
A large part of this big data is most valuable when analysed quickly, as it is generated. Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, and Internet of Things (IoT) , continuous data streams must be processed under very short delays. In multiple domains, there is a need for processing data streams to detect patterns, identify failures , and gain insights. Data is often gathered and analysed by Data Stream Processing Engines (DSPEs).
A DSPE commonly structures an application as a directed graph or dataflow. A dataflow has one or multiple sources (i.e., sensors, gateways or actuators); operators that perform transformations on the data (e.g., filtering); and sinks (i.e., queries that consume or store the data). Most complex operator transformations – hereafter called stateful operators – store information about previously received data as new data is streamed in. Also, a dataflow has stateless operators that consider only the current data. Traditionally, Data Stream Processing (DSP) applications were conceived to run in clusters of homogeneous resources or on the cloud . In a cloud deployment, the whole application is placed on a single cloud provider to benefit from virtually unlimited resources. This approach allows for elastic DSP applications with the ability to allocate additional resources or release idle capacity on demand during runtime to match the application requirements.
In many scenarios, cloud elasticity is not enough to meet DSP time requirements due to the location of the data sources and the application life-cycle. Mainly, in IoT scenarios data sources are located on the edge of the Internet, and data is transferred to the cloud via long-distance links that increase the end-to-end application latency (i.e., the time data is generated to the time it reaches the sinks). The communication overhead incurred by transferring data through high-latency Internet links makes it impossible to achieve near real-time processing on clouds alone. A common cloud-edge infrastructure for IoT scenarios – termed here as highly distributed infrastructure – is where data is continuously produced by multiple sensors and controllers, forwarded to gateways, switches, or hubs on edge using low latency networks and eventually to the cloud for processing. The edge often has devices with low, but non-negligible, memory and CPU capabilities grouped according to their location or network latency. One edge group can transfer data to another group or to the cloud, and the channel used for communication is often the Internet.
More recently, software frameworks [7, 8] and architecture have been proposed for carrying out DSP using cloud-edge infrastructure for improving the scalability and aiming to achieve short end-to-end latencies. The edge devices can be leveraged to complement the computing capabilities of the cloud and reduce the overall end-to-end application latency, bandwidth requirements and other performance metrics. Exploring such infrastructure allows for employing multiple performance models and minimise the overhead and costs of performing data stream processing. Using cloud-edge infrastructure introduces additional challenges regarding application scheduling, resource elasticity, and programming models. The task of scheduling DSP operators on highly distributed infrastructure with heterogeneous resources is generally referred to as operator placement (i.e., configuration) and has proven to be NP-hard . Determining how to deploy the operators or move them from the cloud to edge devices is also challenging because of the device limitations (i.e., memory, CPU and network bandwidth) and network (i.e. Internet).
As DSP applications are long-running, the load and infrastructure conditions can change during its execution. After the operator placement, operators may need to be reassigned due to variable workload or device failures. The process to reorganise or migrate DSP operators on compute resources is called reconfiguration. Reconfiguring DSP applications and deciding which operators to be reassigned is also NP-Hard. The DSP application reconfiguration is guided by some strategies, for instance, pause-and-resume  or parallel track . The former shutdowns or pauses the original operator before its state and code are migrated and execution efficiently restored on the new resource. A drawback is the increase in end-to-end latency, which is caused by pausing an operator which will store the incoming data until the migration finishes, and then synchronise the data. The latter creates a new instance of the operator that is ran concurrently until the old and the new operator instances synchronise their states. Although it is faster than pause-and-resume, it requires enhanced mechanisms for state movement.
The solution search space for determining the DSP operator placement or the application reconfiguration can be enormous depending on the number of operators, streams, resources and network links. The problem becomes even more complex to solve when considering multiple Quality of Service (QoS) metrics, for instance, end-to-end latency, traffic volume that uses WAN links, monetary cost that incurs in the application deployment, and the overhead posed by saving the application state and moving operators to new resources. As the cloud-edge infrastructure and applications grow larger, trying to devise a (re)configuration plan while optimising multiple objectives can result in a larger search space.
We introduce a set of strategies to place operators onto cloud and edge while considering characteristics of resources and meeting the requirements of applications. In particular, we first decompose the application graph by identifying behaviours such as forks and joins, and then dynamically split the dataflow graph across edge and cloud. Comprehensive simulations and a real testbed considering multiple application settings demonstrate that our approach can improve the end-to-end latency in over 50% and even other QoS metrics.
The solution search space for operator reassignment can be enormous depending on the number of operators, streams, resources and network links. Moreover, it is important to minimise the cost of migration while improving latency. Reinforcement Learning (RL) and Monte-Carlo Tree Search (MCTS) have been used to tackle problems with large search spaces and states [12, 13], performing at human-level or better in games such as Go. We model the application reconfiguration problem as a Markov Decision Process (MDP) and investigate the use of RL and MCTS algorithms to devise reconfiguring plans that improve QoS metrics.
The key contributions of this thesis are listed below:
• A survey of DSP application elasticity and DSP application operator (re)assignments in heterogeneous infrastructure;
• A mathematical model that describes the computation and the communication services focusing on the end-to-end latency, and the application and computational resource constraints;
• Strategies for configuring DSP applications considering single and multi objective optimisation;
• An MDP framework employed to RL algorithms by considering single and multi-objective optimisation to reconfigure DSP applications.