An Optimization Approach for Denormalized Data Models in Distributed Environments Based on a Multidimensional Cost Model [Directed PhD thesis]

By Jihane Mali, my former PhD student co-advised with Faten Atigui, Ahmed Azough and Shohreh Ahvar at ESILV & ISEP. [theses.fr, manuscrit]

One of the best PhD student I had. A real pleasure to directed Jihane during those years. Lot of opened perspectives!

The growth of digital technology has to an increase in data, often referred to as “Big Data”. The latter is the result of vast volumes of data coming from various sources such as social media, sensors, business transactions, and more.Big Data is characterized by its volume, velocity, variety, and veracity, commonly known as the “4 V”. The vast scale and complexity of this data present both challenges and opportunities for Information Systems (IS). In fact, traditional data processing tools and techniques are often inadequate for handling Big Data due to its size and heterogeneity, leading to the emergence of NoSQL (Not only SQL) systems.
Moreover, this new complexity of database systems compelling IS to continuously refine their data models and meticulously select storage and management options that best align with their requirements. This continuous refinement is crucial due to the evolving nature of data and the varying needs of IS. While existing solutions focus on transforming data models, none of them offer guidance on selecting the most suitable data model(s) for a given IS use case. This lack of guidance can lead to suboptimal data models that could impact the performance, cost and scalability in the long term.
To address this issue, we propose in this thesis ModelDrivenGuide, a global automated approach designed to lead the data model selection process. ModelDrivenGuide takes as input a conceptual model, and a use case that includes queries, settings and infrastructure constraints. It then generates all relevant data models to the given use case. In this first phase, refinement rules are applied recursively generating one logical model after the other. Our approach then relies on a heuristic to reduce the search space by avoiding cycles and redundancies that could be a result of these recursive transformations. This generation process provides a set of optimal data model candidates rather than settling for a single dedicated solution, which may not guarantee optimal outcome.
Given the diversity of the generated potential data models, the next step is to facilitate the selection of the most suitable model(s) for the specific use case. To achieve this, we propose a multidimensional cost calculation phase, designed to evaluate the logical costs of each data model, enabling effective comparisons. This approach allows for the comparison of different data models without the expense of their physical implementation. This cost model incorporates the time, environmental and financial impacts of each data model and integrates the cost of both data models and queries based on the users’ inputs.
Furthermore, in order to guide the choice of optimal data models for a specific use case, we propose a data model selection phase. The latter allows ranking the generated data models by employing optimization strategies that consider their associated costs as well as settings variations. These strategies can be characterized as short and long term optimizations. The first optimization strategy allows ranking data models by focusing on optimizing costs in a single setting (i.e., data volume and number of servers). The second optimization strategy is more suitable to handle the cases where settings undergo rapid changes, as the cost associated with a data model may change swiftly. This strategy enables enhancing the stability of data models by favoring those with lower density and average cost, thereby ensuring both stability and a degree of cost effectiveness. This approach considers the dynamic nature of data and usage patterns, ensuring that the selected data model(s) remain optimal even as conditions change.
Finally, to demonstrate our approach, we developed a prototype that implements the three key phases of our approach: (a) Data Model Generation (DMG), (b) Multidimensional Cost Calculation (MCC) and (C) Data Model Selection (DMS). Our simulation is based on the TPC-C benchmark. Initially, we explore how the structure of the conceptual model and the number and the type (e.g., filters, joins etc…) of the use case queries, affect the number of potential data models generated. Moreover, subsequent results highlight the impacts of query types and settings variation on the costs of these data models. Then the optimization strategies are applied to rank data models based on their costs. We also provide a visualization tool that displays all the generated data models wrt. their costs. This tool simplifies decision-making for Information Systems, addressing the absence of such a solution in existing tools. The prototype and visualization tool were both developed in Java.

Laisser un commentaire