Resource Utilization for Raw Data Query Processing : Optimizing Required Resources & Maximizing Utilization of Existing Resources
Abstract
Scientific experiments and modern applications generate large amount of dataevery day. Many such applications store the data in raw format initially, as theschema is not known. The traditional database management system (DBMS) requiresthe entire dataset to be loaded before querying it. Data loading requires asignificant amount of time and resources, which increases application latency andrunning costs. In-situ engines eliminate the data loading requirement, therebyreducing upfront resource utilization. However, they suffer from high query executiontime (QET) and reparsing. It has been observed that state-of-the-art in-situand DBMS do not utilize available resources efficiently.This thesis proposes ResourceAvailability andWorkload aware Hybrid Framework(RAW-HF) to tackle underutilization of resources. It optimizes requiredresources (ORR) and maximizes utilization of existing resources (MUER) for resourceefficient processing of raw datasets. It is a hybrid system consisting of anin-situ engine and DBMS. The in-situ engine reduces data to query time whileDBMS moderates the raw data reparsing. Hybrid framework for raw data queryprocessing and resource monitoring is developed during the initial phase. Analysisof resource monitoring indicated substantial underutilization of resources.The optimization of required resources is done using Query Complexity Aware(QCA) and Workload and Storage Aware Cost-based (WSAC) algorithms. QCAand WSAC also improved workload execution time (WET). Further resource utilizationis improved by Maximizing Utilization of Available Resources (MUAR)algorithm.RAW-HF is demonstrated using scientific experiment datasets like Sloan DigitalSky Survey (SDSS) and Linked Observation Data (LOD). RAW-HF query andresource performances are compared with state-of-the-art techniques. The stateof-the-art techniques which allocate resources accurately based on historical resourceconsumption data do not address ad-hoc queries and multi-format joins.On the other hand, RAW-HF addresses ad-hoc queries and also supports multiformatjoins. The ORR phase of RAW-HF reduced the WET by 26% compared tothe state-of-the-art Partial Loading technique. MUAR component of RAW-HF iscapable of estimating work memory value with 15-20% error required to achievethe best query performance with only single query run data. A comparison ofMUAR with machine learning based techniques like PCC and AutoToken is alsopresented. The overall CPU, RAM, and IO resource utilization has been improvedby 61-91% over traditional database management systems. Although the Partialloading technique requires 33% lesser RAM than RAW-HF, it needs 24% more IO.The improvement in dataset size processing capacity is also estimated for SDSSdataset. The estimation proposes that RAW-HF framework can be used to processlarge application datasets efficiently using existing resources.
Collections
- PhD Theses [87]