Inexpensive data analytics using cold storage devices
The past few years have seen an unprecedented growth in the popularity of cloud-hosted database and data-warehousing services. An ever-increasing customer base, new big-data applications, regulatory compliance requirements, and a multitude of other factors have resulted in the data footprint of these services growing into Petabytes (if not Exabytes).
Databases in private and public clouds (and enterprise databases in particular) use storage tiering to lower capital and operational expenses. In such a setting, data waterfalls from an SSD-based high-performance tier when it is “hot” (frequently accessed) to a disk-based capacity tier and finally to a tape-based archival tier when “cold” (rarely accessed). Cold data has been identified as the fastest growing storage segment, with a 60% cumulative annual growth rate.
To address the unprecedented growth of cold data, hardware vendors introduced new devices named Cold Storage Devices (CSD) explicitly targeted at cold data workloads. With access latencies in tens of seconds and cost/GB as low as $0.01/GB/month, CSD provide a middle ground between the low-latency (ms), high-cost, HDD-based capacity tier, and high-latency (minute to hour), low-cost, tape-based archival tier, adding an additional tier to the storage hierarchy.
In this project, we look at the economic and performance aspects of database management systems (DBMS) built on top of the storage tiering hierarchy. Driven by economic aspects, we propose to flatten the four-tier storage hierarchy into just two tiers – the performance tier based on SSD and the capacity tier based on Cold Storage. In addition to facilitated data management, such flattening halves the storage cost of data. Nonetheless, current database systems will suffer from severe performance drop when CSD are used as a replacement for HDD, due to the mismatch between design assumptions made by the query execution engine of a DBMS and actual storage characteristics of the CSD. In this project, we explore a novel query processing paradigm that turns around the common wisdom of accessing data in a predetermined order and instead embraces a strategy where the storage module decides on the access order, thereby achieving both cost savings and fast query performance.
This project is done in collaboration with EPFL, Switzerland.
Leader: Renata Borovica-Gajic
Collaborators: EPFL, Switzerland
Computing and Information Systems
Networks and data in society
cloud computing; data structures; database systems