Skip to main content Link Menu Expand (external link) Document Search Copy Copied
Title Authors Synthesis Publisher Keywords
Neural Packet Classification Eric Liang, Ion Stoica This paper proposes using RL to construct a decision tree for packet classifiers, and it shows how to model formulate the MDP for constructing the decision tree. SIGCOMM 2019 Reinforcement Learning, Decision Tree
Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data Meghdad Kurmanji, Peter Triantafillou This paper presents a learning framework called DDUp for detecting OODs and updating learned database components. A statistical test is used to test the hypothesis that the new data obey the same distribution as the learned model. Knowledge Distillation is used to transfer knowledge from the old model to the new model if the hypothesis is rejected. SIGMOD 2023 Knowledge Distillation, Transfer Learning
From Large Language Models to Databases and Back - A discussion on research and education Sihem Amer-Yahia, .etc A panel discussion on LLM and database research. They discussed about the potential impact of LLM towards databases, and what are advantages and limitations of incoperating LLM in databases. DASFAA 2023 LLM, Database, ChatGPT
Fine-grained Partitioning for Aggressive Data Skipping Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin This paper presents a feature based partitioning framework. A number of features are selected to calculate a feature vector for each tuple. The conjunction of all feature vectors in a block is used for data skipping. And data are partitioned to maximize the data skipping gain. The contributions are (1) a workload analyzer, which generates a set of features from a query log, (2) a partitioner, which computes a blocking scheme by solving a optimization problem, (3) a feature-based block skipping framework used in query execution. SIGMOD 2014 Partitioning, NP-Hard
Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing Guido Moerkotte This paper shows how to speed up data warehouses with summaries called Small Materialized Aggregates. These SMAs are lightweight and easy to generate. They are similiar to views or data cubes but only compact information is generated for each data bucket. VLDB 1998 SMA, Data Cube, Data Warehouse
Different Cube Computation Approaches: Survey Paper Dhanshri S. Lad, Rasika P. Saste This paper survey the different algorithms to compute data cubes. The authors also propose use MR to speed up the data cube computation. IJCSIT 2014 Data Cube, Mapreduce
High-Diemnsional OLAP: A Minimal Cubing Approach Xiaolei Li, Jiawei Han, Hector Gonzalez This paper proposes to decompose the data cube computation by precomputing small sized groups called fragements and a value-list inverted index. All dimensions are divided into 3/4 dimension groups called fragments. For each fragments all data cubes are computed as lists of tuple ids using the inverted index. This paper also shows how to serve point queries and subcube queries with these fragments. VLDB 2004 Data Cube, Shell Fragment, OLAP
Data Cube: A Relational Aggregation Operator - Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh This visionary paper explains to us what are data cubes and why we need them in analytics. It also shows us how to generate data cubes with the SQL group-by, proposes what enhancements we need make to group-bys. And it categorifies the aggregate functions. It shows how to compute data cubes with distributive and algebraic aggregates. Data Mining and Knowledge Discovery 19997 Data Cube, Group-By, Cross-Tab, Roll-Up, Drill-Down
Self-Organizing Data Containers Samuel Madden, Jialin Ding, Tim Kraska, .etc The authors envision a kind of systems called Self-Organizing Data Containers which employ a disaggregated and open system architechture. They name a few characteristics of these system - supporting efficient indexing, supporting concurrent acesses, supporting data envolving. They implement a prototype and compare with Delta Lake. The authors also point the directions - using replications to optimize the physical layout, incremental changes and auto-optimizations - for future research. CIDR 2022 Self-Organizing Data Container, Cloud Storage, Amazon S3
Instance-Optimized Data Layouts for Cloud Analytics Workloads Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc This paper proposes a method called MTO which optimizes QD-Tree through Sideways Information Passing. It add Join-induced Predicates into QD-Tree. Effectively it should perform better than single table optimization. But Join-induced Predicates also poses a challenge when new data are added or old data are updated. Join-induced Predicates need to be refreshed and adjusted. SIGMOD 2021 QD-Tree, Sideways Information Passing, Instance-Optimized Data Layout
Tsunami: A Learned Multi-dimensional - Index for Correlated Data and Skewed Workloads Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc This paper proposes a learned multi-dimensional index called Tsunami which is an improved successor to Flood, another learned multi-dimentional index. The authors observe that there are query skew and data correlation which pose challenges to both tranditional multi-dimentional indices and Flood. This paper invents two structures - Grid Tree and Argumented Grid. Each structure is constructured as an optimization problem. The authors also formulate the optimization goals. PVLDB Vol 14, No. 2, 2020 Learned Index, Multi-dimensional Index, Skewed Workload
SA-LSM: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis Teng Zhang, Jianling Sun, .etc This paper presents SA-SLM which uses survival analysis to predict data access events per record. This ssystem employs a proactively compaction srategy to move cold data to slow media. The authors claim that they can reduce tail latency by up to 78.9% using SA-SLM. PVLDB Vol 15 Issue 10, 2022 LSM-Tree, Survival Analysis, Random Forest
Tiresias: Enabling Predictive Autonomous Storage and Indexing Michael Abebe, Horatiu Lazu, Khuzaima Daudjee This paper presents the Tiresias method which combine workload prediction and autonomous storage and indexing. The authors introduce how to predict future workoad and estimate plan cost under a specific storage layout. They also give a heuristic method to calculate the benefit of employing a storage change. Though this paper doesn’t say how to propose a storage change I guess they should have a small fixed set of storage choices to select from. SIGMOD 2021 Autonomous Storage, Indexing, Workload Prediction
Stochastic Database Cracking: Towards Robust Adaptive - Indexing in Main-Memory Column-Stores Felix Halim, Stratos Idreos, Panagiotis Karras, Roland H. C. Yap This paper extends the origional idea of database cracking by introducing stochastic cracks. The origional database cracking only cracks a column exactly based on data predicates. This paper shows that for sequential workload the origional cracking method has no optimization compared to random workload. In order to address this problem this paper propose two different algorithms - DDC and DDR. The difference is that DDC always tries to cut in the center which requires a cost of finding the median, but DDR chooses to cut randomly. They also devise other variants with more lightweight initial cost based on these two algorithms. PVLDB 2012 Database Cracking, Column Store, Adaptive Indexing
Annotating Columns with Pre-trained Language Models Yoshihiko Suhara, Jinfeng Li, Cagatay Demiralp, Chen Chen, Wang-Chiew Tan This paper show how to use LM as representation learning to predict column types and column relation in a table. The authors show how to encode a table as one feature instead of using a column-wise feature, and how to combine these two tasks in one model. But they didn’t answer why the order of row and column values seem doesn’t matter too much in their tests. SIGMOD 2022 Large Language Model, Column Annotation, Column Clustering
Learning a Partitioning Advisor for Cloud Databases Benjamin Hilprecht, Carsten Binnig, Uwe Rohm This paper shows how to use DRL to find the optimized horizontal partititoning for a cloud database. The authors first use a cost model to train the DRL offline, then refine the training online. To further reduce the training time they use sampled dataset instead of full dataset, and runtime caching and maximum runtime capping techniques. SIGMOD 2020 Deep Reinforcement Learning, Partitioning, Q-Learning