Paper Digest of December 2023

Title	Authors	Synthesis	Publisher	Keywords
Neural Packet Classification	Eric Liang, Ion Stoica	This paper proposes using RL to construct a decision tree for packet classifiers, and it shows how to model formulate the MDP for constructing the decision tree.	SIGCOMM 2019	Reinforcement Learning, Decision Tree
Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data	Meghdad Kurmanji, Peter Triantafillou	This paper presents a learning framework called DDUp for detecting OODs and updating learned database components. A statistical test is used to test the hypothesis that the new data obey the same distribution as the learned model. Knowledge Distillation is used to transfer knowledge from the old model to the new model if the hypothesis is rejected.	SIGMOD 2023	Knowledge Distillation, Transfer Learning
From Large Language Models to Databases and Back - A discussion on research and education	Sihem Amer-Yahia, .etc	A panel discussion on LLM and database research. They discussed about the potential impact of LLM towards databases, and what are advantages and limitations of incoperating LLM in databases.	DASFAA 2023	LLM, Database, ChatGPT
Fine-grained Partitioning for Aggressive Data Skipping	Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin	This paper presents a feature based partitioning framework. A number of features are selected to calculate a feature vector for each tuple. The conjunction of all feature vectors in a block is used for data skipping. And data are partitioned to maximize the data skipping gain. The contributions are (1) a workload analyzer, which generates a set of features from a query log, (2) a partitioner, which computes a blocking scheme by solving a optimization problem, (3) a feature-based block skipping framework used in query execution.	SIGMOD 2014	Partitioning, NP-Hard
Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing	Guido Moerkotte	This paper shows how to speed up data warehouses with summaries called Small Materialized Aggregates. These SMAs are lightweight and easy to generate. They are similiar to views or data cubes but only compact information is generated for each data bucket.	VLDB 1998	SMA, Data Cube, Data Warehouse
Different Cube Computation Approaches: Survey Paper	Dhanshri S. Lad, Rasika P. Saste	This paper survey the different algorithms to compute data cubes. The authors also propose use MR to speed up the data cube computation.	IJCSIT 2014	Data Cube, Mapreduce
High-Diemnsional OLAP: A Minimal Cubing Approach	Xiaolei Li, Jiawei Han, Hector Gonzalez	This paper proposes to decompose the data cube computation by precomputing small sized groups called fragements and a value-list inverted index. All dimensions are divided into 3/4 dimension groups called fragments. For each fragments all data cubes are computed as lists of tuple ids using the inverted index. This paper also shows how to serve point queries and subcube queries with these fragments.	VLDB 2004	Data Cube, Shell Fragment, OLAP
Data Cube: A Relational Aggregation Operator - Generalizing Group-By, Cross-Tab, and Sub-Totals	Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh	This visionary paper explains to us what are data cubes and why we need them in analytics. It also shows us how to generate data cubes with the SQL group-by, proposes what enhancements we need make to group-bys. And it categorifies the aggregate functions. It shows how to compute data cubes with distributive and algebraic aggregates.	Data Mining and Knowledge Discovery 19997	Data Cube, Group-By, Cross-Tab, Roll-Up, Drill-Down
Self-Organizing Data Containers	Samuel Madden, Jialin Ding, Tim Kraska, .etc	The authors envision a kind of systems called Self-Organizing Data Containers which employ a disaggregated and open system architechture. They name a few characteristics of these system - supporting efficient indexing, supporting concurrent acesses, supporting data envolving. They implement a prototype and compare with Delta Lake. The authors also point the directions - using replications to optimize the physical layout, incremental changes and auto-optimizations - for future research.	CIDR 2022	Self-Organizing Data Container, Cloud Storage, Amazon S3
Instance-Optimized Data Layouts for Cloud Analytics Workloads	Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc	This paper proposes a method called MTO which optimizes QD-Tree through Sideways Information Passing. It add Join-induced Predicates into QD-Tree. Effectively it should perform better than single table optimization. But Join-induced Predicates also poses a challenge when new data are added or old data are updated. Join-induced Predicates need to be refreshed and adjusted.	SIGMOD 2021	QD-Tree, Sideways Information Passing, Instance-Optimized Data Layout
Tsunami: A Learned Multi-dimensional - Index for Correlated Data and Skewed Workloads	Jialin Ding, Umar Farooq Minhas, Tim Kraska, .etc	This paper proposes a learned multi-dimensional index called Tsunami which is an improved successor to Flood, another learned multi-dimentional index. The authors observe that there are query skew and data correlation which pose challenges to both tranditional multi-dimentional indices and Flood. This paper invents two structures - Grid Tree and Argumented Grid. Each structure is constructured as an optimization problem. The authors also formulate the optimization goals.	PVLDB Vol 14, No. 2, 2020	Learned Index, Multi-dimensional Index, Skewed Workload
SA-LSM: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis	Teng Zhang, Jianling Sun, .etc	This paper presents SA-SLM which uses survival analysis to predict data access events per record. This ssystem employs a proactively compaction srategy to move cold data to slow media. The authors claim that they can reduce tail latency by up to 78.9% using SA-SLM.	PVLDB Vol 15 Issue 10, 2022	LSM-Tree, Survival Analysis, Random Forest
Tiresias: Enabling Predictive Autonomous Storage and Indexing	Michael Abebe, Horatiu Lazu, Khuzaima Daudjee	This paper presents the Tiresias method which combine workload prediction and autonomous storage and indexing. The authors introduce how to predict future workoad and estimate plan cost under a specific storage layout. They also give a heuristic method to calculate the benefit of employing a storage change. Though this paper doesn’t say how to propose a storage change I guess they should have a small fixed set of storage choices to select from.	SIGMOD 2021	Autonomous Storage, Indexing, Workload Prediction
Stochastic Database Cracking: Towards Robust Adaptive - Indexing in Main-Memory Column-Stores	Felix Halim, Stratos Idreos, Panagiotis Karras, Roland H. C. Yap	This paper extends the origional idea of database cracking by introducing stochastic cracks. The origional database cracking only cracks a column exactly based on data predicates. This paper shows that for sequential workload the origional cracking method has no optimization compared to random workload. In order to address this problem this paper propose two different algorithms - DDC and DDR. The difference is that DDC always tries to cut in the center which requires a cost of finding the median, but DDR chooses to cut randomly. They also devise other variants with more lightweight initial cost based on these two algorithms.	PVLDB 2012	Database Cracking, Column Store, Adaptive Indexing
Annotating Columns with Pre-trained Language Models	Yoshihiko Suhara, Jinfeng Li, Cagatay Demiralp, Chen Chen, Wang-Chiew Tan	This paper show how to use LM as representation learning to predict column types and column relation in a table. The authors show how to encode a table as one feature instead of using a column-wise feature, and how to combine these two tasks in one model. But they didn’t answer why the order of row and column values seem doesn’t matter too much in their tests.	SIGMOD 2022	Large Language Model, Column Annotation, Column Clustering
Learning a Partitioning Advisor for Cloud Databases	Benjamin Hilprecht, Carsten Binnig, Uwe Rohm	This paper shows how to use DRL to find the optimized horizontal partititoning for a cloud database. The authors first use a cost model to train the DRL offline, then refine the training online. To further reduce the training time they use sampled dataset instead of full dataset, and runtime caching and maximum runtime capping techniques.	SIGMOD 2020	Deep Reinforcement Learning, Partitioning, Q-Learning