Paper Digest of January 2024

Title	Authors	Synthesis	Publisher	Keywords
An End-to-End Learning-based Cost Estimator	Ji Sun, Guoliang Li	This paper shows a cardinality and cost estimation method which takes trees of query plans as inputs and give the estimation of the query cost and cardinalities. The authors show how to encode query operators, predicates, metadata, sample bitmaps into vectors. These vectors go through a embedding method, then through an estimation network to give the estimation of query costs. An abolation study is conducted to show the effectiveness of string embedding choices, the tree structure model and bitmap sampling.	PVLDB Vol 13, November 2019	Representation Learning, Long Short-Term Memory, String Embedding, Query Cost Estimation, Tree Structure Model
Adapative Partitioning and Indexing for In-Situ Query Processing	Matthaios Olma, Anastasia Ailamaki, etc.	This paper presents a online partitioning and indexing framework for in-situ query processing. The framework consists of partition managers, index managers, and a statistics store. The partition managers can generate logical partitions on the fly. The Index Manager will calculate if it is beneficial to build an appropriate index for a partiton.	VLDB Journal 2020	Online Partitioning, Online Indexing, In-Situ Query Processing
Plan-Structured Deep Neural Network Models for Query Performance Prediction	Ryan Marcus, Olga Papaemmanouil	This paper is the first to propose a plan-structured DNN model for query performance prediction. It explains the architecture for the plan-structure DNN. Every operator has a corresponding DNN unit which takes input from its children and gives the performance estimation. The same operators share the network structure and weight.	PVLDB 2019	Query Performance Estimation, Plan-Structured DNN
DB-GPT: Large Language Model Meets Database	Xuanhe Zhou, Zhaoyan Sun, Guoliang Li	This paper demonstrates how to apply LLM to DB tasks. It lists the chanllenges related to LLM prompting and fine tunning, and it also show some results of applying LLM to rewriting queries and recommending indices.	Data Science Engineering 2023	Large Language Model, AI4DB, Fine Tunning, Prompt Engineering
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex	Immanuel Trummer	This paper demonstrates a early stage experiment with Codex to generate query code (python) from natrual language instructions. The author show success ratios against retry times and method instructions, and he shows it achieves comparable results as traditional text to SQL methods. The author also gives future research plans.	PVLDB 2022	Codex, Large Language Model, Code Generation
How Large Language Models Will Disrupt Data Management	Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin	This paper discusses a series of interesting questions about how Large Language Models will disrupt data managment. The authors give examples of near vision changes that LLM can make on data management. Furthermore, the authors also brought up some unsolved issues related to data sharing and data governance.	VLDB 2023	Large Language Model, Data Management, Data Integration
Language Models Enable Simple Systems For Generating Structured Views Of Heterogeneous Data Lakes	Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Re	This paper presents EVAPORATE a framework which utilize LLM to extract and organize data from raw documents. The authors compare three methods - direct, code, and code+. The direct method prompts LLM to extract and organize data directly from documents. The code method prompts LLM to generate code to extract and organize data. The last method also use generated code to extract and orgnaize data, but it uses weak supervision to vote result from many candidate functions. The last one gives the best result and makes a balance between cost (tokens consumed) and performance.	PVLDB Vol 17, October 2023	Large Language Model, Data Extraction, Function Generation
How Good Are Query Optimizers, Really?	Viktor Leis, Thomas Neumann, etc.	This paper investigates how query optimizers perform on cardinality estimation, cost estimation, plan enumeration, and it shows how these query optimizers perform on a more realistic dataset. Based on experiments the authors find that cost model errors are dwarfed by cartinality estimation errors, and estimation errors increase along with the number of join relations. Dynamic query plan optimization during query execution can effectively counteract bad query plans. This paper shows the effectiveness of different design considerations and suggests worthwhile research directions.	PVLDB Vol 9, 2015	Query Optimizer, Cardinality Estimation
Attention Is All You Need	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LLion Jones, Adian N. Gomez, Lukasz Kaiser, Illia Polosukhin	This is the original paper which propose Transformer the model that forms the network core of many Large Language Models. In order to improve network parallelization the authors suggest to use attention mechanism to learn the relations between sequence poisitions in parallel. But without RNN structure positional encoding is used to inject the positional information. Transformer can outperform many state of art RNN and convolutional models.	NIPS 2017	Transformer, RNN, Attention
Cardinality Estimation: An Experimental Survey	Hazar Harmouch, Felix Naumann	This paper invetigates 12 cardinality estimation algorithms. They are FM, PCSA, AMS, BJKST, LogLog, SuperLogLog, HyperLogLog, HyperLogLog++, MinCount, AKMV, LC, and BF. The authors divide them into four categories - counting trailing 1s, counting leading 0s, kth minimum value and linear synopses. And they compare their accuracy and resource requirements. FM, BJKST, AKMV and BF are the best in their class judging by accuracy.	VLDB 2017	Cardinality Estimation, LogLog, HyperLogLog, MinCount, Bloom Filter
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova	This paper improves the model performance by breaking the limit of a left-to-right Transformer architecture by learning from the whole text context. But in order to prevent the model to directly copy the predicted result. The authors randomly marsk 15% tokens of the input sequence and predict the masked tokens. In this way they can train a LM that can learn the representation from the whole context.	arXiv 2018	BERT, Language Model, GPT
Eddies: Continuously Adaptive Query Processing	Ron Avnur, Joseph M. Hellerstein	This paper argues that instead of trying to find a optimal query plan the system should reorder the join order along the execution pipeline. This Eddy module routes the tuples to operators based on avaibility and availability aiming to reduce the processing time and meanwhile maintain correctness.	SIGMOD 2000	Adaptive Query Processing
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs	Jinyang Li, Binyuan Hui, Xuanhe Zhou, Kevin C.C. Chang, Reynold Cheng, Yongbin Li, Guoliang Li	This paper presents BIRD a text-to-sql benchmark for LLM based methods. BIRD collected tables from many different domains which can test the generalization ability of tested models. And it also considered the irregularities of value type in real database values. The new testbench shows that LLM based text-to-sql methods are still inferior to human which shows that there are some chances for research efforts.	NeurIPS 2023	Large Language Model, BIRD, Text-to-SQL
Kepler: Robust Learning for Faster Parametric Query Optimization	Lyric Doshi, Vicent Zhuang, Gaurav Jain, Ryan Marcus,	This paper introduces Kepler a method for parametric query optimization. This method uses Row Count Evolution which perturbs cardinality estimations to generate a set of plans from a base plan optitmizer. It then train a neural network per each query template to learn how to classify the best query plan. This network also gives a confidence value about the result. The model will fall back to the origional plan if the confidence value is low.	SIGMOD 2023	Row Count Evolution, Plan Optimization, Parametric Query Optimization
Parametric Query Optimization	Yannis E. Ioannidis, Raymond T. Ng, Kyuseok Shim, Timos K. Sellis	This paper introduces how to optimize query plans subject to a given set of parameter values and how to use randomized algorithms to find the optimal plans for these parameter values. And this paper also introduces a method called Sideways Informatiton Passing which can optimize queries for large numbers of buffer sizes in the same time needed by the conventional method for one buffer size.	VLDB 1992	Parametric Query Optimization
Bao: Makeing Learned Query Optimization Practical	Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska	This paper introduces Bao (the Bandit Optimizer) which is a query optimizer built on top a traditional query optimizer. This optimizer uses optimizer hints to control the tranditional query optimizer to generate better query plans. In order to select the optimal hints this optimizer models the selection problem as a multi-armed bandit problem and uses Thompson Sampling to train the evaluation network which is the same as in Neo. The experiments show that this optimizer overcomes many limits for practcal application in previous learned optimizers. Even more it surpasses the original optimizer in tail performance.	SIGMOD 2021	Multi-armed Bandit Problem, Thompson Sampling
Balsa: Learning a Query Optimizer Without Expert Demonstrations	Zongheng Yang, Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, Ion Stoica	This paper introduces Balsa a learned query optimizer that doesn’t need expert demonstrations. Balsa bootstraps itself from a simple cost estimator which uses the PostgresSQL cardinality estimator. Then it uses an on-policy reinforcement learning process to learn from real execution latencies. This method can be seen as an improvement to Neo. With a few novel techiques like Diversified Experiences and Multi-agent Training Balsa can explore distinct plans. Combined with these methods Balsa can generate query plans performing better than experts and state of art methods like Bao.	SIGMOD 22	Learned Query Optimization, Machine Learning for Systems