Title | Authors | Synthesis | Publisher | Keywords |
An End-to-End Learning-based Cost Estimator | Ji Sun, Guoliang Li | This paper shows a cardinality and cost estimation method which takes trees of query plans as inputs and give the estimation of the query cost and cardinalities. The authors show how to encode query operators, predicates, metadata, sample bitmaps into vectors. These vectors go through a embedding method, then through an estimation network to give the estimation of query costs. An abolation study is conducted to show the effectiveness of string embedding choices, the tree structure model and bitmap sampling. | PVLDB Vol 13, November 2019 | Representation Learning, Long Short-Term Memory, String Embedding, Query Cost Estimation, Tree Structure Model |
Adapative Partitioning and Indexing for In-Situ Query Processing | Matthaios Olma, Anastasia Ailamaki, etc. | This paper presents a online partitioning and indexing framework for in-situ query processing. The framework consists of partition managers, index managers, and a statistics store. The partition managers can generate logical partitions on the fly. The Index Manager will calculate if it is beneficial to build an appropriate index for a partiton. | VLDB Journal 2020 | Online Partitioning, Online Indexing, In-Situ Query Processing |
Plan-Structured Deep Neural Network Models for Query Performance Prediction | Ryan Marcus, Olga Papaemmanouil | This paper is the first to propose a plan-structured DNN model for query performance prediction. It explains the architecture for the plan-structure DNN. Every operator has a corresponding DNN unit which takes input from its children and gives the performance estimation. The same operators share the network structure and weight. | PVLDB 2019 | Query Performance Estimation, Plan-Structured DNN |
DB-GPT: Large Language Model Meets Database | Xuanhe Zhou, Zhaoyan Sun, Guoliang Li | This paper demonstrates how to apply LLM to DB tasks. It lists the chanllenges related to LLM prompting and fine tunning, and it also show some results of applying LLM to rewriting queries and recommending indices. | Data Science Engineering 2023 | Large Language Model, AI4DB, Fine Tunning, Prompt Engineering |
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex | Immanuel Trummer | This paper demonstrates a early stage experiment with Codex to generate query code (python) from natrual language instructions. The author show success ratios against retry times and method instructions, and he shows it achieves comparable results as traditional text to SQL methods. The author also gives future research plans. | PVLDB 2022 | Codex, Large Language Model, Code Generation |
How Large Language Models Will Disrupt Data Management | Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin | This paper discusses a series of interesting questions about how Large Language Models will disrupt data managment. The authors give examples of near vision changes that LLM can make on data management. Furthermore, the authors also brought up some unsolved issues related to data sharing and data governance. | VLDB 2023 | Large Language Model, Data Management, Data Integration |
Language Models Enable Simple Systems For Generating Structured Views Of Heterogeneous Data Lakes | Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Re | This paper presents EVAPORATE a framework which utilize LLM to extract and organize data from raw documents. The authors compare three methods - direct, code, and code+. The direct method prompts LLM to extract and organize data directly from documents. The code method prompts LLM to generate code to extract and organize data. The last method also use generated code to extract and orgnaize data, but it uses weak supervision to vote result from many candidate functions. The last one gives the best result and makes a balance between cost (tokens consumed) and performance. | PVLDB Vol 17, October 2023 | Large Language Model, Data Extraction, Function Generation |
How Good Are Query Optimizers, Really? | Viktor Leis, Thomas Neumann, etc. | This paper investigates how query optimizers perform on cardinality estimation, cost estimation, plan enumeration, and it shows how these query optimizers perform on a more realistic dataset. Based on experiments the authors find that cost model errors are dwarfed by cartinality estimation errors, and estimation errors increase along with the number of join relations. Dynamic query plan optimization during query execution can effectively counteract bad query plans. This paper shows the effectiveness of different design considerations and suggests worthwhile research directions. | PVLDB Vol 9, 2015 | Query Optimizer, Cardinality Estimation |
Attention Is All You Need | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LLion Jones, Adian N. Gomez, Lukasz Kaiser, Illia Polosukhin | This is the original paper which propose Transformer the model that forms the network core of many Large Language Models. In order to improve network parallelization the authors suggest to use attention mechanism to learn the relations between sequence poisitions in parallel. But without RNN structure positional encoding is used to inject the positional information. Transformer can outperform many state of art RNN and convolutional models. | NIPS 2017 | Transformer, RNN, Attention |
Cardinality Estimation: An Experimental Survey | Hazar Harmouch, Felix Naumann | This paper invetigates 12 cardinality estimation algorithms. They are FM, PCSA, AMS, BJKST, LogLog, SuperLogLog, HyperLogLog, HyperLogLog++, MinCount, AKMV, LC, and BF. The authors divide them into four categories - counting trailing 1s, counting leading 0s, kth minimum value and linear synopses. And they compare their accuracy and resource requirements. FM, BJKST, AKMV and BF are the best in their class judging by accuracy. | VLDB 2017 | Cardinality Estimation, LogLog, HyperLogLog, MinCount, Bloom Filter |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova | This paper improves the model performance by breaking the limit of a left-to-right Transformer architecture by learning from the whole text context. But in order to prevent the model to directly copy the predicted result. The authors randomly marsk 15% tokens of the input sequence and predict the masked tokens. In this way they can train a LM that can learn the representation from the whole context. | arXiv 2018 | BERT, Language Model, GPT |
Eddies: Continuously Adaptive Query Processing | Ron Avnur, Joseph M. Hellerstein | This paper argues that instead of trying to find a optimal query plan the system should reorder the join order along the execution pipeline. This Eddy module routes the tuples to operators based on avaibility and availability aiming to reduce the processing time and meanwhile maintain correctness. | SIGMOD 2000 | Adaptive Query Processing |
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs | Jinyang Li, Binyuan Hui, Xuanhe Zhou, Kevin C.C. Chang, Reynold Cheng, Yongbin Li, Guoliang Li | This paper presents BIRD a text-to-sql benchmark for LLM based methods. BIRD collected tables from many different domains which can test the generalization ability of tested models. And it also considered the irregularities of value type in real database values. The new testbench shows that LLM based text-to-sql methods are still inferior to human which shows that there are some chances for research efforts. | NeurIPS 2023 | Large Language Model, BIRD, Text-to-SQL |
Kepler: Robust Learning for Faster Parametric Query Optimization | Lyric Doshi, Vicent Zhuang, Gaurav Jain, Ryan Marcus, | This paper introduces Kepler a method for parametric query optimization. This method uses Row Count Evolution which perturbs cardinality estimations to generate a set of plans from a base plan optitmizer. It then train a neural network per each query template to learn how to classify the best query plan. This network also gives a confidence value about the result. The model will fall back to the origional plan if the confidence value is low. | SIGMOD 2023 | Row Count Evolution, Plan Optimization, Parametric Query Optimization |
Parametric Query Optimization | Yannis E. Ioannidis, Raymond T. Ng, Kyuseok Shim, Timos K. Sellis | This paper introduces how to optimize query plans subject to a given set of parameter values and how to use randomized algorithms to find the optimal plans for these parameter values. And this paper also introduces a method called Sideways Informatiton Passing which can optimize queries for large numbers of buffer sizes in the same time needed by the conventional method for one buffer size. | VLDB 1992 | Parametric Query Optimization |
Bao: Makeing Learned Query Optimization Practical | Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska | This paper introduces Bao (the Bandit Optimizer) which is a query optimizer built on top a traditional query optimizer. This optimizer uses optimizer hints to control the tranditional query optimizer to generate better query plans. In order to select the optimal hints this optimizer models the selection problem as a multi-armed bandit problem and uses Thompson Sampling to train the evaluation network which is the same as in Neo. The experiments show that this optimizer overcomes many limits for practcal application in previous learned optimizers. Even more it surpasses the original optimizer in tail performance. | SIGMOD 2021 | Multi-armed Bandit Problem, Thompson Sampling |
Balsa: Learning a Query Optimizer Without Expert Demonstrations | Zongheng Yang, Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, Ion Stoica | This paper introduces Balsa a learned query optimizer that doesn’t need expert demonstrations. Balsa bootstraps itself from a simple cost estimator which uses the PostgresSQL cardinality estimator. Then it uses an on-policy reinforcement learning process to learn from real execution latencies. This method can be seen as an improvement to Neo. With a few novel techiques like Diversified Experiences and Multi-agent Training Balsa can explore distinct plans. Combined with these methods Balsa can generate query plans performing better than experts and state of art methods like Bao. | SIGMOD 22 | Learned Query Optimization, Machine Learning for Systems |
| | | | |