Estimating Query Timings in Elasticsearch
DOI:
https://doi.org/10.14738/tnc.92.9887Keywords:
Elasticsearch, Elasticsearch Query, Query Cost Model, Document Frequency, Total Term Frequency, Term Vectors.Abstract
In a shared Elasticsearch environment it can be useful to know how long a particular query will take to execute. This information can be used to enforce rate limiting or distribute requests equitably among multiple clusters. Elasticsearch uses multiple Lucene instances on multiple hosts as an underlying search engine implementation, but this abstraction makes it difficult to predict execution with previously known predictors such as the number of postings. This research investigates the ability of different pre-retrieval statistics, available through Elasticsearch, to accurately predict the execution time of queries on a typical Elasticsearch cluster. The number of terms in a query and the Total Term Frequency (TTF) from Elasticsearch’s API are found to significantly predict execution time. Regression models are then built and compared to find the most accurate method for predicting query time.
References
[2]. Elasticsearch Reference. Match Query. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html, Accessed on: Sept. 5, 2020
[3]. Büttcher, S., Clarke, C. L. and Cormack, G. V., Information retrieval: Implementing and evaluating search engines. Cambridge, MA, USA: The MIT Press, 2016. p. 137-145.
[4]. Elasticsearch Reference. Term Vectors API. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html, Accessed on: Sept. 5, 2020
[5]. Perros, H. G., A model for predicting the response time of an on-line system for electronic fund transfer. ACM SIGMETRICS Performance Evaluation Review. 12(1): DOI: 10.1145/1041818.1041820, Accessed on: Sept. 5, 2020,
[6]. Tozer, S., Brecht, T. and Aboulnaga, A., Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. ICDE 2010. DOI: 10.1109/ICDE.2010.5447850.
[7]. Sheikh, M. B., Minhas, U. F., Khan, O. Z., Aboulnaga, A., Poupart P. and Taylor, D. J., A bayesian approach to online performance modeling for database appliances using gaussian models. ICAC 2011. DOI: 10.1145/1998582.1998603.
[8]. Hauff, C., Predicting the effectiveness of queries and retrieval systems. Ph.D. Dissertation, Department of Electrical Engineering, Mathematics and Computer Science, University of Twente, Enschede, NL, 2010.
[9]. Tonellotto, N., Macdonald, C. and Ounis, I., Query efficiency prediction for dynamic pruning. ACM LSDS-IR ’11. DOI: 10.1145/2064730.2064734.
[10]. Macdonald, C., Tonellotto, N. and Ounis, I., Learning to predict response times for online query scheduling. ACM SIGIR ’12. DOI: 10.1145/2348283.2348367.
[11]. Peng, Z. and Plale, B., A Multi-tenant Fair Share Approach to Full-text Search Engine. DataCloud 2016. DOI: 10.1109/datacloud.2016.010.
[12]. Search Data Collection. [Online]. Available: https://archive.org/details/AOL_search_data_leak_2006, Accessed on: Sept. 15, 2020.
[13]. HathiTrust Digital Library Datasets. [Online]. Available: https://www.hathitrust.org/datasets, Accessed on: Sept. 15, 2020.
[14]. Apache JMeter. [Online]. Available: https://www.jmeter.apache.org, Accessed on: Sept. 15, 2020.
[15]. Theil, H., A Rank-Invariant Method of Linear and Polynomial Regression Analysis. In Henri Theil’s contributions to economics and econometrics. Volume 1. Econometric theory and methodology B. Raj and J. Koerts, Eds. Springer Netherlands, pp. 345-381.