Load testing an Elasticsearch cluster before a migration, upgrade or similar is always a good strategy to reduce bad surprises afterwards. With pludoni GmbH, we use Elasticsearch since 2014 for our Job search backend for all of our community websites (Empfehlungsbund), as well as blog article search and for keyword optimizations.
Recently, I’ve prepared such a migration and needed a way to verify that the cluster holds after switching, of better yet, improving the performance in regards to the costs. After checking Github et. al. for other specific code, I was not satisfied enough and build some small script around the awesome siege
tool.
Preparation - Gather good queries for test
To get a realistic performance test, I suggest to grab original payloads of queries that your production ES runs. Even more, I only took a couple of queries that are the slowest to boost my confidence in the end.
To to that, first enable Slow Log
in ES settings in your UI (cerebro/kopf whatever frontend) or via curl:
curl 'http://localhost:9200/index_settings/update' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json;charset=utf-8' --data \
'{"index":"CLUSTERNAME","settings":{"index.search.slowlog.threshold.query.warn":"1s","index.search.slowlog.threshold.query.info":"0.5s"},"host":"http://localhost:9200"}'
Then, wait a while or produce slow logs via querying. Afterwards the file /var/log/elasticsearch/*_index_search_slowlog.log
will fill up.
Extract the query payloads by copying all json between brackets:
[2019-12-11 04:30:29,725][INFO ][index.search.slowlog.query] [es01.localhost] [ebsearch_production][2] took[1.7s],
took_millis[1774], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], source[<<COPY ALL BETWEEN BRACKETS>>], extra_source[],
- Put each payload each in individual file in a common folder, e.g.
payloads/1
,payloads/2
etc.
Build new test cluster
One hint if not yet used: Use Repository + Snapshots (S3) to quickly seed a new cluster with production grade data.
Test the (old/new) cluster
The test is run by the battle-tested tool Siege, which should be easy to install from all your OS repo (Apt, Brew, etc.). Siege supports a input parameter with a file with urls to test. Later, we will utilize Siege like that:
# Concurrency: 3, for 1 minute
siege -b --log=./siege.log -H 'Content-Type: application/json' --internet --delay=15 -c 3 -t 1M --file=urls.txt
The urls.txt has the format:
http://server/index/_search POST {".....search payload"}
To generate urls.txt
easily with all of the payloads, I’ve created a Rakefile
, because Ruby is awesome. Also, we are living in 2019(+), so the Ruby that shipped with your distro should be just fine, no rvm/rbenv needed.
SERVER = '10.10.10.100:9200'.freeze
DURATION = '1M'.freeze # 1 minute each test
CONCURRENCY_TESTS = (1..10) # or [1, 5, 10, 20, 100] etc.
INDEX_NAME = 'ebsearch_production'
desc 'create urls.text file with all payloads in payloads/*'
task :urls do
out = Dir['payloads/*'].map do |pl|
"http://#{SERVER}/#{INDEX_NAME}/_search POST #{File.read(pl)}"
end
puts "recreating urls.txt for #{SERVER} with #{out.count} requests"
File.write('urls.txt', out.join("\n"))
end
desc 'run series!'
task :run do
File.unlink('siege.log')
(1..MAX_CONCURRENCY).each do |c|
puts "==== #{c} Concurrent ==== "
sh %[siege -b -m "#{SERVER}-C#{c}" --log=./siege.log -H 'Content-Type: application/json' --internet --delay=15 -c #{c} -t #{DURATION} --file=urls.txt]
end
end
desc 'show csv as tsv for copy paste into google spreadsheets'
task :csv do
lines = File.read('siege.log').lines
csv = lines.reject { |i| i.include?("****") }.map { |line| line.gsub(',', '').gsub('.', ',') }.join
puts csv
end
task default: [:urls, :run]
- Modify the params in the header of the file
- Run it!
rake
- After it is finished (CONCURRENCY_TESTS * DURATION), you can output the data:
rake csv
and copy the output in e.g. Google Spreadsheets to easily generate charts
Bonuspoints: quick chart with ascii-charts
Install bundler, if not done yet gem install bundler
Append to Rakefile:
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'ascii_charts'
end
desc 'chart'
task :chart do
require 'ascii_charts'
lines = File.read('siege.log').lines
csv = lines.reject { |i| i.include?("****") || i.include?('Elap Time') }
data = csv.map { |i| i.split(',').map(&:strip) }.map { |a| %w[date transactions duration transfer response_time requests_s mbs conc success failed].zip(a).to_h }
require 'pry'
puts "======= Response Time / Concurrency ========"
items = data.each_with_index.map { |d| [d['conc'].to_f.round, d['response_time'].to_f] }
puts AsciiCharts::Cartesian.new(items).draw
puts "======= Requests/s / Concurrency ========"
items = data.each_with_index.map { |d| [d['conc'].to_f.round, d['requests_s'].to_f] }
puts AsciiCharts::Cartesian.new(items).draw
end
Bonus: results for our smallish cluster
Our search used custom search plugins that are quite CPU intensive, especially with long queries. Overall our concurrent users are not that many, so a 3-4 node cluster is generally enough.
Deployment target is the very cost efficient Hetzner Cloud (HCloud). Here a quick overview over the different cloud instance types Hetzner offers at this point (2019):
- CX11 (1 Core, 2GB, 3 EUR)
- CX21 (2 Core, 4GB, 6 EUR)
- CX31 (2 Core, 8GB, 11 EUR) (not included, because no CPU improvement and RAM is not utilized)
- CX41 (4 Core, 16GB, 19 EUR)
- CX51 (8 Core, 32GB, 36 EUR)
I’ve tried the following combinations in the Hetzner cloud. Please note, that if the rq/s looks ridiculous low, but please keep in mind that those 10 concurrent users are only searching with the worst queries that I could found.
nodes | EUR/month | rq/s @ 10ccu | response time @ 10 ccu | requests/s/EUR |
---|---|---|---|---|
1x Coordinator (CX11) + 2x Data CX21 | 15 EUR | 5.90 | 1.67 | 0.39 |
1x Coordinator (CX11) + 3x Data CX21 | 21 EUR | 4.84 | 2.03 | 0.23 |
1x Coordinator (CX11) + 2x Data CX41 | 41 EUR | 11.24 | 0.80 | 0.27 |
1x Coordinator (CX11) + 3x Data CX41 | 60 EUR | 17.19 | 0.58 | 0.28 |
1x Coordinator (CX11) + 4x Data CX41 | 79 EUR | 16.56 | 0.54 | 0.20 |
My findings:
- Even the smallest instance seemed to be fine for the Coordinator node, the CPU usage never reached any kind of utilization number
- Going from 2 CX21 to 3 CX21 did not improve the core metrics (reqs/s, response time), but worsening it. My conclusion is, that the CX21 has too low CPU power to
- Same, going to 4 CX41 seems to be worse than 3 CX41
- 2 or 3 CX41 are best performance for the price
- CX51 untested
- PLEASE NOTE: That findings could be totally related to our type of querying which includes custom search algorithm written in Groovy, Also: I am not a Elasticsearch expert, so there might be tuning params, sharding settings that could be adjusted.