Recently, amidst an infrastructure upgrade of an old (kinda ancient really) ElasticSearch setup, we posed the question “How would this setup fare in an AWS-based system?”.
AWS advertises its ElasticSearch service as being “Fully managed, scalable, and reliable”, and we have long wanted to play with it and explore its potential benefits, especially its pay-for-what-you-use model.
Having gotten permission from our client to do this research on their behalf as a prospective alternative to their current setup, we opened the AWS ES homepage on our browsers and went on an adventure.
Note: this post means to show how a service like AWS ES could potentially fit our way of working, specifically within a particular project. There will be no concrete examples or technical steps to reproduce any single aspect or the various configurations we’ve experimented with; if that’s what you’re looking for, you won’t find it here. In here, there be monsters!
AWS already gets some stuff out of the way, such as hardware and software installation, monitoring, backups, and a bunch of other tasks not directly related to the ES service itself. That’s a definite plus when you want to focus on developing your app or platform or whatever and don’t want to spend days “getting around to it”. So, with that already in mind, we set out to answer three key questions:
Our basis of comparison was the current setup, a cluster of 3 ES VMs at DigitalOcean, 3 nodes, one of which is elected master, each indexing 2 shards and 2 replicas, totalling 12 active shards.
The web server was running in last decade’s Ubuntu, most user-facing scripts powered by PHP 5, and the ES service was built on ES 1.5 running on an unmentionable version of Java.
It all worked quite well in fact, pages and search results were served near-instantaneously and fully reliably. But most of that infrastructure was nearing its end-of-life cycle, so updates to the code were needed in order to be sustainable for the foreseeable future. That was the main goal we were tasked with, and we got it out of the way with minor hiccups; we’re talking deprecated methods, new APIs, some refactoring… The usual for this kind of task.
During the code upgrades, we actually used a Searchly instance for remote testing, for no particular reason other than we already knew it was simple to set up. Once our ES code was up and running, we started an AWS ES instance, added the proper access keys, and that was it. During the first few iterations, we stuck to AWS’s free tier, just until we found our footing amongst the myriad of control panels and services that AWS has to offer.
And look at that, we’ve already answered our first question, “How easy is it to spin up a fully configured instance from scratch?”. It really was that easy, our code worked on AWS just as well as it worked on localhost and on Searchly.
The next question we sought to answer was a challenge. The peculiarity of this project was that we were indexing hundreds of thousands of documents, totalling a few Gb in size, for performing fast searches across them all. Besides the documents content, there were some extra fields to process such as tagging, and while they didn’t carry any complex functionality (yet), they were still extra fields to index and search through.
The search function was accessible by end users, through a portal with millions of monthly visits; it could not be anything else but blazing fast.
These documents could be updated several times a day, and new ones would be added as well. So the indexing process needed also to be performant enough to not block any user-facing functionalities either.
In short, processing power was our number one parameter to consider. Well, the AWS ES free tier does not handle something like that. At all. It gives you a single vCPU with 1 GiB memory, not even enough processing hours (750 hours per month) for that level of regular indexing, let alone near-constant searches.
So we dove into the many on-demand plans AWS has available. Each time we upgraded the setup we tried different configurations, luckily AWS has many (useful!) knobs and switches to play with; number of nodes, number of replicas, dedicated master nodes, allocated space, plus quite a few extras. It’s more than enough to fit the needs of the vast majority of use-cases. Each iteration in this process brought a visible performance improvement, so we were confident that, with some effort in choosing the right pricing plan together with the right ES configuration, we could reach comparable capabilities to the setup currently in production.
As pretty much any indexing system, we needed an admin entry point with write access to index the documents, and a read-only public entry point for end-users to fetch search results. That in itself is not too complicated, but we also wanted to keep the built-in Kibana, as it’s extremely useful for managing and debugging ES instances.
In-between IP whitelisting, user groups and roles, permissions, and quite a few settings spread out just as many control panels, setting that up proved to be much more troublesome than advertised. Perhaps it would have been more straightforward for those more familiar with AWS’s IAM Role Management, but at this point we had spent so much effort just in this sub-task that, even though it wasn’t directly related to the ES optimization work, we felt it was equally important to share, especially since it made us rethink our conclusions for the first question, how easy it was to configure.
But back to the main point and onto the most important question of all…
Let’s put it this way, we went up to 7 nodes of t2.medium instances (2 vCPUs with 4 GiB) index configured with 4 shards and 2 replicas, totalling 12 active shards (each shard guaranteed its thread, excluding extra fail-safe threads). This setup was comparable to what was in production, and when added together with an EC2 instance and static IP, its cost was also comparable to the cost of the production server. We had reached our cost ceiling.
Its responsiveness, however, was not comparable, still far below it in fact. The worst case 250ms total request times in production could still take up to a minute in our AWS ES instance under stress.
We tried every configuration tweak and code optimization we could think of, we fiddled with every setting there was in the control panels, we squeezed every last drop of performance out of our code, we scoured the internet for best practices and tips, we sacrificed a goat to almighty Ra, all to no avail. We could only conclude that the only way to increase responsiveness would be to upgrade the pricing plan and throw more hardware at it.
We found AWS ES to be a highly configurable tool. It seems especially useful if you have a specific need for something else in the AWS infrastructure, with which it integrates almost seamlessly.
However, that extra infrastructure behind AWS ES comes with a higher cost, both monetary and in processing demands that translates to an apparent lower performance (comparing to our reference point at least).
In the end, we could not recommend AWS ES to our client, we ended up spinning up new and updated instances at DigitalOcean of the same base hardware and configuration, and things were running perfectly as expected at the exact same cost as before. This was a good learning exercise though. As newcomers to the world of ES, applied in such a particular way, we learned much about best practices in its configuration and optimization. We put AWS ES to the test, and even though it didn’t fit this particular project, we can definitely see ourselves experimenting with it again in the future.
As an added bonus that we were not looking for, we discovered that DigitalOcean is also a service we will keep in mind, should the need arise in upcoming projects.