The Dev world is waking up to the idea of microservices architecture and the benefits it brings; this short article is a cautionary word of warning from the coal face.
I am part of a DevOps team that develops several applications that follow the now widely recognised pattern of microservices; multiple services running on a single server and then scaled across multiple data centres. Some of our apps run within Docker containers, some do not. Those which are running outside of containers are all co-located on the same servers. Most of these non container apps are Java based implementations.
Our aim is to utilise server resources as much as we consider safely possible, running as many services on a single server that puts us within a safe but cost effective boundary of CPU / memory / disk space usage.
The overall up-time of our applications has been great, however, recently we noticed a distinct slowing of some of our services which was impacting user performance. We were seeing very heavy CPU load on some of our servers. We weren’t sure if we had a bug or we were simply running the servers too hot during peak demand. All services on the box were competing for the same limited CPU resource, resulting in each of them crawling along.
After much investigation, we tracked the issue down to a bug in one of the services (not the one we were expecting). This bug resulted in a JVM bloom, which greedily used up as much CPU as it could get its hands on!
Fixing this bug has highlighted to us one of the perils of the microservices architecture – all services must learn to play nicely with each other and share equally the limited resources they have. Although the JVM is a wonderous machine, it’s very difficult to stop it from being greedy when things go wrong. It can gobble up both CPU and memory if it really wants to! Development and testing should have in mind the target state, i.e. the available resources on the operational servers and aim to keep each app’s resource footprint to a minimum.
As mentioned previously, we also have applications that are backed by services running in Docker containers. We haven’t witnessed resources blooms like this before. However, we have tried to pack our servers too tightly on more than one occasion and had to back off our bin packing policy.
Stress testing can help to highlight issues, but when dealing with this type of architecture, knowing whether a slow down is a result of a single greedy service or just simple a lack of resources to cope with demand can be very difficult to diagnose.
Let us know in the comments if you’ve had a similar experience and how you uncovered the cause!