On Monday 12th June, a couple of us from the Naimuri Cloud team headed down to London to attend the HashiDays event in London. For those that aren’t familiar, HashiDays is a conference run by HashiCorp – the company who develop the awesome OpenSource DevOps tools that we’ve become very excited about! At this conference they gave updates on their products, and showcased some of the features and benefits that are perhaps lesser known. They also invited a few guest speakers to come and talk about how they use these products in real environments and how they’ve benefited from them.
The event itself (‘The Brewery’) was extremely slick. From the registration process and breakfast, to the top grade sound and lighting. I have never been to one of the Apple developer conferences but this is how I imagine it would be. The food and drink throughout the day was also delicious so we would have come away fairly satisfied even if the content of the talks had not been up to scratch.
But they were.
Intro
Kicking off the day was the main man – founder and CTO – Mitchell Hashimoto. Mitchell’s talk started by stating one of the main aims of HashiCorp is to have products that enable you to “run on any platform, cloud and non-cloud”. He then continued with a summary review of each product – Consul, Vault, Nomad, Terraform and Packer. One of our friends at the event described it later on as “similar to a changelog – but a lot less dull!” Talks later on would expand on some of these points but for now we learnt that:
- Consul has a new autopilot feature that helps manage Consul servers, this includes the cleanup of dead servers from the cluster.
- Vault has a new replication feature (Enterprise only) that allows it to be run multi-datacentre.
- Nomad was introduced as being different from other Hashicorp products in terms of its adoption. Mitchell said that typically Hashicorp products are picked up by large numbers of small companies and users, hence there is a lot written about them on the Internet. Nomad is the opposite. It is used by a small number of (very) large companies. This hinted at the stability and scalability of it.
- Terraform development is focussing on operational safety. A big change is upcoming in the architecture whereby the core code will be split from the providers. This will allow more rapid changes of individual providers without having to wait for a new release of Terraform. The core Terraform binary will download provider specific binaries as required. This also allows versions of each provider to be specified in addition to versions of the core.
- Packer is 5 years old and has reached the 1.0 release!
- Little or no mention of Vagrant.
Vault
Next up on stage was Jeff Mitchell – Lead developer for the Vault project. We’d previously only considered that Vault was useful as a key / value store for secure assets. Seems that we weren’t the only ones to make this assumption. Jeff’s talk described how Vault is a swiss army knife of secure functions – “Jack of all trades, master of some” and has lots of features that you may not know about, including – using Vault to provide TOTP with key management as a service. He described a use case web application with database backend which used Vault features NOT including the key / value store.
User connections to the web app would be encrypted via mutual TLS, using Vault as the Certificate Authority (CA) to manage certificates between client and web server. In turn the web server would use Vault to write encrypted data to the database server itself. The main theme of the talk was that Vault could and should be deployed into each datacenter in order to provide consistent APIs for any secure interactions.
Terraform / Nomad Demo
Following Jeff we met Paddy Foran – Software Engineer at Hashicorp – who performed a live demo using Terraform to create a Nomad Cluster across multiple cloud providers.
Nomad is the Hashicorp product which enables you to run your application. It does this by creating a cluster of compute resource and then scheduling jobs (applications) to run on them. These can either be long lived jobs, such as micro services, or jobs that run and then exit, such as a data processing tasks.
I was excited when I saw the diagram showing the nodes running across AWS (Amazon Web Services) and GCP (Google Compute Platform). Not only did it have beautiful symmetry (!) but I could imagine the potential scaling and resilience possibilities that this architecture would provide. Paddy started by running the code and then spent the time while it ran describing the details of how it worked. One of the most impressive parts was the simple way in which an IPSEC VPN was created between the two cloud providers. IPSEC is typically a tricky thing to configure but Paddy showed it to be trivial using Terraform. He didn’t have long to describe this though, within a few minutes it was complete and he had built a compute cluster of 18 worker nodes ready to accept instructions for what to do next. A few Redis Docker containers were kicked off and the demo was complete. Great job!
Consul at Scale
A coffee break ensued and a double mocha was quickly dispatched (helpful after a 5am start to the day) before we were introduced to James Phillips – Lead Engineer for Consul. Described by compere Seth Vargo as “the smartest guy he knew”. James gave some insight into running Consul at Scale – two scenarios were given. The first scenario was a single Consul cluster runnings lots of nodes. James talked about how Consul used the SWIM protocol to give a reliable method in order to determine failed nodes via a ‘Distributed Failure Detector’. This would use the information from multiple nodes, using both TCP and UDP in order to come up with reliable decisions, independent of network splits or other factors that could incorrectly determine that a node was dead. Consul uses a gossip protocol to communicate this information but the gossip network load is capped. Apparently, even at very high number of nodes and in scenarios of high flux then this would be no higher than 300KB/s.
The second scenario was running lots of clusters across multiple data centres. We learnt that in this use case, the only communication between clusters is via server nodes. Non server nodes only gossip with the local cluster. If information is required about another cluster then it is performed by the servers forwarding the request to the other cluster and then returning the information. One of the main benefits of this is that network load is even lower (vs. single cluster) for the same number of nodes.
Nomad
We then went back to Nomad with HashiCorp co-founder Armon Dadgar. Seth introduced him as “actually the smartest guy he knew, even smarter than James”. Living up to this introduction, the talk by Armon was immaculate as he described the architecture and design of Nomad. Using similar architecture as Consul to gossip information between nodes and also federate over multiple clusters. The missing element, when designing Nomad, was any scheduling code and Armon referenced the Omega and Borg projects by Google Research as well as Sparrow and Mesos by Amplab as inspirations for how they built the Nomad scheduler.
The result is a system that scales very very well. He described a ‘ridiculous’ test case that they devised. This used 5000 compute nodes, registered with Nomad, in order to launch 1 million Docker containers running Redis. He referenced Bill Gates’ classic quote “640KB is enough for anybody” to make the point that we can’t predict how big any of this stuff is actually going to get. The test case was successful and a 1 million cluster Redis cluster was created within 5 minutes (this was not a live demo on the day!). Hashicorp published the results and the next day received a call from the second largest hedge fund in the world (Citadel) saying that they needed to run a cluster of 4 million containers! Thus proving the point about predicting scale.
The rest of the afternoon was dedicated to talks from HashiCorp partners and other companies that use the products.
OpenCredo
First we had Nicki Watt from OpenCredo who talked about the typical way in which users of Terraform evolve their usage. From a simple implementation for a single user through to a code structure that can be maintained by large teams – storing the Terraform state file remotely in S3. Reducing code duplication (keeping it DRY – ‘do not repeat yourself’) by creating reusable modules of code. I was fairly smug during this talk as it’s the exact experience we’ve been through at Naimuri in order to develop our best practice pattern of using Terraform. It was reassuring to see that others have also done it and reached the same conclusion.
Elsevier
After a day of individual talks we then had our first duet. James Rasell and Iain Gray from Elsevier talked about how they’ve achieved ‘Operational Maturity’ using HashiCorp products. Specifically they’ve been focused on improving Infrastructure, Release Deployment and Security Governance of their system. They also referenced their Terraform setup, acknowledging the previous talk by Nicki, saying that they’d also followed the same approach!
James expanded on this slightly though by saying that they are obsessive about using Terraform to provision everything in their (many) AWS accounts. This includes the initial bootstrapping of the AWS account itself in order to make it usable by Terraform. Chicken and Egg? Elsevier also have strict standards around how they write Terraform modules – ensuring that each module has a changelog and uses semantic versioning for new releases. They also go beyond a README for each module to create comprehensive documentation on their usage.
Mention was also made to using Packer as part of their ‘AMI bakery’ (inspired by Netflix?) to create hardened AMIs and using Serverspec to test them once built. These secure and tested images are then ready to be deployed across their environments.
Monzo
The final talk of the day was given by Simon Vans-Colina and Matt Heath who are DevOps engineers working for the new breed of disruptive bank – Monzo. Interestingly, I’d never heard of them until very recently, a friend recommended that I get one of their cards for a family holiday. I was really impressed with the speed of the interaction using the application, topping up with cash from another account and then immediately using the Monzo card to withdraw from a cash machine. The application responds with notifications instantly which hinted at the progressiveness of the underlying tech …
The Monzo guys needed to provision a physical network connection to the banking Faster Payments Service (FPS) which would typically require an expensive traditional piece of network equipment. Due to their requirement to route connections from their AWS hosted application to FPS they found it was easier to provision this service using a Linux appliance. They were able to test and build this successfully using Vagrant and Packer. They also use Terraform heavily but for running and managing their Docker containers they run Kubernetes running on CoreOS.
Tiki Bar
That concluded the talks. All that remained was consumption of cold beer(s), canapes and analysis of the day in the Tiki bar. It was great to see and hear in person the people who write the products that we spend our working day obsessing over. We previously only knew them via blog posts and GitHub issues so to see the people behind the code was awesome.
We love working with HashiCorp products and we hope to share this passion and experience with our customers in delivering their projects.
Roll on HashiDays 2018!