Just over 8 months ago I wrote about running the complete HashiCorp stack on top of Alpine Linux. Since then, the entire production workload of my work has moved over to this cluster, and through a handful of upgrades we’ve learned a lot about how it works and how to maintain it. This article is a followup to the original, which if you haven’t read, you should take a break and do so. The original article is here.
Some things have changed dramatically in my cluster architecture since
then. The most notable is that I now run unbound
on every machine
in the fleet to provide better resiliancy in the case of a failure of
any one DNS server. Another major change was to ditch packer for a
lot of docker containers and use it only for VM builds. Its better
suited to this, and it allowed us to use more specialized tooling in
both places for cleaner results.
Some things have not changed in the last 8 months. We still run with a dedicated pool of colocated servers. We also take the hard stance that anything running will run in Nomad, not on the metal. Most importantly, we still run Alpine Linux as our preferred operating system.
Upgrades People, Upgrades!
The biggest challenge myself and my team learned this year was the overhead and how to upgrade the cluster without downtime. We learned this the hard way.
HashiCorp releases new builds regularly enough that we had to spend some serious time polishing this process and ensuring it was as smooth as possible. Part of this meant using actual Alpine APK’s for Nomad, Consul, and Vault as this made distribution much easier. It also makes upgrades on our long lived master servers easier. Since HashiCorp doesn’t ship fully static binaries, we also invested time in setting up more advanced tooling to build the static musl binaries internally. This tooling made it much easier to do upgrades, but still not foolproof.
To make upgrades safer, we needed to pool machines and scale them more cleanly and with less human oversight (as it turns out, humans like to “fix” problems during rollouts).
Using AWS ASG’s to contain the actual nomad clients makes it easier to scale them and dynamically replace machines, but also makes it harder to reason about the state of machines that may have been spun up dynamically to replace lost capacity. The biggest draw to using an ASG for upgrade safety is the ability to scale it up, deploy a new version of all the workers, and then scale it back down after removing the old machines. This enables an upgrade to happen at a slower pace that is set by the rate at which applications in the cluster are upgraded, rather than causing a lot of churn during a machine replacement. We still need a drain pass over machines that are terminating to cleanly remove system jobs, but this is largely unavoidable.
Pull, Don’t Push
For most machines now we pull the configuration rather than pushing it. Since we still use Ansible this makes it easier to ignore scaling limits, since new machines just copy down a tarball of all the base configuration they need during startup.
Pulling config also helps push closer to the world of immutable infrastructure, in which we don’t manage tons of machines, and instead replace and reload appliances as needed. We do however still maintain the capability to push by using the Ansible EC2 dynamic inventory plugin.
Growing Pains
One of the first things to go was the statically defined machines from Terraform. Using statically defined pools like this doesn’t work very well in the long term and means that special care has to be taken to not accidentally introduce changes to a single machine. Using launch templates and ASGs helped to make sure that machines came up clean every time.
Another growing pain we’re still dealing with is to release the remaining static addresses. These are a holdover from migrating a workload that was used to a physical DC, and so static addressing was one of the things we allowed into the new cluster rather than trying to fix all issues during the migration.
The Future
Some ideas that existed back in April still exist now as goals for the future, and some new goals have shown up as well.
I’m still unhappy with Ansible as a provisioner as it requires Python
on every machine, something which is otherwise not required for the
machines on the cluster. Initially, I’d hoped to replace this with
some other technology written in Go, but unfortunately the 2 most
promising options are not really usable (mgmt
is not production
ready, and converge
has been discontinued). Right now my thought is
to use a set of dedicated tools to pull binaries, supervise daemons,
and pull config files with a very limited template scope. This
could then be baked into a machine image and run without needing to
has Python on every machine.
Another idea that I’ve tossed around is to replace Alpine altogether.
Using a tool like
distri
would enable packing an entire OS without needing to modify packages
on boot at all. The idea is that if all the rest of your
infrastructure is checked into git and updated via automation, why not
the OS itself.
The final major idea I have had over the last few months is to build a proper set of modules for building base infrastructure in AWS. Right now my HashiCorp cluster terraform is very tightly coupled to my product infrastructure terraform. This isn’t a good place to be, and means that small changes in one can make huge problems in the other. I think the best path forward is to split out the HashiCorp bits to an external, Open Source project and to become a downstream consumer of it. In this way I can also maintain a test cluster, and hopefully entice better support of musl from HashiCorp by providing an easy-onboarding path for running a cluster that isn’t as fragile as other cluster schedulers.
This level of automation and magic is close to what I think the ill
fated HashiCorp Otto project was striving for, the ability to type
make prod
and have a datacenter spring forth from the ether. I
think that Otto failed because it tried to do too much stuff without
clear focus, whereas this idea has the clear focus of “run nomad
somewhere”. That somewhere point is important since it might be
AWS, it might be GCP, or it might be on a shiny new rack of Oxide
computers. The goal is more that its quick
and easy to spin up a reference Nomad cluster that can be arbitrarily
large.
Hopefully you enjoyed this article. If you did, feel free to send me
an email at maldridge[@]michaelwashere.net, or send me a ping on
freenode (I idle as maldridge
).