skybet · andrei-sandu · Apr 16, 2021 · Apr 16, 2021 · Apr 16, 2021 · Apr 16, 2021
diff --git a/_posts/2021-04-16-if-you-can-describe-it.md b/_posts/2021-04-16-if-you-can-describe-it.md
@@ -0,0 +1,165 @@
+---
+layout:     post
+title:      If you can describe it, you can automate it
+date:       2021-04-16
+summary:    The truly Agile team never has enough time to make any product the best it can be. Yet there is a difference between "it will have to do" and "this is good enough _for now_". This is the story of automation, of continuous iterative improvement. A story of a horrendous manual process that was rarely used and never properly understood, and the work to turn it into a better, more automated process. Along on this adventure, you will meet a Core design pattern, a lot of Bash, a cunning Chef hack, a dramatic MySQL oversight, and a surprising lesson in agility.
+category:   Devops
+tags:       bash,chef,hashicorp,percona,mysql,backup,automation
+author:     andrei_sandu
+---
+## The design pattern
+
+When you hear stories involving big, monolithic code repositories, they tend to end in, or at least centre around, a lament against legacy code and un-maintainability,
+with an optional ridiculous story about a 1000 line single SQL statement. The story of core_mysql is somewhat odd in that category - while absolutely a big, seemingly
+monolithic structure, this has a different, more content ending than your usual anecdote of this kind. Put simply, core_mysql is Core Customer Tribe's solution to the
+problem of persistent storage, and it has been iterated upon whilst in service for a long time. It has been two years since we put it live for the first time, in which
+time more than 60 new versions have been released, expanding well beyond its initial functionality. Initially built for a single use case, with a specific if not so
+tight deadline, the Chef-based solution has grown to cover multiple use cases, handle encryption at rest and in transit, multiple databases, users, custom configurable
+user permissions, countless monitoring items, backup, disaster recovery and other fancy-sounding features. At the time of writing, the cookbook turned design pattern
+has helped deploy five highly available database clusters handling diverse data, from customer Safer Gambling data to Payments, Document Verification data, and static
+assets(don't ask!). And, for what it's worth, we are actually really proud of it! Without going into too much detail, core_mysql is a design pattern that uses
+[Percona XTraDB Cluster](https://www.percona.com/software/mysql-database/percona-xtradb-cluster) to build and deploy a highly available synchronous cluster. It
+also deploys [ProxySQL](https://proxysql.com/documentation/) for service discovery, configures Table Encryption,
+[XTraBackup](https://www.percona.com/software/mysql-database/percona-xtrabackup) for data backup, monitoring and alerting via Nagios/NRPE and collectd/Graphite,
+and can deploy and configure cross-site asynchronous replication for Disaster Recovery purposes. It is a complex solution, designed to accommodate as many use
+cases and provide as many useful features as possible. At the centre of it all sits the sbg_core_mysql Chef cookbook, which installs and configures all the components
+needed for this to work. Clocking at fourteen interlinked recipes and seven custom resources, the cookbook has been built upon non-stop since it was first deployed.
+
+Most components of core_mysql are currently some level of automated - this was the ultimate intention behind it, to make persistent storage available, repeatable,
+and easy-to-use across the tribe. However, this was not always the case. As it so often happens in this industry, deadlines meant that we had no shortage of elements
+left to be automated at a later date "when we had the time". I like to think we did a great job of delivering on those promises with time. Sometimes out of necessity,
+when new requirements arose from the roadmaps of the tribe, other times simply on an ad-hoc basis when there was time, out of passion, or out of the frustration of
+some engineer having to deal with manual processes. This post will examine one example of such an arduous manual process I recently automated, the challenges it
+presented and the adventure it took me on. The process in question is the restore from backup process.
+
+## The backup restore process
+
+While the backup process for core_mysql has always been a question of running a backup script - roughly speaking a push-button job - the backup restore process
+has been far less thought out. The process has always been manual, and to put it mildly, it is a very complex operation. Recently, we have better documented the
+process as part of a new compliance piece (as well as wanting better on-call documentation in general), but that does not solve the main issue with it:
+it's MASSIVE! It is a seven-step, six-page behemoth of a runbook - and it needs to be this big - and has been affectionately called the longest runbook
+known to man. Since we rarely had to actually restore from a backup, we have been content with the manual process for a long time. That said, it was a cause
+of much toil and frustration whenever the process was needed. There is only so much that knowledge sharing sessions can hope to accomplish, the ideal situation
+would be a process that is straightforward.
+
+SBG often runs what we call a DR Dry Run. We pretend that our live environment has been compromised, and act out our plan for Disaster Recovery,
+failing over to a secondary site. In the context of core_mysql, that meant we have to transfer data from our cross-site replication servers into
+the database clusters built into the DR site. It was this use case that had got the most use out of the backup restore process, it being the main
+cleanup step following the DR Dry Run. I took part in a DR Dry Run in October and finally finished at 7pm after running through the painful manual
+process that I had created. It was obvious to me then what a tremendous pain point this process could become in the wrong scenario. The fact we don't
+usually need to restore from backup, in my mind, would only amplify the pain when we do need to do so.
+
+I was taught that if you can describe a process, you CAN automate it. There already was a step-by-step runbook of how the process is performed.
+So I decided to make a straightforward, automated process for it.  I knew it was possible, it had to be - and I hoped I would learn something from it.
+
+## Challenges
+
+There is a reason why the manual runbook was so complex. Aside from the obvious considerations for a backup restore scenario (there will be an outage,
+the process is destructive, etc.), there is an important consideration for Table Encryption. A requirement from the very beginning, our solution for
+Table Encryption has always been complicated in subtle ways. Without getting into too much detail, we used a readily available Table Encryption solution for
+encryption at rest, which relies on a keyring file for encryption and decryption (essentially a master encryption key, although with a few twists that
+fortunately are not relevant for the problem at hand). This posed a problem regarding the storage and integrity of the encryption key, as the backups we
+take are encrypted (so long as Table Encryption is used). Restoring the backup, therefore, does not guarantee we get the data back, unless we also have the
+right encryption key to unlock that data. Known as "Step 5 of the longest runbook known to man", this problem constituted the most esoteric and confusing
+obstacle to automation, and it involved interacting with Hashicorp Vault and understanding the custom (and convoluted) way we used it to preserve the
+integrity and history of the encryption keys we use. The cookbook deals with key rotation by storing and archiving all encryption keys in Hashicorp
+Vault - the company-approved choice for all our secret-keeping needs. This way, both the integrity and the history of our Encryption Keys is preserved via Chef,
+courtesy of a big custom-designed resource. We had Chef recognise when the key was missing on disk so it can restore it from Vault; similarly, we had Chef
+recognise when the key had rotated, so that it could have the archived key still available if needed. Most relevant for the problem at hand, if we were to
+restore from a backup, we would have to make sure that the encryption key on disk was the same as the one present at the time we had taken said backup.
+
+## The solution
+
+Early on, a naive solution in the shape of a Jenkins pipeline has quickly presented itself. We are dealing with multi-server clusters, after-all, and Jenkins
+is designed for just such requirements. It is a well understood tool, synonymous with push-button jobs, and (in theory) should be able to achieve what I seek.
+The only obstacle to this approach was the problem of Encryption restore, and the fact that I would need to make Jenkins interact with our Hashicorp Vault API.
+I was initially quite confident that I would learn what I needed to make it work quickly, in an epic feat of Learning and Development, however, after a brief
+read-through of everything I could find about Hashicorp Vault API, I eventually had to accept I would not be able to find the odd, specific solution I needed.
+In truth, I failed to find a way to interact directly with the Vault API through Jenkins altogether. If I were brutally honest with myself, there probably
+is a way to do so, I just never found the right resource for it. All the same, I had grown impatient, and was longing for progress.
+
+I realised I needed a different approach if I were to deliver anything. It had occurred to me that all these issues I had solved before, that I did not need a
+brand new, shiny solution. Instead, I decided to stay faithful to the existing design pattern, and harness the techniques my squad had perfected in dealing with
+all things core_mysql (and many other situations, as well). I decided to code what I know, and rely on what I already have and know. If there is one technical skill
+I excel at, it is Chef and Ruby! I had worked on core_mysql for three years, and there is a reason why Chef is the heart of it all. I know Chef! I have used it
+before to interface with Hashicorp Vault! I could make it work in the context of a backup restore!  Likewise, I have solved automation problems before: I have
+automated Replication Setup using Bash scripts. I have written the manual runbook, I should be able to translate it into Bash. This would strip away the glamour
+of a Jenkins pipeline, with a literal push-button, but on the other hand it would allow me to make progress faster (as well as allowing me to establish a
+solid baseline to expand upon).
+
+The decision was made. A couple of coding sessions, and a handful of testing sessions later, the solution was proven and production-ready. A Chef-controlled
+Bash script would be installed on the MySQL servers, which takes the backup file paths as parameters and performs the backup restore steps. The brunt of the
+necessary work was in truth quite uninteresting - the sequence of steps were already there, in the manual runbook, simply awaiting to be translated into Bash.
+I made sure the script fails safely as much as possible, and tested the different failure states on one of the test clusters. I endeavoured to make the script
+as verbose as possible, and included a confirmation step as an extra safety measure. But I would like to focus attention on two aspects that I found surprisingly
+insightful, and interesting to deal with, respectively.
+
+First, I want to highlight that while production-ready and deployed, the present solution is still very much bare-bones. Aside from the fact that it is
+essentially just a Bash script, rather than an intuitive Jenkins Pipeline, it does not deal with taking the cluster down, and it assumes that the backup
+originated from the same server. Yet, I see this as a strength, rather than a weakness. The solution is good enough for now., which has always been the
+underlying theme with core_mysql . We have always shipped solutions that are "good enough", then continuously built upon them every step of the way. This
+is how this huge design pattern has come to grow this big, and this is why it is successful. The present backup restore solution has achieved what was set
+out - the runbook is a third the size of the old manual one, it is far less prone to error or failure, and it shall save a lot of time, especially in DR
+exercises. Everything else will come later, as soon as we have the time to improve it. And we will, eventually, because we care about it. I know, for
+example, that given the present baseline, I can integrate it in a Jenkins pipeline at some point.
+
+Secondly, the focus of this whole endeavour has been the problem of the Encryption Keys, and interacting with Hashicorp Vault. As mentioned previously,
+we have used Chef resources to interact with Vault's API in the past. It made sense to exploit that capability further, so I created a similar custom Chef
+resource, aimed solely at restoring the correct Encryption key to disk. But putting it in Chef meant it would be evaluated every time Chef ran, which,
+given we have daemonised chef-client on pretty much all our servers, would be quite often. In order to ensure it only executed when a backup restore is
+ongoing, I needed to tie this Chef resource to the Bash script somehow. The way to achieve this is simple, but effective. I had the Bash script place a
+file on disk as part of the backup restore process, containing the necessary information for Chef to know which Encryption key to restore. Then, I had
+the new Chef resource listen for that file on disk, and only execute if it is present. Lastly, I had Chef delete that file whenever it finds it. That way,
+the only time Chef tries to restore a specific Encryption key is when it runs right at the end of a backup restore. I find it almost poetic how Chef
+controls the Bash script, but in turn, that Bash script controls Chef's execution whenever it is triggered.
+
+## The Belly of the Beast
+
+It was a sunny early spring Friday, when it became clear that something was certainly wrong. The evening before I had tried and failed to perform a now
+routine test of the backup restore process on one of our test clusters. It was not the first time I had seen this kind of error, but this time there was
+no doubt: the process did not work. I do not mean the automated script had a bug, but the process itself was flawed - I had tried every backup in the
+last week, both the ones scheduled each night, and backups taken manually. I had tried the manual runbook, switching the order of steps - nothing was
+working. I was off the next week, and on Thursday was scheduled the first DR Dry Run that would have used the new shiny automated process for backup restore.
+
+I was not to know yet, but the "longest runbook known to man" was missing a step. It transpires that xtrabackup (the backup tool we use) has a --prepare
+command, designed to ensure the backup is ready to be restored. This option deals with encryption when present, and is important in order to ensure the
+data does not get corrupted after it gets restored. However, it appears this step is not always a deal breaker, as we have successfully restored backups
+without using it in the past. But not reliably - in fact, at the second demo we held within the tribe, and the one which we recorded on video, the backup
+restore failed. At that time, it was only that backup that failed to restore, so we wrote it off as one corrupt backup - an exception to the rule.
+
+This terrifying Friday got me thinking about the importance of testing our processes. As previously stated, backup restore was seldom used until recently.
+It is this infrequency that allowed a potentially devastating flaw to slip through in our runbook - and by extension, in my script. Conversely,
+it is the dawn of routine testing for this process that uncovered it, and prevented it surfacing when we need the process to work perfectly/in an
+actual incident. It also got me to appreciate my team's culture of covering for one another. Early in the day, two of my colleagues offered to look
+into the issue with me, and continue investigations while I was off. They discovered the missing step, and pointed me to the best way to fix the automated
+script upon my return! The DR Dry Run completed without too much trouble thanks to them.
+
+Fixing this bug required me to re-think the way I dealt with Encryption restore, as the --prepare command requires the encryption key to be present
+before the data is restored. I ended up refactoring the chef resources, in order to separate the Encryption Restore logic from the rest of the cookbook.
+I had the script run chef-client with a custom runlist, before the --prepare command, using only the recipe necessary for Encryption Restore.
+
+## Conclusions
+
+First and foremost, this project has been a success. We have already saved a lot of time when we conducted our first backup restore test a couple of
+weeks after the project was done, and not too long before the time of writing, and that economy will only grow as time passes. A less tangible boon,
+yet I would argue a more important one, is that we have reduced the chances of human error associated with this process. Now, this is a very hard one
+to measure, because we rarely ever used the process at all, and when we did, I was usually the one to perform the feat (even so, I remember screwing
+up at times, thankfully not in Production). But there is no denying this is the case, and it is a lot more valuable than the time the new script
+will save. We would ideally never need to do a restore from backup in Production, but if we ever do, I am now far less terrified of the thought, and
+this is thanks to the shortening of the "longest runbook known to man".
+
+Furthermore, I take pride in the circumstances by which this script has come to be. We identified a point of pain that was extremely rarely felt, yet
+we saw its potentially disastrous impact. In other words, we were proactive! This project was not planned out, it was not put on any roadmaps, it was
+not prioritised nor refined - at least not to begin with. It was incepted as a Learning and Development project. and yet, it was delivered. What's more,
+my team is now looking at formalizing this kind of project born out of L&D. We saw the value in it, and I was empowered to do it. More than that, I was
+given the whole team's support when it went wrong.
+
+For as large a codebase as core_mysql is, it never ceases to amaze me how much more it always seems to need to grow. This is a loved product, one in which
+I personally believe, and one which I want to see succeed. It will never be perfect, but it will always be better than it used to be. This little project,
+the automation of backup restore, is but the latest example of a set of engineering practices that has guided my team and I, ones that I found myself
+contemplating about every step of the way. I see it as nothing but a good example of those principles. I started by exploring shiny solutions, but realized
+the value of coding what I know, and building with the tools closest to me. In the beginning, I had imagined a big single-button cluster-level solution, that
+solves all the problems, and covers all the possible edge cases, but then I had to accept that if I were to deliver anything, I had to stick to the bare-bones
+solution which is good enough for now. I had to remind myself of the difference between "That's enough, it will do" and "This is good enough for now" - that
+is, a difference in mentality spawning from caring about my code and vowing to improve it continuously, and incrementally. This design pattern is not just a
+monolith the kind of which engineers tell horror stories about, it is a supported, ever evolving thing, which builds upon its past.