10 KiB

Raw Blame History Unescape Escape

title	date	tags
Backups and Updates and Dependencies and Resiliency	2024-02-18T16:00:00-08:00	[homelab SDLC]

This post is going to be a bit of a meander. It starts with the description of a bug (and appropriate fix, in the hopes of helping a fellow unfortunate), continues on through a re-consideration of software engineering practice, and ends with a bit of pretentious terminological philosophy. Strap in, let's go!

The bug

I had a powercut at home recently, which wreaked a bit of havoc on my homelab - good reminder that I need to buy a UPS! Among other fun issues added to my Disaster Recovery backlog, I noticed that the Sonarr container in my Ombi pod was failing to start up, with logs that looked a little like¹:

[Fatal] ConsoleApp: EPIC FAIL! 

[v4.0.0.615] NzbDrone.Common.Exceptions.SonarrStartupException: Sonarr failed to start: Error creating main database --->
  System.Exception: constraint failed NOT NULL constraint failed: Commandstemp.QueuedAt While Processing: "INSERT INTO "Commands_temp" ("Id", "Name", "Body", "Priority", "Status", "QueuedAt", "StartedAt", "EndedAt", "Duration", "Exception", "Trigger", "Result") SELECT "Id", "Name", "Body", "Priority", "Status", "QueuedAt", "StartedAt", "EndedAt", "Duration", "Exception", "Trigger", "Result" FROM "Commands"" --->
  code = Constraint (19), message = System.Data.SQLite.SQLiteException (0x800027AF): constraint failed NOT NULL
...

I could parse enough of this to know that something was wrong with the database, but not how to fix it.

After trying the standard approach of "overwriting the database with a backup²" - no dice - I went a-googling. It seems that a buggy migration was introduced in v4.0.0.614 of Sonarr, rendering startup impossible if there are any Tasks on the backlog in the database. Since my configuration previously declared the image tag as simply latest³, the pod restart triggered by the power outage pulled in the latest version, which included that buggy migration. Once I knew that, it was the work of several-but-not-too-many-moments to:

k scale deploy/ombi --replicas to bring down the existing deployment (since I didn't want Sonarr itself messing with the database while I was editing it)
Spin up a basic ops pod with the PVC attached - frustratingly there's still no option to do so directly from k run, so I had to hand-craft a small Kubernetes manifest and apply it.
Install sqlite3 and blow away the Tasks table.
Teardown my ops pod, rescale the Ombi pod, and confirm everything working as expected.

The first realization - automatic dependency updates

This experience prompted me to re-evaluate how I think about updating dependencies⁴. Having only had professional Software Engineering experience at Amazon, a lot of my perspectives are naturally biased towards the Amazonian ways of doing things, and it's been an eye-opening experience to get more experience, contrast Amazon's processes with others', and see which I prefer⁵.

I'd always been a bit surprised to hear the advice to pin the exact versions of your dependencies, and to only ever update them deliberately, not automatically. This, to me, seemed wasteful - if you trust your dependencies to follow SemVer, you can safely naïvely pull in any non-major update, and know that you are:

depending on the latest-and-greatest version of your dependency (complete with any efficiency gains, security patches, added functionality, etc.)
never going to pull in anything that will break your system (because that, by definition, would be a Major SemVer change)

The key part of the preceding paragraph is "if you trust your dependencies". At Amazon, I did - any library I depended on was either explicitly written by a named team (whose office hours I could attend, whose Slack I could post in, whose Oncall I could pester), or was an external library deliberately ingested and maintained by the Third-Party Software Team. In both cases, I knew the folks responsible for ensuring the quality of the software available to me, and I knew that they knew that they were accountable for it. I knew them to be held to (roughly!) the same standards that I was. Moreover, the sheer scale of the company meant that any issue in a library would be likely to be found, reported, investigated, and mitigated even before my system did a regular daily scan for updates. That is - the possible downside to me of automatically pulling in non-major changes was practically zero, so the benefit-ratio is nearly infinite. I can count on one hand the number of times that automatically pulling in updates caused any problems for me or my teams, and only one of those wasn't resolved by immediately taking an explicit dependency on the appropriate patch-version. Consequently, my services were set up to depend only on a specific Major Version of a library, and to automatically build against the most-recent Minor Version thereof.

But that's not the daily experience of developers, most of whom are taking dependencies mostly on external libraries, without the benefits of a 3P team vetting them for correctness, nor of accountability of the developing team to fix any reported issues immediately. In these situations - where there is non-negligible risk that a breaking change might be incorrectly published with a minor version update, or indeed that bugs might remain unreported or unfixed for long periods of time - it is prudent to pin an explicit version of each of your dependencies, and to only make any changes when there is a functionality, security, or other reason to update.

The second realization - resiliency as inefficiency

Two phenomena described here -

Having to buy a UPS, because PG&E can't be trusted to deliver reliable energy.
Having to pin your dependency versions to not-the-latest-and-greatest minor-version, because their developers can't be trusted to deliver bug-free and correctly-SemVer'd updates.

...are examples of a broader phenomenon I've been noticing and seeking to name for some time - "having to take proactive remediative/protective action because another party can't be trusted to measure up to reasonable expectations". This is something that bugs me every time I notice it⁶, because it is inefficient, especially if the service-provider is providing a(n unreliable) service to many customers. At what point does the cost of thousands of UPSes outweigh the cost of, y'know, just providing reliable electricity⁷?

In a showerthought this morning, I realized - this is just resiliency engineering in real life. In fact, I remembered reading a quote from, I think, the much-fêted book "How Infrastructure Works", to the effect that any resiliency measure "looks like" inefficiency when judged solely on how well the system carries out its function in the happy case - because the objective of resiliency is not to improve the behaviour of the happy case, but to make it more common by steering away from failure cases. Hopefully this change of perspective will allow me to meet these incidents with a little more equanimity in the future.

...and if you have any recommendations for a good UPS (ideally, but not necessarily, rack-mountable), please let me know!

I didn't think to grab actual logs at the time - it was only in the shower a day or two later that I realized this provided the jumping-off point for this blog post. These logs are taken from this Reddit post, which I found invaluable in fixing the issue. ↩︎
Handily, Sonarr seems to automatically create a sonarr.db.BACKUP file - at least, it was present and I didn't remember making it! 😝 but, even if that hadn't been the case, I [took my own advice]({{< ref "posts/check-your-backups" >}}) and set up backups with BackBlaze, which should have provided another avenues. That reminds me...the backup mechanism is overdue for a test... ↩︎
I know, I know...installing Watchtower is on my list, I swear! ↩︎
in this section I'm using "dependencies" to refer to "software libraries used by the services that I as a professional software engineer own-and-operate", but most of the same thinking applies to "image tags of services that I deploy alongside my application that are owned and developed by people other than me or my team". ↩︎
I will die on the hill that Amazon's internal CI/CD system is dramatically superior to any Open Source offering I've found, in ways that don't seem that hard to replicate (primarily, though not solely, image specifications based on build metadata rather than hard-coded infra repo updates), and I'm frankly baffled as to why no-one's implementing their functionality?⁸ ↩︎
Though, having finally gotten around to blogging about it, I now can't bring to mind any of the examples that I'd noted. ↩︎
I'm glossing over a lot of complexity, here, and deliberately hand-waving away the fact that "every problem looks easy from the outside". It's perfectly possible that the difficulty of going from 5 9's of electrical uptime to 100% is impractical - that "the optimal amount of powercuts is non-zero" - or that occasional powercuts aren't as impactful to the average consumer as they are homelab aficionadoes. Frankly, I doubt both points, given what I've heard about PG&E's business practices - but, nonetheless, the fact remains that every marginal improvement to a service-provider's service has a leveraged impact across all of its consumers. That break-even point might fall at different places, depending on the diminishing returns of improvement and on the number of customers - but the magnifying effect remains. ↩︎
Yes, this is a deliberate invocation of Cunningham's Law. Please do prove me wrong! ↩︎

10 KiB Raw Blame History Unescape Escape

The bug

The first realization - automatic dependency updates

The second realization - resiliency as inefficiency

10 KiB

Raw Blame History Unescape Escape