blogcontent

11 KiB

Raw Blame History Unescape Escape

title	date	draft	tags
Pre-Pipeline Verification, and the Push-And-Pray Problem	2023-11-17T19:49:06-08:00	true	[CI/CD SDLC]

It's fairly uncontroversial that, for a good service-deployment pipeline, there should be:

at least one pre-production stage
automated tests running on that stage
a promotion blocker if those tests fail

The purpose of this testing is clear: it asserts ("verifies") certain correctness properties of the service version being deployed, such that any version which lacks those properties - which "is incorrect" - should not be deployed to customers. This allows promotion to be automated, reducing human toil and allowing developers to focus their efforts on development of new features rather than on confirmation of the correctness of new deployments.

There's plenty of interesting nuance in the design of in-pipeline testing stages - and although this article isn't specifically about pipeline design, before moving on to the main premise I need to establish one related concept; that of Prod Fidelity.

Definition of Prod Fidelity

We can think of the deployments¹ of an app as being characterized by selecting a set of values for some configuration variables - variables like "fleet size", "stages used of dependency services", and, most impactfully, "what Load Balancer fronts this deployment (and is it one that serves traffic from paying, production customers)?"². A deployment which perfectly mimics production is production (and so is unsuitable for pre-production testing³); but, the more a deployment differs from production, the more likely that it will give misleading testing results. Some illustative examples:

Consider an overall system change C which is implemented by change A in service Alpha and by change B in service Beta, where Alpha depends on Beta. Assume that B is deployed to Beta's pre-prod stage, but not to Beta's prod stage. Consider a test (for behaviour implemented by C) which executes against a deployment of Alpha which a) has A deployed, and which b) hits Beta's pre-prod stage. This test will pass (the Alpha deployment has A, and the dependency deployment has B), but it would be incorrect to conclude from that passing test that "it is safe to promote this version of Alpha to production" - because Alpha's prod depends on Beta's prod, and the test made no assertion about whether B was deployed to Beta's prod. Thus, in general, the testing stage which makes the final "is this version safe to promote to production?" verification should depend on the production deployments of its dependencies.

Non-prod deployments which are solely intended for testing might disable or loosen authentication, load-shedding/throttling, or other "non-functional" aspects of the service. While this can be sensible and justified if it leads to simpler operations, it can lead to blind-spots in testing around those very same aspects.
Load Testing results must be interpreted with caution where the configuration of the deployments and that of its dependencies does not match the configuration of prod. Even assuming that a service can handle traffic that scales linearly in the compute-size of the service (a justifiable though often-incorrect assumption), scaling your prod by a factor of N compared with your load-testing deployment does not guarantee you can handle N-times the traffic if your dependencies are not similarly scaled!

You've probably already guessed, but, to be explicit - I define Prod Fidelity to mean "the degree to which a deployment matches Prod's configuration". This is not a universally objectively quantifiable value - I cannot tell you whether "using the same AMIs as prod" is more or less impactful to Prod Fidelity for your service than "having DEBUG-level logging enabled" - but, I suspect that you have a decent idea of the relative importance of the particular variables for your service.

For the purposes of this article, it's not important to be able to give a number to Prod Fidelity - just to be able to compare it, to state that a given deployment has higher or lower Prod Fidelity than another. Generally speaking, as a software version progresses through the SDLC, it will be excecuted on deployments of increasing Prod Fidelity:

Detecting logical errors (rather than errors in configuration, deployment, or infrastructure) usually doesn't require high Prod Fidelity. High Prod Fidelity is generally more expensive - either in literal financial expense (running a deployment with an equal volume of equally-powerful compute hardware to Prod is more expensive than running a small set of "good-enough to run tests on"), or in operational complexity (a deployment which closely mimics Production in terms of functionality will require all the same functionality maintainance - authentication providers, certificate management, and so on.). Ceteris Paribus, it's preferable if an error can be detected before the change

TK....hmmm. Maybe I a) need to reconsider this point (is there really value in a pipeline beyond Alpha/Beta/Gamma/Load-Test/One-Box/Prod), and b) should just cut this out entirely. But preserve it - it's an interesting idea (and good writing, and especially a good diagram!), but maybe not necessary to this post.

(consider load testing results, or tests which rely on incompletely-deployed behaviour in dependencies when the testing stages don't hit production dependencies). Given that tension, how closely should your testing stages mimic production? For stages which closely mimic production and which "talk to" production downstreams and datastores, how do you mark test traffic such that it doesn't distort those datasets or generate real financial transactions while still providing a high-fidelity test?

TK Prod Fidelity increases

Definition of Deployed Testing

Categories of test are a fuzzy taxonomy - different developers will inevitably have different ideas of what differentiates a Component Test from an Integration Test, or an Acceptance Test from a Smoke Test, for instance - so, in the interests of clarity, I'm here using (coining?) the term "Deployed Test" to denote a test which can only be meaningfully carried out when the service is deployed to hardware and environment that resembles those on/in which it runs in production. These typically fall into two categories:

Tests whose logic exercises the interaction of the service with other services - testing AuthN/AuthZ, network connectivity, API contracts, and so on.
Test that focus on aspects of the deployed environment - service startup configuration, Dependency Injection, the provision of environment variables, nuances of the excecution environment (e.g. Lambda's Cold Start behaviour), and so on.

Note that these tests don't have to solely, specifically, or intentionally test characteristics of a prod-like environment to be Deployed Tests! Any test which relies on them is a Deployed Test, even if that reliance is indirect. For instance, all Customer Journey Tests - which interact with a service "as if" a customer would, and which make a sequence of "real" calls to confirm that the end result is as-expected - are Deployed Tests (assuming they interact with an external database), even though the test author is thinking on a higher logical level than confirming database connectivity. The category of Deployed Tests is probably best understood by its negation - any test which uses mocked downstreams, and/or which can be simply executed from an IDE on a developer's workstation without any deployment framework, is most likely not a Deployed Test.

Note also that, by virtue of requiring a "full" deployment, Deployed Tests typically involve invoking the service via its externally-available API, rather than by directly invoking functions or methods as in Unit Tests.

Typically, a change which proceeds through the SDLC will undergo testing which has higher Prod Fidelity

On the spectrum of Prod Fidelity (see the footnote[^multiple-footnote-link] linked from the second paragraph), Deployed Testing falls more towards the high-fidelity end.

TK differntiate from Ephemeral Environments for acceptance

I'm here using the definition from the Twelve-Factor App, that a Deploy(ment) is "a running instance of the app[...]typically a production site, and one or more staging sites.". Personally I don't love this definition - the intuitive meaning of "Deployment" for me is "the act of updating the executable binaries on a particular fleet/subset of execution hardware, to a newer version of those binaries", and I'm generally loathe to use a term whose term-of-art meaning significantly differs from (i.e. is not a sub/super-set of) the intuitive meaning unless there's clear value to doing so. In particular, I'm not aware of an alternative term for the process of "updating the binaries" , leading to the confusing possible statement "I'm making a deployment of version 3.2 to the pre-prod deployment". However, the Twelve-Factor definition appears to be widely-used, and my best alternative "stage" only really applies within a pipeline, so I'll attempt to use it in an unambiguous way⁴. ↩︎
I remain convinced that "what image is deployed to this deployment?" is not a configuration variable defining a deployment, but rather is an emergent runtime property of the deployment pipeline regarded as an operating software system; it should be considered as State, not Structure. See my [previous article]({{< ref "/posts/ci-cd-cd, oh my" >}}) for more exploration of this - though, since it's been over a year since I wrote that, and I've now had experience of using k8s/Argo professionally, I'm long-overdue for a follow-up (spoiler alert - I think I was right the first time ;) ). ↩︎
Another interesting topic that this post doesn't touch on - should you test on Production? (TL;DR - yes, but carefully, and not solely :P ) ↩︎
"Environment" - perhaps the most overloaded term in all software engineering, even worse than "Map" - is not even in the running as an alternative. ↩︎

11 KiB Raw Blame History Unescape Escape

Definition of Prod Fidelity

Definition of Deployed Testing

11 KiB

Raw Blame History Unescape Escape