Client after client of mine encounters the same impediments to agile delivery. As Northwest Cadence deepens our expertise in the Microsoft cloud, we learn more and more about technologies and techniques that can help.
In the old days, production was special. It was the place where all our limited resources were spent first: better CPU, more memory, bigger disk, load-balanced, greater redundancy, disaster recovery, tighter security. We spent whatever it took (or whatever we had) to make production robust, which didn’t leave much for any of our other machines or environments. Meanwhile, if we wanted to hire and keep the best talent, we beefed up developers’ workstations in their own particular ways—but the demands of our fancy dev tools didn’t always match the needs of production either.
Virtual machines gave us greater flexibility than before, but we still had to host the physical boxes the VMs ran on, and we still had to maintain the virtual environments and keep them configured and up-to-date.
When my colleague Steve talks about the age of “scarcity”, this is what comes to mind for me. We couldn’t possibly buy or maintain all the machines or VMs we needed to replicate production through our entire lifecycle. Something had to give, and whatever it was, took our quality and flow with it.
But why “scarcity”? Haven’t we been saying “disk is cheap” for years? When I wanted bigger RAM sticks for my laptop, I popped up my favorite online retailer and had them for peanuts + free shipping. I once pointed out to a client that they could buy a complete desktop tower that would easily meet all the needs of their team’s QA, at the retail shop we could see from the window of their office, for less than the cost of the dinner I’d seen them charge to their company credit card the night before. (Yes, I may have been jumping up and down and waving my arms when I told them this.) Technology is abundant!
For many organizations, it took the cloud to make all this abundance accessible—secure, easily configured, no hassle with licensing or updates.
Constraints or contention in QA environments
I still find large, enterprise clients where it seems like their QA teams have to beg for scraps. That scarcity mentality, where production and then perhaps developers get environments provisioned first, and leftover or obsolete boxes are pressed into service as intermediate testing environments, is surprisingly common.
Even more shocking, to me, is that I’ve encountered more than one organization where multiple teams had to share a single QA instance—literally waiting their turn for the team ahead of them to finish testing on it, then waiting for IT to wipe it down, reconfigure it, and completely reinstall a different subset of their software for testing. The only part that didn’t surprise me was watching that system fall apart when they tried to move it to a two-week sprint cadence!
The agile symptom I most often see here is multiple teams (with or without interdependencies) who run their sprints like little waterfalls, developing code in one sprint and testing it in a later one. Sometimes they even decompose their product backlog items (PBIs) along dev/test lines, making it look like they run a healthy velocity! Once we start measuring their per-feature cycle time, the truth comes out.
To realize the business benefits of an agile transformation, teams need to be able to develop and test a small feature within the same sprint, and PBIs need to result in potentially shippable code. Without that, the organization doesn’t get the flow of value it was promised, and support for agile practices can break down.
The cloud can help here, by making rich, fully-resourced QA environments easy to create, easy to replicate, and cost-effective: during the (unlikely?) times a QA environment isn’t needed, just spin it down and pay nothing until you need to spin it up again. And if every team happens to need their biggest, baddest environment at the same time? No big deal. The cloud can accommodate as many instances as you need with no new machines to purchase or provision. No more waiting in line to deploy for testing! Now teams can manage their own lifecycle and optimize their own flow. It’s abundance that respects your budget—and enables your agility.
Limited or no prod-like environments/data for testing
In the past, it didn’t always make economic sense to maintain multiple beefy prod-like environments, or to maintain and secure large collections of production (or prod-like) data, when the majority of functional tests didn’t seem to need all that muscle. This limits teams’ ability to get real, representative test results, which can be a problem anywhere along the pipeline. Some of my clients have a production mirror in their final staging/UAT environment. It’s better than nothing, but still dramatically increases their time to feedback. They’ll often tell me that test automation has limited usefulness for them, because though it catches compile-time errors early in the lifecycle, it’s the runtime errors—involving complex user experiences and I/O that isn’t well-controlled—that really bite them.
The manual testing to find these late-breaking defects is costly in itself, both in terms of humans’ time and the calendar time it takes to conduct. These QA folks exist to catch showstopping bugs before users do, but their findings are rarely received as good news. When issues are found so late in the release cycle, long after the sprint in which they were introduced, the fixes become more difficult and more costly for everyone.
The agile symptom in this situation is often an entire sprint dedicated to “hardening” or “stabilization” just before a release. Once again, what we’re exposing is that teams don’t have the resources they need to produce shippable code within a sprint. Once again, what the organization experiences is a lack of flexibility, and regularly-occurring iterations where teams have to do a lot of expensive remediation work instead of producing new business value.
Abundance to the rescue! Just as the cloud makes it easy and cost-effective to maintain as many QA environments as you need, and pay only for what you use, the cloud also makes it a trivial matter to spin up ridiculously powerful machines or store and refresh insane amounts of data for testing at a low cost. If your compile-time automated tests and your basic functional manual tests don’t need all those resources all the time, fine! You can manage your spend by using smaller and simpler environments for the bulk of your testing. What changes with the cloud is that you can spin up your production mirror any time, when you need it, early in the process or late. You can run meaningful performance and load tests with simulated peaks and outrageous stressors, and get actionable results when it’s still early enough to fix issues inexpensively.
Unpredictable, disruptive production incidents
The inevitable result of constrained and delayed testing is more defects escaping through to production. When an issue is finally discovered, often found for the first time by a paying customer, in most organizations it’s a fire drill. Teams under high pressure to deliver new valuable functionality find themselves struggling with constant disruption, shifting priorities, and context-switching—not to mention a spoken or unspoken blame for “causing” or “allowing” the defects in the first place. Business stakeholders are frustrated that constant maintenance slows down their plans for growth—it sucks up resources and time without returning any new revenue.
That doesn’t feel like the promise of agile at all!
In this case, the agile symptom is one of the most obvious: unplanned work in the sprint. Teams may build in a capacity buffer to accommodate the interruptions they know are likely to occur. I like the buffer as a short-term technique, but in the long term we also want to address the root causes, and make more lasting improvements to the team’s reliability. It’s especially urgent when I hear managers talk about getting their teams to “get it right the first time” or their developers to “write quality code”—that’s blame, and their teams know it, and it isn’t fair.
Here, the abundance of the cloud can help in a few different ways.
First, of course, everything we’ve already discussed about earlier and more representative testing will help ensure that disruptive bugs don’t escape to production in the first place. Our key agile metric here is mean time to detect (MTTD). When we discover issues just-in-time, very shortly after they were introduced, they’re far simpler to find and faster to fix, because we know where to look, we remember what we just did, and we haven’t had time to write twenty more complex things on top of the original problem. And that’s the key to the blaming problem—finding issues early, before they’ve moved along our pipeline, before others have had a chance to notice them, looks exactly the same as “getting it right the first time”!
Second, with the analytics and monitoring tools available in the cloud—easy to create and inexpensive to run—we can add automation to production itself to detect issues proactively, before they’ve had significant end-user impact. We can catch them before they have a chance to make us look bad. This is especially great for IT and it’s the heart of DevOps. Running and monitoring our applications in the cloud allows us to gradually roll out a new configuration, test it with a tiny percentage of our users, and roll it back instantly if anything goes wrong. The cloud allows us to kick out a misbehaving machine or resource at any time, seamlessly redirecting traffic to known good configurations. The cloud gives us zero-downtime deployments and makes our release day the same as any other day.
Conclusion… for now
Using the power of Azure to manage our development environments, our DevTest pipelines, our DevOps deployments, and our production health, gives our applications an unprecedented resilience and our teams the power to solve some of their toughest agile problems.
But that’s just the infrastructure and delivery side: the “-fall”, if you will, in “water-Scrum-fall”. What about other challenges to agile in other parts of the lifecycle? As I was researching this article, I discovered ways the cloud can help with the “water-”, too! (Insert your favorite condensation pun here.) In a future article, we’ll investigate cloud capabilities around decision support, feedback, metrics, and architecture to tackle the next generation of agile transformation.