Many in the development world have not yet heard of ITIL, but I believe it will come to matter soon enough. The ability to optimize organizational performance, not simply development performance, is a competitive edge and the path toward lowered cost and increased growth in the long term. If you are delivering critical web applications, you really want to get this. I have been in a handful of organizations now that have learned to manage better across development and operations; That is, these different areas were not separated by a wall and Development did not simply toss the application software over and wish Operations good luck. They learned to work as one organization – and it was the difference between predictable sanity and much heartache.
It is the considered opinion of some that attempting to resolve the misalignment of Operations and Development is too difficult to crack, or that most organizations have no interest in solving such problems. That said, I have seen it done and the effort was worth it. Maybe your organization is not ready, but I am hopeful that this post provokes consideration. A key thought in the process – don’t get balled up in ITIL, CMMI, Agile, or anyone else’s canned magic; This is always important, but the larger scale improvements will hit organizational barriers harder than anything else. Solve the problems you have, leverage your existing strengths, and keep the dogma at bay.
What does it look like – to have things working well? Consider an organization running managed services on limited hardware and limited budget in general. Lets say, about 150 applications on a small set of common infrastructure. The organization does real systems engineering and they have an architecture committee(Ops & Dev working together!), which does solid analysis of the alternatives – which considers network architecture, servers, costs, capacity, potential updates to machines, open source libraries, SAN storage, performance, maintenance considerations, and more. Every 6 months they potentially do a serious technology refresh and they are regularly driving new functionality in all technical areas. They run highly available and meet their SLAs, their delivery dates, and their budget… because they actually work together.
Deployments are a collaborative effort of network, security, development, open system, systems engineering, QA teams, along with business owners and release management personnel. In the real organizations where I have seen this capability, it emerged over time – but not a ridiculous amount of time. Without naming names – we will look at one experience…
First, it was necessary to stabilize the patient. Having QA testing on the weekend just to make the Saturday midnight deployment, then deploying the wrong code, having to build new code in the maintenance window… all of this pretty much without a reasonable net… Desperate times indeed. Sometimes you must get this low to get serious, it seems.
So the end-to-end cycle needed stability, because things were piling in chaotically at the end. Needing developers in the maintenance window was ridiculous… so part of the “sale” was that developers would no longer be pager-bound on the weekend. Of course, QA needed relief also; Most of this disarray was because development was allowed to get code changes in late in the week, leaving no time for proper test. They were allowed to deliver late – so they did. As it happened, we had a decent toolset; But we needed some rules.
Does this sound familiar? What’s important to note here is that we have not mentioned code reviews or technical documentation or requirements; We needed first to get an organizational understanding of how this was supposed to work. Once we had flow, then we could focus on the tasks inside. Because so much variation was allowed, the actors in the process could not really depend on much. Delivery was prayer and heroics.
The key gatekeeper here was QA; If they could not have reasonable time with the code(and not on a Saturday!), it was hard to drive quality. But another critical element was needed… one I consider the most fundamental element of working software development; The build needed to be stable… and, in this case, the best answer was to take it away from those who had no time to get it right. Build should be script-driven, done on a dedicated non-developer machine. Developers should be able to replicate the build results prior to check-in, but the build machine is the arbiter of truth. [That is – we really do not care if it worked on your machine! 🙂 ] Luckily, the QA Manager had the strength to say “Enough!” and Operations was tired of deployment windows going awry.
In this organization, we essentially had 1 or 2 maintenance windows per week. Getting your customer into a (sane) minimal but regular deployment schedule is something to drive for. Managing and leveling customer demand is part of a smart approach.
Based on the once a week model, we identified two checkpoints in the process: (1) Your code had to be in for test by noon on Wednesday, with all your ducks in a row[build complete, deployed into test environment with no issues, all related records in the right state, and deployment date approved: Noon was when the report was generated for the 1pm deployment meeting[, and (2) Your code had to be Ready for Production by Friday 10am(signed off as tested in QA environment, promoted to pre-production and verified, all tickets in the right state; A deployment report was generated for the 11am Final Deployment meeting). This was the standard process. Changes, as you might guess by the fast flow; QA had time to do their job, deployment mechanisms were verified, and time existed for contemplating risk and organizing for a solid deployment window. The Friday meeting included review of all the scheduled deployments and planning out the sequence of activities, the players involved, and so on… The Wednesday meeting had included preliminary discussions on these.
Of course, not everything will obey. So we had mechanisms for the exceptions. What is important, of course, is that you make sure exceptions are more rare.
Escalations and emergency deployments were also possible – but the signature requirements were nontrivial and they usually included both business owner and senior management. Development might want code delivered… they might be real proud of what they have accomplished; but a compressed QA cycle or rushed planning effort is real risk. Operations owns the stability of the production environment and the company pays a price for outages. Process needs to enforce and make evident such realities.
We continued to address production emergencies as priority interrupts, and identified a fix first and document after mechanism for anything that could be addressed quickly and without code change. (Not everything is code) Sometimes being in the environment negated access to the tooling. Note: We had monitoring in place for security and compliance, so many production changes were always noticed anyway – but we reinforced and followed up , so that production fixes(and other updates) had proper discussion recorded and audit trails. Outages and impairments also demanded follow-up root-cause analysis.
What is important about this process stabilization is what it enabled. Getting everyone on the same page was only a beginning. Once we had a rhythm, the metrics in the tooling started to have meaning. We could see bottlenecks and quality issues and manpower challenges. And individual groups could see better how to address local challenges. Developers really knew how to code, it seemed… the system before had just kept them too on edge to do their best or to improve anything. QA finally started having their weekends back and could also now execute solid test plans. Operations found their groove, because things were working.
The simplest change of all – getting on the same page with a contained and sane process flow – enabled continuing improvements going forward; Real improvement – reduced costs generally, respectable growth, and stable builds with increasing quality checks(e.g. unit test, static analysis). By laying a stable foundation, we were able to put more focus on the special cases that always come up – infrastructure updates and migrations, technology refreshes, architectural improvements, periods of long-running performance test and penetration tests in pre-production, and so on. Simple application deployments actually became simple. We got better at production monitoring, closed long-open defects, and more – we felt a sense of pride and accomplishment.
What’s important to note here: We did this without being led by ITIL or CMMI or any notion of agile. We did it without high-dollar ITSM tooling. You should try it!