There’s a lot of misunderstanding around the brilliant and unfortunate term DevOps. This article aims to expose a story that made a company build the DevOps culture to its roots.
First, there were the silos
I remember the days when companies had some silos that were part of a software delivery chain. Some key participants in this story:
- Developers: The guys building the software, complaining it’s too slow to get it into production
- Jenkins masters: The guys configuring Jenkins, complaining about inconsistent structures
- Funcional Testers: The guys testing the software, complaining it does not work and red builds at 6pm
- Performance Testers: The guys stressing the software, complaining it is too slow
- Sysadmins: The guys deploying the software, without really knowing what is it about, complaining about brainless work
If we look at the above scenario, everything is broken in so many ways. The proof is that everyone is complaining about something, and nobody really understands the other’s complains. There’s one key participant missing from above, actually the most important one:
- Customer: The guy actually paying every other participant, complaining about not being able to use the software.
We see where this goes right?
Then the company is about to shutdown, and everyone blames everyone else, apart from the customer, which neither does not know or/and does not care about the company structure.
The customer blames the company, as a whole, and the reason behind it is because the company didn’t know how to work as a whole. The real problem was the company’s mindset around that delivery process. Ironic, isn’t it?
My own DevOps story
Fortunately, I had the honour to work for a company that really nailed this problem, and we watched the raise of the DevOps culture in the first person. It took a while, but it really paid our invested efforts.
A special thanks to everyone that made this possible, for keeping and open mind for this improvement to happen.
Timeline: 0 (The realization)
Downtime is money!
- Just joined the company as a developer.
- Development teams perform commits and deployments to development environments by running Chef Client command in each box.
- Infrastructure team perform deployments to live environment using Func or Ansible commands.
- SRE team perform performance tests in a separate environment.
- Integration tests take 10 hours to complete, and some randomly fail, nobody really knows why. Before each deployment, those failing tests are run again manually and they will eventually pass, making sure everything is fine.
- The whole deployment process with tests takes about 2 days.
- Some bad downtimes happen, downtime means company not making profit.
- Me thinking: Not that bad, putting into perspective, at least we have configuration management. Not great either.
Timeline: 3 months (Breaking the wall between development and operations)
- I asked someone of the infrastructure team: Can I help you with the live deployment?
- I got the reply: I can certainly teach you how to do it, but this first time you’ll just watch.
- I was lucky that the infrastructure team was mature enough to understand I was trying to improve the process, not stealing their jobs.
- Ok, we got some top notch operations fellows capable of understanding the value of coaching and continuous improvement.
Timeline: 4 months (Build the fence of trust)
Trust takes a lot of time to build, and no time to destroy.
- Being the Goat!
- I’m bridging the communication between my developer co-workers and the infrastructure team.
- I got the whole process documented in Confluence.
- We now just have to copy-paste some Ansible commands from there to perform the deployment of the most critical system and most legacy in the company (old-school manual canary release).
- I now have access to perform literally anything in the live environment.
Timeline: 6 months (Increase the fence of trust)
- I was finally able to get more people onboard in this mission of improving the deployment process and get them the necessary privileges to make software go live.
- Everyone in that group is able to copy-paste some ansible commands and prove they work.
- Another team member looked into the randomly failing automated tests and created a task force to fix them.
- Another team member engaged with the SRE team to help improving the performance tests process.
Timeline: 12 months (Gathering feedback)
- The integration tests are being a good fight… you know… infamous legacy code noone really knows about.
- The deployment process is still copy-paste of commands, which is far better than the beginning, but starts to look far from ideal.
- Performance is now being tested by the development team against the live environment, with the guidance of the SRE team that provided tools and guidelines for this to be possible.
- There are now better performance dashboards that reflect the true meaning of the service reporting them. This telemetry is helping everyone better understanding the behaviour of the system and the impact of each deployment.
- The infrastructure team is able to tell the development team what can be made better in that process because they are now thinking outside the box (pun intended).
Timeline: 18 months (Improving the process, continuously learning)
- Someone else from the development team picked up from that Ansible instructions documentation and work together with the operations team to build and an Ansible playbook and attach it to a Jenkins job.
- This is now being extended to other services in the ecossystem. A lot of services have now fully automated deployments.
- Everyone in the development team is able to perform releases without blinking.
- The infrastructure team can now focus on developing more tools to make this process even better based on the feedback.
- We can make deployments with the click of a button. Hurray!
- We can use the existing feedback to make this process even better.
The reality today
There’s always something to improve, the trick is to remove the communication friction, get continuous feedback on the process and improve it.
This only worked because both sides of the chain (Dev and Ops) were willing to work together back and forth to improve the process of delivering software fast and with quality.
- Infrastructure as code.
- Automated blue-green deployments.
- Amazed customer.
- Better quality software, faster web-site and mobile apps.
- People working everyday to improve the process even more.
What DevOps is not?
DevOps is not a team
Changing the name of the Sysadmin team to the DevOps (an error that many companies unfortunately do) won’t make things magically start working better.
Actually, by doing this, we are defiling the term, misleading every single person in the company about the true meaning behind the DevOps mindset.
DevOps is not a person
DevOps is not a Dev that knows Ops or an Ops that knows Dev.
Saying “I’m DevOps at my company” or having a fancy title “DevOps Engineer” at LinkedIn is just a way of spotting the name of the company in the ugly way.
People understanding DevOps will just think of that company as having a complete misunderstanding of the term and it’s just making use of the buzzword because… well… it’s a buzzword and it’s cool to have it.
DevOps is not (only) automation
Although automation is a huge deal in the whole DevOps mindset, DevOps is not solving a technology problem. At the end of the day it is all about people interactions. Of course, some tools are needed to make this work in a fast-paced manner.
I can have a fully automated software delivery pipeline, but if I commit my code and go home without worrying about FRs and NFRs (functional and non-functional requirements), I’m not following the DevOps practices.
The three ways of DevOps
DevOps is a customer-driven mindset that promotes the interactions between development and operations. It breaks the wall of the pipeline, and makes everyone work together to accomplish the true goal for all of them as a company: An happy customer.
Once implemented, it helps the teams to be more responsive to customer needs and deliver software faster, staying ahead of the competition.
Developers and Operations are the production chain between the company and the customer. As we go in the delivery value stream from the left to the right, we want to break any sort of friction that exists and make everyone work together to make that flow perfect.
That creates the feedback flow back in the delivery value stream. This way we transform what it used to be a pipeline into a conversation that generates understanding of what is required to improve the process (testing, telemetry, etc).
Take the lessons learned in the feedback loop to improve the process and make it even more optimized. This will result in things like tools, new dashboards, new tests, new ways of working (e.g. Netflix came up with the Chaos Monkey process in this respect).
With the DevOps principles in mind, we can describe architectural patterns, delivery process and guidelines that are the real reason why high-performance companies are high-performance companies (and not because they use technology X or Y).
I explained a real-world story that happened in a high-performance company that became even more performant. They went from good to great with a simple change of mindset.
To finish, a book that I really recommend anyone in this industry to read is The Phoenix Project. It really depicts the reality of many companies and how DevOps help them win.
subscribe via RSS