Grig Gheorghiu on the Murky World of DevOps
By Boundary on 03 18 2013
Grig Gheorghiu has been involved in architecting and deploying large-scale system infrastructures for more than a decade. He doesn’t buy all the hype of DevOps, but believes there’s something there if you know how to find it. Grig currently works as VP of Technical Operations at Nasty Gal, an e-commerce startup in the fashion/apparel space in downtown Los Angeles. You can follow Grig on Twitter at @griggheo.
It seems there is a healthy disagreement on what exactly DevOps is: a practice, methodology, an IT organizational change, a development technique etc. What’s your take?
DevOps, to me, is based on a simple premise: keep the lines of communication open between the Dev team and the Ops team, to the extent that both Dev and Ops are on pager duty, responding to alerts coming from the application and from the infrastructure. As simple as this premise is, there is a lot of work done behind the scenes to actually pull it off. Here are some aspects of that work:
- The famous traditional wall between the Dev team and the Ops team needs to disappear. On one hand, Dev needs to keep Ops in the loop with any technologies they are planning to use. It’s all good and easy when playing with a new tool or technology on your laptop, but it’s a different challenge to deploy it to production and to make sure it stays up.
- On the other hand, Ops needs to create a flexible infrastructure that allows Dev to experiment and deploy freely, multiple times a day. I always give the example of Etsy, where everybody is able to deploy to production 25+ times a day.
- A prerequisite for being able to deploy multiple times a day is to measure, graph and alert on as many metrics as possible, both within the application and across the infrastructure. Again, the DevOps teams at Etsy have mastered this art. This is where a tool such as Boundary becomes invaluable. It gives insight into a layer of the infrastructure, the network, which is usually neglected: few organizations can afford to dedicate resources to the instrumentation, monitoring and graphing of the network layer.
- As I mentioned in my DevOps definition, both Dev and Ops need to carry the pager where the critical alerts from the application and the infrastructure are directed. This forces Dev to learn the Ops infrastructure (at least to make sense of an alert and potentially escalate it to an Ops specialist) and it forces Ops to learn the applications deployed on the infrastructure.
- There can be no frequent deployments and no comprehensive instrumentation of the application/infrastructure without automation. Many people define DevOps as “automated configuration management” and they’re not far from the truth. The advent of cloud computing was essential in this area, since it gave Dev and Ops teams both the tools (cloud APIs) and the requirements (scale) that resulted in DevOps having no other choice but to automate. It’s no coincidence that automated configuration management tools such as Chef and Puppet have become so popular and even indispensable, since cloud computing took off.
How to build a DevOps team: Can you grow your own?
The ideal mix in a DevOps team is obviously Dev engineers who are interested in operational aspects of the applications they are working on (scalability, performance, reliability) and Ops engineers who are interested in creating infrastructures that support frequent deployments of the applications, as well as instrumentation, monitoring and graphing. On top of these technical skills, DevOps professionals need to have people skills in order to have meaningful and fruitful interactions with each other. It’s easier said than done to hire these people.
When hiring an Ops engineer with a DevOps slant, I look at their experiences in deploying and managing large-scale infrastructures, ideally with automated configuration management tools. I also look at how involved they are in various Open Source communities and ideally, they would have their own projects up on GitHub. An involvement in professional communities outside of their day job is a good sign that they will interact well within teams.
I also like to see passion for the profession, a desire to stay on top of their game, and most of all curiosity. These are characteristics you can’t really teach. DevOps is not supposed to be a job description, but in reality there are a lot of people who list that as a job title in their LinkedIn profile. You can find some decent candidates if you mine LinkedIn that way, although the percentage of qualified candidates is pretty small.
Tools and technologies that matter most?
Okay here’s a starting list:
- If you want to call yourself a DevOps pro, you need to master at least one scripting language. My favorite is Python, but Ruby or even (gasp) Perl would do the trick, although most modern tools are based on either Python or Ruby anyway. Knowledge of SQL is a big plus as well.
- In terms of tools and technologies, it all starts with automated configuration management tools such as Chef, Puppet and CFEngine. Automated deployment tools such as Fabric and Capistrano are useful especially for pushing smaller discrete updates such as application configuration files.
- A Continuous Integration system such as Jenkins is a must for testing and deploying of new code.
- Monitoring, logging and graphing tools are essential. I have used Sensu (a new Open Source kid on the block when it comes to replacing the venerable Nagios), Server Density, Boundary, New Relic, Pingdom and PagerDuty.
- For logging we’ve had good success with Papertrail (especially for sending network device logs there for easy search).
- For graphing I think the consensus these days is to use Graphite. I’ve also used homegrown tools based on the Google Visualization API.
- If you are dealing with multiple cloud providers, I recommend using one of the Open Source Multi-cloud libraries available, such as jclouds (Java), libcloud (Python) or deltacloud (Ruby).
- In the context of clouds and virtualization, I also recommend Vagrant for fast local deployments of development environments that closely resemble production.
Organizational or cultural barriers in the way?
The one barrier to rule them all is the wall between Dev and Ops. I’ve heard horror stories mostly from large enterprise environments, where developers have to wait for weeks before getting access to a VM for development purposes. To counteract this wall, you need a culture of collaboration and sharing. That’s why an involvement in Open Source communities is so important. To me, DevOps implies a culture of service. Ops is in the service of Dev, and Dev is in the service of business. If at any point either Dev or Ops becomes a bottleneck in the flow of tasks or ideas, there’s a problem. There’s also a problem if there’s a hoarding of knowledge for job security purposes. You need to promote values such as transparency, openness and going the extra mile to help your fellow workers.
Any tips that have worked for you?
Frequent communications between Dev and Ops is a must. This requires a delicate balance between holding too many and too few meetings. Since most engineering organizations follow some sort of Agile methodology these days, Dev teams already have a daily standup meeting, so Ops teams should follow that example. There should also be a common Dev + Ops meeting at least once per sprint, to have everybody on the same page regarding the initiatives and tasks that will be worked on during the sprint.
Good documentation is critical. The proverbial internal wiki is as good a place as any for this purpose. It’s essential to have documents detailing the infrastructure, how to access it, common issues and procedures to be followed when something goes wrong, including clear escalation procedures. Read John Allspaw’s blog posts offering great advice about all these things.
Finally, postmortems are very useful in determining what went wrong and what could go better next time. Of course, you need to apply the ‘no blaming or punishment’ policy for this tool to be effective.