Juan Paul

July 20, 2010

Bringing Dev and Ops closer together through metrics and dashboards

Filed under: DevOps — juan paul @ 4:30 pm
Tags: ,

I recently had a great visit from Damon Edwards and John Willis over at DevOps Cafe. They stopped by Shopzilla’s Los Angeles HQ to chat about how we use metrics, monitoring and dashboards to not only bridge gaps between our application engineering and operations orgs, but to create measurable applications for practical use by any customer within our shop – product managers and business VPs included. I’ll do you a bit of a favor and highlight the 57-minute video cast available at their operations open mic series:

  • Shopzilla’s ~150 engineer shop produces nearly 100 measurable services, all of which are candidates to have their entire life-cycle’s statistics broadcast to all employees in the company.
  • Software statistics measured range from unit test coverage, documentation coverage, complexity, performance test results, production SLAs, requests per second, exceptions (both distinct lists and aggregate counts) and uptime (measured internally and externally.)
  • These statistics come in at an all-told rate of about 85,000 per second. Measurement speed and scalability in presentation is key.
  • Initially, application engineers were reluctant to put their workings ‘on display’, however once the predictive value of these metrics was exhibited, the notion of broadcasting all this information became a fundamental part of considering a new project complete. The request flow changed from, ‘Please expose some metrics for us to measure. ‘ to ‘Please measure these metrics for us.’
  • Measure everything, and once this data gets too cumbersome for one display, allow the user to customize their own view of normalized data from multiple systems. For example, Shopzilla creates operational ‘mashboards’ that allow a user to define which revenue graph, which logging output, which system health checks they want to see. Once these displays can be customized, you basically have a user driven CMS application creating personal views of whatever data they deem important. If they still don’t use it, then what data do they need? Grab that, and present it in your normalized view as well.
  • If you don’t measure from the beginning (requirements gathering, early in the dev proces,) then how will you know what you’re missing when you get to production? You wouldn’t install a new version of Oracle/Red Hat without reading the release notes, so don’t release internal software without doing the same.
  • Tools used in our Shopzilla model: Sonar (for java unit test and documentation coverage,) Hudson (for continuous builds, deployments and pushing instrumentation to Sonar,) Splunk (for log aggregation,) Keynote (for external site monitoring through synthetic transactions,) Clearstone (for Oracle Coherence grid stats,) Cassandra (someone has got to put those 85k numbers into a meaningful place,) and Graphite (make it pretty.) Mix in a myriad of perl and python code for presentation, and you can abstract any additional tools at your leisure.

If video is more your thing and those bullets are not quite enough, you can still view the entire, hour long video over at DevOps Cafe. Enjoy, and please feel free to comment and/or disagree here or with the DevOps folks.


Blog at WordPress.com.