Juan Paul

July 20, 2010

Bringing Dev and Ops closer together through metrics and dashboards

Filed under: DevOps — juan paul @ 4:30 pm
Tags: ,

I recently had a great visit from Damon Edwards and John Willis over at DevOps Cafe. They stopped by Shopzilla’s Los Angeles HQ to chat about how we use metrics, monitoring and dashboards to not only bridge gaps between our application engineering and operations orgs, but to create measurable applications for practical use by any customer within our shop – product managers and business VPs included. I’ll do you a bit of a favor and highlight the 57-minute video cast available at their operations open mic series:

  • Shopzilla’s ~150 engineer shop produces nearly 100 measurable services, all of which are candidates to have their entire life-cycle’s statistics broadcast to all employees in the company.
  • Software statistics measured range from unit test coverage, documentation coverage, complexity, performance test results, production SLAs, requests per second, exceptions (both distinct lists and aggregate counts) and uptime (measured internally and externally.)
  • These statistics come in at an all-told rate of about 85,000 per second. Measurement speed and scalability in presentation is key.
  • Initially, application engineers were reluctant to put their workings ‘on display’, however once the predictive value of these metrics was exhibited, the notion of broadcasting all this information became a fundamental part of considering a new project complete. The request flow changed from, ‘Please expose some metrics for us to measure. ‘ to ‘Please measure these metrics for us.’
  • Measure everything, and once this data gets too cumbersome for one display, allow the user to customize their own view of normalized data from multiple systems. For example, Shopzilla creates operational ‘mashboards’ that allow a user to define which revenue graph, which logging output, which system health checks they want to see. Once these displays can be customized, you basically have a user driven CMS application creating personal views of whatever data they deem important. If they still don’t use it, then what data do they need? Grab that, and present it in your normalized view as well.
  • If you don’t measure from the beginning (requirements gathering, early in the dev proces,) then how will you know what you’re missing when you get to production? You wouldn’t install a new version of Oracle/Red Hat without reading the release notes, so don’t release internal software without doing the same.
  • Tools used in our Shopzilla model: Sonar (for java unit test and documentation coverage,) Hudson (for continuous builds, deployments and pushing instrumentation to Sonar,) Splunk (for log aggregation,) Keynote (for external site monitoring through synthetic transactions,) Clearstone (for Oracle Coherence grid stats,) Cassandra (someone has got to put those 85k numbers into a meaningful place,) and Graphite (make it pretty.) Mix in a myriad of perl and python code for presentation, and you can abstract any additional tools at your leisure.

If video is more your thing and those bullets are not quite enough, you can still view the entire, hour long video over at DevOps Cafe. Enjoy, and please feel free to comment and/or disagree here or with the DevOps folks.

Advertisement

4 Comments »

  1. Juan Paul – enjoyed the video and this post. Can you provide some insight on how you measure documentation coverage?

    Comment by Burke Autrey — July 22, 2010 @ 12:50 pm | Reply

    • Sure thing: lets start by defining our application as a Maven-built Java ball of code laced with Javadoc notes within a series of methods and conditional branches. We then ensure that our Hudson continuous build servers have the Sonar plugin installed, and that our Maven builds define the Sonar goal each and every time. Much in the same way your jUnit code can be instrumented for test coverage by, say Cobertura, Sonar can walk through your exact source to detect the presence of commented headers per function/method/class. (The Sonar codehaus metrics page provides a good description of the metrics gathered and formula for deriving them. Sonar in a nutshell provides even more detailed screenshots and examples of ‘walking down’ your instrumented code. )
      Again similar to cobertura and unit test coverage, the javadoc instrumentation does not tell you necessarily whether the documentation written is accurate and/or helpful, but at least gives you a very good starting point for detecting what has failed to have been documented at all.
      A compelling add-on to cpan’s Devel::Cover (think Cobertura for Perl,) would allow you to perform the same measurements from your perldoc decorated code as well.
      Glad you enjoyed the video and as always, drop me a line with any additional questions or comments.

      Comment by juan paul — July 24, 2010 @ 2:02 pm | Reply

      • Excellent response. Thank you.

        Comment by Burke Autrey — July 26, 2010 @ 9:12 am

  2. Great post Juan. Can you go into more detail on the home-grown application that integrates modules containing graphs/data from 3rd party tools? I assume you would extend the API of each of the tools like Splunk and Keynote.

    Comment by Rick — March 3, 2011 @ 10:45 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: