Comments:"Explaining Babbage"
URL:http://thecomputersarewinning.com/post/Explaining-Babbage/
ReadyForZero recently released a library called Babbage and I thought I'd take a few minutes to describe the problem that it's solving and how it does it.
We use clojure at ReadyForZero and one of the great things about it is the ability to explore data at the REPL (I talk about one idea why here). At ReadyForZero we collect a lot of data about how people use the site and track how it's working, and for whom. Clojure has powerful sequence manipulation operators, and the tools for accessing statistics and accumulating over sequences is right at hand. For example, suppose that we are curious about visitors to our website, we might execute the following in our REPL: But now, if you want to break down average spend and compare them among the different groups, you might need to write the same thing again for multiple sets: Having to write out all the code for each set sucks! A few of the shortcomings:
- Unscalable - as you add more and more different groups that you want to look at, it becomes unwieldy to write the same form over and over
- Inefficient - each seq is being processed repeatedly
- Verbose - it's hard to distinguish different lines from each other because there is so much boilerplate
We thought about how to formalize this process, and like all good reductions, ended up breaking it down into three steps.
The Babbage Model
Babbage seeks to abstract the process of collecting and comparing statistics into 3 steps:
Create a list of records (maps) that contain the relevant inputs to your accumulators and set predicates. Partition the records into subsets that you want to consider. These can overlap, so one record can be in multiple sets. Aggregate fields of interest. Both what raw numbers to extract from each record and what aggregations you want (eg: mean, sum, histogram).Let's take these in turn with the example above.
Creating the input
Clojure is really well suited to working with maps and sequences, and so it's a good idea to start any "flow" or manipulation with a sequence of flat (as opposed to deeply structured) maps. Building up the required sequence can often require several function calls. For this, babbage provides a mechanism to declare dependencies of functions [1]. Here is a simple example: This case is a bit pedantic, since this would be more easily done in a single pass through raw-visitors, but defining these dependencies as graphfns has several advantages over regular functions:
- Parallelism - two functions that don't depend on each other can be executed in parallel. In this case, spends and browser can be executed in parallel.
- Lazyness - optionally, run-graph can be run in a mode where nothing is actually done until one of the keys in the resulting map is dereferenced. Here's an example of that:
- Composability - you can write smaller functions that can be composed by run-graph, avoiding computation when you don't need it.
Structuring the input computation as a graph helps you create the input, and this sets up nicely for the next step: computing aggregates over different groupings of this sequence.
Partitioning the records into subsets
It's common to want to compare statistics over different subsets, and as we saw above in our "spending per browser" example, computing these by traversing a sequence each time is unscalable, inefficient, and verbose.
With Babbage, you can define the subsets you're interested in declaratively, by defining predicate functions that indicate membership. For example, continuing our example, we can take the output of the previous section and compute the "spend" for different subsets:
In addition to just partitioning them and defining sets with predicate functions, you can build up more complicated set compositions declaratively by using the standard set composition operators like union and intersection. There are plenty of examples in the README.
This makes definining different subsets efficient since each predicate is just computed once, and the aggregations for different subsets happen in the same pass over the sequence. More importantly, the definition of partitions is succinct.
Aggregating field values
We've seen above how aggregations are used, for example the "mean" above is computed on the "spend" field. You can specify multiple aggregation functions per field, and use any function you want to extract the value from a record. The README has more.
Babbage defines these stats using monoids, which is a simple formalism that lends itself to parallelizing the reduction of these statistics. If there's interest, let me know in the comments and I can write about it and how it interacts with the upcoming clojure reducers library and distributed aggregation.
Advantages of this model
I've tried to demonstrate how using Babbage breaks down the process of accumulating statistics into 3 distinct pieces, which are completely composable and orthogonal. This makes it faster to develop, more efficient to run, simpler to reason about, and easier to change.
At ReadyForZero, we've found a 3-4X development time reduction from thinking about aggregations in this way. If you're doing the kind of stuff at the top of this post, give Babbage a shot and let us know how it works for you or what we can do to improve it.
See the extensive README for more examples.