Explaining Babbage

URL:http://thecomputersarewinning.com/post/Explaining-Babbage/

ReadyForZero recently released a library called Babbage and I thought I'd take a few minutes to describe the problem that it's solving and how it does it.

We use clojure at ReadyForZero and one of the great things about it is the ability to explore data at the REPL (I talk about one idea why here). At ReadyForZero we collect a lot of data about how people use the site and track how it's working, and for whom. Clojure has powerful sequence manipulation operators, and the tools for accessing statistics and accumulating over sequences is right at hand. For example, suppose that we are curious about visitors to our website, we might execute the following in our REPL: But now, if you want to break down average spend and compare them among the different groups, you might need to write the same thing again for multiple sets: Having to write out all the code for each set sucks! A few of the shortcomings:

Unscalable - as you add more and more different groups that you want to look at, it becomes unwieldy to write the same form over and over
Inefficient - each seq is being processed repeatedly
Verbose - it's hard to distinguish different lines from each other because there is so much boilerplate

We thought about how to formalize this process, and like all good reductions, ended up breaking it down into three steps.

The Babbage Model

Babbage seeks to abstract the process of collecting and comparing statistics into 3 steps:

Create a list of records (maps) that contain the relevant inputs to your accumulators and set predicates. Partition the records into subsets that you want to consider. These can overlap, so one record can be in multiple sets. Aggregate fields of interest. Both what raw numbers to extract from each record and what aggregations you want (eg: mean, sum, histogram).
Let's take these in turn with the example above.

Creating the input

Clojure is really well suited to working with maps and sequences, and so it's a good idea to start any "flow" or manipulation with a sequence of flat (as opposed to deeply structured) maps. Building up the required sequence can often require several function calls. For this, babbage provides a mechanism to declare dependencies of functions [1]. Here is a simple example: This case is a bit pedantic, since this would be more easily done in a single pass through raw-visitors, but defining these dependencies as graphfns has several advantages over regular functions:

Parallelism - two functions that don't depend on each other can be executed in parallel. In this case, spends and browser can be executed in parallel.
Lazyness - optionally, run-graph can be run in a mode where nothing is actually done until one of the keys in the resulting map is dereferenced. Here's an example of that:
Composability - you can write smaller functions that can be composed by run-graph, avoiding computation when you don't need it.

Structuring the input computation as a graph helps you create the input, and this sets up nicely for the next step: computing aggregates over different groupings of this sequence.

Partitioning the records into subsets

It's common to want to compare statistics over different subsets, and as we saw above in our "spending per browser" example, computing these by traversing a sequence each time is unscalable, inefficient, and verbose.

With Babbage, you can define the subsets you're interested in declaratively, by defining predicate functions that indicate membership. For example, continuing our example, we can take the output of the previous section and compute the "spend" for different subsets:

In addition to just partitioning them and defining sets with predicate functions, you can build up more complicated set compositions declaratively by using the standard set composition operators like union and intersection. There are plenty of examples in the README.

This makes definining different subsets efficient since each predicate is just computed once, and the aggregations for different subsets happen in the same pass over the sequence. More importantly, the definition of partitions is succinct.

Aggregating field values

We've seen above how aggregations are used, for example the "mean" above is computed on the "spend" field. You can specify multiple aggregation functions per field, and use any function you want to extract the value from a record. The README has more.

Babbage defines these stats using monoids, which is a simple formalism that lends itself to parallelizing the reduction of these statistics. If there's interest, let me know in the comments and I can write about it and how it interacts with the upcoming clojure reducers library and distributed aggregation.

Advantages of this model

I've tried to demonstrate how using Babbage breaks down the process of accumulating statistics into 3 distinct pieces, which are completely composable and orthogonal. This makes it faster to develop, more efficient to run, simpler to reason about, and easier to change.

At ReadyForZero, we've found a 3-4X development time reduction from thinking about aggregations in this way. If you're doing the kind of stuff at the top of this post, give Babbage a shot and let us know how it works for you or what we can do to improve it.

See the extensive README for more examples.

React to this post on hacker news.

Image may be NSFW.
Clik here to view.

Explaining Babbage

The Babbage Model

Creating the input

Partitioning the records into subsets

Aggregating field values

Advantages of this model

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112