Introducing Drake, a kind of ‘make for data’

Comments:" Introducing Drake, a kind of ‘make for data’ - Factual Blog"

URL:http://blog.factual.com/introducing-drake-a-kind-of-make-for-data

Processing data can be a real a mess!

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

a multitude of steps, with complicated dependencies
code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

Introducing ‘Drake’, a “Make for Data”

We call this tool Drake, and today we are excited to share Drake with the world, as an open source project. It is written in Clojure.

Drake is a text-based command line data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs. It automatically resolves dependencies and provides a rich set of options for controlling the workflow. It supports multiple inputs and outputs and has HDFS support built-in.

We use Drake at Factual on various internal projects. It serves as a primary way to define, run, and manage data workflow. Some core benefits we’ve seen:

Non-programmers can run Drake and fully manage a workflow
Encourages repeatability of the overall data building process
Encourages consistent organization (e.g., where supporting scripts live, and how they’re run)
Precise control over steps (for more effective testing, debugging, etc.)
Unifies different tools in a single workflow (shell commands, Ruby, Python, Clojure, pushing data to production, etc.)

Examples

Here’s a simple example of a Drake workflow file with three steps:

;
; Grabs us some data from the Internets
;
contracts.csv <-
 curl http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt > $OUTPUT
;
; Filters out all but the evergreen contracts
;
evergreens.csv <- contracts.csv
 grep Evergreen $INPUT > $OUTPUT
;
; Saves a super fancy report
;
report.txt <- evergreens.csv [python]
 linecount = len(file("$[INPUT]").readlines())
 with open("$[OUTPUT]", "w") as f:
 f.write("File $[INPUT] has {0} lines.\n".format(linecount))

Items to the left of an arrow ( <- ) are output files, and to the right of an arrow are input files. Under the line specifying inputs and outputs is the body of the step, holding one ore more commands. The command(s) of a step are expected to handle the input(s) and produce the expected output(s). By default, Drake steps are written as bash commands.

Assuming we called this file workflow.d (what Drake expects by default), we’d kick off the entire workflow by simply running Drake in that directory:

Drake will give us a preview and ask us to confirm we know what’s going on:

The following steps will be run, in order:
 1: contracts.csv <- [missing output]
 2: evergreens.csv <- contracts.csv [projected timestamped]
 3: report.txt <- evergreens.csv [projected timestamped]
Confirm? [y/n]

By default, Drake will run all steps required to build all output files that are not up to date. But imagine we wanted to run our workflow only up to producing evergreens.csv, but no further. Easy:

The preview:

The following steps will be run, in order:
 1: contracts.csv <- [missing output]
 2: evergreens.csv <- contracts.csv [projected timestamped]
Confirm? [y/n]

That’s a very simple example. To see a workflow that’s a bit more interesting, take a look at the “human-resources” workflow in Drake’s demos. There you’ll see a workflow that uses HDFS, contains inline Ruby, Python, and Clojure code, and deals with steps that have multiple inputs and produce multiple outputs. Diagramed, it looks like:

As our workflows grow complicated, Drake’s value grows more apparent. Take target selection for example. Imagine we’ve run the full workflow shown above and everything’s up-to-date. Then we hear that the skills database has been updated. We’d like to force a rebuild of skills and all affected dependent outputs. Drake knows how to force build (+), and it knows about the concept of downtree (^). So we can just do this:

Drake will prompt us with a preview…

The following steps will be run, in order:
1: skills <- [forced]
2: people.skills <- skills, people [forced]
3: people.json <- people.skills [forced]
4: last_gt_first.txt, first_gt_last.txt <- people.json [forced]
5: for_HR.csv <- people.json [forced]
Confirm? [y/n]

… and we’re off and running.

But wait, there’s more!

Drake offers a ton more stuff to help you bring sanity to your otherwise chaotic data workflow, including:

rich target selection options
support for inline Ruby, Python, and Clojure code
tags
ability to “branch” your input and output files
HDFS integration
variables
includes

Drake’s designer gives you a screencast

Here’s a video of Artem Boytsov, primary designer of Drake, giving a detailed walk through:

Drake integrates with Factual

Just in case you were wondering! Drake includes convenient support for Factual’s public API, so you can easily integrate your workflows with Factual data. If that interests you, and you’re not afraid to sling a bit of Clojure, please see the wiki docs for the Clojure-based protocol called c4.

Drake has a full specification and user manual

A lot of work went into designing and specifying Drake. To prove it, here’s the 60 page specification document. The specification can be downloaded as a PDF and treated like a user manual.

We’ve also started wiki-based documentation for Drake.

Build Drake for yourself

To get your hands on Drake, you can build it from the GitHub repo.

All feedback welcome!

If you’re a wrangler of data workflows, we hope Drake might be of some use to you. Bug reports and contributions can be submitted via the GitHub repo. Any comments or questions can be submitted to the Google Group for Drake.

Go make some great workflows!

Sincerely,

Aaron Crow

Software Engineer at Factual

Introducing Drake, a kind of ‘make for data’ - Factual Blog

Processing data can be a real a mess!

Introducing ‘Drake’, a “Make for Data”

Examples

But wait, there’s more!

Drake’s designer gives you a screencast

Drake integrates with Factual

Drake has a full specification and user manual

Build Drake for yourself

All feedback welcome!

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...