Quantcast
Channel: Hacker News 50
Viewing all articles
Browse latest Browse all 9433

CitusDB: Scalable Analytics Database - Run SQL on JSON files -- without any data loads

$
0
0

Comments:"CitusDB: Scalable Analytics Database - Run SQL on JSON files -- without any data loads"

URL:http://citusdata.com/blog/65-run-sql-on-json-files-without-any-data-loads


Created on May 30, 2013 Written by Hadi Moshayedi

We keep our log data in JSON files. Can we run analytic queries on these files without having to load them into a database?

We hear this question frequently from customers, and are finally happy to say yes. Without loading any data into a database, you can now quickly analyze your semi-structured data, and even make use of standard SQL data visualization tools. The key component that makes this possible is foreign data wrappers in Postgres. These wrappers enable running SQL on an external data source, given that the wrapper knows how to convert external data from its original form into PostgreSQL's tuple format.

We wanted to enable this functionality for JSON files, and implemented a wrapper named json_fdw. We made sure json_fdw handled nested fields, array types, and heterogeneous documents, and then open sourced the project. The source code and build instructions for json_fdw are available through GitHub; and the wrapper depends on PostgreSQL 9.2 for its SQL capabilities.

Now, let's see how this all fits together with an example. For this, we start by downloading customer reviews data from Amazon for the year 1998 in nested JSON format.

$ wget http://examples.citusdata.com/customer_reviews_nested_1998.json.gz

We have about 600K documents in this dataset. Let's take a look at an example document in this file:

$ gunzip -c customer_reviews_nested_1998.json.gz | head -n 1
{ "customer_id": "AE22YDHSBFYIP", "review": { "date": "1970-12-30", "rating": 5,
 "votes": 10, "helpful_votes": 0 }, "product": { "id": "1551803542", "title":
 "Start and Run a Coffee Bar (Start & Run)", "sales_rank": 11611, "group": "Book",
 "category": "Business & Investing", "subcategory": "General", "similar_ids":
 ["0471136174","0910627312","047112138X","0786883561","0201570483"] } }

Now, let's connect to the PostgreSQL database, and create a foreign table for this JSON file.

$ /opt/citusdb/2.0/bin/psql -h localhost -d postgres
postgres=# CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
 customer_id TEXT,
 "review.date" DATE,
 "review.rating" INTEGER,
 "product.id" CHAR(10),
 "product.group" TEXT,
 "product.title" TEXT,
 "product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');

The first two commands define json_fdw in the database. The last command declares a foreign table schema that gets interpreted at read-time. Some other points to highlight here:

  • json_fdw can directly read from gzipped files. This way, you don't even need to decompress your files.
  • Nested fields in JSON documents are referenced using dot separators. For example, a field defined as "review": { "rating" : 5 } in the JSON document is declared as "review.rating" in the table schema.
  • The foreign table schema is defined at read-time. If you have an additional field that you'd like to query, such as "review.votes", you can simply add the column name and start querying for data. You can even create multiple table schemas for the same underlying file.
  • json_fdw assumes the underlying data can be heterogeneous. If you are querying for a column, and this field doesn't exist in a document or the field's type doesn't match, json_fdw considers the field to be null.

Now that we have the foreign table created, we can run SQL queries on it. In the following, we show a rather simple query, but complex queries that involve sub-selects, SQL window functions, and even joins against your regular Postgres tables are possible.

-- Look for a correlation between book titles and review ratings in our JSON file
SELECT
 width_bucket(length("product.title"), 1, 50, 5) title_length_bucket,
 round(avg("review.rating"), 2) AS review_average,
 count(*)
FROM
 customer_reviews
WHERE
 "product.group" = 'Book'
GROUP BY
 title_length_bucket
ORDER BY
 title_length_bucket;

This particular SQL query completes in about 6 seconds on a 3.1 GHz CPU core. This includes reading documents from the compressed JSON file, parsing fields from those documents, and handing them to Postgres for query execution.

We ran similar queries on the entire customer reviews dataset using a 3.1 GHz CPU core. Regular JSON files were 1.8 GB in size and fit into memory. Average document size was 300 bytes, and json_fdw processed 122K documents/second. When compressed, the file size dropped by 80% down to 325 MB. In this case, json_fdw's processing speed slightly slowed down to 108K documents/second.

We know that these numbers are just approximates; and we're sharing them here to give a rough idea about performance. Typically with analytic queries, a lot depends on the query itself and the underlying documents. For example, we found that as JSON documents get bigger in size, processing speeds decrease.

Of course, if query speeds are important to you, you could use CitusDB binaries, and "stage" data into a distributed foreign table. With this approach, you can also have the database automatically collect statistics about the underlying data, and apply query optimizations such as partition pruning.

For more details on query distribution, or any type of feedback you have on json_fdw, please get in touch with us at engage @ citusdata.com and tell us what you think!

Got comments? Join the discussion on Hacker News.

  • Prev
    Running PostgreSQL on Compression-enabled ZFS

Viewing all articles
Browse latest Browse all 9433

Trending Articles