URL:http://blog.paralleluniverse.co/post/44146699200/spaceships
A New Approach to Databases and Data Processing — Simulating 10Ks of Spaceships on My Laptop
In this post I’d like to discuss a new approach to database-aided data processing, and how we were able to use this approach to simulate tens of thousands of fighting spaceships on my laptop. Obviously, “a new approach to data processing” is a grand claim which will turn out to be little more than a repurposing of a rather old approach, and, naturally, such a broad topic has many ins and outs and specifics that I only have room to hint at and would require deeper discussion; nevertheless, I hope this bird’s-eye view of a serious problem and its suggested solution will be enough to spark your curiosity. But first, here’s what 10,000 spaceships in battle look like (best viewed in high quality):
Could we please have the old Moore’s Law back?
CPUs aren’t getting any faster. They haven’t been getting faster for a few years, and they won’t be getting faster ever again[1].
While Gordon Moore’s 1965 prediction about the rate of growth of CPU transistor density still holds, more or less, to this day, all it does now is give us more cores.
If up until five or six years ago growing requirements demanded that a piece of software handle double the amount of data, or double the number of users, all developers had to do was wait for new hardware. Now the story is different: scaling experts are hired; databases sharded; consistency guarantees relaxed.
Unfortunately, we can’t get the “old Moore’s law” back. If a program has but two instructions, one depending on the result of the other, no number of cores could make executing those two dependent instructions faster, as they must run consecutively. This is, in essence, Amdahl’s law. Heck, even a program consisting of a single instruction would run doubly-fast on a double-speed CPU, but would not be accelerated in the slightest by the addition of cores.
This, of course, isn’t news. All the vast amount of talk of scaling in the past few is a direct consequence of “old Moore’s” decline.
But Amdahl’s law isn’t quite a death sentence to software scaling. After all, software doesn’t need to do the same amount of work faster and faster; it needs to do more work at the same speed less work used to take. More work usually means processing more data, and more data brings with it more opportunities for parallelism, thus canceling-out Amdahl. But — and this is a big but, indeed — achieving this comes at a hefty price. Exploiting multiple cores well is really hard. It takes a lot of work. Doing it both correctly and efficiently is damn-near impossible. So what do we do? we shard.
Why sharding just won’t cut it
The simplest way to handle more data using more cores (whether on a single machine or in cluster) is to partition it into disjoint subsets, and work on each subset in isolation. In the database world this is called sharding, and it makes scaling relatively simple; only downside is -— it makes writing compellingly complex applications hard.
Sharding only works well when certain assumptions hold. In particular, it requires little or no interaction between data elements in separate shards. In fact, we can look at it like this: having N shards is imposing N “non-interaction” constraints on the application (or writing jury rigged code to handle such interactions as corner cases).
But if the new Moore’s law gives us exponential core-number growth, to exploit those cores we need to grow the number of shards exponentially as well, meaning we’re placing and exponential number of constraints on an application that probably only needs to accommodate a linearly increasing number of users, and that’s bad.
We could, of course, use fewer cores and fewer shards, but then data-processing performance within each shard will remain constant (because CPU clock speed has plateaued), and we won’t be able to do cooler stuff with the data.
Sharding also limits other useful abilities like secondary indices: sometimes we’d like to access our data according to different criteria that don’t align neatly with the chosen partitioning.
But, since sharding is relatively simple, that is the approach commonly taken when facing a scaling challenge. The result is usually software that’s severely hindered in its complex, useful, abilities, or one that requires inordinate amount of work to fight an uphill battle against the limitations their architecture imposes on the application, rather than have the architecture magically expand the application’s abilities.
The way to go
If we want to continue writing compellingly complex applications at an ever-increasing scale we must come to terms with the new Moore’s law and build our software on top of solid infrastructure designed specifically for this new reality; sharding just won’t cut it.
First, this means using a programming language built for concurrency. At present, there are two such languages in wide(ish) use: Erlang[2] and Clojure. Other languages might make some concurrent functionality easier than before, but are not designed from the ground up for concurrency (for example, they do not prevent races or enforce correct data visibility)[3].
Second, we need the right concurrent data structures. Every application needs a large data structure (call it a database if you like) to use as its main data-store. This data structure must be concurrent, and it must be mutable to allow for concurrent writes. Erlang and Clojure don’t readily allow for writing such data structures because, for the sake of efficiency, these data structures must internally circumvent the concurrency protections put in place by the language. Erlang provides such a data structure in the form of ETS, and Clojure programs may rely on Java’s concurrent data-structures.
For simplicity’s sake, let us call such a data structure the database, even though it is not necessarily durable durable (though it must be able to store the application’s main data set).
What the database should do
Like map-reduce frameworks and SQL databases as they’ve been traditionally used — and unlike many NoSQL solutions — we believe that it is the database’s role not only to store and retrieve data on behalf of the application, but also to assist the application in processing it. This because the database is in a unique position to know which transactions may contend for the same data items, and how to schedule them with respect to one another for the best possible performance. The database can and should be smart. If the database is dumb and sharded, it places a-priori constraints to avoid contention that it could have handled intelligently at runtime without imposing limitations on the application.
As I’ve said, this isn’t new. Traditional RDBMSs have been doing this for a long time by supporting complex SQL queries and stored procedures, and scheduling them just right. They, too, intended for the brains of the application to reside with the database, in its central server, while terminals provided little more than a simple UI.
Only, RDBMSs have had trouble scaling beyond the single server, though not due to a fault in this fundamental approach, as some NewSQL DBs have shown. And if you take this approach and retrofit it to modern day requirements and practices, you might just find that the database can become the application’s scaling engine rather than its bottleneck.
It works like this: you issue a query, but instead of waiting for a response, you hand the database a callback. The database takes care of scheduling the query, and then passes the result to your callback. The callback processed the data, and issues further queries and transactions.
You might think it’s a quite like asynchronous DB APIs popularized recently by Node.js, but it’s not. The main difference is this: your callback runs on a thread managed by the database. This means that if you issue two queries that result in two transactions, and those queries or transactions do not interfere with each other, the database will figure out that it can run your data-processing logic in parallel, on two separate cores.
You may think of it as writing your entire application as stored procedures; the database is not embedded in the application — the application becomes embedded in the database.
Domain specific data-structures are key in achieving performance magic. Too general, and the database won’t be able to exploit fortunate scheduling opportunities. But if you use just the right one, you can sit back and let your database worry about all the gruesome concurrency stuff and parallelize your business logic for you.
Simulating spaceships
So, now we get to the fun part; the part where we simulate lot’s and lots of spaceships blowing each other up.
To do this, we’ll use SpaceBase, our very own real-time spatial database, and let it parallelize our spaceships logic. The Java[4] code is up on github.
The spaceships aren’t particles in a particle system. Every spaceship constantly queries its surroundings to navigate and chase after targets, and interacts with its neighboring ships by shooting at them and propelling them away when it blows up. Each and every spaceship initiates queries and transactions, and all of them do it concurrently. The main loop merely triggers the spaceships’ actions, but they are all asynchronously run on SpaceBase’s threads. Here’s how the whole thing is implemented, more or less, in pseudocode:
main():
loop
foreach(s in spaceships)
s.run()
spaceship.run():
sbase.query(range_query(RADAR_RADIUS), visitor(results) {
foreach(s in results)
apply_rejection_force()
compute_new_position()
spacebase.update(this)
}
sbase.query(beam_query(heading, range), visitor(results) {
t = choose_target(results)
if hit(t)
t.hit()
}
spaceship.hit():
take_damage()
if destroyed
sbase.query_for_update(range_query(BLAST_RADIUS), visitor(results) {
foreach(s in results)
s.apply_blast_force()
spacebase.update(s)
}
Now, this isn’t the fastest way to simulate fighting spaceships. This isn’t even the fastest way of doing it with SpaceBase[5], but our intention here was to simulate dealing with a large number of requests coming through the network, initiating transactions.
Here’s how fast each simulation cycle runs on one of our development machines, a dual-core 2.3 GHz, i5 MacBook Pro:
10,000 ships avg: 190.33ms std: 53.48ms 20,000 ships avg: 435.38ms std: 146.07msSpaceBase’s performance will continue to improve, but that is not the important thing. The important thing is scalability. Here’s how fast the thing runs on another of our development machines, a quad-core 2.3GHz, i7 MacBook Pro:
10,000 ships avg: 96.38ms std: 15.56ms 20,000 ships avg: 198.53ms std: 31.42msVoila! We’ve got linear scalability[6] in the number of cores, with no sharding or zoning. The ships, namely our domain objects, are free to interact with one another any which way they like. It doesn’t even matter if all the spaceships decide to concentrate in, say, one quadrant of the space. The database will handle that[7].
And what about 50,000 spaceships? A cycle for 50K spaceships takes about 550ms on my i7 MacBook.
If you’d like to play some more with SpaceBase, you can download the Lite version that supports up to 20,000 objects here. If you do, and would like to ask questions about SpaceBase, you can do that here, in our new Google Group.
What’s next
Now, SpaceBase is a real-time spatial database, and it has applications in augmented reality, defense, gaming and location-based services, but we’re going to use the same approach for other kinds of databases, employing different data structures for completely different applications.
Also, a missing piece from our discussion of a scalable architecture is the distribution fabric responsible for scaling beyond the capabilities of a single machine. This component distributes and scales the database, but it must keep the data and the application code co-located (i.e., each machine must host a portion of the data alongside the code responsible for processing it), otherwise networking bottlenecks will kill off all the performance we’ve tried so hard to squeeze[8].
The distribution fabric must scale without resorting to sharding for the reasons outlined above, and it must fit well with the application’s concurrency mechanisms. But we will not discuss this component now as I’ve done this elsewhere and will do it yet again in the future, when we demonstrate how our approach can achieve cluster-wide scaling without weighing down the application with burdensome incidental limitations.