Comments:"Go Database! Docs"
URL:http://zond.github.com/god/architecture.html
Overview
god is a scalable, performant, persistent in memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format.
Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy and transparent failover behaviour.
This is a general architectural overview aimed at somewhat technically inclined readers interested in how and why god does what it does.
To try it out right now, go get github.com/zond/god/god_server
, then run god_server
and browse to http://localhost:9192/.
For API documentation, go to http://go.pkgdoc.org/github.com/zond/god.
For the source, go to https://github.com/zond/god.
Namespace
Chord circle
The namespace of god, as for the Chord project, is a conceptual circle where each actor, be it database node or data item, has a position. Usually the circle overflows at the maximum value for a selected hash function used to spread keys over the namespace.
Murmur
The hash function used in god is MurmurHash3, a fairly new and extremely fast hash function. Unlike many other hash functions it makes no cryptographical claims, but focuses solely on high speed and low collision distribution.
god contains a slightly modified (only to let it compile in a C compiler) version of the 128 bit, x64 MurmurHash3 wrapped in a thin layer of Go.
Hashed vs non hashed keys
The main reason to hash keys before inserting them into the namespace is to avoid concentrating keys in a part of the circle, to facilitate letting all nodes share as equal an amount of data as possible.
However, since it could be very useful for users of a database to store ordered data, or to wilfully concentrate certain data on certain parts of the cluster, god does not force the user to hash the keys.
Instead, the nodes in god are able to migrate along the circle to hand off data to one another when an imbalance is detected.
Structure
Radix trees
To map keys to values, a mapping structure is needed. For infrastructural reasons (synchronization and cleaning) as well as for functionality of different kinds, we need a sorted mapping, and it has to be deterministically structured.
Radix trees have all these attributes, and are somewhat frugal when it comes to memory consumption due to the common parts of keys only being stored once. god uses radix trees for all data structures, each node has a tree containing a segment of the keyspace.
In the illustration to the right, nodes only contain the part of their key differing from the key of their parent, and the nodes marked red don't contain data at all but only contain common parts of keys for their children. Note that the root node contains the empty key, and in this case it actually has a value as well.
Byte values
All nodes in the main tree can contain byte values, which can be encoded to contain strings, integers, floats or whatever kind of data one wants to store.
Sub trees
In addition, since the need for structured data (in the sense that one can sort, paginate, slice and dice it) is one of the things that makes a vanilla key/value store hard to use in many common use cases, each key in a Tree can also contain a separate sub tree.
This allows us to, for example, store user metadata in the top level byte values, while storing the set of friends in a sub tree. Then we can easily fetch, slice and store friends without having to reserialize the user data (or even worse, the entire set of friends).
Mirror trees
To allow us to sort, paginate, slice and dice the data depending on not only key, but also value, each tree can also be 'mirrored'. This, in essence, means that the tree will contain another tree that is exactly mirroring the state of the first tree, except that the keys of the outer tree is the values of the inner tree and vice versa.
All this goodness comes at a cost, of course. Mutating mirrored trees take about three times as much resources as regular trees, since each update has to first remove the key from the mirror tree, then update the main tree and lastly insert the new key/value combination in the mirror.
To avoid colliding keys in the mirror trees, when several keys in the outer tree have the same values, each key in the mirror tree is appended with the corresponding value. Those appendages are removed, however, when key/value pairs are fetched or sliced from the mirror tree.
Configuration
To control how trees behave, for example (and currently only) to control whether they are mirrored or not, there is a configuration map for each sub tree. Setting the key mirrored
to yes
will make the tree keep a mirror.
Set operations
Operations
The sub trees in god can be used for various kinds of set operations: union, intersection, subtraction and xor. Note that xor does currently not follow the definition at http://en.wikipedia.org/wiki/Exclusive_or, but instead matches only keys present in exactly one of the input sets.
god includes a simple s-expression parser that parses these kinds of expressions. The code (I (U set1 set2 set3) (U set4 set5 set6))
, for example, will calculate the intersection of the unions of set1-3 and set4-6.
Complexity
Since all sets in god are both sorted and possible to skip through at logarithmic time, the set operations are usually efficient and take time depending on the logarithm of the size of the source sets.
This means that, instead of iterating through one set (in the intersection example) and skipping all keys not present in all other sets, we skip to the next value equal to or greater than the greatest value we looked at in the last iteration. This will let us skip through great patches of keyspace where there are no matches in one or more sets at the same time as guaranteeing that we find commonalities if there are any.
Mergers
Since many set operations (such as intersections and unions) may produce values from a combination of source sets, there is a concept of merge function. Merge functions take a list of lists of input values, and produce a list of output values.
These merge operations range from trivial, like ConCat that simply concatenates the bytes in all input values and puts the result in a single element list and Append that simply creates an output list out of all values in the input lists.
To use a merger other than Append (which is the default), simply add :MERGERNAME
to the function symbol in your s-expression. For example: (I:ConCat set1 (U:ConCat set2 set3))
will concatenate the union of set2-3, then concatenate those results to the intersection with set1.
Encodings
To enable the merge operations to do more interesting things, and to simplify for users of god to store more interesting things than byte slices in the database, there are predefined encodings implemented for encoding integers, floats and bigints into byte slices.
If such an encoding is used, it is also possible to use more clever mergers, such as IntegerSum, FloatMul or BigIntAnd.
Parameters
Set expressions can be given start and stop conditions, such as the minimum key to start with, the maximum key to look at, or the maximum number of items to return. This way even the set expressions can be paginated and treated as slices.
If a set expression also has a destination provided, the results of the set expression will not be returned to the caller, but put inside the sub tree defined by the destination key.
Persistence
Log files
god nodes stay persistent between restarts by always synchronizing with sibling nodes. But when the entire cluster needs to be restarted, some kind of disk based persistence is needed.
For that purpose a log file based persistence engine logs all mutating operations (put, delete, clear, configure) performed by each node and streams them to disk.
Snapshots
Since the same keys will often be changed many times over during the lifetime of a database, only log files would quickly prove inefficient, since the same log file would contain multiple puts, deletes and new puts of the same key.
To alleviate this, snapshot files can be created that contain the merged sum of all operations in a set of log files. Only the latest operation on each key in the snapshot is recorded to get rid of redundant operations.
god does this by reading the latest snapshot and all log files following it into a set of nested maps, overwriting and deleting the map data as new operations stream in from the log files. Then the map set is streamed into a snapshot file.
Automatic snapshotting
To avoid cluttering the disk with huge log files containing redundant data, god always automatically creates new snapshots when the last log file grows too largs. This is done in a separate goroutine and thus does not have to take away from the performance of the node, but it will require extra memory to build the map set.
Routing
Notify
Each node in the cluster regularly notifies its successor of its presence. The successor, in its turn, will add the notifier to its routing table and respond with the node it considers to be its predecessor.
The notifier will compare the returned predecessor to itself, and add it to its own routing table if different.
Ping
Each node also takes responsibility for regularly pinging its successor with a packet containing itself and a hash of its routing table.
The predecessor will add the successor to its routing table and compare the hash with that of its own routing table. If different, it will also request the complete routing table from the successor.
The predecessor will then clean its routing table by removing all nodes between its confirmed (being a notifier) predecessor and itself, so that nodes having been removed after a failed ping will not get reintroduced by a successor.
Route changes
If a node notices that a neighbour is missing, the missing node will be removed from its routing table and the new table will propagate through the ring in a counter clockwise diretion, from successor to predecessor.
Since the responsibility for keys rest with the successor of the key, and the successor will be one of the first nodes to notice a failure, this helps preserve the integrity of the responsibility in the cluster.
Connection
net/rpc client
To make sure that the native go client fails as seldom as possible when the cluster configuration changes, the client has an entire routing table cached. Whenever an error occurs when doing an operation against a node, that node is removed from the routing table and a new complete table is fetched from a randomly selected node of those still in the table.
As even this scenario is troublesome, since it delays the current operation, client.Conn#Start
triggers a periodic comparison of the routing table against randomly chosen nodes in the cluster, where any differences will cause the table to be downloaded again.
JSON/HTTP API
The JSON API is unfortunately not this clever, and any failover or error handling must be handled by the client code.
Responsibility
Successors
Like in Chord/DHash, each node is responsible for all items between its predecessor and itself. Since the predecessor is the one node each node keeps pinging, this works out well since we can trust each node to know if a key is between its predecessor and itself.
Redundancy
To allow for a transparent failover, where the new owner of a set of data already has it available, god lets each node keep a backup of the data of a number of its predecessors. To ensure this, all nodes receiving a modifying command will forward it to its N successors.
If a node disappears, this will allow the cluster to stay in a valid state, since the new owner already has the backup (now master) copy of the data in its local storage.
Migrations
To enable efficient separation of responsibility without forcing all keys to be hashed, god makes nodes migrate along the namespace to balance their responsibilities between themselves.
All nodes in the cluster periodically ask their successors about the amount of data they are responsible for, and if the successor is responsible for too much data compared to the asking node, the asking node will move backwards in the namespace to hand off responsibility for a set of keys to the successor.
Not forcing hashed keys would allow, for example, data for a geographical region to be prefixed with a code, and making data with that prefix go on designated servers placed in the same geographical region. The cluster could also be configured so that all backups for data positioned in one region was positioned in other regions for greated safety in the face of disasters of different kinds.
Currently there is no option to either limit node migrations (to force nodes in one datacenter to stay in one part of the namespace), or to force backups to live on nodes in other datacenters, but it would be conceptually fairly easy to add in a future version if the need arose.
Timestamps
Latest write wins
Since god is intended to work with hundreds or even thousands of nodes, it is likely that some nodes will experience temporary outages or net splits. Since god does not have eventual consistency and versioned entries, a situation where the cluster is split in isolated parts still accessed by clients will lead to the rejoined cluster containing a mix of the latest writes for all rejoined parts.
However, as long as the disconnected parts of the cluster are not used for writing (being crashed, completely disconnected or just turned off), timestamps do a sufficient job of avoiding conflicts.
Undone changes
When a node is somehow disconnected and reconnected without being emptied of all data, it will potentially contain stale values for all its contained keys.
To avoid this stale data being redistributed across the cluster, each value has a timestamp and newer timestamps always override older timestamps.
Zombies
If a node gets disconnected and reconnected without being cleared in between, it could conceivably contain data that was removed in its absence.
To avoid these values being reanimated and haunt you as zombie data, each delete or clear (except the kill
command, ironically) will result in tombstones replacing the data you remove. These tombstones are timestamped just like any other piece of data, and will disperse through the cluster just like a put.
A potential problem with tombstones to protect against zombies is removing them when the zombie fright is over. god will keep tombstones for at least 24 hours to give users time to fix net splits or ensure that crashed servers are cleaned before being put online again.
To avoid a big performance hit every 24 hours after a spike in deletions/second, god currently only removes tombstones in a lazy fashion. This means that whenever a tree is updated, every node touched by the update will remove siblings that are tombstones older than 24 hours. Sub trees marked as cleared more than 24 hours ago will also be removed completely.
This will with great likelihood keep collecting tombstones from a database that is continuously modified with the same access pattern, but will likely cause many tombstones to remain if the access patterns change a lot over time so that the same area of the database is never touched again.
Time network
To make sure that the timestamps as far as possible reflect the actual sequence of events in the cluster, and avoid putting additional burdens on the operations part of things, god contains its own time network.
The time network in god is an extremely simple one, though, and makes no claim to knowing the real time at any point. It simply synchronizes clocks randomly across the cluster, and for each node updates an offset to the local cpu clock so that they converge on as small a difference as possible.
Synchronization
Ensuring backups
To make sure that the data in god is always distributed the way it should be, with backups of each key segment on the nodes succeeding the first successor of the segment, each node regularly synchronizes the data it is responsible for with its successors.
This synchronization will compare contents and timestamps, and make sure that all nodes supposed to hold a segment will hold the same data, and that said data will be that having the latest timestamps.
Merkle trees
To allow nodes to compare any segment of their database to each other without having to iterate through entire key sets, the radix trees in god also contain hashes of their sub trees, in essence turning them into Merkle trees.
This enables comparing two equal trees with a single byte slice comparison, and finding the differences in two unequal trees a matter of recursing down the child nodes themselves having differing hashes.
Cleaning
Having the data one should have is just half the work. To get rid of the data it shouldn't have, god nodes will regularly look for the first key in its data succeding its own position.
If said key is not between the predecessor of the node and the node itself, the node will start a synchronization of the segment between the misplaced key and its first successor. This synchronization will push this data to all nodes in the cluster that ought to have it (its immediate successor and its backup nodes).
The last synchronization in this cleanup operation will also remove, without leaving tombstones, the data from the cleaning node.
Consequences
Data is written
When data is written from the native Go client, the client will look up the successor of the key in its routing table. Then it will contact the successor and ask it to write the data via a Put
command.
If data is written from the HTTP/JSON API, the receiving node will first check its routing table to see if it is the successor of the key. If it is not, it will synchronously forward the command to the proper successor.
When the right node receives a modifying command, it will forward the command to its N successors. If the command is marked to be synchronous, this forward will wait until all successors have responded. If not, the node will simply modify its own tree and respond to the client.
Data is read
When data is read from the HTTP/JSON API, the receiving node will first check its routing table to see if it is the successor of the key. If it is not, it will synchronously forward the command to the proper successor and return with the response.
If the native Go client reads data, it will (for most commands) contact all nodes responsible for having the data in their trees, and return the value with the newest timestamp. This is somewhat more costly than just returning the first value, but more resilient to node churn or failures.
Node joins
When a node joins the network, it will immediately start notifying its successor of its presence. The successor in its turn will tell its previous predecessor (when it notifies the successor) of the new node, which will cause the old predecessor to start notifying the new node instead. The predecessor node will now deliver a new hash of the routing table to its predecessor, which will request a new routing table, and in its turn deliver it backwards.
The synchronization protocols will force N predecessors of the new node to start comparing the parts of their trees that they own with the same part of the tree of the new node, and start pushing data to it so that it gets a complete set of all data it should backup.
At the same time, the new node will start comparing the part of its tree that it owns with the same part of the trees of its N successors, which will make it pull all the data it is responsible for from them. The last node that used to be responsible for backing up the data of the new node will also notice that is now holds data it should not, and start pushing it to the new node and cleaning it from its own tree when done.
Note that the illustration does not show all these operations, since it would make the image completely unintelligible.
Leaving nodes
When a node leaves the network its predecessor and successor will notice when the node stops responding to pings and notifications. This will cause them to remove the node from their routing tables, and start propagating the new tables backwards throught the ring.
The successor of the lost node will now be responsible for the namespace the lost node had, and since it already has the backup of this namespace clients will be able to continue accessing the data without interruption. It will also start synchronizing said namespace with its N successors, to make sure there are enough backups.
Returning nodes
As a former member of the cluster returns to the system, it may have old keys (either deleted or changed in the current cluster), and it will try to synchronize them with the rest of the cluster. But since the timestamps of the changed keys will be older than in the live cluster, they will be overwritten by more current data.
Since the trees in god lazily remove tombstones after 24 hours, operations should take care to never let nodes having been disconnected for more than 24 hours reconnect again.
Other than these time synchronization issues, returning nodes are treated the same as joining nodes.