Comments:"Fast Non-Standard Data Structures for Python"
URL:http://kmike.ru/python-data-structures/
Python provides great built-in types like dict, list, tuple and set; there are also array, collections, heapq modules in the standard library; this article is an overview of external lesser known packages with fast C/C++ based data structures usable from Python.
Note
Disclaimer: I created datrie, marisa-trie, hat-trie and DAWG Python wrappers.
Bloom Filters
Bloom Filter (wiki) is an extremely memory-efficient probabilistic data structure which is used to test whether an element is a member of a set; there may be false positive retrieval results, but false negatives are not possible ("item not in set" query result is always correct).
Implementations available for Python:
Arrays
Numpy & Pandas & Scipy
The king of numeric arrays is numpy (home page). It provides several data structures (ndarray, structured arrays) for single- and multi-dimensional numeric data. SciPy provides support for sparse arrays (scipy.sparse) and much more. Pandas (home page) provides extra goodies. There is a lot of information about numpy, pandas and scipy on the Internet; they deserves more than one paragraph, but let's move on.
carray
carray package provides a chunked+compressed data structure for numerical data. It uses less memory than traditional ndarray and provides efficient shrinks and appends (copies of the whole array are not needed).
BList
blist package provides several data structures (blist, sortedlist, weaksortedlist, etc) that may act like general-purpose containers replacing standard list. Blist uses a hybrid array/tree structure that makes inserts and removals from the middle fast (these operations requires moving big memory chunks with standard list).
There is a rejected PEP-3128 about the inclusion of blist into the standard library.
BitArray
bitarray package provides a data structure which represents an array of booleans efficiently (in a bit vector). It is also useful for dealing with bit-level data and data compressed with variable bit length encoding.
Linked Lists
python-llist package provides classical linked list extension for Python. There is also a linked list implementation in roly.
Tries
There is no tree/trie/graph structure in Python standard library and pure-Python implementations suffer from extensive memory usage; using a C++/C-based trie (wiki) implementation is a good idea.
Note
In the following tests the memory usage was measured for 3 million unique unicode Russian words; "simple lookup" was a lookup for the word "АВИАЦИЯ".
If I properly understood the code, this is a pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie,wiki) and may use a lot of memory because of that. It doesn't work under Python 3.x and doesn't directly support unicode.
All trie operations (exact lookups, prefix lookups, inserts & updates) are fast & efficient.
Example:
>>>fromBioimporttrie>>>tr=trie.trie()>>>forwordinwords:...tr[word.encode('utf8')]=len(word)>>>tr['АВИАЦИЯ']7
Judy Arrays
Judy Arrays (wiki) are known to be very fast but obscure data structure heavily optimized for 32bit systems. Unfortunately I was not able to install neither PyJudy nor py-judy nor py-judy2 Python wrapper so I have nothing more to say about Judy Arrays :)
HAT-Trie
HAT-Trie (pdf) is the Trie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with fastest lookups.
I've started a hat-trie Python wrapper for the very nice C HAT-Trie implementation by Daniel Jones, but never finished it. The wrapper is not polished and needs more love but the basics (trie building and exact lookups) are implemented.
Benchmarks show this trie is indeed fast (the wrapper bottleneck is Python unicode<->bytes conversion, not the trie itself). It is not very memory efficient and some operations taken for granted for tries (like prefix search) may be slow-ish and/or hard to implement for HAT-tries.
Example:
>>>importhat_trie>>>trie=hat_trie.Trie()>>>forwordinwords:...trie[word]=len(word)>>>trie[u'АВИАЦИЯ']7
Python-CharTrie
As far as I can tell, python-chartrie provides a pointer-based implementation of the classic Trie data structure; it is very fast but not memory efficient; unicode is not directly supported.
Example:
>>>importchartrie>>>trie=chartrie.CharTrie()>>>forwordinwords:...trie[word.encode('utf8')]=len(word)>>>trie['АВИАЦИЯ']7
DATrie
datrie is a Python wrapper for the Double-Array Trie C implementation (home page) by Theppitak Karoonboonyanan. The library has rich API (including advanced iteration and walking), is quite fast, works under Python 2.x and 3.x and supports unicode.
The limitation of this library is that inserting items into trie may be slow, especially if insertions are unsorted and the trie is big. Another limitation is that the alphabet for the keys must be defined by developer at trie creation time.
Python wrapper uses utf_32_le codec internally; this codec is currently slow and it is the bottleneck for datrie. There is a ticket with a patch in the CPython bug tracker (http://bugs.python.org/issue15027) that should make this codec fast, so there is a hope datrie will become faster with future Pythons.
Example:
>>>importdatrie>>>ALPHABET=u'-АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ'>>>trie=datrie.BaseTrie(ALPHABET)>>>forwordinwords:...trie[word]=len(word)>>>trie[u'АВИАЦИЯ']7
MARISA-Trie
MARISA-trie is a very memory-efficient recursive LOUDS-trie-based data structure by Susumu Yata implemented as C++ library (repo). The library supports memory-mapped IO so it is possible to have on-disk trie and reduce the memory usage even further.
This library has 2 Python wrappers: official SWIG-based (included in C++ library distribution) and unofficial Cython-based (which is faster, can be installed via pip and has a different API). The benchmark data above is for the RecordTrie from the unofficial wrapper.
The unofficial Python wrapper which is also named marisa-trie allows not only the keys to be represented densely, but the values as well; this could be a big win for some applications in terms of memory usage.
It is possible to store several values for the same key and return these values in sorted order with marisa_trie.RecordTrie.
The limitation of marisa-trie is that it is a static data structure: it is possible to "build" a trie, but it is not possible to change it after building.
Example:
>>>importmarisa_trie>>>data_format='<B'>>>data=zip(words,((len(w),)forwinwords))>>>trie=marisa_trie.RecordTrie(data_format,data)>>>trie[u'АВИАЦИЯ'][(7,)]
Graphs & Specialized Automata
Directed Acyclic Word Graphs
I'm aware of 2 DAWG C/C++ based library for Python:
Unfortunately it seems the setup.py for pyDAWG doesn't work; I tried to contact the author but he didn't respond for some reason.
DAWG is a Python wrapper for dawgdic C++ DAWG implementation by Susumu Yata with an interface similar to marisa-trie.
Example:
>>>importdawg>>>data=zip(words,(len(w)forwinwords))>>>d=dawg.IntDAWG(data)>>>d[u'АВИАЦИЯ']7
General Purpose Graph Libraries
igraph and graph-tool packages are both mature and well supported.
There is also NetworkX package which uses numpy for number crunching.
Aho-Corasic Automaton
Aho-Corasick automaton is a data structure that can quickly do a multiple-keyword search across text.
There is a C-based Python extension module called ahocorasick and a Cython-powered acora extension providing Aho-Corasic automaton.
Trees
Note
I have tried none of these tree packages.
- There is a persistent Balanced Tree (wiki) implementation in ZODB (BTrees);
- Binary- (wiki), RedBlack- (wiki) and AVL (wiki) trees are provided by bintrees module;
- there is a package named patricia-tree that doesn't provide a Patricia-Tree [sic] but has a C-based implementation of Ternary Search Tree (wiki);
- pytst C++ library for Ternary Search Tree has Python bindings;
- rbtree is a fast Red-Black Tree implementation for Python.
Locality Sensitive Hashtables
Locality Sensitive Hashtable (see wiki) maps high-dimensional data to lower-dimensional data; it differ from classical hashtable in its ability to hash similar (in some sense) items to the same bucket.
LSHAsh package (by Kay Zhu) provides a locality sensitive hashtable based on numpy; it supports several distance functions (like cosine similarity) out of box, and it is possible to define your own similarity functions.
It may be useful e.g. for detecting near-duplicates.
Pure-Python data structures
There is an even greater number of pure-Python implementations for different data structures; they are not covered in this overview article intentionally.
This overview is biased towards tries, sorry for that, that's just a personal bias :)
If you know a C/C++/Cython-based Python extension that is not mentioned in this overview please let me know; comments and corrections are very welcome!