Quantcast
Channel: Hacker News 50
Viewing all articles
Browse latest Browse all 9433

Fast Non-Standard Data Structures for Python

$
0
0

Comments:"Fast Non-Standard Data Structures for Python"

URL:http://kmike.ru/python-data-structures/


Fast Non-Standard Data Structures for Python Sat 01 September 2012 By Mikhail Korobov

Python provides great built-in types like dict, list, tuple and set; there are also array, collections, heapq modules in the standard library; this article is an overview of external lesser known packages with fast C/C++ based data structures usable from Python.

Note

Disclaimer: I created datrie, marisa-trie, hat-trie and DAWG Python wrappers.

Bloom Filters

Bloom Filter (wiki) is an extremely memory-efficient probabilistic data structure which is used to test whether an element is a member of a set; there may be false positive retrieval results, but false negatives are not possible ("item not in set" query result is always correct).

Implementations available for Python:

Arrays

Numpy & Pandas & Scipy

The king of numeric arrays is numpy (home page). It provides several data structures (ndarray, structured arrays) for single- and multi-dimensional numeric data. SciPy provides support for sparse arrays (scipy.sparse) and much more. Pandas (home page) provides extra goodies. There is a lot of information about numpy, pandas and scipy on the Internet; they deserves more than one paragraph, but let's move on.

carray

carray package provides a chunked+compressed data structure for numerical data. It uses less memory than traditional ndarray and provides efficient shrinks and appends (copies of the whole array are not needed).

BList

blist package provides several data structures (blist, sortedlist, weaksortedlist, etc) that may act like general-purpose containers replacing standard list. Blist uses a hybrid array/tree structure that makes inserts and removals from the middle fast (these operations requires moving big memory chunks with standard list).

There is a rejected PEP-3128 about the inclusion of blist into the standard library.

BitArray

bitarray package provides a data structure which represents an array of booleans efficiently (in a bit vector). It is also useful for dealing with bit-level data and data compressed with variable bit length encoding.

Linked Lists

python-llist package provides classical linked list extension for Python. There is also a linked list implementation in roly.

Tries

There is no tree/trie/graph structure in Python standard library and pure-Python implementations suffer from extensive memory usage; using a C++/C-based trie (wiki) implementation is a good idea.

Note

In the following tests the memory usage was measured for 3 million unique unicode Russian words; "simple lookup" was a lookup for the word "АВИАЦИЯ".

License:Biopython License (it is extremely liberal) Memory usage:242M Simple lookup:333 ns (1004 ns with encoding) Unicode:no Python:2.x

If I properly understood the code, this is a pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie,wiki) and may use a lot of memory because of that. It doesn't work under Python 3.x and doesn't directly support unicode.

All trie operations (exact lookups, prefix lookups, inserts & updates) are fast & efficient.

Example:

>>>fromBioimporttrie>>>tr=trie.trie()>>>forwordinwords:...tr[word.encode('utf8')]=len(word)>>>tr['АВИАЦИЯ']7

Judy Arrays

Judy Arrays (wiki) are known to be very fast but obscure data structure heavily optimized for 32bit systems. Unfortunately I was not able to install neither PyJudy nor py-judy nor py-judy2 Python wrapper so I have nothing more to say about Judy Arrays :)

HAT-Trie

License:MIT Memory usage:125M Simple lookup:195 ns Unicode:yes Python:2.x and 3.x

HAT-Trie (pdf) is the Trie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with fastest lookups.

I've started a hat-trie Python wrapper for the very nice C HAT-Trie implementation by Daniel Jones, but never finished it. The wrapper is not polished and needs more love but the basics (trie building and exact lookups) are implemented.

Benchmarks show this trie is indeed fast (the wrapper bottleneck is Python unicode<->bytes conversion, not the trie itself). It is not very memory efficient and some operations taken for granted for tries (like prefix search) may be slow-ish and/or hard to implement for HAT-tries.

Example:

>>>importhat_trie>>>trie=hat_trie.Trie()>>>forwordinwords:...trie[word]=len(word)>>>trie[u'АВИАЦИЯ']7

Python-CharTrie

License:BSD Memory usage:194M Simple lookup:175 ns (840 ns with encoding) Unicode:no Python:2.x and 3.x

As far as I can tell, python-chartrie provides a pointer-based implementation of the classic Trie data structure; it is very fast but not memory efficient; unicode is not directly supported.

Example:

>>>importchartrie>>>trie=chartrie.CharTrie()>>>forwordinwords:...trie[word.encode('utf8')]=len(word)>>>trie['АВИАЦИЯ']7

DATrie

License:LGPL v2.1 Memory usage:101M Simple lookup:281 ns Unicode:yes Python:2.x and 3.x

datrie is a Python wrapper for the Double-Array Trie C implementation (home page) by Theppitak Karoonboonyanan. The library has rich API (including advanced iteration and walking), is quite fast, works under Python 2.x and 3.x and supports unicode.

The limitation of this library is that inserting items into trie may be slow, especially if insertions are unsorted and the trie is big. Another limitation is that the alphabet for the keys must be defined by developer at trie creation time.

Python wrapper uses utf_32_le codec internally; this codec is currently slow and it is the bottleneck for datrie. There is a ticket with a patch in the CPython bug tracker (http://bugs.python.org/issue15027) that should make this codec fast, so there is a hope datrie will become faster with future Pythons.

Example:

>>>importdatrie>>>ALPHABET=u'-АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ'>>>trie=datrie.BaseTrie(ALPHABET)>>>forwordinwords:...trie[word]=len(word)>>>trie[u'АВИАЦИЯ']7

MARISA-Trie

License:MIT/BSD Memory usage:11M Simple lookup:2010 ns Unicode:yes Python:2.x and 3.x

MARISA-trie is a very memory-efficient recursive LOUDS-trie-based data structure by Susumu Yata implemented as C++ library (repo). The library supports memory-mapped IO so it is possible to have on-disk trie and reduce the memory usage even further.

This library has 2 Python wrappers: official SWIG-based (included in C++ library distribution) and unofficial Cython-based (which is faster, can be installed via pip and has a different API). The benchmark data above is for the RecordTrie from the unofficial wrapper.

The unofficial Python wrapper which is also named marisa-trie allows not only the keys to be represented densely, but the values as well; this could be a big win for some applications in terms of memory usage.

It is possible to store several values for the same key and return these values in sorted order with marisa_trie.RecordTrie.

The limitation of marisa-trie is that it is a static data structure: it is possible to "build" a trie, but it is not possible to change it after building.

Example:

>>>importmarisa_trie>>>data_format='<B'>>>data=zip(words,((len(w),)forwinwords))>>>trie=marisa_trie.RecordTrie(data_format,data)>>>trie[u'АВИАЦИЯ'][(7,)]

Graphs & Specialized Automata

Directed Acyclic Word Graphs

I'm aware of 2 DAWG C/C++ based library for Python:

Unfortunately it seems the setup.py for pyDAWG doesn't work; I tried to contact the author but he didn't respond for some reason.

DAWG is a Python wrapper for dawgdic C++ DAWG implementation by Susumu Yata with an interface similar to marisa-trie.

License:MIT/BSD Memory usage:2.8M Simple lookup:249 ns Unicode:yes Python:2.x and 3.x

Example:

>>>importdawg>>>data=zip(words,(len(w)forwinwords))>>>d=dawg.IntDAWG(data)>>>d[u'АВИАЦИЯ']7

General Purpose Graph Libraries

igraph and graph-tool packages are both mature and well supported.

There is also NetworkX package which uses numpy for number crunching.

Aho-Corasic Automaton

Aho-Corasick automaton is a data structure that can quickly do a multiple-keyword search across text.

There is a C-based Python extension module called ahocorasick and a Cython-powered acora extension providing Aho-Corasic automaton.

Trees

Note

I have tried none of these tree packages.

  • There is a persistent Balanced Tree (wiki) implementation in ZODB (BTrees);
  • Binary- (wiki), RedBlack- (wiki) and AVL (wiki) trees are provided by bintrees module;
  • there is a package named patricia-tree that doesn't provide a Patricia-Tree [sic] but has a C-based implementation of Ternary Search Tree (wiki);
  • pytst C++ library for Ternary Search Tree has Python bindings;
  • rbtree is a fast Red-Black Tree implementation for Python.

Ropes

A rope (wiki) is a binary tree-based data structure for efficiently storing and manipulating a very long string. There is a Python extension for ropes called pyropes (thanks Austin Cory Bart for the pointer).

Blist (mentioned earlier) can be seen as a rope generalized for non-string usage.

Locality Sensitive Hashtables

Locality Sensitive Hashtable (see wiki) maps high-dimensional data to lower-dimensional data; it differ from classical hashtable in its ability to hash similar (in some sense) items to the same bucket.

LSHAsh package (by Kay Zhu) provides a locality sensitive hashtable based on numpy; it supports several distance functions (like cosine similarity) out of box, and it is possible to define your own similarity functions.

It may be useful e.g. for detecting near-duplicates.

Pure-Python data structures

There is an even greater number of pure-Python implementations for different data structures; they are not covered in this overview article intentionally.

This overview is biased towards tries, sorry for that, that's just a personal bias :)

If you know a C/C++/Cython-based Python extension that is not mentioned in this overview please let me know; comments and corrections are very welcome!

  • UPD1: pyropes module (see Ropes);
  • UPD2: acora extension (see Aho-Corasic Automaton);
  • UPD3: DAWG has a pypi release now;
  • UPD4: some memory issues with DAWG were fixed;
  • UPD5: rbtree (Red-Black Tree implementation);
  • UPD6: scipy.sparse; NetworkX; some notes about BList;
  • UPD7: LSHAsh package.

Viewing all articles
Browse latest Browse all 9433

Trending Articles