Introduction
Welcome. This post is part of a series of Continuum
Analytics Open Notebooks showcasing our projects,
products, and services.
In this Continuum Open Notebook, you’ll learn more about how Numba works and
how it reduces your programming effort, and see that it achieves comparable
performance to C and Cython over a range of benchmarks.
If you are reading the blog form of this notebook, you can run the code yourself
on our cloud-based Python-in-the-browser app, Wakari.
Wakari gives you a full Scientific Python stack, right from your browser, and
allows you to write and share your own IPython Notebooks. Sign up for free
here.
How Does Numba Work?
Numba is a Continuum Analytics-sponsoredopen source project. Numba’s job
is to make Python + NumPy code as fast as its C and Fortran
equivalents without sacrificing any of the power and flexibility of Python.
Python can be slower than C and Fortran because it features a generic, dynamic
object system. If you were to look at the Python source code in C, you would
see that every object, even simple integer constants, live in large, generic
PyObject structures. The Python interpreter has to unwind several layers of
abstraction each time it operates on a generic object. Let’s consider a simple
statement to demonstrate this concept:
c = a+b
We’ll assume that a
and b
are both floating-point numbers. Adding them
together is a single instruction on any modern CPU. This statement in C or
Fortran will usually generate just this single floating-point add instruction at
compile-time. At run-time, dispatching this instruction will likely only
require a single CPU cycle, and it will complete in less than five cycles.
The same statement in Python will generate dozens of instructions. Because a
and b
are dynamically typed, the interpreter must first determine the type ofa
and b
, which will require lookups to memory of the a
and b
types.
Then the interpreter has to determine if the type possesses the add
method. A
new object, c
, may need to be created. The creation of c
requires a memory
allocation on the heap. Finally, the floating-point add operation is called,
and the result is stored in c
. The many additional function calls are
responsible for the first order of magnitude of difference in performance
between Python and compiled languages such as C and Fortran. But it is the
memory allocations and dereferencing that are responsible for the next several
orders of magnitude of performance difference. Python does not feature a native
just-in-time compiler, so every time it sees this statement again (such as in a
for loop), it has to repeat all the work it just did.
Numba is our bridge between Python and performance. Numba takes over for the
Python interpreter on decorated functions and classes, and intelligently adds
type information to as many objects as possible in an expression. When Numba
can’t figure out what type an object is, it falls back to the same expensive
type queries the Python interpreter uses. Numba then compiles the Python and
NumPy functions and classes into performant code. Numba can compile just-in-time
with the autojit
decorator, or ahead of time with the jit
decorator.
Benchmarking
This notebook provides a benchmark comparison between Python, C interfaced
through ctypes, Cython, and Numba - all from an IPython notebook in the cloud
that you can run yourself!
The notebook is self-validating, with integrated
tests checking the correctness of each kernel function before timing it. We
encourage you to experiment with the code, try out new ideas, or even improve
the code performance or the benchmarks themselves. Feel free to reuse any of
this code for your own work.
We will start with two variants of a vector sum benchmark, then look at theGrowCut benchmark introduced in the last notebook.
Setup
We start by importing the libraries we need and defining a plotting function.
We also install an IPython extension, cmagic, for compiling C code using the
same compiler and flags that were used to build Python. By default,
we have hidden some of the longer code snippets. Click on the title to
view them inline.
%pylab inline # for compiling Cython inline%load_ext cythonmagic# for compiling C code inline (you only need to install this once)%install_ext https://gist.github.com/ahmadia/5600293/raw/f1bb89de5dfa1f2cc67201039b253949529e96a0/cmagic.py%load_ext cmagic import matplotlib.pyplot as plt # Adjust the SIZEX and SIZEY variables here if you would# like larger or smaller images.SIZEX, SIZEY = 10, 10plt.figsize(SIZEX, SIZEY) from timeit import timeitfrom numba import autojitimport mathimport ctypesimport numpy as npimport numpy.testing as nptfrom matplotlib import ticker def time_and_plot(algo, Ns, dtype, func_names, time_func, xlabel, ylabel, yscale = "linear", yscaling = 1):"""Plot Timing Comparison Timing Parameters ----------------- algo : str Name of the algorithm being timed. Example: "Matrix-Matrix Multiply", Ns : indexable, int Size of arguments to be passed to timed functions. dtype : numpy.dtype Type of arguments to pass to timed functions. func_names : str Name of the functions in the __main__ namespace to be timed. time_func : function Timing function. See time_func. Plot Parameters --------------- xlabel : str See plt.xlabel. ylabel : str See plt.ylabel. yscale : str, optional, defaults to "linear" See plt.set_yscale. yscaling : integer, optional, defaults to 1 Ratio to multiply y data values by. """ data = np.empty((len(func_names), len(Ns)), dtype=np.float)for k in xrange(len(func_names)):for i in xrange(len(Ns)):data[k,i] = time_func(func_names[k], Ns[i], dtype) plt.clf()fig1, ax1 = plt.subplots()w, h = fig1.get_size_inches()fig1.set_size_inches(w*1.5, h)ax1.set_xscale("log")ax1.get_xaxis().set_major_formatter(ticker.FormatStrFormatter("%d"))ax1.get_xaxis().set_minor_locator(ticker.NullLocator())ax1.set_yscale(yscale)plt.setp(ax1.get_xticklabels(), fontsize=14)plt.setp(ax1.get_yticklabels(), fontsize=14) ax1.grid(color="lightgrey", linestyle="--", linewidth=1, alpha=0.5) plt.xlabel(xlabel, fontsize=24)plt.ylabel(ylabel, fontsize=24)plt.xlim(Ns[0]*.9, Ns[-1]*1.1)plt.suptitle("%s Performance" % (algo), fontsize=24) for k in xrange(len(func_names)):plt.plot(Ns, data[k,:]*yscaling, "o-", linewidth=2, markersize=5, label=func_names[k])plt.legend(loc="upper right", fontsize=18)Python
For loop benchmark (Integer)
Our first benchmark is a simple loop calculating a vector sum over $N$ values.
This is a native NumPy function, so we’ll define that first.
def numpy_sum(y):return y.sum()Python
We have to write the same loop explicitly in Python.
def python_sum(y):N = len(y)x = y[0]for i in xrange(1,N):x += y[i]return x'Python
Note that the Python code does not require us to specify what y
is, beyond
that it must be indexable. Although the Python dynamic types are flexible,
they are not performant.
Next, we define the Numba code.
numba_sum = autojit()(python_sum)numba_sum.func_name = "numba_sum"Python
With a single line of code, we create a high-performance but equally flexible
version of python_sum
. When numba_sum
is called with a numpy ndarray
object, numba_sum
will execute at the same speed as C or Cython. Don’t
believe me? Let’s time it!
Note: We set the func_name
attribute to numba_sum
to distinguish it frompython_sum
, the func_name
inherited by default.
Here’s the C code we will compare against. See the notebook for
details on how it is interfaced using magic functions.
int _c_int_sum(int N, int *y) {int i;int x = y[0];for (i=1; i<N; i++){x += y[i];}return x;C
Notice that because C is statically typed, we have to state ahead of time what
the contents of y
are.
Next up is Cython.
#cython: boundscheck=False#cython: nonecheck=False#cython: wraparound=False def cython_int_sum(int[:] y):cdef int icdef int N = y.shape[0]cdef int x = y[0]for i in xrange(1,N):x += y[i]return xCython
Cython is an optimising static compiler for Python. This
Cython code will generate a Python extension
module. Note that the
Cython language is neither C nor Python, but a creole constructed from the two
languages. Again, see the notebook for details on how this code is
compiled and run using magic functions.
Correctly Measuring Performance
We will use the timeit
module to handle our performance comparison. timeit
doesn’t have access to any of the variables in our namespace by default, so we
attach and retrieve them from the __main__
module.
def time_sum_func(func_name, N=10000, dtype=np.int32, trials=100):"""Timeit Helper Function (Simple One-D Functions) Parameters ---------- func_name : str Name of the function in the __main__ namespace to be timed. N : int, optional, defaults to 10000 Size of np.ones array to construct and pass to function. dtype : np.dtype Type of array to construct and pass to function. trials : int, optional, defaults to 100 This parameter is passed to timeit Returns ------- func_time : float Average execution time over all trials """import __main____main__.y = np.ones((N), dtype=dtype)return (timeit(stmt="func(__main__.y)"",setup="import __main__; from __main__ import %s as func" % func_name,number=trials)/trials)Python
Here are the results of our first call to the timer for three of the benchmarks:
print "Numba (first call) (ms): %g" % (time_sum_func("numba_sum")*1e3)print "Cython (ms): %g" % (time_sum_func("cython_int_sum")*1e3)print "C (ms): %g" % (time_sum_func("c_int_sum")*1e3) Numba (first call) (ms): 0.944879Cython (ms): 0.00568867C (ms): 0.0138402Python
The first time we run the Numba code, we notice that it is much slower than C or
Cython. This is a feature. Remember that Numba is a just-in-time compiler;
this means the code is not compiled until the very last moment before execution
(thus the just in time). If we re-run the Numba code a second time:
print "Numba (subsequent) (ms): %g" % (time_sum_func("numba_sum")*1e3)print "Cython (ms): %g" % (time_sum_func("cython_int_sum")*1e3)print "C (ms): %g" % (time_sum_func("c_int_sum")*1e3) Numba (subsequent) (ms): 0.0144315Cython (ms): 0.00579834C (ms): 0.0180197'Python
We see that execution time is consistently much faster. Numba only pays the
cost of compiling once for a given type of function arguments. Numba caches the
results of compilation between function calls and recognizes
that numba_sum has been called with an integer array previously,
saving a recompile!
Performance Comparison
Let’s see how NumPy, Python, Cython, Numba, and C actually stack up!
Ns = np.logspace(6, 12, num=15, base=2)func_names = "cython_int_sum", "numba_sum", "c_int_sum", "numpy_sum", "python_sum"time_and_plot("Simple Integer Sum", Ns, np.int32, func_names,time_sum_func, "Array Size", "Time: ms", yscale="log", yscaling=1e3)Python
Whoa! We’re not going to have time to run Python on big arrays, let’s drop it
from the rest of the comparison.
Ns = np.logspace(10, 24, num=15, base=2)func_names = "cython_int_sum", "numba_sum", "c_int_sum", "numpy_sum"time_and_plot("Simple Integer Sum", Ns, np.int32, func_names, time_sum_func, "Array Size", "Time: s", yscale="log", yscaling=1)Python
Cython and Numba both do very well for small arrays, although Numba eventually
loses some performance for very large arrays. Numba is not quite as fast as C
or Cython for very large problems in this case, this will be addressed in an
upcoming release.
For loop benchmark (Floating-Point)
This next benchmark demonstrates the true flexibility of Numba. We don’t need
to modify the Numba code at all, we simply pass an array of doubles this
time instead of integers. We have to write new functions with a different type
for y
in both the C
and the Cython
code.
Whoa! Whoa! Easy with the pitchforks and torches! I have delicate skin!
Yes, we know that this problem could be solved by using typedefs and macros,
or templates in C++. But we would still need to have multiple functions, one
for each possible case, and this would quickly explode combinatorially for
combinations of multiple options. Besides, the whole point of this exercise is
to get performance while keeping the developer’s job as simple as possible.
One of Python’s greatest attributes is its support for clean, generic functions.
Numba really shines in supporting generic functions while providing performance
through autojit
.
double _c_double_sum(int N, double *y) {int i;double x = y[0];for (i=1; i<N; i++){x += y[i];}return x;}C
%%cython #cython: boundscheck=False#cython: nonecheck=False#cython: wraparound=False def cython_double_sum(double[:] y):cdef int icdef int N = y.shape[0]cdef double x = y[0]for i in xrange(1, N):x += y[i]return xCython
Performance Comparison
We measure the performance over a range of vector sizes.
Ns = np.logspace(6, 12, num=15, base=2)func_names = "cython_double_sum", "numba_sum", "c_double_sum", "numpy_sum", "python_sum"time_and_plot("Simple Floating-Point Sum", Ns, np.float, func_names,time_sum_func, "Vector Size", "Time: ms", yscale="log",yscaling=1e3)Python
Ns = np.logspace(10, 24, num=15, base=2)func_names = "cython_double_sum", "numba_sum", "c_double_sum", "numpy_sum"time_and_plot("Simple Floating-Point Sum", Ns, np.float, func_names,time_sum_func, "Vector Size", "Time: (s) Log-Scale", yscale="log", yscaling=1)Python
From a performance perspective, the different versions are behaving almost
identically for large vectors. Both Cython and Numba really shine for smaller
array size, though, even over NumPy!
GrowCut benchmark
Artifical benchmarks always leave us with a minor sense of dissatisfaction,
similar to the feeling we’re left with after eating hot dogs made out of that
unidentifiable bright red meat. Let’s go back to a real application kernel and
consider the impacts of using Numba there.
We revisit the GrowCut kernel introduced in the prior entry in this series.
First we do a pure Python implementation.
def window_floor(idx, radius):if radius > idx:return 0else:return idx - radius def window_ceil(idx, ceil, radius):if idx + radius > ceil:return ceilelse:return idx + radius def python_kernel(image, state, state_next, window_radius):changes = 0sqrt_3 = math.sqrt(3.0) height = image.shape[0]width = image.shape[1] for j in xrange(width):for i in xrange(height): winning_colony = state[i, j, 0]defense_strength = state[i, j, 1] for jj in xrange(window_floor(j, window_radius),window_ceil(j+1, width, window_radius)):for ii in xrange(window_floor(i, window_radius),window_ceil(i+1, height, window_radius)):if (ii == i and jj == j):continue d = image[i, j, 0] - image[ii, jj, 0]s = d * dfor k in range(1, 3):d = image[i, j, k] - image[ii, jj, k]s += d * dgval = 1.0 - math.sqrt(s)/sqrt_3 attack_strength = gval * state[ii, jj, 1] if attack_strength > defense_strength:defense_strength = attack_strengthwinning_colony = state[ii, jj, 0]changes += 1 state_next[i, j, 0] = winning_colonystate_next[i, j, 1] = defense_strength return changesPython
Next, we autojit the pure Python kernels to create accelerated Numba variants.
If we want maximum performance, we need to autojit the two functions used in
iterating over the loop. We could have removed the functions themselves, but
they help improve the readability of the code. Currently, Numba does not
support inlining (Pull Requests
welcome!), which makes it more challenging to put function calls in innermost
loops performantly.
window_ceil = autojit()(window_ceil)window_floor = autojit()(window_floor)numba_kernel = autojit()(python_kernel)numba_kernel.func_name = "numba_kernel"Python
%%cython #Note: This code is not in the public domain. This implementation of GrowCut#is used under the terms of the BSD license, and is available from#https://github.com/stefanv/growcut_py #cython: cdivision=True#cython: boundscheck=False#cython: nonecheck=False#cython: wraparound=False from __future__ import division import numpy as npcimport cythoncimport numpy as cnp cdef extern from "math.h" nogil:double sqrt(double) cdef inline double distance(double[:, :, ::1] image,Py_ssize_t r0, Py_ssize_t c0,Py_ssize_t r1, Py_ssize_t c1) nogil:cdef:double s = 0, dint i for i in range(3):d = image[r0, c0, i] - image[r1, c1, i]s += d * d return sqrt(s) cdef double s3 = sqrt(3) cdef inline double g(double d) nogil:return 1 - (d / s3) def cython_kernel(double[:, :, ::1] image, double[:, :, ::1] state, double[:, :, ::1] state_next, Py_ssize_t window_radius): cdef:Py_ssize_t i, j, ii, jj, width, heightdouble gc, attack_strength, defense_strength, winning_colonyint changes height, width = image.shape[0], image.shape[1] changes = 0 for j in range(width):for i in range(height): winning_colony = state[i, j, 0]defense_strength = state[i, j, 1] for jj in xrange(max(0, j - window_radius), min(j + window_radius + 1, width)):for ii in xrange(max(0, i - window_radius), min(i + window_radius + 1, height)):if ii == i and jj == j:continue # p -> current cell, (i, j)# q -> attacker, (ii, jj) gc = g(distance(image, i, j, ii, jj)) attack_strength = gc * state[ii, jj, 1] if attack_strength > defense_strength:defense_strength = attack_strengthwinning_colony = state[ii, jj, 0]changes += 1 state_next[i, j, 0] = winning_colonystate_next[i, j, 1] = defense_strength return changesCython
Finally, here’s a pure C version of the GrowCut kernel, ready
to interface into Python.
%%c _c_kernel #include "Python.h"#include "math.h" #define image(row, column, channel) (_image[row*width*3 +column*3 + channel])#define state(row, column, idx) (_state[row*width*2 +column*2 + idx])#define state_next(row, column, idx) (_state_next[row*width*2 +column*2 + idx]) double distance(double*, Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t);double g(double);double s3; Py_ssize_t _window_floor(Py_ssize_t, Py_ssize_t);Py_ssize_t _window_ceil(Py_ssize_t, Py_ssize_t, Py_ssize_t); int _c_kernel(double *_image, double *_state, double *_state_next, Py_ssize_t window_radius, Py_ssize_t width, Py_ssize_t height){Py_ssize_t i, j, ii, jj;double gc, attack_strength, defense_strength, winning_colony;int changes; s3 = sqrt(3.0);changes = 0; for (j=0; j < width; j++) {for (i =0; i < height; i++) { winning_colony = state(i, j, 0);defense_strength = state(i, j, 1); for (jj = _window_floor(j, window_radius); jj < _window_ceil(j+1, width, window_radius); jj++) {for (ii = _window_floor(i, window_radius); ii < _window_ceil(i+1, height, window_radius) ; ii++) {if (ii == i && jj == j) {continue;}// p -> current cell, (i, j)// q -> attacker, (ii, jj) gc = g(distance(_image, i, j, ii, jj, width)); attack_strength = gc * state(ii, jj, 1); if (attack_strength > defense_strength) {defense_strength = attack_strength;winning_colony = state(ii, jj, 0);changes += 1;}state_next(i, j, 0) = winning_colony;state_next(i, j, 1) = defense_strength;}}}}return changes;} Py_ssize_t _window_floor(Py_ssize_t idx, Py_ssize_t radius){if (radius > idx) {return 0;}else {return idx - radius;}} Py_ssize_t _window_ceil(Py_ssize_t idx, Py_ssize_t ceil, Py_ssize_t radius){if (idx + radius > ceil) {return ceil;}else {return idx + radius;}} double distance(double *_image,Py_ssize_t r0, Py_ssize_t c0,Py_ssize_t r1, Py_ssize_t c1,Py_ssize_t width){double s = 0, d;int i; for (i=0; i<3; i++) {d = image(r0, c0, i) - image(r1, c1, i);s += d * d;}return sqrt(s);} double g(double d){return 1 - (d / s3);}C
Performance Comparison
We time the verified kernels over a range of square image sizes. Again,
very small images in Python, then larger images in Cython, Numba, and C.
def time_kernel(k_name, N, dtype):"""Timeit Helper Function (GrowCut Functions) Parameters ---------- k_name : str Name of the GrowCut kernel in the __main__ namespace to be timed. N : int Size of image arrays to construct and pass to function. dtype : np.dtype Type of arrays to construct and pass to function. Returns ------- func_time : float Average execution time over 3 trials """ import __main__image = np.zeros((N, N, 3), dtype=dtype)state = np.zeros((N, N, 2), dtype=dtype)state_next = np.empty_like(state) # colony 1 is strength 1 at position 0,0# colony 0 is strength 0 at all other positionsstate[0, 0, 0] = 1state[0, 0, 1] = 1 __main__.image = image__main__.state = state__main__.state_next = state_next trials = 3 return timeit(stmt="kernel(__main__.image, __main__.state, __main__.state_next, 10)",setup="from __main__ import %s as kernel; import __main__" % k_name,number=trials)/trialsPython
Ns = np.logspace(2, 6, num=5, base=2) kernel_names = 'cython_kernel', 'numba_kernel', 'c_kernel', 'python_kernel',time_and_plot('GrowCut', Ns, np.double, kernel_names, time_kernel,'Image Size: NxN', 'Time: s', yscale='log', yscaling=1)Python
Ns = np.logspace(3, 10, num=8, base=2) kernel_names = 'cython_kernel', 'numba_kernel', 'c_kernel'time_and_plot('GrowCut', Ns, np.float64, kernel_names, time_kernel,'Image Size: NxN', 'Time: s', yscale='log', yscaling=1)Python
We don’t observe a significant performance difference between the C, Cython, or
Numba kernels. Of course, only one of the three kernels is written in clean,
dynamic, Python :)
Conclusion
In this open notebook, we compared Numba against Cython and C. First, we
explored some simple benchmarks. Then, we returned to the GrowCut example. Our
experiments reveal that Numba performs as well as Cython and native C interfaced
directly into Python. At the same time, the Numba code is clearly the easiest
to understand and write from a Python programmer’s perspective.
We still have much more ground to cover, including how the professional version
of Numba, NumbaPro, can accelerate code on
GPUs. NumbaPro is available as part of
our Anaconda Accelerate product.
Update:
At the request of several commenters, here is a test script and benchmarks that we ran on PyPy and Anaconda Python (with Numba). The results are not tuned (I am not a PyPy expert!) so we did not post them in the blog. We’d be happy to look deeper into this with the PyPy developers. While PyPy is not currently installed on Wakari, we are looking at a number of ways we can install and support the PyPy community.
Acknowledgements
This notebook features the GrowCut algorithm, which was introduced in “GrowCut”
- Interactive Multi-Label N-D Image Segmentation By Cellular
Automata”
by V. Vezhnevets and V.
Konouchine. Credit for the original native Python code and the accelerated
Cython code is due to Dr. Nathan Faggian
and Dr. Stefan van der Walt. The Cython implementation is used under the terms
of the BSD license.
Tags:
Numba
IPython Notebook
Python
ImageProcessing
Please enable JavaScript to view the comments powered by Disqus.
comments powered by