Quantcast
Channel: Hacker News 50
Viewing all articles
Browse latest Browse all 9433

Accelerating Python Libraries with Numba (Part 2)

$
0
0

Comments:"Accelerating Python Libraries with Numba (Part 2)"

URL:http://continuum.io/blog/numba_performance


Introduction

Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll learn more about how Numba works and how it reduces your programming effort, and see that it achieves comparable performance to C and Cython over a range of benchmarks.

If you are reading the blog form of this notebook, you can run the code yourself on our cloud-based Python-in-the-browser app, Wakari. Wakari gives you a full Scientific Python stack, right from your browser, and allows you to write and share your own IPython Notebooks. Sign up for free here.

How Does Numba Work?

Numba is a Continuum Analytics-sponsoredopen source project. Numba’s job is to make Python + NumPy code as fast as its C and Fortran equivalents without sacrificing any of the power and flexibility of Python. Python can be slower than C and Fortran because it features a generic, dynamic object system. If you were to look at the Python source code in C, you would see that every object, even simple integer constants, live in large, generic PyObject structures. The Python interpreter has to unwind several layers of abstraction each time it operates on a generic object. Let’s consider a simple statement to demonstrate this concept:

c = a+b

We’ll assume that a and b are both floating-point numbers. Adding them together is a single instruction on any modern CPU. This statement in C or Fortran will usually generate just this single floating-point add instruction at compile-time. At run-time, dispatching this instruction will likely only require a single CPU cycle, and it will complete in less than five cycles.

The same statement in Python will generate dozens of instructions. Because a and b are dynamically typed, the interpreter must first determine the type ofa and b, which will require lookups to memory of the a and b types. Then the interpreter has to determine if the type possesses the add method. A new object, c, may need to be created. The creation of c requires a memory allocation on the heap. Finally, the floating-point add operation is called, and the result is stored in c. The many additional function calls are responsible for the first order of magnitude of difference in performance between Python and compiled languages such as C and Fortran. But it is the memory allocations and dereferencing that are responsible for the next several orders of magnitude of performance difference. Python does not feature a native just-in-time compiler, so every time it sees this statement again (such as in a for loop), it has to repeat all the work it just did.

Numba is our bridge between Python and performance. Numba takes over for the Python interpreter on decorated functions and classes, and intelligently adds type information to as many objects as possible in an expression. When Numba can’t figure out what type an object is, it falls back to the same expensive type queries the Python interpreter uses. Numba then compiles the Python and NumPy functions and classes into performant code. Numba can compile just-in-time with the autojit decorator, or ahead of time with the jit decorator.

Benchmarking

This notebook provides a benchmark comparison between Python, C interfaced through ctypes, Cython, and Numba - all from an IPython notebook in the cloud that you can run yourself! The notebook is self-validating, with integrated tests checking the correctness of each kernel function before timing it. We encourage you to experiment with the code, try out new ideas, or even improve the code performance or the benchmarks themselves. Feel free to reuse any of this code for your own work.

We will start with two variants of a vector sum benchmark, then look at theGrowCut benchmark introduced in the last notebook.

Setup

We start by importing the libraries we need and defining a plotting function. We also install an IPython extension, cmagic, for compiling C code using the same compiler and flags that were used to build Python. By default, we have hidden some of the longer code snippets. Click on the title to view them inline.

%pylab inline # for compiling Cython inline%load_ext cythonmagic# for compiling C code inline (you only need to install this once)%install_ext https://gist.github.com/ahmadia/5600293/raw/f1bb89de5dfa1f2cc67201039b253949529e96a0/cmagic.py%load_ext cmagic import matplotlib.pyplot as plt # Adjust the SIZEX and SIZEY variables here if you would# like larger or smaller images.SIZEX, SIZEY = 10, 10plt.figsize(SIZEX, SIZEY) from timeit import timeitfrom numba import autojitimport mathimport ctypesimport numpy as npimport numpy.testing as nptfrom matplotlib import ticker def time_and_plot(algo, Ns, dtype, func_names, time_func, xlabel, ylabel, yscale = "linear", yscaling = 1):"""Plot Timing Comparison  Timing Parameters ----------------- algo : str Name of the algorithm being timed. Example: "Matrix-Matrix Multiply", Ns : indexable, int Size of arguments to be passed to timed functions. dtype : numpy.dtype Type of arguments to pass to timed functions. func_names : str Name of the functions in the __main__ namespace to be timed. time_func : function Timing function. See time_func.  Plot Parameters --------------- xlabel : str See plt.xlabel. ylabel : str See plt.ylabel. yscale : str, optional, defaults to "linear" See plt.set_yscale. yscaling : integer, optional, defaults to 1 Ratio to multiply y data values by. """ data = np.empty((len(func_names), len(Ns)), dtype=np.float)for k in xrange(len(func_names)):for i in xrange(len(Ns)):data[k,i] = time_func(func_names[k], Ns[i], dtype) plt.clf()fig1, ax1 = plt.subplots()w, h = fig1.get_size_inches()fig1.set_size_inches(w*1.5, h)ax1.set_xscale("log")ax1.get_xaxis().set_major_formatter(ticker.FormatStrFormatter("%d"))ax1.get_xaxis().set_minor_locator(ticker.NullLocator())ax1.set_yscale(yscale)plt.setp(ax1.get_xticklabels(), fontsize=14)plt.setp(ax1.get_yticklabels(), fontsize=14) ax1.grid(color="lightgrey", linestyle="--", linewidth=1, alpha=0.5) plt.xlabel(xlabel, fontsize=24)plt.ylabel(ylabel, fontsize=24)plt.xlim(Ns[0]*.9, Ns[-1]*1.1)plt.suptitle("%s Performance" % (algo), fontsize=24) for k in xrange(len(func_names)):plt.plot(Ns, data[k,:]*yscaling, "o-", linewidth=2, markersize=5, label=func_names[k])plt.legend(loc="upper right", fontsize=18)Python

For loop benchmark (Integer)

Our first benchmark is a simple loop calculating a vector sum over $N$ values. This is a native NumPy function, so we’ll define that first.

def numpy_sum(y):return y.sum()Python

We have to write the same loop explicitly in Python.

def python_sum(y):N = len(y)x = y[0]for i in xrange(1,N):x += y[i]return x'Python

Note that the Python code does not require us to specify what y is, beyond that it must be indexable. Although the Python dynamic types are flexible, they are not performant.

Next, we define the Numba code.

numba_sum = autojit()(python_sum)numba_sum.func_name = "numba_sum"Python

With a single line of code, we create a high-performance but equally flexible version of python_sum. When numba_sum is called with a numpy ndarray object, numba_sum will execute at the same speed as C or Cython. Don’t believe me? Let’s time it!

Note: We set the func_name attribute to numba_sum to distinguish it frompython_sum, the func_name inherited by default.

Here’s the C code we will compare against. See the notebook for details on how it is interfaced using magic functions.

int _c_int_sum(int N, int *y) {int i;int x = y[0];for (i=1; i<N; i++){x += y[i];}return x;C

Notice that because C is statically typed, we have to state ahead of time what the contents of y are.

Next up is Cython.

#cython: boundscheck=False#cython: nonecheck=False#cython: wraparound=False def cython_int_sum(int[:] y):cdef int icdef int N = y.shape[0]cdef int x = y[0]for i in xrange(1,N):x += y[i]return xCython

Cython is an optimising static compiler for Python. This Cython code will generate a Python extension module. Note that the Cython language is neither C nor Python, but a creole constructed from the two languages. Again, see the notebook for details on how this code is compiled and run using magic functions.

Correctly Measuring Performance

We will use the timeit module to handle our performance comparison. timeit doesn’t have access to any of the variables in our namespace by default, so we attach and retrieve them from the __main__ module.

def time_sum_func(func_name, N=10000, dtype=np.int32, trials=100):"""Timeit Helper Function (Simple One-D Functions)  Parameters ---------- func_name : str Name of the function in the __main__ namespace to be timed. N : int, optional, defaults to 10000 Size of np.ones array to construct and pass to function. dtype : np.dtype Type of array to construct and pass to function. trials : int, optional, defaults to 100 This parameter is passed to timeit  Returns ------- func_time : float Average execution time over all trials """import __main____main__.y = np.ones((N), dtype=dtype)return (timeit(stmt="func(__main__.y)"",setup="import __main__; from __main__ import %s as func" % func_name,number=trials)/trials)Python

Here are the results of our first call to the timer for three of the benchmarks:

print "Numba (first call) (ms): %g" % (time_sum_func("numba_sum")*1e3)print "Cython (ms): %g" % (time_sum_func("cython_int_sum")*1e3)print "C (ms): %g" % (time_sum_func("c_int_sum")*1e3) Numba (first call) (ms): 0.944879Cython (ms): 0.00568867C (ms): 0.0138402Python

The first time we run the Numba code, we notice that it is much slower than C or Cython. This is a feature. Remember that Numba is a just-in-time compiler; this means the code is not compiled until the very last moment before execution (thus the just in time). If we re-run the Numba code a second time:

print "Numba (subsequent) (ms): %g" % (time_sum_func("numba_sum")*1e3)print "Cython (ms): %g" % (time_sum_func("cython_int_sum")*1e3)print "C (ms): %g" % (time_sum_func("c_int_sum")*1e3) Numba (subsequent) (ms): 0.0144315Cython (ms): 0.00579834C (ms): 0.0180197'Python

We see that execution time is consistently much faster. Numba only pays the cost of compiling once for a given type of function arguments. Numba caches the results of compilation between function calls and recognizes that numba_sum has been called with an integer array previously, saving a recompile!

Performance Comparison

Let’s see how NumPy, Python, Cython, Numba, and C actually stack up!

Ns = np.logspace(6, 12, num=15, base=2)func_names = "cython_int_sum", "numba_sum", "c_int_sum", "numpy_sum", "python_sum"time_and_plot("Simple Integer Sum", Ns, np.int32, func_names,time_sum_func, "Array Size", "Time: ms", yscale="log", yscaling=1e3)Python

Whoa! We’re not going to have time to run Python on big arrays, let’s drop it from the rest of the comparison.

Ns = np.logspace(10, 24, num=15, base=2)func_names = "cython_int_sum", "numba_sum", "c_int_sum", "numpy_sum"time_and_plot("Simple Integer Sum", Ns, np.int32, func_names, time_sum_func, "Array Size", "Time: s", yscale="log", yscaling=1)Python

Cython and Numba both do very well for small arrays, although Numba eventually loses some performance for very large arrays. Numba is not quite as fast as C or Cython for very large problems in this case, this will be addressed in an upcoming release.

For loop benchmark (Floating-Point)

This next benchmark demonstrates the true flexibility of Numba. We don’t need to modify the Numba code at all, we simply pass an array of doubles this time instead of integers. We have to write new functions with a different type for y in both the C and the Cython code.

Whoa! Whoa! Easy with the pitchforks and torches! I have delicate skin!

Yes, we know that this problem could be solved by using typedefs and macros, or templates in C++. But we would still need to have multiple functions, one for each possible case, and this would quickly explode combinatorially for combinations of multiple options. Besides, the whole point of this exercise is to get performance while keeping the developer’s job as simple as possible.

One of Python’s greatest attributes is its support for clean, generic functions. Numba really shines in supporting generic functions while providing performance through autojit.

double _c_double_sum(int N, double *y) {int i;double x = y[0];for (i=1; i<N; i++){x += y[i];}return x;}C
%%cython #cython: boundscheck=False#cython: nonecheck=False#cython: wraparound=False def cython_double_sum(double[:] y):cdef int icdef int N = y.shape[0]cdef double x = y[0]for i in xrange(1, N):x += y[i]return xCython

Performance Comparison

We measure the performance over a range of vector sizes.

Ns = np.logspace(6, 12, num=15, base=2)func_names = "cython_double_sum", "numba_sum", "c_double_sum", "numpy_sum", "python_sum"time_and_plot("Simple Floating-Point Sum", Ns, np.float, func_names,time_sum_func, "Vector Size", "Time: ms", yscale="log",yscaling=1e3)Python
Ns = np.logspace(10, 24, num=15, base=2)func_names = "cython_double_sum", "numba_sum", "c_double_sum", "numpy_sum"time_and_plot("Simple Floating-Point Sum", Ns, np.float, func_names,time_sum_func, "Vector Size", "Time: (s) Log-Scale", yscale="log", yscaling=1)Python

From a performance perspective, the different versions are behaving almost identically for large vectors. Both Cython and Numba really shine for smaller array size, though, even over NumPy!

GrowCut benchmark

Artifical benchmarks always leave us with a minor sense of dissatisfaction, similar to the feeling we’re left with after eating hot dogs made out of that unidentifiable bright red meat. Let’s go back to a real application kernel and consider the impacts of using Numba there.

We revisit the GrowCut kernel introduced in the prior entry in this series.

First we do a pure Python implementation.

def window_floor(idx, radius):if radius > idx:return 0else:return idx - radius def window_ceil(idx, ceil, radius):if idx + radius > ceil:return ceilelse:return idx + radius def python_kernel(image, state, state_next, window_radius):changes = 0sqrt_3 = math.sqrt(3.0) height = image.shape[0]width = image.shape[1] for j in xrange(width):for i in xrange(height): winning_colony = state[i, j, 0]defense_strength = state[i, j, 1] for jj in xrange(window_floor(j, window_radius),window_ceil(j+1, width, window_radius)):for ii in xrange(window_floor(i, window_radius),window_ceil(i+1, height, window_radius)):if (ii == i and jj == j):continue d = image[i, j, 0] - image[ii, jj, 0]s = d * dfor k in range(1, 3):d = image[i, j, k] - image[ii, jj, k]s += d * dgval = 1.0 - math.sqrt(s)/sqrt_3 attack_strength = gval * state[ii, jj, 1] if attack_strength > defense_strength:defense_strength = attack_strengthwinning_colony = state[ii, jj, 0]changes += 1 state_next[i, j, 0] = winning_colonystate_next[i, j, 1] = defense_strength return changesPython

Next, we autojit the pure Python kernels to create accelerated Numba variants.

If we want maximum performance, we need to autojit the two functions used in iterating over the loop. We could have removed the functions themselves, but they help improve the readability of the code. Currently, Numba does not support inlining (Pull Requests welcome!), which makes it more challenging to put function calls in innermost loops performantly.

window_ceil = autojit()(window_ceil)window_floor = autojit()(window_floor)numba_kernel = autojit()(python_kernel)numba_kernel.func_name = "numba_kernel"Python
%%cython #Note: This code is not in the public domain. This implementation of GrowCut#is used under the terms of the BSD license, and is available from#https://github.com/stefanv/growcut_py #cython: cdivision=True#cython: boundscheck=False#cython: nonecheck=False#cython: wraparound=False from __future__ import division import numpy as npcimport cythoncimport numpy as cnp cdef extern from "math.h" nogil:double sqrt(double) cdef inline double distance(double[:, :, ::1] image,Py_ssize_t r0, Py_ssize_t c0,Py_ssize_t r1, Py_ssize_t c1) nogil:cdef:double s = 0, dint i for i in range(3):d = image[r0, c0, i] - image[r1, c1, i]s += d * d return sqrt(s) cdef double s3 = sqrt(3) cdef inline double g(double d) nogil:return 1 - (d / s3) def cython_kernel(double[:, :, ::1] image, double[:, :, ::1] state, double[:, :, ::1] state_next, Py_ssize_t window_radius): cdef:Py_ssize_t i, j, ii, jj, width, heightdouble gc, attack_strength, defense_strength, winning_colonyint changes height, width = image.shape[0], image.shape[1] changes = 0 for j in range(width):for i in range(height): winning_colony = state[i, j, 0]defense_strength = state[i, j, 1] for jj in xrange(max(0, j - window_radius), min(j + window_radius + 1, width)):for ii in xrange(max(0, i - window_radius), min(i + window_radius + 1, height)):if ii == i and jj == j:continue # p -> current cell, (i, j)# q -> attacker, (ii, jj) gc = g(distance(image, i, j, ii, jj)) attack_strength = gc * state[ii, jj, 1] if attack_strength > defense_strength:defense_strength = attack_strengthwinning_colony = state[ii, jj, 0]changes += 1 state_next[i, j, 0] = winning_colonystate_next[i, j, 1] = defense_strength return changesCython

Finally, here’s a pure C version of the GrowCut kernel, ready to interface into Python.

%%c _c_kernel #include "Python.h"#include "math.h" #define image(row, column, channel) (_image[row*width*3 +column*3 + channel])#define state(row, column, idx) (_state[row*width*2 +column*2 + idx])#define state_next(row, column, idx) (_state_next[row*width*2 +column*2 + idx]) double distance(double*, Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t);double g(double);double s3; Py_ssize_t _window_floor(Py_ssize_t, Py_ssize_t);Py_ssize_t _window_ceil(Py_ssize_t, Py_ssize_t, Py_ssize_t); int _c_kernel(double *_image, double *_state, double *_state_next, Py_ssize_t window_radius, Py_ssize_t width, Py_ssize_t height){Py_ssize_t i, j, ii, jj;double gc, attack_strength, defense_strength, winning_colony;int changes; s3 = sqrt(3.0);changes = 0; for (j=0; j < width; j++) {for (i =0; i < height; i++) { winning_colony = state(i, j, 0);defense_strength = state(i, j, 1); for (jj = _window_floor(j, window_radius); jj < _window_ceil(j+1, width, window_radius); jj++) {for (ii = _window_floor(i, window_radius); ii < _window_ceil(i+1, height, window_radius) ; ii++) {if (ii == i && jj == j) {continue;}// p -> current cell, (i, j)// q -> attacker, (ii, jj) gc = g(distance(_image, i, j, ii, jj, width)); attack_strength = gc * state(ii, jj, 1); if (attack_strength > defense_strength) {defense_strength = attack_strength;winning_colony = state(ii, jj, 0);changes += 1;}state_next(i, j, 0) = winning_colony;state_next(i, j, 1) = defense_strength;}}}}return changes;} Py_ssize_t _window_floor(Py_ssize_t idx, Py_ssize_t radius){if (radius > idx) {return 0;}else {return idx - radius;}} Py_ssize_t _window_ceil(Py_ssize_t idx, Py_ssize_t ceil, Py_ssize_t radius){if (idx + radius > ceil) {return ceil;}else {return idx + radius;}} double distance(double *_image,Py_ssize_t r0, Py_ssize_t c0,Py_ssize_t r1, Py_ssize_t c1,Py_ssize_t width){double s = 0, d;int i; for (i=0; i<3; i++) {d = image(r0, c0, i) - image(r1, c1, i);s += d * d;}return sqrt(s);} double g(double d){return 1 - (d / s3);}C

Performance Comparison

We time the verified kernels over a range of square image sizes. Again, very small images in Python, then larger images in Cython, Numba, and C.

def time_kernel(k_name, N, dtype):"""Timeit Helper Function (GrowCut Functions)  Parameters ---------- k_name : str Name of the GrowCut kernel in the __main__ namespace to be timed. N : int Size of image arrays to construct and pass to function. dtype : np.dtype Type of arrays to construct and pass to function.  Returns ------- func_time : float Average execution time over 3 trials """ import __main__image = np.zeros((N, N, 3), dtype=dtype)state = np.zeros((N, N, 2), dtype=dtype)state_next = np.empty_like(state) # colony 1 is strength 1 at position 0,0# colony 0 is strength 0 at all other positionsstate[0, 0, 0] = 1state[0, 0, 1] = 1 __main__.image = image__main__.state = state__main__.state_next = state_next trials = 3 return timeit(stmt="kernel(__main__.image, __main__.state, __main__.state_next, 10)",setup="from __main__ import %s as kernel; import __main__" % k_name,number=trials)/trialsPython
Ns = np.logspace(2, 6, num=5, base=2) kernel_names = 'cython_kernel', 'numba_kernel', 'c_kernel', 'python_kernel',time_and_plot('GrowCut', Ns, np.double, kernel_names, time_kernel,'Image Size: NxN', 'Time: s', yscale='log', yscaling=1)Python
Ns = np.logspace(3, 10, num=8, base=2) kernel_names = 'cython_kernel', 'numba_kernel', 'c_kernel'time_and_plot('GrowCut', Ns, np.float64, kernel_names, time_kernel,'Image Size: NxN', 'Time: s', yscale='log', yscaling=1)Python

We don’t observe a significant performance difference between the C, Cython, or Numba kernels. Of course, only one of the three kernels is written in clean, dynamic, Python :)

Conclusion

In this open notebook, we compared Numba against Cython and C. First, we explored some simple benchmarks. Then, we returned to the GrowCut example. Our experiments reveal that Numba performs as well as Cython and native C interfaced directly into Python. At the same time, the Numba code is clearly the easiest to understand and write from a Python programmer’s perspective.

We still have much more ground to cover, including how the professional version of Numba, NumbaPro, can accelerate code on GPUs. NumbaPro is available as part of our Anaconda Accelerate product.

Update:

At the request of several commenters, here is a test script and benchmarks that we ran on PyPy and Anaconda Python (with Numba). The results are not tuned (I am not a PyPy expert!) so we did not post them in the blog. We’d be happy to look deeper into this with the PyPy developers. While PyPy is not currently installed on Wakari, we are looking at a number of ways we can install and support the PyPy community.

Acknowledgements

This notebook features the GrowCut algorithm, which was introduced in “GrowCut” - Interactive Multi-Label N-D Image Segmentation By Cellular Automata” by V. Vezhnevets and V. Konouchine. Credit for the original native Python code and the accelerated Cython code is due to Dr. Nathan Faggian and Dr. Stefan van der Walt. The Cython implementation is used under the terms of the BSD license.

Tags: Numba IPython Notebook Python ImageProcessing Please enable JavaScript to view the comments powered by Disqus.comments powered by

Viewing all articles
Browse latest Browse all 9433

Trending Articles