Comments:"Paul's Pontifications: When Haskell is faster than C"
URL:http://paulspontifications.blogspot.com/2013/01/when-haskell-is-faster-than-c.html
TL;DR: Conventional wisdom is wrong. Nothing can beat highly micro-optimised C, but real everyday C code is not like that, and is often several times slower than the micro-optimised version would be. Meanwhile the high level of Haskell means that the compiler has lots of scope for doing micro-optimisation of its own. As a result it is quite common for everyday Haskell code to run faster than everyday C. Not always, of course, but enough to make the speed difference moot unless you actually plan on doing lots of micro-optimisation.
This view seems to be borne out by the Computer Language Benchmarks Game (aka the Shootout), which collects implementations of various programs written in different languages and measures their execution speed and memory usage. The winningest programs are always written in highly optimised C. The Haskell versions run anything from slightly to several times slower than the C versions. Furthermore, if you read the Haskell it is also highly optimised, to the point where it no longer looks like Haskell. Instead it reads like C translated into Haskell. Based on this, you would be justified in concluding that everyday Haskell (that is, Haskell that plays to the strengths of the language in brevity, correctness and composability) must be irredeemably slow.
But then I read this presentation
by Tony Albrecht, in which he talks about how some seemingly innocent everyday
C++ code is actually very inefficient. Two things in particular caught
my eye:
- A fetch from the main memory when there is a cache miss costs 400 cycles. Pointer indirection in particular tends to fill up cache lines, and hence increase the number of cache misses overall.
- A wrong branch prediction costs over 20 cycles. A branch that skips a simple calculation can actually run slower than the calculation.
But most code isn't written like that. Tony Albrecht is a games programmer, an expert at squeezing cycles out of the rendering loop. Most developers do not live in that world. For them the objective is to produce code that meets the requirements, which includes being fast enough. This is not laziness or waste, but practicality. First design the optimal algorithm, then implement it in idiomatic code. Only if that does not run fast enough should you start detailed profiling and micro-optimisation. Not only is the micro-optimisation process itself expensive, but it makes the code hard to understand for future maintenance.
So I wondered: the high level of Haskell gives the compiler many more opportunities for micro-optimisation than C. Rather than comparing micro-optimised programs therefore, it seemed sensible to compare everyday programs of the sort that might be produced when readability is more important than raw speed. I wanted to compare programs that solved a problem large enough to have a useful mix of work, but small enough that I could write Haskell and C versions fairly quickly. After poking around the Shootout website I settled on the reverse-complement problem.
A potential issue was that one of my programs might inadvertently use something highly non-optimal, so I decided I would profile the code and remove anything that turned out to be pointlessly slow, but not change the structure of the program or add complexity merely for the sake of speed. With that in mind I wrote Haskell and C versions. I also downloaded the Shootout winner to get some feel for how my programs compared. You can see my code at the bottom of this post.
The first version of the Haskell took 30 seconds (compared with the Shootout time of about 0.6 seconds). As I had feared, profiling did indeed reveal something pointlessly slow in it. In order to filter out carriage returns from the input I used "isLetter", but unlike the C char type the Haskell Char covers the whole of Unicode, and determining if one of those is a letter is not trivial. I put the filter after the complement operation and compared the result with zero, which in addition to being faster is also the Right Thing if the input contains invalid characters. Once I had this fixed it dropped down to a much more respectable 4.3 seconds. Interestingly, profiling suggests that about half the time is being spent writing out the 60 character lines; merely printing out the result with no line breaks cut execution down to around 2 seconds.
The C version, meanwhile, took 8.2 seconds. Profiling did not directly reveal the cause, but it seemed to imply that the processor was spending most of its time somewhere other than my code. I strongly suspect that this time is being spent in putc(3) and getc(3). The obvious optimisation would use fread(3) and fwrite(3) to read and write characters in blocks instead of one at a time, but that would require significant changes to the code; extra bookkeeping to deal with the start of the next sequence (signalled by a ">" character) when it is found half way through a block, and to insert newlines similarly. Unlike the replacement of isLetter in Haskell this would require new variables and control structures driven by the cost of the solution rather than a simple switch to a less expensive expression
It might be argued that I have tilted the playing field against C by not making these changes, and that any halfway competent C programmer would do so when faced with code that runs too slowly. If isLetter is pointlessly slow, isn't the same thing true of putc(3) and getc(3)? But I think there is a clear difference. Both programs are written in a character-oriented way because the problem is described in terms of characters. I wrote the inner loop of the C to operate on a linked list of blocks because it looked like a faster and simpler choice than copying the whole string into a new buffer twice the size every time it overflowed (on average this algorithm copies each character once or twice; see Knuth for details). I might have considered reading or writing the characters in reverse order rather than doing the in-memory reverse in a separate function, but profiling didn't show that as a significant time sink. Overall, getting decent performance out of the C is going to take about the same amount of work as writing the code in the first place.
On the other hand the Haskell has decent performance out of the box because the compiler automates a lot of the micro-optimisation that C forces you to do manually. It may not do as good a job as a human with plenty of time might do, but it does it automatically and reasonably well.
This isn't principally about code size, but for the record the Haskell has 42 SLOC, of which 21 are executable. The C has 115 SLOC, of which 63 are executable. The Shootout C winner has 70 SLOC, of which 46 are executable (counting comma separated statements as one line per statement).
So here is the bottom line. If you really need your code to run as fast as possible, and you are planning to spend half your development time doing profiling and optimisation, then go ahead and write it in C because no other language (except possibly Assembler) will give you the level of control you need. But if you aren't planning to do a significant amount of micro-optimisation then you don't need C. Haskell will give you the same performance, or better, and cut your development and maintenance costs by 75 to 90 percent as well.
For what its worth, here is the code for the two programs.
First the Haskell:
module HaskellText (main) where
import Control.Applicative
import Control.Monad
import Data.ByteString.Lazy.Char8 (ByteString)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Char
import Data.Int
import Data.Vector.Unboxed (Vector, (!), (//))
import qualified Data.Vector.Unboxed as V
import Data.Word
import System.IO
-- | A block requiring the reverse complement. The first element is the
-- header string. The second element is the actual data.
data Block = Block ByteString ByteString
main :: IO ()
main = getBlocks >>= mapM_ (writeBlock . reverseComplement)
-- | Write a block to stdio with line breaks.
writeBlock :: Block -> IO ()
writeBlock (Block name txt) = do
putChar '>'
B.putStrLn name
mapM_ B.putStrLn $ splitEvery 60 txt
-- | A utility function that ought to be part of the bytestring library.
splitEvery :: Int64 -> ByteString -> [ByteString]
splitEvery n t = go t
where
go t = if (not $ B.null t2) then t1 : go t2 else [t1]
where (t1, t2) = B.splitAt n t
-- | Read a series of blocks, each of which starts with a '>' character.
getBlocks :: IO [Block]
getBlocks = map mkBlock . drop 1 . B.split '>' <$> B.getContents
where
mkBlock bs1 = Block h t
where (h, t) = B.break (== '\n') bs1
reverseComplement :: Block -> Block
reverseComplement (Block name codes) =
Block name
$ B.filter (/= '\0')
$ B.map (V.unsafeIndex tbl . ord)
$ B.reverse codes
where
tbl = V.replicate 256 '\0' //
map (first ord) (pairs ++ map (first toLower) pairs)
pairs = [('A', 'T'),('C', 'G'),('G', 'C'),('T', 'A'),
('M', 'K'),('R', 'Y'),('W', 'W'),('S', 'S'),
('Y', 'R'),('K', 'M'),('V', 'B'),('H', 'D'),
('D', 'H'),('B', 'V'),('N', 'N')]
first f (x,y) = (f x, y)
Now the C. Sorry about the missing brockets around the #include files; this blog host won't render them, even when I try writing the HTML directly.
#include stdio.h
#include stdlib.h
#include strings.h
#define BLOCKLEN 1024
#define TITLELEN 132
#define LINELEN 60
typedef struct block_struct {
int len;
char text [BLOCKLEN];
struct block_struct *next;
} Block;
/* Here to get them profiled */
int my_getc (FILE *stream) {
return (getc(stream));
}
int my_putc (int c, FILE *stream) {
return (putc(c,stream));
}
/* Initialised by calls to setComplement. Holds the complement codes. */
static char complement[256];
void setComplement(char c1, char c2) {
complement [c1] = c2;
complement [c2] = c1;
complement [c1 | 0x20] = c2;
complement [c2 | 0x20] = c1;
}
/* Reverse-complement block in-place */
void reverseComplementBlock (Block *block) {
char *ptr1;
char *ptr2;
char c;
ptr1 = &(block->text[0]);
ptr2 = &(block->text[block->len - 1]);
while (ptr1 <= ptr2) {
c = complement [*ptr1];
*ptr1 = complement [*ptr2];
*ptr2 = c;
ptr1++;
ptr2--;
}
}
/*
Read blocks up to either the next ">" or EOF. Create a chain of
blocks in reverse order of reading.
*/
Block *readBlocks () {
char c;
int i;
Block *result;
Block *old;
old = NULL;
result = NULL;
c = ' ';
while (c != '>' && !feof (stdin)) {
result = malloc (sizeof (Block));
i = 0;
while (i < BLOCKLEN && !feof (stdin)) {
c = (char)my_getc(stdin);
if (c == '>') {
ungetc (c, stdin);
break;
}
if (c >= ' ') { // Drop line breaks.
result->text[i++] = c;
}
}
result->len = i;
result->next = old;
old = result;
}
return result;
}
int process_sequence () {
char title[TITLELEN];
Block *ptr;
Block *old_ptr;
int i;
int cpos;
if (!feof(stdin)) {
fgets (title, TITLELEN, stdin);
} else {
return 0;
}
if (index (title, '\n') == NULL) {
while (my_getc (stdin) != '\n') {}
}
ptr = readBlocks ();
printf("%s", title);
cpos = 0;
while (ptr != NULL) {
reverseComplementBlock (ptr);
for (i = 0; i < ptr->len; i++) {
if (cpos++ == 60) {
my_putc ('\n',stdout);
cpos = 1;
}
my_putc (ptr->text[i], stdout);
}
old_ptr = ptr;
ptr = ptr->next;
free (old_ptr);
/* Beware; the obvious "for" loop would free the block and then
dereference it to move to the next one. */
}
my_putc ('\n',stdout);
return 1;
}
void main () {
int i;
char title[TITLELEN];
for (i = 0; i < 256; i++) {
complement [i] = 0;
}
setComplement ('A', 'T');
setComplement ('C', 'G');
setComplement ('M', 'K');
setComplement ('R', 'Y');
setComplement ('W', 'W');
setComplement ('S', 'S');
setComplement ('V', 'B');
setComplement ('H', 'D');
setComplement ('N', 'N');
while (process_sequence ()) {}
}