Comments:"When Haskell Is Not faster than C - Jacques Mattheij"
URL:http://jacquesmattheij.com/when-haskell-is-not-faster-than-c
This article is in response to an earlier one comparing Haskell and C, which made the claim that Haskell beats out C when it comes to speed. On a Hacker News thread I made the bold claim that the C code is rife with problems and that it is tempting to fix it.
I’ve earned a good part of my life’s income to date taking other people’s C code, cleaning it up, making it work and then making it work better. In one sense the original program that this article is about is perfect, it takes the prescribed input, processes it and produces the desired output using the test set provided. That’s really all you can ask for in any program, under normal circumstances. Provided of course that the program produces its results in an acceptable time. What acceptable is is up to the user of such a program, in this particular case the writer of that article (linked at the end) used acceptable to mean ‘slower than Haskell’ and as a long time observer of the language wars this irks me.
For one, I think if you compare two languages side by side I think you should at least be as proficient in the one language as you are in the other, and you should be proficient in both. Second, you should if you are not proficient in one of the two limit yourself to highlighting the strong points of your favorite language rather than talking down the other language that you don’t know as much about. Third, a comparison on speed between Haskell and C is almost meaningless, speed is not Haskells strongest point and it would be folly to bench it against C for that particular reason.
The monetary gain from using a higher level language than C is made because the programmer spends less time writing/debugging/maintaining the software. As compilers get better and languages are better capable of using raw speed of the hardware without investing a lot of programmer time this trade-off will be made more frequently to the disadvantage of languages lower on the totem pole such as C.
I have no clue about the specific strengths or weaknesses of Haskell. I hear a lot about it but there are so many languages and there is only so much time. My own language repertoire is probably outdated, limited and not ‘hip’ by any stretch of the imagination but I think I know my old tools well, a bit like an elderly carpenter, please keep your fingers away from the chisels.
C may be old, but beating C for raw performance is quite hard (if it is possible at all, even FORTRAN still has its uses) and there are applications where the sheer number of computers working on a single problem outweighs the programmer cost advantage offered by an easy to use language (I’ve never heard anybody describe Haskell as ‘easy’, but it may very well be easier than C, I just do not know) by a considerable margin. The example chosen is quite apt because a lot of the problems in the field of computational genomics are exactly such problems. Other advantages from efficient code may translate into hard real time guarantees, improved battery life, lower charges from your service provider or a lower power bill. And since the environmental impact of computing is more and more on the radar optimization may find a new lease on life. That said, any optimization that you don’t know for sure that you need is likely to be premature and it is important to be aware of the cost of optimization and to make the appropriate decision when to employ it.
So this is an attempt to set the record straight and will give a level comparison between the ‘old’ version of the code as supplied by the original author and a re-written piece of code. I’ve purposefully not looked at other C entries in the language shoot-out section to keep things fair and to show what you can do to fix this code. A from scratch implementation would go for the end result in a much more direct way.
The time taken by the version as posted on http://paulspontifications.blogspot.nl/2013/01/when-haskell-is-faster-than-c.html on my machine (i7 920, 2.66GHz) when compiled with -O3 (a C compiler optimization directive) to process a 250M ‘fasta’ format file is about 7.4 seconds (average of 10 runs), so we’ll be using that as a baseline.
As we go the original program will be transformed into one that hopefully reads a bit easier and performs better as well.
The first re-factor will be a simple cleanup.
After turning on some compiler warnings (-Wall -Wextra) the first thing that jumps out is an unused variable ‘title’ in the ‘main’ function (happens to everybody without warnings on) and the fact that characters (signed entities) are used as array indices. Changing the base type of the arrays and pointers to ‘unsigned character’ removes these warnings and takes care of a potential problem if the program would ever be fed bytes > 0x7f, in which case it would start to refer to memory before the lookup table. You really don’t want to do that. C’s signed character type has tripped up many programmers, especially when used as an index. Main really has a type ‘int’ and you should declare it as such, and add a proper return value.
As usual when dealing with programs that are undocumented other than a few brief comments here and there part of the fun in doing things like this is figuring out how it does what it does. This one is fairly simple but still, the principle holds, I like puzzles.
C does not require you to set static global arrays to ‘0’, so the for loop in the main function can go, you also don’t need to declare empty blocks so the {} after the while can go as well.
First thing to hit is the loop that writes the output. This is a character loop that is a bit inefficient the way it is written. The fasta format uses 60 characters to a line and we can simply write out lines of 60 bytes in one go, adjusting the size for the last line to whatever is left. New runtime of the program is 5.2 seconds.
I don’t really like the way the feof / fgets bit at the beginning of the process function looks so I’ve changed it a bit using the return value of the fgets function as a guide whether we’re at the end of the input or not. A bit more fiddling to the output loop to make it look nicer (obviously no speed gain, that’s just cosmetics). What is not cosmetics is that the title string is messed up if it is longer than 132 characters, the remaining characters would be dropped. This is a problem because the FASTA format does not allow lines > 120 characters but if we receive input like this we have the tough choice of either keeping the input and performing the job or throwing an error. Silently discarding the excessive input seems wrong to me. If this were a contract job I’d aks for guidance at this point what the company policy is in situations like these ;) Some more looking at that loop and you see that you can simply drop the title variable altogether and simply use the loop that scans for the ‘\n’ character to process the whole title. End of file handling of this program (and bad input) are pretty weak, but those things are not specified so we’ll just leave them the way they are. For robust production code this would have to be substantially improved on.
Next up, the read loop. The read loop is also character based, just a minor tweak and instead of reading characters we’re reading lines using fgets. New runtime: 1.2 seconds. C has a nasty problem with looking ahead in the input stream, we need to know if the first character of the next line is a greater than sign, if it is then that’s the sign that we are looking at the beginning of a new sequence. This is done rather un-elegantly using a getc / ungetc pair, which effectively allows you to peek one character ahead.
That 1024 bytes buffer is a bit anemic, this is 2013, we’re processing 250M of data and we’re going to pull it all in anyway. Let’s use 256K buffers, we’ll still use exactly the same amount of memory + on average half an empty 256K buffer. That’s a very small price to pay. Runtime now 0.97 seconds.
So far we’ve been quite nice with our optimizations, no tricks, no gimmicks. The speed of the program is not fantastic but quite respectable.
But suppose your boss is at your desk breathing down your neck. It’s fast but really not yet fast enough, what can you do to make this code faster? ‘Use as much memory as you want’ he said when leaving…
Optimization is a funny game, the more you optimize the code the uglier it gets. Programs that have been optimized are - if you don’t stop where we’ve gotten so far - a lot harder to maintain and a lot harder to debug as well. So now we’re entering the twilight zone of optimizations, things that you really don’t want to do unless you’ve got an extremely good reason for doing them, and things that might subtly alter the way you can use your program, so you need to be vigilant against that. Since there is no clear specification here we have a lot of leeway but in a real world setting there would be all kinds of constraints.
A quick check of how fast the program runs when you disable ‘reverse_complement’ shows that this takes approximately 20% of the runtime. This means that even if we’d reduce that portion to ‘0’ there would still be 80% of the time used for other bits of work. Another quick check, cat’ing the file to /dev/null shows that this takes about 50 ms. Clearly raw IO speed is not what is holding us up.
That leaves the two loops that we’ve already optimized a bit as the main offenders.
One of the problems with those loops is that (unlike reverse_complement) we can’t do the work in parallel, so the usual ‘get out of jail free’ card does not help us here. If we were CPU bound for the larger part of the runtime the effect of multi-threading would be much more pronounced, but in this case we can never win more than 20% that way. Reading from an input stream is an inherently serial operation and writing to an output stream is too. When you’re optimizing there are a bunch of trade-offs that you can make, the most common one is space versus speed. If memory is significantly larger than the largest possible input sequence you could read all of the sequence into one giant chunk of memory. that would mean a single read. The interesting possibility that approach offers (versus the block oriented approach used by the original program) is that the linefeeds would remain where they are. Simply skipping them would be enough. (this skip would require two tests in the inner loop of the reverse_complement routine, which would incur a penalty, but that penalty is not as large as the one imposed by the way the linefeeds are processed now). A single read and a single write could then do the whole job.
A very quick check, increase the block length to something > than the input file length shows about 5% improvement, but we’re still going through those loops. I like the idea of a very fast scan, let’s see what we can do with that.
So, this is the first major departure from the original code, We start off with allocating buffers again, but this time we read huge chunks of data until all data has been read, and then we move all the data into one single buffer which we can then process in place.
To make reverse_complement a bit more universal we pass it two pointers to the data instead of the ‘block’ pointer.
Reading and writing the data now comes in at 0.22 seconds with 1M blocks. (reading from a file and writing to /dev/null).
Next up the processing. We leave all the data in place, we don’t mess with the linefeeds (we just skip them) and we stuff it all into one giant buffer. Runtime: 0.44 seconds. .22 of that comes for the reading/writing, and the other half is the actual processing.
Note that this change means that the program will need an end-of-file at some point in order to function, and that end-of-file should arrive before we run out of memory. This is a serious drawback and we should really do something about that, because now the program is no longer up to spec. It may be fast, but if it contains a bug like that the speed is not relevant. Make it correct first, then make it fast!
The problem is (like with many text protocols) that we have absolutely no idea how much data to read to get the next working set into memory. The only thing we have to go on here is that single ‘>’. Scanning for that character is expensive! It requires a full pass through the data just to figure out where the one sequence ends and the next one starts. And since this is crucial information we can’t really skip this step. The good news is, we had that scan anyway so whatever the reader gains in complexity is lost by the processing step (which will now only process one sequence).
The final result is a runtime of about .75 seconds, quite acceptable and this would put it #1 on the shootout if submitted. There are stil two bits of low hanging fruit left to optimize this code further:
The first is very obvious. If we’re processing multiple sequences then (in the case of the original program) we can process blocks in parallel on multiple cores using threads. In the current incarnation of the program you’d apply that technique to a full sequence. As shown above this would yield at most a 20% improvement, my best guess is that with the input streams given the yield would be lower than that (about 7%, 1/3rd of 20% because there are 3 input sequences).
The second is a bit less obvious. The reader loop now reads in as much as it needs or a bit more. Once detected the bit more is saved for the next round of reading. Every time that scan takes place it re-scans from the beginning of the sequence found so far. That’s fast (because it is in memory), but still very wasteful (because a sequence is larger than the CPU cache it is in fact very wasteful), you only need to scan the new part after you’ve done the initial scan. So you could improve on this by adding another scan in the first portion of the reader loop just after it re-cycles the saved portion. For every iteration of the loop after that one you only scan the bit that you just read in.
Total memory consumption of this version is about twice as good as the best C version in the shootout, and elapsed time slightly better. Using the threading and scan tricks above memory consumption should be about the same and elapsed time should be significantly better.
In the words of Colin Wright: “You can’t make a computer do stuff faster, but you can make it do less work”, and that really is the essence of optimization.
The realloc is not as bad as it looks, since it is almost always the last allocated block growing it is nearly free.
C gives you a lot of leeway in how you want to express your code. Over the years I’ve settled on a style that is pretty defensive, this helps in avoiding some of the pitfalls in the C language. When you read the code below you might think ‘I can do this quicker using a pre-decrement’ or you might think ‘I can push these two separate statements onto a single line’. These things are all true, and in fact they might give you a small performance boost at times. But for me such optimizations are not worth it unless we’re talking about an inner loop where a substantial amount of time is spent. And even then I’d work hard to get rid of the number of times we iterate through the loop before resorting to such micro-optimizations trading readability for speed.
I hope you enjoyed this little exercise, I certainlyh did. The final code is below.
(afterword: Joachim Schipper points out that memchr could be used to speed up the inner loop in the scanner for the ‘>’ character see here, that’s important because it makes the inner loop both faster and more readable, and this is a good illustration of why you want code review).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BLOCKLEN (1*1024*1024)
#define TRUE 1
#define FALSE 0
/* Initialised by calls to set_complement. Holds the complement codes. */
static char complement[256];
void set_complement(int c1, int c2) {
complement [c1] = c2;
complement [c2] = c1;
complement [c1 | 0x20] = c2;
complement [c2 | 0x20] = c1;
}
/*
* this walks two pointers towards each other, every time one of the pointers
* hits a linefeed the linefeed is skipped and the loop restarted. Whenever neither
* pointer is on a linefeed (the most common case) the characters are complemented
* and swapped.
*/
void reverse_complement(unsigned char * ptr1, unsigned char * ptr2) {
unsigned char c;
while (ptr1 <= ptr2) {
if (*ptr1 == 10) {
ptr1++;
continue;
}
if (*ptr2 == 10) {
ptr2--;
continue;
}
c = complement[*ptr1];
*ptr1++ = complement [*ptr2];
*ptr2-- = c;
}
}
/*
* Fasta files do not have any parser friendly features such as length fields. This means
* that if you want to know if a sequence is complete you have to scan for the next '>'
* character of the title following the current sequence. This is quite time consuming,
* and even if you find it you still have a problem because you'll have read *too much*
* data from the input stream. That excess is then saved in a separate buffer to be re-used
* at the beginning of the next round of reading.
*
* This loop is pretty slow, it takes up about 40% of the total runtime of the program.
* There is a clear way to improve it, which is to scan only the part that was just read
* in after the first pass through the loop in read_sequence. I'll leave that as an exercise
* for those that want to beat this version of the code, there is an easy #1 spot in the
* language shootout waiting for you if you add that feature ;) (and multithreading if
* you want to keep that #1 spot!)
*/
int sequence_complete(unsigned char * p, int n, unsigned char ** saved, int * savedsize)
{
// skip the leading '>' for the title of the current block, we want the *next* block
p++;
n--;
// now search the rest of the data for a sequence start
while (n) {
if (*p == '>') {
// we've found a title for the next sequence, figure how how much excess we have
// and save the remainder
*saved = malloc(n);
memcpy(*saved, p, n);
*savedsize = n;
return TRUE;
}
p++;
n--;
}
return FALSE;
}
// read in a sequence, if we read in more than one whole one save the excess
unsigned char * read_sequence(int * s, unsigned char ** saved, int * nsaved) {
int size;
unsigned char * p;
int read;
int n;
if (*saved == NULL) {
// first round or a really lucky break
size = BLOCKLEN;
p = malloc(BLOCKLEN);
read = 0;
}
else {
// round which starts from excess data read during a previous round
size = *nsaved;
p = *saved;
*saved = NULL;
read = *nsaved;
size <<= 1; // fix for bug spotted by Robert G. Jakabosky
}
while (!feof(stdin) || read) {
p = realloc(p, size);
n = fread(p + read, 1, size - read, stdin);
read += n;
// if the sequence is now complete save the excess for the next
// iteration and process this sequence. Like that we don't need
// to read all the way to end-of-file before we can start processing
// if we have read more than one whole sequence this still works, it
// just means that we will re-visit this on the next round through
// the reader to fetch another whole sequence.
if (sequence_complete(p, read, saved, nsaved)) {
read -= *nsaved; // compensate for saved portion
break;
}
// if we did not read any data we can return now
if (n == 0) {
break;
}
// every time we reach the end of the block we scale up
if (read == size) {
size <<= 1;
}
}
*s = read;
return p;
}
void process_sequence(unsigned char * p, int size) {
unsigned char * start;
// skip the block title
start = (unsigned char*)strchr((char*)p, '\n') + 1;
// figure out start and end of this sequence and do the complement
reverse_complement(start, start + size - (start - p) - 1);
}
void setup() {
set_complement ('A', 'T');
set_complement ('C', 'G');
set_complement ('M', 'K');
set_complement ('R', 'Y');
set_complement ('W', 'W');
set_complement ('S', 'S');
set_complement ('V', 'B');
set_complement ('H', 'D');
set_complement ('N', 'N');
}
/*
* process sequences one-by-one until we're out of sequences and end-of-file
* has been reached on the input and the readahead buffer is empty
*/
void process() {
unsigned char * readahead = NULL;
int nreadahead;
unsigned char * p;
int s;
while (readahead != NULL || !feof(stdin)) {
p = read_sequence(&s, &readahead, &nreadahead);
process_sequence(p, s);
fwrite(p, 1, s, stdout);
free(p);
}
}
int main() {
setup();
process();
return 0;
}
Finally, this code is still terrible! Yes, really. It is fragily because it does not deal well with error conditions such as broken input and many others besides. Because the original does not and the specification doesn’t give any requirements I left it that way but I would never ever consider a piece of code like this finished. At a minimum it would require checks on all system / library call return values, unit tests for every function and so on. Even a ‘rivial’ little program like this is not trivial if you plan on putting it into a real world situation where people will run their business based on your code.
One thing that struck me about the FASTA format is that there is absolutely no attempt to detect/correct errors in the data, for a file format that stores information where mutations are not without effect you’d hope they would at least put in some safe-guards to make sure that mutations are of the biological variety rather than the digital one (bitrot).
Another thing that I noticed when I was finished and looked through the language shootout entries was OCaml #3, about 3 times as slow as the C version here, but only 12.8M of ram in use, I have absolutely no idea how it does that!