How to make this sed script faster?

My testing indicated that sed can become CPU bound pretty easily on something like this. If you have a multi-core machine you can try spawning off multiple sed processes with a script that looks something like this:

#!/bin/sh
INFILE=data.txt
OUTFILE=fixed.txt
SEDSCRIPT=script.sed
SPLITLIMIT=`wc -l $INFILE | awk '{print $1 / 20}'`

split -d -l $SPLITLIMT $INFILE x_

for chunk in ls x_??
do
  sed -f $SEDSCRIPT $chunk > $chunk.out &
done

wait 

cat x_??.out >> output.txt

rm -f x_??
rm -f x_??.out

Try changing the first two lines to:

s/[ \t]*|[ \t]*/|/g

The best I was able to do with sed, was this script:

s/[\s\t]*|[\s\t]*/|/g
s/[\s\t]*$//
s/^|/null|/

In my tests, this ran about 30% faster than your sed script. The increase in performance comes from combining the first two regexen and omitting the "g" flag where it's not needed.

However, 30% faster is only a mild improvement (it should still take about an hour and a half to run the above script on your 1GB data file). I wanted to see if I could do any better.

In the end, no other method I tried (awk, perl, and other approaches with sed) fared any better, except -- of course -- a plain ol' C implementation. As would be expected with C, the code is a bit verbose for posting here, but if you want a program that's likely going to be faster than any other method out there, you may want to take a look at it.

In my tests, the C implementation finishes in about 20% of the time it takes for your sed script. So it might take about 25 minutes or so to run on your Unix server.

I didn't spend much time optimizing the C implementation. No doubt there are a number of places where the algorithm could be improved, but frankly, I don't know if it's possible to shave a significant amount of time beyond what it already achieves. If anything, I think it certainly places an upper limit on what kind of performance you can expect from other methods (sed, awk, perl, python, etc).

Edit: The original version had a minor bug that caused it to possibly print the wrong thing at the end of the output (e.g. could print a "null" that shouldn't be there). I had some time today to take a look at it and fixed that. I also optimized away a call to strlen() that gave it another slight performance boost.