In the post Rewriting Perl to plain C the runtime of the serial runs were reported. As expected the C program was a lot faster than the Perl script. Now running programs in parallel showed two unexpected behaviours: (1) more parallelizations can degrade runtime, and (2) running unoptimized programs can be faster.
See also CPU Usage Time Is Dependant on Load.
In the following we use the C program
siriusDynCall and the Perl script
siriusDynUpro which was described in above mentioned post. The program or scripts reads roughly 3GB of data. Before starting the program or script all this data has been already read into memory by using something like
1. AMD Processor. Running 8 parallel instances, s=size=8, p=partition=1(1)8:
for i in 1 2 3 4 5 6 7 8; do time siriusDynCall -p$i -s8 * > ../resultCp$i & done real 50.85s user 50.01s sys 0
Merging the results with the
sort command takes a negligible amount of time
sort -m -t, -k3.1 resultCp* > resultCmerged
Best results are obtained when running just s=4 instances in parallel:
$ for i in 1 2 3 4 ; do /bin/time -p siriusDynCall -p$i -s4 * > ../dyn4413c1p$i & done real 33.68 user 32.48 sys 1.18