Parallelization and CPU Cache Overflow

In the post Rewriting Perl to plain C the runtime of the serial runs were reported. As expected the C program was a lot faster than the Perl script. Now running programs in parallel showed two unexpected behaviours: (1) more parallelizations can degrade runtime, and (2) running unoptimized programs can be faster.

See also CPU Usage Time Is Dependant on Load.

In the following we use the C program siriusDynCall and the Perl script siriusDynUpro which was described in above mentioned post. The program or scripts reads roughly 3GB of data. Before starting the program or script all this data has been already read into memory by using something like wc or grep.

1. AMD Processor. Running 8 parallel instances, s=size=8, p=partition=1(1)8:

for i in 1 2 3 4 5 6 7 8; do time siriusDynCall -p$i -s8 * > ../resultCp$i & done
real 50.85s
user 50.01s
sys 0

Merging the results with the sort command takes a negligible amount of time

sort -m -t, -k3.1 resultCp* > resultCmerged

Best results are obtained when running just s=4 instances in parallel:

$ for i in 1 2 3 4 ; do /bin/time -p siriusDynCall -p$i -s4 * > ../dyn4413c1p$i & done
real 33.68
user 32.48
sys 1.18

Continue reading

Operation Costs Measured in CPU Clock Cycles

I have written on the merits of 64-bit vs 32-bit, or on the AMD Bulldozer CPU Architecture Overview. An article in ithare.com gave a very good overview about the relative performance of various CPU operations. Below is the graphic:

Not all CPU operations are created equal

Also see the often quoted Agner Fog: Instruction Tables.

Running CPU/GPU Intensive Jobs on Titan Supercomputer

There is a an INCITE program (HPC Call for Proposals), where one can apply for CPU/GPU intensive jobs, the link is INCITE.

From the FAQ: The INCITE program is open to US- and non-US-based researchers and research organizations needing large allocations of computer time, supporting resources, and data storage to pursue transformational advances in science and engineering.

The machines in question: Mira and Titan.

Counting to Ten on Linux

Very good article on timing CPU bound application.

Random ASCII - tech blog of Bruce Dawson

I recently discovered a Linux shell script that was running slowly due to an inefficiently implemented loop. This innocent investigation ended up uncovering misleading information from time and a bad interaction between the Linux thread scheduler and its CPU power management algorithms.

View original post 1,944 more words