Vasily Volkov (UC Berkeley): Unrolling parallel loops

This post has moved to eklausmeier.goip.de/blog/2013/06-01-vasily-volkov-uc-berkeley-unrolling-parallel-loops.

Loop unrolling is not only good for sequential programming, it has similar dramatic effects in highly parallel codes as well, see Unrolling parallel loops (local copy), also see #pragma unroll in the NVidia CUDA programming guide.

Some bullet points of the presentation:

More resources consumed per thread

Note: each load costs 2 arithmetic instructions
• 32 banks vs 32 streaming processors
• But run at half clock rate
These 3 loads are 6x more expensive than 1 FMA

Conclusion:
• Simple optimization technique
• Resembles loop unrolling
• Often results in 2x speedup

Dead link: On the homepage Vasily Volkov you find more information on CUDA optimizations.

Cédric Augonnet, Samuel Thibault and Raymond Namyst call Vasily Volkov a “CUDA-hero” in How to get portable performance on
accelerator-based platforms without the
agonizing pain
.

In a similar vein Dr. Mark Harris describes the beneficial effect of unrolling in parallel reduction.