Vasily Volkov (UC Berkeley): Unrolling parallel loops

Loop unrolling is not only good for sequential programming, it has similar dramatic effects in highly parallel codes as well, see Unrolling parallel loops (local copy), also see #pragma unroll in the NVidia CUDA programming guide.

Some bullet points of the presentation:

More resources consumed per thread

Note: each load costs 2 arithmetic instructions
• 32 banks vs 32 streaming processors
• But run at half clock rate
These 3 loads are 6x more expensive than 1 FMA

• Simple optimization technique
• Resembles loop unrolling
• Often results in 2x speedup

Dead link: On the homepage Vasily Volkov you find more information on CUDA optimizations.

Cédric Augonnet, Samuel Thibault and Raymond Namyst call Vasily Volkov a “CUDA-hero” in How to get portable performance on
accelerator-based platforms without the
agonizing pain

In a similar vein Dr. Mark Harris describes the beneficial effect of unrolling in parallel reduction.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.