|3063rd/3113||Giorgino et al, J. Chem. Theory Comput, 2012||cancer|
|3389th/5798||Sadiq et al, PNAS 2012||hiv|
|1801st/1995||Venken et al, JCTC 2013||hiv|
|1886th/3349||Buch et al, JCIM 2013||cancer|
|3741st/4477||Pérez-Hernández et al, JCP 2013||methods|
|564th/2163||Bisignano et al. JCIM 2014||methods|
|369th/1283||Doerr et al. JCTC 2014||methods|
|573rd/2838||Stanley et al, Nat Commun 2014||cancer|
|1453rd/3611||Ferruz et al., JCIM 2015||methods|
|1313th/4128||Ferruz et al., Sci Rep 2016||brain|
|3083rd/4815||Stanley et al., Sci Rep 2016||cancer|
|4319th/4730||Noe et al., Nat Chem 2017||methods|
Parallel Programming and Applied Mathematics, PPAM for short, is a biennial conference started in 1994, with the proceedings published by Springer in the Lecture Notes in Computer Sciences series, see PPAM. It is sponsored by IBM, Intel, Springer, AMD, RogueWave, and HP. The last conference had a fee of 420 EUR.
It is held in conjunction with 6th Workshop on Language based Parallel Programming.
Prominent speakers are:
Day 2 of the conference had below talks. Prof. Dr. Bernd Brügmann gave a short introduction. He pointed out that Jena is number 10 in Physics in Germany, has ca. 100.000 inhabitants, and 20.000 students.
- Dr. Karl Rupp, Vienna, Lessons Learned in Developing the Linear Algebra Library ViennaCL. Notes: C++ operator overloading normally uses temporary, special trickery necessary to circumvent this, ViennaCL not callable from Fortran due to C++/operator overloading, eigen.tuxfamily.org, Karl Rupp’s slides, with CUDA 5+6 OpenCL and CUDA are more or less on par,
- Prof. Dr. Rainer Heintzmann, Jena, CudaMat – a toolbox for Cuda computations. Continue reading
As announced in Workshop Programming of Heterogeneous Systems in Physics, July 2014, I attended this two-day conference in Jena, Germany. Below are speakers and pictures with my personal notes.
The last workshop from 2011 had two high-profile speakers regarding CUDA technology: Dr. Timo Stich and Vasily Volkov. The 2014 workshop features Dr. Stich, Dr. Karl Rupp, Hans Pabst (Intel), and others.
Addendum 06-Aug-2014: See my notes taken at the workshop in
- Day 1, Workshop Programming of Heterogeneous Systems in Physics
- Day 2, Workshop Programming of Heterogeneous Systems in Physics
This article “CUDA-Enabled GPUs” made me check my GPU again using
CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 560" CUDA Driver Version / Runtime Version 6.0 / 5.0 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1023 MBytes (1072889856 bytes) ( 7) Multiprocessors x ( 48) CUDA Cores/MP: 336 CUDA Cores Continue reading
hgpu.org contains links to reviews, tutorials, research papers, program packages concerning various aspects of graphics and non-graphics (general purpose computing) using of GPU and related parallel architectures (FPGA, Cell processors etc.). The majority is on NVidia, see picture below.
Recent GPUGRID tasks, like
27x0-SANTI_RAP74wtCUBIC really keep my NVidia GTX 560 hot, i.e., as warm as 70° Celsius or higher.
Fri Sep 6 17:16:38 2013 +------------------------------------------------------+ | NVIDIA-SMI 5.325.15 Driver Version: 325.15 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 560 Off | 0000:01:00.0 N/A | N/A | | 92% 76C N/A N/A / N/A | 866MB / 1023MB | N/A Default | +-------------------------------+----------------------+----------------------+ Continue reading
I ran below commands under different load for my Gigabyte GTX 560 graphic card.
export LD_LIBRARY_PATH=$CUDA_PATH/lib64 time /usr/local/cuda/samples/sdk/0_Simple/matrixMul/matrixMul time /usr/local/cuda/samples/sdk/0_Simple/matrixMulCUBLAS/matrixMulCUBLAS
I was interested in the value GFlop/s.
There is a an INCITE program (HPC Call for Proposals), where one can apply for CPU/GPU intensive jobs, the link is INCITE.
From the FAQ: The INCITE program is open to US- and non-US-based researchers and research organizations needing large allocations of computer time, supporting resources, and data storage to pursue transformational advances in science and engineering.
Loop unrolling is not only good for sequential programming, it has similar dramatic effects in highly parallel codes as well, see Unrolling parallel loops (local copy), also see
#pragma unroll in the NVidia CUDA programming guide.
Some bullet points of the presentation:
More resources consumed per thread
Note: each load costs 2 arithmetic instructions
• 32 banks vs 32 streaming processors
• But run at half clock rate
These 3 loads are 6x more expensive than 1 FMA
• Simple optimization technique
• Resembles loop unrolling
• Often results in 2x speedup
On the homepage Vasily Volkov you find more information on CUDA optimizations.
Cédric Augonnet, Samuel Thibault and Raymond Namyst call Vasily Volkov a “CUDA-hero” in How to get portable performance on
accelerator-based platforms without the
Instead of starting X like
and therefore loading the NVidia CUDA environment, one can simply add
[ -c /dev/nvidia0 ] || mknod -m 666 /dev/nvidia0 c 195 0 [ -c /dev/nvidiactl ] || mknod -m 666 /dev/nvidiactl c 195 255