Filippo Mantovani: ARM for HPC

On 23-Oct-2017 Filippo Mantovani held a talk in Darmstadt on “Mobile technology for production-ready high-performance computing systems: The path of the Mont-Blanc project”. Unfortunately I was unable to attend, but Mr. Mantovani sent me his Darmstadt Seminar slides. As his slides and documents are very interesting to people using or intending to use ARM in HPC, I copy these documents here, so they are easily available. I also copied a report on “MB3_D6.4 Report on application tuning and optimization on ARM platform“.

Slides on OpenMP by Christian Terboven & Dirk Schmidl

Christian Terboven and Dirk Schmidl from IT Center RWTH Aachen presented a deck of slides on OpenMP:

  1. Introduction to OpenMP
  2. OpenMP Tasking In Depth
  3. OpenMP Recap
  4. OpenMP and Performance
  5. Advanced OpenMP Features

Some very striking slides are reproduced here.

openMP-ForkJoin

openMP-DataSharing

openMP-DataSharingAttrib

openMP-DataSharingAttrib2

openMP-Worksharing

openMP-Reduction

openMP-Tasks

Not directly related to OpenMP but giving a good visual description of the latency within a CPU:
openMP-Latency

Output of lstopo from hwloc

This is the output of

lstopo --of png > ~/tmp/lstopo.png

for a machine with an AMD octacore FX 8120, bulldozer architecture, see AMD Bulldozer CPU Architecture Overview.

lstopo

One can just type lstopo, which shows the same in a separate window. lstopo is part of hwloc in Arch or hwloc in Ubuntu.

Below is the output for an Intel NUC, with 4th generation/Haswell Core i5-4250U:
lstopoNUC

See Vol 1 datasheet, Vol 2 datasheet.

Added 06-Jan-2018: Below is the output for Skylake i7-6600U in an HP EliteBook notebook:

Georg Hager’s Blog: Intel vs. GCC for the OpenMP vector triad: Barrier shootout!

Georg Hager’s Blog posted an illustrative article on icc versus g++ performance w.r.t. OpenMP. Dr. Georg Hager is one of the authors of Introduction to High Performance Computing for Scientists and Engineers.

Measurement of

double precision, dimension(N) :: a,b,c,d
! initialization etc. omitted
s = walltime()
!$omp parallel private(R,i)
do R=1,NITER
!$omp do
  do i=1,N
    a(i) = b(i) + c(i) * d(i)
  enddo
!$omp end do
enddo
!$omp end parallel
e=walltime()
MFlops = R*N/(e-s)/1.e6

gives

icc versus g++