This post has moved to eklausmeier.goip.de/blog/2015/01-17-setting-up-slurm-on-multiple-machines.
SLURM is a job-scheduler for Linux. It is most useful if you have a cluster of computers. But if you only have two or three computers it can be put to good use. SLURM is an acronym for Simple Linux Utility for Resource Management.
SLURM is used by Sunway TaihuLight with 10,649,600 computing cores, Tianhe-2 with 3,120,000 Intel Xeon E5 2.2 GHz cores (not CPUs) — Tianhe-2 is in effect a combination of 16,000 computer nodes. These two chinese machines were the two most powerful supercomputers in the world up to November 2017; now overpowered by Oak Ridge National Laboratory’s Summit and LLNL’s Sierra. TaihuLight is located in Wuxi, Tianhe-2 is located in Guangzhou, China. Europe’s fastest supercomputer Piz Daint uses SLURM, so does the fastest supercomputer in the middle-east, KAUST. SLURM is also used by Odyssey supercomputer at Harvard with 2,140 nodes equipped with AMD Opteron 6376 CPUs. Frankfurt’s supercomputer with 862 compute nodes, each equipped with AMD Opteron 6172 processors, also uses SLURM. Munich’s SuperMUC-NG uses SLURM, Barcelonas Supercomputer as well. Many other HPC sites as listed in Top500 use SLURM.
What is the catch of a job-scheduler? While a single operating system, like Linux, can manage jobs on a single machine, the SLURM job-scheduler can shuffle jobs around lots of machines, thereby even out the load on all of them.
Continue reading →