Setting up SLURM on Multiple Machines

SLURM is a job-scheduler for Linux. It is most useful if you have a cluster of computers. But if you only have two or three computers it can be put to good use. SLURM is an acronym for Simple Linux Utility for Resource Management.

SLURM is used by Tianhe-2, the most powerful supercomputer in the world with 3,120,000 Intel Xeon E5 2.2 GHz cores (not CPUs). It is in effect a combination of 16,000 computer nodes. Tianhe-2 is located in Guangzhou, China. SLURM is also used by Odyssey supercomputer at Harvard with 2,140 nodes equipped with AMD Opteron 6376 CPUs. Frankfurt’s supercomputer with 862 compute nodes, each equipped with AMD Opteron 6172 processors, also uses SLURM. Many other HPC sites use SLURM.

What is the catch of a job-scheduler? While a single operating system, like Linux, can manage jobs on a single machine, the SLURM job-scheduler can shuffle jobs around lots of machines, thereby even out the load on all of them.

Installing the software is pretty simple as it is part of Ubuntu, Debian, and many other distros.

apt-get install munge
apt-get install slurm-llnl

These packages also add the users munge and slurmd, respectively.

Configuring munge goes like this:

create-munge-key

This command creates /etc/munge/munge.key, which is readable by the user munge only. This file is copied to all machines to the same place, i.e., /etc/munge/munge.key. Once this cryptographic key is created for munge, one can start the munge daemon munged and test whether it works:

klm@chieftec:~$ echo Hello, world | munge | unmunge 
STATUS:           Success (0)
ENCODE_HOST:      chieftec (127.0.1.1)
ENCODE_TIME:      2014-11-07 22:37:30 +0100 (1415396250)
DECODE_TIME:      2014-11-07 22:37:30 +0100 (1415396250)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              klm (1000)
GID:              klm (1000)
LENGTH:           13

Hello, world

It is important to use one unique /etc/munge/munge.key file on all machines, and not to use create-munge-key on each machine.

The munged daemon is started by

/etc/init.d/munge start

Once munged is up and running one configures SLURM by editing /etc/slurm-llnl/slurm.conf.

ControlMachine=nuc
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
AccountingStorageType=accounting_storage/filetxt
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/var/log/slurm-llnl/slurm_jobcomp.log
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
NodeName=chieftec CPUs=8 State=UNKNOWN 
NodeName=nuc CPUs=4 State=UNKNOWN 
PartitionName=p1 Nodes=chieftec,nuc Default=YES MaxTime=INFINITE State=UP

Starting the SLURM daemons is

/etc/init.d/slurm-llnl start

As a test run

srun who

Added 05-Apr-2017: Quote from Quick Start Administrator Guide:

1. Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.
2. Install MUNGE for authentication. Make sure that all nodes in your cluster have the same munge.key. Make sure the MUNGE daemon, munged is started before you start the Slurm daemons.

Advertisements

One thought on “Setting up SLURM on Multiple Machines

  1. Pingback: slurmd confused of pid-file – Elmar Klausmeier's Weblog

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s