Ensure All Packages are Updated
- If using a Vultr instance, make sure a VPC 1.0 network is created for both the Controller Node and Compute Nodes to live in (the same region must be used).
Ensure All Packages are Updated
dnf upgrade -yInstall munge on the Controller Node
- Enable the
develrepo to then install themunge-develpackage.dnf config-manager --set-enabled devel - Install the
mungepackages.dnf install -y munge munge-devel - Check that the
mungeuser was created successfully.getent passwd munge - Create the
mungekey.create-munge-key - Start
mungedbefore the Slurm services.systemctl enable munge - Restart the
mungeservice.systemctl restart munge systemctl status mungeSet Permissions for
mungeon the Controller Node - Set the permissions for
mungeon the Controller Node.sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/ sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/ sudo chmod 0755 /run/munge/ sudo chmod 0700 /etc/munge/munge.key sudo chown -R munge: /etc/munge/munge.keyInstall Warewulf on the Controller Node
- Install the Warewulf RPM.
dnf install -y https://github.com/warewulf/warewulf/releases/download/v4.5.8/warewulf-4.5.8-1.el8.x86_64.rpm - Configure
firewalldto allow the appropriate services through.systemctl restart firewalld firewall-cmd --permanent --add-service=warewulf firewall-cmd --permanent --add-service=dhcp firewall-cmd --permanent --add-service=nfs firewall-cmd --permanent --add-service=tftp firewall-cmd --reload - Apply the
warewulf.confFile ``` tee /etc/warewulf/warewulf.conf «EOF WW_INTERNAL: 45 ipaddr:netmask: network: warewulf: port: 9873 secure: false update interval: 60 autobuild overlays: true host overlay: true syslog: false datastore: /usr/share grubboot: false dhcp: enabled: true template: default range start: range end: systemd name: dhcpd tftp: enabled: true tftproot: /var/lib/tftpboot systemd name: tftp ipxe: "00:00": undionly.kpxe "00:07": ipxe-snponly-x86_64.efi "00:09": ipxe-snponly-x86_64.efi 00:0B: arm64-efi/snponly.efi nfs: enabled: true export paths: - path: /home export options: rw,sync mount options: defaults mount: true
- path: /opt export options: ro,sync,no_root_squash mount options: defaults mount: false systemd name: nfs-server container mounts:
- source: /etc/resolv.conf dest: /etc/resolv.conf readonly: true paths: bindir: /usr/bin sysconfdir: /etc localstatedir: /var/lib ipxesource: /usr/share/ipxe srvdir: /var/lib firewallddir: /usr/lib/firewalld/services systemddir: /usr/lib/systemd/system wwoverlaydir: /var/lib/warewulf/overlays wwchrootdir: /var/lib/warewulf/chroots wwprovisiondir: /var/lib/warewulf/provision wwclientdir: /warewulf EOF ```
- Enable the
warewulfdservice.systemctl enable --now warewulfd - Configure all services on the Controller Node.
wwctl configure --all - Set the appropriate permissions for SELinux for
tftpbootrestorecon -Rv /var/lib/tftpboot/ - Import the
rockylinux-9Docker container.wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rockylinux-9 --build - Set the
rockylinux-9container as the default.wwctl profile set default --container rockylinux-9 - Configure the subnet mask and gateway and set that as the default Warewulf profile.
wwctl profile set -y default --netmask=255.255.240.0 --gateway=10.25.96.3 wwctl profile list - Add the Compute Node to the Warewulf Node List.
wwctl node add warewulf-compute-node-1-osaka --ipaddr=10.25.96.4 --discoverable=true wwctl node list -a warewulf-compute-node-1-osaka - Rebuild the Warewulf Overlay.
wwctl overlay buildInstall munge on the Compute Node Image
- As
root, exec into the Rocky Linux 9 container.wwctl container exec rockylinux-9 /bin/bash - Install
mungein the container.dnf install -y munge - Enable the
mungeservice.systemctl enable munge - Run
exitand the container will be rebuilt. - Create the
mungekey overlay.wwctl overlay import --parents wwinit /etc/munge/munge.key - Set permissions for the
munge directorieson the Compute Node.wwctl overlay chown wwinit /etc/munge/munge.key $(id -u munge) $(id -g munge) wwctl overlay chmod wwinit /etc/munge/munge.key 0400 wwctl overlay chown wwinit /etc/munge $(id -u munge) $(id -g munge) wwctl overlay chmod wwinit /etc/munge 0700 - Rebuild the overlay.
wwctl overlay buildAdd a Compute Node to the Cluster
- Select the Upload ISO –> iPXE Custom Script option in Vultr for the node you want to set up as a Compute Node.
- Setup the node with the amount of CPU, RAM and disk space
- Start the Compute Node and it will then boot and download the image from the Controller Node.
- Test
mungefrom the Controller Node to make sure it works.munge -n munge -n | unmunge munge -n | ssh root@<YOUR_COMPUTE_NODE> unmunge remungeTime Synchronisation Between the Controller Node and the Compute Node
sudo timedatectl set-timezone <Region>/<City> wwctl overlay import wwinit /etc/localtime - Exec into the
rockylinux-9container.wwctl container exec rockylinux-9 /bin/bash - Install the
chronypackage in therockylinux-9container.dnf install -y chrony exitto rebuild the container.- Rebuild the overlay after that.
wwctl overlay build exitout and return to the Controller Node.Install slurm on the Controller Node and Compute Node Image
- Enable the
powertoolsrepo.dnf config-manager --set-enabled powertools - Create the
slurmuser and group.export SLURMUSER=900 groupadd -g $SLURMUSER slurm useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm getent passwd 900 getent group 900 - Sync the
slurmuser and group with therockylinux-9container.wwctl container syncuser --write rockylinux-9 --build - Set up the database for Slurm.
dnf install -y mariadb-server mariadb-devel systemctl enable --now mariadb # Respond `Yes` to all of the questions, aside from the one which asks to reset the root password. mysql_secure_installation - Install further
slurmprerequisitesdnf install -y pam-devel readline-devel perl - Further set up the
rockylinux-9container for use on the Compute Node.wwctl container exec rockylinux-9 /bin/bash dnf config-manager --set-enabled crb dnf install -y dnf-plugins-core dnf install -y gcc gcc-c++ tar make python3 openssl openssl-devel pam-devel numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad libevent libevent-devel dbus-devel exitfrom the container.- Install
slurmfrom SchedMD on the Controller Node.wget https://download.schedmd.com/slurm/slurm-23.11.5.tar.bz2 - Build the
slurmRPMs.rpmbuild -ta slurm-23.11.5.tar.bz2 - Change into the
rpmbuild/RPMS/x86_64directory and install each package.cd rpmbuild/RPMS/x86_64 sudo dnf localinstall -y slurm-23.11.5-1.el8.x86_64.rpm sudo dnf localinstall -y slurm-slurmctld-23.11.5-1.el8.x86_64.rpm sudo dnf localinstall -y slurm-perlapi-23.11.5-1.el8.x86_64.rpm sudo dnf localinstall -y slurm-slurmdbd-23.11.5-1.el8.x86_64.rpm sudo dnf localinstall -y slurm-pam_slurm-23.11.5-1.el8.x86_64.rpm sudo dnf localinstall -y slurm-example-configs-23.11.5-1.el8.x86_64.rpm - Install
slurmon the Compute Node image.wwctl container exec rockylinux-9 /bin/bash wget https://download.schedmd.com/slurm/slurm-23.11.5.tar.bz2 dnf install -y mariadb-devel munge-devel pam-devel readline-devel perl dnf install -y rpm-build rpmbuild -ta slurm-23.11.5.tar.bz2 cd /root/rpmbuild/RPMS/x86_64/ dnf localinstall -y slurm-23.11.5-1.el9.x86_64.rpm slurm-slurmd-23.11.5-1.el9.x86_64.rpm cd ~ rm -Rf ./rpmbuild - Then
exitto write the changes to the container.Create a Spool Directory for the Controller Node
- A Spool is a place where data is written by a process, to be used later on or possibly by a different process as well.
mkdir /var/spool/slurmctld chown slurm:slurm /var/spool/slurmctldConfigure Slurm on the Controller Node
cd /etc/slurm/ cp slurm.conf.example slurm.conf cp slurmdbd.conf.example slurmdbd.conf - Edit
slurm.confand changeClusterNameandSlurmctldHostto the hostname of the Controller Node. - Enable
firewalldports.firewall-cmd --permanent --zone=internal --add-port=6817/tcp firewall-cmd --permanent --zone=internal --add-port=6819/tcp firewall-cmd --reload - Start the
slurmdaemon.systemctl enable --now slurmctldImport slurm.conf into the Compute Node Image
wwctl overlay import --parents wwinit /etc/slurm/slurm.conf # Check that it was imported correctly. cat /var/lib/warewulf/overlays/wwinit/rootfs/etc/slurm/slurm.confGet Compute Node Hardware Info
sshinto the Compute Nodessh <COMPUTE_NODE_IP> slurmd -CAdd the slurmd -C Output to the slurm.conf File on the Controller Node
- Edit
NodeNamein/etc/slurm/slurm.confand add something similar like the below example.NodeName=warewulf-compute-node-1-osaka CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=7941 State=UNKNOWN - Update
slurm.confin Warewulf on the Controller Node.wwctl overlay del wwinit /etc/slurm/slurm.conf wwctl overlay import wwinit /etc/slurm/slurm.conf wwctl overlay buildCreate the Cgroup slurm Config and Add to the Overlay
cd /etc/slurm/ cp cgroup.conf.example cgroup.conf cat cgroup.conf wwctl overlay import wwinit /etc/slurm/cgroup.confSetup the Database for Slurm on the Controller Node
vim /etc/slurm/slurmdbd.conf - Set the following parameters and then save the file.
AuthType=auth/munge DbdAddr=<Address_of_Controller_Node> DbdHost=localhost SlurmUser=slurm DebugLevel=verbose LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid - Configure MariaDB
mysql -u root -p grant all on slurm_acct_db.* TO 'slurm'@'localhost' # Then exit out back to the MariDB main CLI (looks like "MariaDB [(none)]>") create database slurm_acct_db; ctrl + dFinally, set the Following in slurm.conf on the Controller Node
vim /etc/slurm/slurm.confAccountingStorageType=accounting_storage/slurmdbd JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.logBring Up the Compute Node
- Check the status of the Slurm Nodes, run the following from the Controller Node.
sinfo - Bring up the Compute Node.
scontrol update nodename=warewulf-compute-node-1-osaka state=idleCreate a Simple Slurm Script
#!/bin/bash #SBATCH --job-name test hostname uptimeCreate a Batch Job for the Script
sbatch test.slurmSet up the Firewall so All Compute Node Addresses are Added to the Trusted Zone
sudo firewall-cmd --zone=trusted --add-source=10.25.96.0/24srun Then Becomes Available
srun --pty bash