[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

https://groups.google.com/g/slurm-users/c/RXCHB7OE_Kk

Williams, Jenny Avis unread, Jul 11, 2023, 11:47:53 PM to slurm…@schedmd.com Progress on getting slurmd to start under cgroupv2

Issue: slurmd 22.05.6 will not start when using cgroupv2

Expected result: even after reboot slurmd will start up without needing to manually add lines to /sys/fs/cgroup files.

When started as service the error is:

systemctl status slurmd

slurmd.service - Slurm node daemon

Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)

Drop-In: /etc/systemd/system/slurmd.service.d
```
     `-extendUnit.conf
```
Active: failed (Result: exit-code) since Tue 2023-07-11 10:29:23 EDT; 2s ago

Process: 11395 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

Main PID: 11395 (code=exited, status=1/FAILURE)

Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: Started Slurm node daemon.

Jul 11 10:29:23 g1803jles01.ll.unc.edu slurmd[11395]: slurmd: slurmd version 22.05.6 started

Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE

Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: Failed with result ‘exit-code’.

When started at the command line the output is:

slurmd -D -vvv 2>&1 |egrep error

slurmd: error: Controller cpuset is not enabled!

slurmd: error: Controller cpu is not enabled!

slurmd: error: Controller cpuset is not enabled!

slurmd: error: Controller cpu is not enabled!

slurmd: error: Controller cpuset is not enabled!

slurmd: error: Controller cpu is not enabled!

slurmd: error: Controller cpuset is not enabled!

slurmd: error: Controller cpu is not enabled!

slurmd: error: cpu cgroup controller is not available.

slurmd: error: There’s an issue initialising memory or cpu controller

slurmd: error: Couldn’t load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed

slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup

Steps to mitigate the issue:

While the following steps do not solve the issue, they do get the system in a state such that slurmd will start, at least until next reboot. The re-install slurm-slurmd is a one-time step to ensure that local service modifications are out of the picture. Currently, even after reboot the cgroup echo steps are necessary at a minimum.

#!/bin/bash

/usr/bin/dnf -y reinstall slurm-slurmd

systemctl daemon-reload

/usr/bin/pkill -f ‘/usr/sbin/slurmstepd infinity’

systemctl enable slurmd

systemctl stop dcismeng.service && \

/usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/cgroup.subtree_control && \

/usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/system.slice/cgroup.subtree_control && \

systemctl start slurmd && \

echo ‘run this: systemctl start dcismeng’

Environment:

scontrol show config

Configuration data as of 2023-07-11T10:39:48

AccountingStorageBackupHost = (null)

AccountingStorageEnforce = associations,limits,qos,safe

AccountingStorageHost = m1006

AccountingStorageExternalHost = (null)

AccountingStorageParameters = (null)

AccountingStoragePort = 6819

AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu

AccountingStorageType = accounting_storage/slurmdbd

AccountingStorageUser = N/A

AccountingStoreFlags = (null)

AcctGatherEnergyType = acct_gather_energy/none

AcctGatherFilesystemType = acct_gather_filesystem/none

AcctGatherInterconnectType = acct_gather_interconnect/none

AcctGatherNodeFreq = 0 sec

AcctGatherProfileType = acct_gather_profile/none

AllowSpecResourcesUsage = No

AuthAltTypes = (null)

AuthAltParameters = (null)

AuthInfo = (null)

AuthType = auth/munge

BatchStartTimeout = 10 sec

BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64

BcastParameters = (null)

BOOT_TIME = 2023-07-11T10:04:31

BurstBufferType = (null)

CliFilterPlugins = (null)

ClusterName = ASlurmCluster

CommunicationParameters = (null)

CompleteWait = 0 sec

CoreSpecPlugin = core_spec/none

CpuFreqDef = Unknown

CpuFreqGovernors = OnDemand,Performance,UserSpace

CredType = cred/munge

DebugFlags = (null)

DefMemPerNode = UNLIMITED

DependencyParameters = kill_invalid_depend

DisableRootJobs = No

EioTimeout = 60

EnforcePartLimits = ANY

Epilog = (null)

EpilogMsgTime = 2000 usec

EpilogSlurmctld = (null)

ExtSensorsType = ext_sensors/none

ExtSensorsFreq = 0 sec

FairShareDampeningFactor = 1

FederationParameters = (null)

FirstJobId = 1

GetEnvTimeout = 2 sec

GresTypes = gpu

GpuFreqDef = high,memory=high

GroupUpdateForce = 1

GroupUpdateTime = 600 sec

HASH_VAL = Match

HealthCheckInterval = 0 sec

HealthCheckNodeState = ANY

HealthCheckProgram = (null)

InactiveLimit = 65533 sec

InteractiveStepOptions = –interactive –preserve-env –pty $SHELL

JobAcctGatherFrequency = task=15

JobAcctGatherType = jobacct_gather/cgroup

JobAcctGatherParams = (null)

JobCompHost = localhost

JobCompLoc = /var/log/slurm_jobcomp.log

JobCompPort = 0

JobCompType = jobcomp/none

JobCompUser = root

JobContainerType = job_container/none

JobCredentialPrivateKey = (null)

JobCredentialPublicCertificate = (null)

JobDefaults = (null)

JobFileAppend = 0

JobRequeue = 1

JobSubmitPlugins = lua

KillOnBadExit = 0

KillWait = 30 sec

LaunchParameters = (null)

LaunchType = launch/slurm

licences = mplus:1,nonmem:32

LogTimeFormat = iso8601_ms

MailDomain = (null)

MailProg = /bin/mail

MaxArraySise = 90001

MaxDBDMsgs = 701360

MaxJobCount = 350000

MaxJobId = 67043328

MaxMemPerNode = UNLIMITED

MaxNodeCount = 340

MaxStepCount = 40000

MaxTasksPerNode = 512

MCSPlugin = mcs/none

MCSParameters = (null)

MessageTimeout = 60 sec

MinJobAge = 300 sec

MpiDefault = none

MpiParams = (null)

NEXT_JOB_ID = 12286313

NodeFeaturesPlugins = (null)

OverTimeLimit = 0 min

PluginDir = /usr/lib64/slurm

PlugStackConfig = (null)

PowerParameters = (null)

PowerPlugin =

PreemptMode = OFF

PreemptType = preempt/none

PreemptExemptTime = 00:00:00

PrEpParameters = (null)

PrEpPlugins = prep/script

PriorityParameters = (null)

PrioritySiteFactorParameters = (null)

PrioritySiteFactorPlugin = (null)

PriorityDecayHalfLife = 14-00:00:00

PriorityCalcPeriod = 00:05:00

PriorityFavorSmall = No

PriorityFlags = SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,MAX_TRES

PriorityMaxAge = 60-00:00:00

PriorityUsageResetPeriod = NONE

PriorityType = priority/multifactor

PriorityWeightAge = 10000

PriorityWeightAssoc = 0

PriorityWeightFairShare = 10000

PriorityWeightJobSise = 1000

PriorityWeightPartition = 1000

PriorityWeightQOS = 1000

PriorityWeightTRES = CPU=1000,Mem=4000,GRES/gpu=3000

PrivateData = none

ProctrackType = proctrack/cgroup

Prolog = (null)

PrologEpilogTimeout = 65534

PrologSlurmctld = (null)

PrologFlags = Alloc,Contain,X11

PropagatePrioProcess = 0

PropagateResourceLimits = ALL

PropagateResourceLimitsExcept = (null)

RebootProgram = /usr/sbin/reboot

ReconfigFlags = (null)

RequeueExit = (null)

RequeueExitHold = (null)

ResumeFailProgram = (null)

ResumeProgram = (null)

ResumeRate = 300 nodes/min

ResumeTimeout = 60 sec

ResvEpilog = (null)

ResvOverRun = 0 min

ResvProlog = (null)

ReturnToService = 2

RoutePlugin = route/default

SchedulerParameters = batch_sched_delay=10,bf_continue,bf_max_job_part=1000,bf_max_job_test=10000,bf_max_job_user=100,bf_resolution=300,bf_window=10080,bf_yield_interval=1000000,default_queue_depth=1000,partition_job_depth=600,sched_min_interval=20000000,defer,max_rpc_cnt=80

SchedulerTimeSlice = 30 sec

SchedulerType = sched/backfill

ScronParameters = (null)

SelectType = select/cons_tres

SelectTypeParameters = CR_CPU_MEMORY

SlurmUser = slurm(47)

SlurmctldAddr = (null)

SlurmctldDebug = info

SlurmctldHost[0] = ASlurmCluster-sched(x.x.x.x)

SlurmctldLogFile = /data/slurm/slurmctld.log

SlurmctldPort = 6820-6824

SlurmctldSyslogDebug = (null)

SlurmctldPrimaryOffProg = (null)

SlurmctldPrimaryOnProg = (null)

SlurmctldTimeout = 6000 sec

SlurmctldParameters = (null)

SlurmdDebug = info

SlurmdLogFile = /var/log/slurm/slurmd.log

SlurmdParameters = (null)

SlurmdPidFile = /var/run/slurmd.pid

SlurmdPort = 6818

SlurmdSpoolDir = /var/spool/slurmd

SlurmdSyslogDebug = (null)

SlurmdTimeout = 600 sec

SlurmdUser = root(0)

SlurmSchedLogFile = (null)

SlurmSchedLogLevel = 0

SlurmctldPidFile = /var/run/slurmctld.pid

SlurmctldPlugstack = (null)

SLURM_CONF = /etc/slurm/slurm.conf

SLURM_VERSION = 22.05.6

SrunEpilog = (null)

SrunPortRange = 0-0

SrunProlog = (null)

StateSaveLocation = /data/slurm/slurmctld

SuspendExcNodes = (null)

SuspendExcParts = (null)

SuspendProgram = (null)

SuspendRate = 60 nodes/min

SuspendTime = INFINITE

SuspendTimeout = 30 sec

SwitchParameters = (null)

SwitchType = switch/none

TaskEpilog = (null)

TaskPlugin = cgroup,affinity

TaskPluginParam = (null type)

TaskProlog = (null)

TCPTimeout = 2 sec

TmpFS = /tmp

TopologyParam = (null)

TopologyPlugin = topology/none

TrackWCKey = No

TreeWidth = 50

UsePam = No

UnkillableStepProgram = (null)

UnkillableStepTimeout = 600 sec

VSizeFactor = 0 percent

WaitTime = 0 sec

X11Parameters = home_xauthority

Cgroup Support Configuration:

AllowedKmemSpace = (null)

AllowedRAMSpace = 100.0%

AllowedSwapSpace = 1.0%

CgroupAutomount = yes

CgroupMountpoint = /sys/fs/cgroup

CgroupPlugin = cgroup/v2

ConstrainCores = yes

ConstrainDevices = yes

ConstrainKmemSpace = no

ConstrainRAMSpace = yes

ConstrainSwapSpace = yes

IgnoreSystemd = no

IgnoreSystemdOnFailure = no

MaxKmemPercent = 100.0%

MaxRAMPercent = 100.0%

MaxSwapPercent = 100.0%

MemorySwappiness = (null)

MinKmemSpace = 30 MB

MinRAMSpace = 30 MB

Slurmctld(primary) at ASlurmCluster-sched is UP

Williams, Jenny Avis’s profile photo Williams, Jenny Avis unread, Jul 12, 2023, 2:41:52 AM to slurm…@schedmd.com Additional configuration information – /etc/slurm/cgroup.conf

CgroupAutomount=yes

ConstrainCores=yes

ConstrainRAMSpace=yes

CgroupPlugin=cgroup/v2

AllowedSwapSpace=1

ConstrainSwapSpace=yes

ConstrainDevices=yes

Hermann Schwärzler’s profile photo Hermann Schwärzler unread, Jul 12, 2023, 5:36:36 PM to slurm…@lists.schedmd.com Hi Jenny,

I guess you have a system that has both cgroup/v1 and cgroup/v2 enabled.

Which Linux distribution are you using? And which kernel version? What is the output of mount | grep cgroup What if you do not restrict the cgroup-version Slurm can use to cgroup/v2 but omit “CgroupPlugin=…” from your cgroup.conf?

Regards, Hermann

On 7/11/23 19:41, Williams, Jenny Avis wrote:

Additional configuration information – /etc/slurm/cgroup.conf

CgroupAutomount=yes

ConstrainCores=yes

ConstrainRAMSpace=yes

CgroupPlugin=cgroup/v2

AllowedSwapSpace=1

ConstrainSwapSpace=yes

ConstrainDevices=yes

From: Williams, Jenny Avis Sent: Tuesday, July 11, 2023 10:47 AM To: slurm…@schedmd.com Subject: cgroupv2 + slurmd - external cgroup changes needed to get modifications are out of the picture. /Currently, even after reboot the cgroup echo steps are necessary at a minimum./

#!/bin/bash

/usr/bin/dnf -y reinstall slurm-slurmd

systemctl daemon-reload

/usr/bin/pkill -f ‘/usr/sbin/slurmstepd infinity’

systemctl enable slurmd

systemctl stop dcismeng.service && \

//usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/cgroup.subtree_control && \/

//usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/system.slice/cgroup.subtree_control && \/ Williams, Jenny Avis’s profile photo Williams, Jenny Avis unread, Jul 12, 2023, 11:51:02 PM to Slurm User Community List The systems have only cgroup/v2 enabled

mount |egrep cgroup

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) Distribution and kernel RedHat 8.7 4.18.0-348.2.1.el8_5.x86_64 Hermann Schwärzler’s profile photo Hermann Schwärzler unread, Jul 13, 2023, 7:45:46 PM to slurm…@lists.schedmd.com Hi Jenny,

ok, I see. You are using the exact same Slurm version and a very similar OS version/distribution as we do.

You have to consider that cpuset support is not available in cgroup/v2 in kernel versions below 5.2 (see “Cgroups v2 controllers” in “man cgroups” on your system). So some of the warnings/errors you see - at least “Controller cpuset is not enabled” - is expected (and slurmd should start nevertheless). This btw is one of the reasons why we stick with cgroup/v1 for the time being.

We did some tests with cgroups/v2 and in our case slurmd started with no problems (except the error/warning regarding the cpuset controller). But we have a slightly different configuration. You use JobAcctGatherType = jobacct_gather/cgroup ProctrackType = proctrack/cgroup TaskPlugin = cgroup,affinity CgroupPlugin = cgroup/v2

We use for the respective settings: JobAcctGatherType = jobacct_gather/linux ProctrackType = proctrack/cgroup TaskPlugin = task/affinity,task/cgroup CgroupPlugin = (null) - i.e. we don’t set that one in cgroup.conf

Maybe using the same settings as we do helps in your case? Please be aware that you should change JobAcctGatherType only when there are no running job steps!

Regards, Hermann Williams, Jenny Avis’s profile photo Williams, Jenny Avis unread, Jul 15, 2023, 8:46:37 AM to Slurm User Community List Thanks, Herman, for the feedback.

My reason for posting was to request some inspection of the systemd file for slurmd such that this “nudging” would not be necessary.

I’d like to explore that a little more – it looks like cgroupsv2 cpusets are working for us in this configuration, except for having to “nudge” the daemon to start with the steps originally listed.

This document from RedHat explicitly describes enabling cpusets under cgroupsv2 under rhel 8 – this at least appears to be working in our configuration. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel

This document is were I got the steps to get the daemon working and cpusets enabled. I’ve checked the contents of job_*/cpuset.cpus under /s

Regards, Jenny