[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start
[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start
https://groups.google.com/g/slurm-users/c/RXCHB7OE_Kk
Williams, Jenny Avis unread, Jul 11, 2023, 11:47:53 PM to slurm…@schedmd.com Progress on getting slurmd to start under cgroupv2
Issue: slurmd 22.05.6 will not start when using cgroupv2
Expected result: even after reboot slurmd will start up without needing to manually add lines to /sys/fs/cgroup files.
When started as service the error is:
systemctl status slurmd
-
slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmd.service.d
`-extendUnit.confActive: failed (Result: exit-code) since Tue 2023-07-11 10:29:23 EDT; 2s ago
Process: 11395 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 11395 (code=exited, status=1/FAILURE)
Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: Started Slurm node daemon.
Jul 11 10:29:23 g1803jles01.ll.unc.edu slurmd[11395]: slurmd: slurmd version 22.05.6 started
Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: Failed with result ‘exit-code’.
When started at the command line the output is:
slurmd -D -vvv 2>&1 |egrep error
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: cpu cgroup controller is not available.
slurmd: error: There’s an issue initialising memory or cpu controller
slurmd: error: Couldn’t load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed
slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup
Steps to mitigate the issue:
While the following steps do not solve the issue, they do get the system in a state such that slurmd will start, at least until next reboot. The re-install slurm-slurmd is a one-time step to ensure that local service modifications are out of the picture. Currently, even after reboot the cgroup echo steps are necessary at a minimum.
#!/bin/bash
/usr/bin/dnf -y reinstall slurm-slurmd
systemctl daemon-reload
/usr/bin/pkill -f ‘/usr/sbin/slurmstepd infinity’
systemctl enable slurmd
systemctl stop dcismeng.service && \
/usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/cgroup.subtree_control && \
/usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/system.slice/cgroup.subtree_control && \
systemctl start slurmd && \
echo ‘run this: systemctl start dcismeng’
Environment:
scontrol show config
Configuration data as of 2023-07-11T10:39:48
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost = m1006
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort = 6819
AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreFlags = (null)
AcctGatherEnergyType = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes = (null)
AuthAltParameters = (null)
AuthInfo = (null)
AuthType = auth/munge
BatchStartTimeout = 10 sec
BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters = (null)
BOOT_TIME = 2023-07-11T10:04:31
BurstBufferType = (null)
CliFilterPlugins = (null)
ClusterName = ASlurmCluster
CommunicationParameters = (null)
CompleteWait = 0 sec
CoreSpecPlugin = core_spec/none
CpuFreqDef = Unknown
CpuFreqGovernors = OnDemand,Performance,UserSpace
CredType = cred/munge
DebugFlags = (null)
DefMemPerNode = UNLIMITED
DependencyParameters = kill_invalid_depend
DisableRootJobs = No
EioTimeout = 60
EnforcePartLimits = ANY
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = ext_sensors/none
ExtSensorsFreq = 0 sec
FairShareDampeningFactor = 1
FederationParameters = (null)
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = gpu
GpuFreqDef = high,memory=high
GroupUpdateForce = 1
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckNodeState = ANY
HealthCheckProgram = (null)
InactiveLimit = 65533 sec
InteractiveStepOptions = –interactive –preserve-env –pty $SHELL
JobAcctGatherFrequency = task=15
JobAcctGatherType = jobacct_gather/cgroup
JobAcctGatherParams = (null)
JobCompHost = localhost
JobCompLoc = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobContainerType = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = lua
KillOnBadExit = 0
KillWait = 30 sec
LaunchParameters = (null)
LaunchType = launch/slurm
licences = mplus:1,nonmem:32
LogTimeFormat = iso8601_ms
MailDomain = (null)
MailProg = /bin/mail
MaxArraySise = 90001
MaxDBDMsgs = 701360
MaxJobCount = 350000
MaxJobId = 67043328
MaxMemPerNode = UNLIMITED
MaxNodeCount = 340
MaxStepCount = 40000
MaxTasksPerNode = 512
MCSPlugin = mcs/none
MCSParameters = (null)
MessageTimeout = 60 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 12286313
NodeFeaturesPlugins = (null)
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = (null)
PowerParameters = (null)
PowerPlugin =
PreemptMode = OFF
PreemptType = preempt/none
PreemptExemptTime = 00:00:00
PrEpParameters = (null)
PrEpPlugins = prep/script
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 14-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags = SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,MAX_TRES
PriorityMaxAge = 60-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 10000
PriorityWeightAssoc = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSise = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS = 1000
PriorityWeightTRES = CPU=1000,Mem=4000,GRES/gpu=3000
PrivateData = none
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = Alloc,Contain,X11
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram = /usr/sbin/reboot
ReconfigFlags = (null)
RequeueExit = (null)
RequeueExitHold = (null)
ResumeFailProgram = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 2
RoutePlugin = route/default
SchedulerParameters = batch_sched_delay=10,bf_continue,bf_max_job_part=1000,bf_max_job_test=10000,bf_max_job_user=100,bf_resolution=300,bf_window=10080,bf_yield_interval=1000000,default_queue_depth=1000,partition_job_depth=600,sched_min_interval=20000000,defer,max_rpc_cnt=80
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
ScronParameters = (null)
SelectType = select/cons_tres
SelectTypeParameters = CR_CPU_MEMORY
SlurmUser = slurm(47)
SlurmctldAddr = (null)
SlurmctldDebug = info
SlurmctldHost[0] = ASlurmCluster-sched(x.x.x.x)
SlurmctldLogFile = /data/slurm/slurmctld.log
SlurmctldPort = 6820-6824
SlurmctldSyslogDebug = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg = (null)
SlurmctldTimeout = 6000 sec
SlurmctldParameters = (null)
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdParameters = (null)
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurmd
SlurmdSyslogDebug = (null)
SlurmdTimeout = 600 sec
SlurmdUser = root(0)
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPlugstack = (null)
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 22.05.6
SrunEpilog = (null)
SrunPortRange = 0-0
SrunProlog = (null)
StateSaveLocation = /data/slurm/slurmctld
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = INFINITE
SuspendTimeout = 30 sec
SwitchParameters = (null)
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = cgroup,affinity
TaskPluginParam = (null type)
TaskProlog = (null)
TCPTimeout = 2 sec
TmpFS = /tmp
TopologyParam = (null)
TopologyPlugin = topology/none
TrackWCKey = No
TreeWidth = 50
UsePam = No
UnkillableStepProgram = (null)
UnkillableStepTimeout = 600 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
X11Parameters = home_xauthority
Cgroup Support Configuration:
AllowedKmemSpace = (null)
AllowedRAMSpace = 100.0%
AllowedSwapSpace = 1.0%
CgroupAutomount = yes
CgroupMountpoint = /sys/fs/cgroup
CgroupPlugin = cgroup/v2
ConstrainCores = yes
ConstrainDevices = yes
ConstrainKmemSpace = no
ConstrainRAMSpace = yes
ConstrainSwapSpace = yes
IgnoreSystemd = no
IgnoreSystemdOnFailure = no
MaxKmemPercent = 100.0%
MaxRAMPercent = 100.0%
MaxSwapPercent = 100.0%
MemorySwappiness = (null)
MinKmemSpace = 30 MB
MinRAMSpace = 30 MB
Slurmctld(primary) at ASlurmCluster-sched is UP
Williams, Jenny Avis’s profile photo Williams, Jenny Avis unread, Jul 12, 2023, 2:41:52 AM to slurm…@schedmd.com Additional configuration information – /etc/slurm/cgroup.conf
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
CgroupPlugin=cgroup/v2
AllowedSwapSpace=1
ConstrainSwapSpace=yes
ConstrainDevices=yes
Hermann Schwärzler’s profile photo Hermann Schwärzler unread, Jul 12, 2023, 5:36:36 PM to slurm…@lists.schedmd.com Hi Jenny,
I guess you have a system that has both cgroup/v1 and cgroup/v2 enabled.
Which Linux distribution are you using? And which kernel version? What is the output of mount | grep cgroup What if you do not restrict the cgroup-version Slurm can use to cgroup/v2 but omit “CgroupPlugin=…” from your cgroup.conf?
Regards, Hermann
On 7/11/23 19:41, Williams, Jenny Avis wrote:
Additional configuration information – /etc/slurm/cgroup.conf
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
CgroupPlugin=cgroup/v2
AllowedSwapSpace=1
ConstrainSwapSpace=yes
ConstrainDevices=yes
From: Williams, Jenny Avis Sent: Tuesday, July 11, 2023 10:47 AM To: slurm…@schedmd.com Subject: cgroupv2 + slurmd - external cgroup changes needed to get modifications are out of the picture. /Currently, even after reboot the cgroup echo steps are necessary at a minimum./
#!/bin/bash
/usr/bin/dnf -y reinstall slurm-slurmd
systemctl daemon-reload
/usr/bin/pkill -f ‘/usr/sbin/slurmstepd infinity’
systemctl enable slurmd
systemctl stop dcismeng.service && \
//usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/cgroup.subtree_control && \/
//usr/bin/echo +cpu +cpuset +memory » /sys/fs/cgroup/system.slice/cgroup.subtree_control && \/ Williams, Jenny Avis’s profile photo Williams, Jenny Avis unread, Jul 12, 2023, 11:51:02 PM to Slurm User Community List The systems have only cgroup/v2 enabled
mount |egrep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) Distribution and kernel RedHat 8.7 4.18.0-348.2.1.el8_5.x86_64 Hermann Schwärzler’s profile photo Hermann Schwärzler unread, Jul 13, 2023, 7:45:46 PM to slurm…@lists.schedmd.com Hi Jenny,
ok, I see. You are using the exact same Slurm version and a very similar OS version/distribution as we do.
You have to consider that cpuset support is not available in cgroup/v2 in kernel versions below 5.2 (see “Cgroups v2 controllers” in “man cgroups” on your system). So some of the warnings/errors you see - at least “Controller cpuset is not enabled” - is expected (and slurmd should start nevertheless). This btw is one of the reasons why we stick with cgroup/v1 for the time being.
We did some tests with cgroups/v2 and in our case slurmd started with no problems (except the error/warning regarding the cpuset controller). But we have a slightly different configuration. You use JobAcctGatherType = jobacct_gather/cgroup ProctrackType = proctrack/cgroup TaskPlugin = cgroup,affinity CgroupPlugin = cgroup/v2
We use for the respective settings: JobAcctGatherType = jobacct_gather/linux ProctrackType = proctrack/cgroup TaskPlugin = task/affinity,task/cgroup CgroupPlugin = (null) - i.e. we don’t set that one in cgroup.conf
Maybe using the same settings as we do helps in your case? Please be aware that you should change JobAcctGatherType only when there are no running job steps!
Regards, Hermann Williams, Jenny Avis’s profile photo Williams, Jenny Avis unread, Jul 15, 2023, 8:46:37 AM to Slurm User Community List Thanks, Herman, for the feedback.
My reason for posting was to request some inspection of the systemd file for slurmd such that this “nudging” would not be necessary.
I’d like to explore that a little more – it looks like cgroupsv2 cpusets are working for us in this configuration, except for having to “nudge” the daemon to start with the steps originally listed.
This document from RedHat explicitly describes enabling cpusets under cgroupsv2 under rhel 8 – this at least appears to be working in our configuration. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel
This document is were I got the steps to get the daemon working and cpusets enabled. I’ve checked the contents of job_*/cpuset.cpus under /s
Regards, Jenny