race? Failed to retrieve cgroup stats: open …/user.slice/sha.scope/memory.stats and pids.current: ENOENT #23789
race? Failed to retrieve cgroup stats: open …/user.slice/sha.scope/memory.stats and pids.current: ENOENT #23789
https://github.com/containers/podman/issues/23789
Closed #24400 Closed race? Failed to retrieve cgroup stats: open …/user.slice/sha.scope/memory.stats and pids.current: ENOENT #23789 #24400 @edsantiago Description edsantiago opened on Aug 28, 2024 Member New one-off (so far) in rawhide rootless parallel:
<+026ms> # $ podman container stats –no-stream –format {{“\n”}} <+163ms> # time=”2024-08-27T22:13:08-05:00” level=warning msg=”Failed to retrieve cgroup stats: open /sys/fs/cgroup/user.slice/user-4839.slice/user@4839.service/user.slice/libpod-7328f381a1cde439c6e20df0aa54dfe87e731daf1716f97279072fe7786b7480.scope/memory.stat: no such file or directory” # time=”2024-08-27T22:13:08-05:00” level=warning msg=”Failed to retrieve cgroup stats: open /sys/fs/cgroup/user.slice/user-4839.slice/user@4839.service/user.slice/libpod-7328f381a1cde439c6e20df0aa54dfe87e731daf1716f97279072fe7786b7480.scope/pids.current: no such file or directory” # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # #| FAIL: Command succeeded, but issued unexpected warnings # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Activity
edsantiago added flakes Flakes from Continuous Integration on Aug 28, 2024 Luap99 Luap99 commented on Aug 28, 2024 Luap99 on Aug 28, 2024 Member On quick look, we are locked during the stats call so this should prevent another podman process from stopping/removing the container in parallel. We also only query the cgroup of the ctr state is running or paused. I am not to familiar with how the cgroups setup works, but the container process itself can exit at any time during or calls but AFAIK the cgroup isn’t deleted when the process exits but rather needs an explicit removal via podman/oci runtime which also needs to hold a lock so it is not clear to me where the race would be.
github-actions github-actions commented on Sep 28, 2024 github-actions bot on Sep 28, 2024 – with GitHub Actions A friendly reminder that this issue had no activity for 30 days.
github-actions added stale-issue on Sep 28, 2024
edsantiago removed stale-issue on Oct 28, 2024 edsantiago edsantiago commented on Oct 28, 2024 edsantiago on Oct 28, 2024 Member Author Seen last week in rawhide rootless, slightly different symptom:
<+026ms> # $ podman stats –no-stream –format {{“\n”}} <+219ms> # Error: unable to obtain cgroup stats: read /sys/fs/cgroup/user.slice/user-3860.slice/user@3860.service/user.slice/libpod-c472c9ad2475d01451cced72412f6221b819b050026324f5c7768c7ed66f6840.scope/memory.max: no such device (read instead of open, and memory.max)
Luap99 Luap99 commented on Oct 28, 2024 Luap99 on Oct 28, 2024 Member @giuseppe PTAL Is it expected that reading the cgroup files is racy? We do seem to hold the container locks correctly so there should be not removal happening at the same time I think, #23789 (comment)
giuseppe giuseppe commented on Oct 28, 2024 giuseppe on Oct 28, 2024 Member one reason for seeing ENODEV is that the cgroup is deleted after the open, but before the read syscall: containers/crun#537
Trying to reproduce locally
giuseppe giuseppe commented on Oct 29, 2024 giuseppe on Oct 29, 2024 Member I was able to reproduce quite easily with:
diff –git a/vendor/github.com/containers/common/pkg/cgroups/cgroups_linux.go b/vendor/github.com/containers/common/pkg/cgroups/cgroups_linux.go index d2c610fb1..7e49c7ed3 100644 — a/vendor/github.com/containers/common/pkg/cgroups/cgroups_linux.go +++ b/vendor/github.com/containers/common/pkg/cgroups/cgroups_linux.go @@ -8,6 +8,7 @@ import ( “context” “errors” “fmt”
- “io” “math” “os” “path/filepath” @@ -258,10 +259,19 @@ func (c *CgroupControl) initialise() (err error) { }
func readFileAsUint64(path string) (uint64, error) {
- data, err := os.ReadFile(path)
- file, err := os.Open(path) if err != nil { return 0, err }
- defer file.Close() +
- time.Sleep(time.Second / 5) +
- data, err := io.ReadAll(file)
- if err != nil {
- return 0, err
- } + running while true; do podman stats –no-trunc –no-stream –no-reset –format ‘\n’; done in a loop while in another terminal I create containers with podman run -d ….
Not sure about the fix here, should we ignore any kind of error if the cgroup is gone?
Or should the test just stat a known container that it has control over instead of being affected by other runs?
Luap99 Luap99 commented on Oct 29, 2024 Luap99 on Oct 29, 2024 Member Or should the test just stat a known container that it has control over instead of being affected by other runs?
That is not really possible and the point of these tests is to catch bugs like this.
Not sure about the fix here, should we ignore any kind of error if the cgroup is gone?
But what deletes the cgroup? AFAIK we hold the container lock so we should not call crun kill/delete or any other runtime command at the same time. I mean making the listing robust against such races make sense to me but only if that race is to be expected.
giuseppe giuseppe commented on Oct 29, 2024 giuseppe on Oct 29, 2024 Member But what deletes the cgroup? AFAIK we hold the container lock so we should not call crun kill/delete or any other runtime command at the same time.
systemd deletes the cgroup when it detects there are no more processes running there
giuseppe giuseppe commented on Oct 29, 2024 giuseppe on Oct 29, 2024 Member Or should the test just stat a known container that it has control over instead of being affected by other runs?
That is not really possible and the point of these tests is to catch bugs like this.
if we want to support this case, we need to define when these errors can be ignored. One way could be for podman stats -a to validate if the container/cgroup is still running/existing on each error. If the container exited, then it should be ignored
Luap99 Luap99 commented on Oct 29, 2024 Luap99 on Oct 29, 2024 Member yes I think we should catch ENOENT and ENODEV then we could simply check if the container pid is still alive with kill(pid,0) and if not we ignore this error and continue with the next container.
giuseppe added 2 commits that reference this issue on Oct 29, 2024 libpod: report cgroups deleted during Stat() call
Verified f1d437d libpod: report cgroups deleted during Stat() call
Verified c93c6bb
giuseppe mentioned this on Oct 29, 2024 libpod: report cgroups deleted during Stat() call #24400 giuseppe giuseppe commented on Oct 29, 2024 giuseppe on Oct 29, 2024 · edited by giuseppe Member opened a PR: #24400
giuseppe added a commit that references this issue on Oct 29, 2024 libpod: report cgroups deleted during Stat() call
Verified 1f44d0f
openshift-merge-bot closed this as completedin #24400on Oct 29, 2024
stale-locking-app added locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. on Jan 28
stale-locking-app locked as resolved and limited conversation to collaborators on Jan 28