bug regarding slow PCI initialization and BAR assignment times for Nvidia GPUs passed-through to VMs on our DGX H100

bug regarding slow PCI initialisation and BAR assignment times for Nvidia GPUs passed-through to VMs on our DGX H100

https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com/

From: Mitchell Augustin mitchell.augustin@canonical.com To: linux-pci@vger.kernel.org, kvm@vger.kernel.org, Alex Williamson alex.williamson@redhat.com, Bjorn Helgaas bhelgaas@google.com, linux-kernel@vger.kernel.org Subject: drivers/pci: (and/or KVM): Slow PCI initialisation during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100 Date: Mon, 25 Nov 2024 16:46:29 -0600 [thread overview] Message-ID: CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com (raw)

Hello,

I’ve been working on a bug regarding slow PCI initialisation and BAR assignment times for Nvidia GPUs passed-through to VMs on our DGX H100 that I originally believed to be an issue in OVMF, but upon further investigation, I’m now suspecting that it may be an issue somewhere in the kernel. (Here is the original edk2 mailing list thread, with extra context: https://edk2.groups.io/g/devel/topic/109651206) [0]

When running the 6.12 kernel on a DGX H100 host with 4 GPUs passed through using CPU passthrough and this virt-install command[1], VMs using the latest OVMF version will take around 2 minutes for the guest kernel to boot and initialise PCI devices/BARs for the GPUs. Originally, I was investigating this as an issue in OVMF, because GPU initialisation takes much less time when our host is running an OVMF version with this patch[2] removed (which only calculates the MMIO window size differently). Without that patch, the guest kernel does boot quickly, but we can only use the Nvidia GPUs within the guest if pci=nocrs pci=realloc are set in the guest (evidently since the MMIO windows advertised by OVMF to the kernel without this patch are incorrect). So, the OVMF patch being present does evidently result in correct MMIO windows and prevent us from needing those kernel options, but the VM boot time is much slower.

In discussing this, another contributor reported slow PCIe/BAR initialisation times for large BAR Nvidia GPUs in Linux when using VMs with SeaBIOS as well. This, combined with me not seeing any slowness when these GPUs are initialised on the host, and the fact that this slowness only happens when CPU passthrough is used, are leading me to suspect that this may actually be a problem somewhere in the KVM or vfio-pci stack. I did also attempt manually setting different MMIO window sizes using the X-PciMmio64Mb OVMF/QEMU knob, and it seems that any MMIO window size large enough to accommodate all GPU memory regions does result in this slower initialisation time (but also a valid mapping).

I did some profiling of the guest kernel during boot, and I’ve identified that it seems like the most time is spent in this pci_write_config_word() call in __pci_read_base() of drivers/pci/probe.c.[3] Each of those pci_write_config_word() calls takes about 2 seconds, but it adds up to a significant chunk of the initialisation time since __pci_read_base() is executed somewhere between 20-40 times in my VM boot.

As a point of comparison, I measured the time it took to hot-unplug and re-plug these GPUs after the VM booted, and observed much more reasonable times (under 5s for each GPU to re-initialise its memory regions). I’ve also been trying to get this hotplugging working in VMs where the GPUs aren’t initially attached at boot, but in any such configuration, the memory regions for the PCI slots that the GPUs get attached to during hotplug are too small for the full 128GB these GPUs require (and I have yet to figure out a way to fix that. More details on that in [0]).

I’m wondering if any other users of Nvidia GPUs or other large BAR GPUs in VMs with GPU and CPU passthrough have reported similar slowness during boot, and if anyone has any insight. If you also suspect this might be a kernel issue, and if there is anything I can provide to help identify the root causes in that case, please let me know.

[0] https://edk2.groups.io/g/devel/topic/109651206

[1] virt-install –name 4gpu-vm-2 –vcpus vcpus=16,maxvcpus=16 –memory 943616 –numatune 0,mode=strict –iothreads 1,iothreadids.iothread0.id=1 –cputune emulatorpin.cpuset=55,167,iothreadpin0.iothread=1,iothreadpin0.cpuset=54,166,vcpupin0.vcpu=0,vcpupin0.cpuset=16,vcpupin1.vcpu=1,vcpupin1.cpuset=128,vcpupin2.vcpu=2,vcpupin2.cpuset=17,vcpupin3.vcpu=3,vcpupin3.cpuset=129,vcpupin4.vcpu=4,vcpupin4.cpuset=18,vcpupin5.vcpu=5,vcpupin5.cpuset=130,vcpupin6.vcpu=6,vcpupin6.cpuset=19,vcpupin7.vcpu=7,vcpupin7.cpuset=131,vcpupin8.vcpu=8,vcpupin8.cpuset=20,vcpupin9.vcpu=9,vcpupin9.cpuset=132,vcpupin10.vcpu=10,vcpupin10.cpuset=21,vcpupin11.vcpu=11,vcpupin11.cpuset=133,vcpupin12.vcpu=12,vcpupin12.cpuset=22,vcpupin13.vcpu=13,vcpupin13.cpuset=134,vcpupin14.vcpu=14,vcpupin14.cpuset=23,vcpupin15.vcpu=15,vcpupin15.cpuset=135 –os-variant ubuntu22.04 –graphics none –noautoconsole –boot loader=/usr/share/OVMF/OVMF_CODE_4M.fd,loader_ro=yes,loader_type=pflash –console pty,target_type=serial –network network:default –network network:private-net –import –disk path=/var/lib/libvirt/images/4gpu-vm-2.qcow2,format=qcow2,driver.queues=16,driver.iothread=1 –host-device 1b:00.0,address.type=pci –host-device 61:00.0,address.type=pci –host-device c3:00.0,address.type=pci –host-device df:00.0,address.type=pci

[2] https://github.com/tianocore/edk2/commit/ecb778d0ac62560aa172786ba19521f27bc3f650

[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?h=v6.12#n251

Thanks,

Mitchell Augustin Software Engineer - Ubuntu Partner Engineering next reply other threads:[~2024-11-25 22:46 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top 2024-11-25 22:46 Mitchell Augustin [this message] 2024-11-26 17:34 ` drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100 Alex Williamson 2024-11-26 22:18 ` Mitchell Augustin 2024-11-26 22:41 ` Alex Williamson 2024-11-26 23:08 ` Mitchell Augustin 2024-11-27 0:02 ` Alex Williamson 2024-11-27 1:12 ` Mitchell Augustin 2024-11-27 17:22 ` Alex Williamson 2024-12-02 19:36 ` Mitchell Augustin 2024-12-03 18:34 ` Mitchell Augustin 2024-12-03 19:20 ` Alex Williamson 2024-12-03 20:33 ` Mitchell Augustin 2024-12-03 22:06 ` Alex Williamson 2024-12-03 23:09 ` Mitchell Augustin 2024-12-03 23:30 ` Alex Williamson 2024-12-06 0:09 ` Mitchell Augustin 2025-01-08 23:06 ` Mitchell Augustin 2025-01-13 18:22 ` Alex Williamson 2025-01-13 19:43 ` Mitchell Augustin Reply instructions:

You may reply publicly to this message via plain-text email using any one of the following methods:

Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox

Avoid top-posting and favour interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
Reply using the –to, –cc, and –in-reply-to switches of git-send-email(1):

git send-email
–in-reply-to=CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
–to=mitchell.augustin@canonical.com
–cc=alex.williamson@redhat.com
–cc=bhelgaas@google.com
–cc=kvm@vger.kernel.org
–cc=linux-kernel@vger.kernel.org
–cc=linux-pci@vger.kernel.org
/path/to/YOUR_REPLY

https://kernel.org/pub/software/scm/git/docs/git-send-email.html
If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link Be sure your reply has a Subject: header at the top and a blank line before the message body. This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.