9.3. libvirt NUMA Tuning

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-numa_and_libvirt

Generally, best performance on NUMA systems is achieved by limiting guest size to the amount of resources on a single NUMA node. Avoid unnecessarily splitting resources across NUMA nodes. Use the numastat tool to view per-NUMA-node memory statistics for processes and the operating system. In the following example, the numastat tool shows four virtual machines with suboptimal memory alignment across NUMA nodes:

numastat -c qemu-kvm

Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total ————— —— —— —— —— —— —— —— —— —– 51722 (qemu-kvm) 68 16 357 6936 2 3 147 598 8128 51747 (qemu-kvm) 245 11 5 18 5172 2532 1 92 8076 53736 (qemu-kvm) 62 432 1661 506 4851 136 22 445 8116 53773 (qemu-kvm) 1393 3 1 2 12 0 0 6702 8114 ————— —— —— —— —— —— —— —— —— —– Total 1769 463 2024 7462 10037 2672 169 7837 32434 Show more

Run numad to align the guests’ CPUs and memory resources automatically. Then run numastat -c qemu-kvm again to view the results of running numad. The following output shows that resources have been aligned:

numastat -c qemu-kvm

Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total ————— —— —— —— —— —— —— —— —— —– 51747 (qemu-kvm) 0 0 7 0 8072 0 1 0 8080 53736 (qemu-kvm) 0 0 7 0 0 0 8113 0 8120 53773 (qemu-kvm) 0 0 7 0 0 0 1 8110 8118 59065 (qemu-kvm) 0 0 8050 0 0 0 0 0 8051 ————— —— —— —— —— —— —— —— —— —– Total 0 0 8072 0 8072 0 8114 8110 32368 Show more

Note

Running numastat with -c provides compact output; adding the -m option adds system-wide memory information on a per-node basis to the output. See the numastat man page for more information. 9.3.1. Monitoring Memory per host NUMA Node

You can use the nodestats.py script to report the total memory and free memory for each NUMA node on a host. This script also reports how much memory is strictly bound to certain host nodes for each running domain. For example:

/usr/share/doc/libvirt-python-2.0.0/examples/nodestats.py

NUMA stats NUMA nodes: 0 1 2 3 MemTotal: 3950 3967 3937 3943 MemFree: 66 56 42 41 Domain ‘rhel7-0’: Overall memory: 1536 MiB Domain ‘rhel7-1’: Overall memory: 2048 MiB Domain ‘rhel6’: Overall memory: 1024 MiB nodes 0-1 Node 0: 1024 MiB nodes 0-1 Domain ‘rhel7-2’: Overall memory: 4096 MiB nodes 0-3 Node 0: 1024 MiB nodes 0 Node 1: 1024 MiB nodes 1 Node 2: 1024 MiB nodes 2 Node 3: 1024 MiB nodes 3 Show more

This example shows four host NUMA nodes, each containing approximately 4GB of RAM in total (MemTotal). Nearly all memory is consumed on each domain (MemFree). There are four domains (virtual machines) running: domain ‘rhel7-0’ has 1.5GB memory which is not pinned onto any specific host NUMA node. Domain ‘rhel7-2’ however, has 4GB memory and 4 NUMA nodes which are pinned 1:1 to host nodes. To print host NUMA node statistics, create a nodestats.py script for your environment. An example script can be found the libvirt-python package files in /usr/share/doc/libvirt-python-version/examples/nodestats.py. The specific path to the script can be displayed by using the rpm -ql libvirt-python command. 9.3.2. NUMA vCPU Pinning

vCPU pinning provides similar advantages to task pinning on bare metal systems. Since vCPUs run as user-space tasks on the host operating system, pinning increases cache efficiency. One example of this is an environment where all vCPU threads are running on the same physical socket, therefore sharing a L3 cache domain. Note

In Red Hat Enterprise Linux versions 7.0 to 7.2, it is only possible to pin active vCPUs. However, with Red Hat Enterprise Linux 7.3, pinning inactive vCPUs is available as well. Combining vCPU pinning with numatune can avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and numatune should be configured together. If the virtual machine is performing storage or network I/O tasks, it can be beneficial to pin all vCPUs and memory to the same physical socket that is physically connected to the I/O adapter. Note

The lstopo tool can be used to visualise NUMA topology. It can also help verify that vCPUs are binding to cores on the same physical socket. See the following Knowledgebase article for more information on lstopo: https://access.redhat.com/site/solutions/62879. Important

Pinning causes increased complexity where there are many more vCPUs than physical cores. The following example XML configuration has a domain process pinned to physical CPUs 0-7. The vCPU thread is pinned to its own cpuset. For example, vCPU0 is pinned to physical CPU 0, vCPU1 is pinned to physical CPU 1, and so on:

    <cputune>
            <vcpupin vcpu='0' cpuset='0'/>
            <vcpupin vcpu='1' cpuset='1'/>
            <vcpupin vcpu='2' cpuset='2'/>
            <vcpupin vcpu='3' cpuset='3'/>
            <vcpupin vcpu='4' cpuset='4'/>
            <vcpupin vcpu='5' cpuset='5'/>
            <vcpupin vcpu='6' cpuset='6'/>
            <vcpupin vcpu='7' cpuset='7'/>
    </cputune> Show more

There is a direct relationship between the vcpu and vcpupin tags. If a vcpupin option is not specified, the value will be automatically determined and inherited from the parent vcpu tag option. The following configuration shows for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag :

    <cputune>
            <vcpupin vcpu='0' cpuset='0'/>
            <vcpupin vcpu='1' cpuset='1'/>
            <vcpupin vcpu='2' cpuset='2'/>
            <vcpupin vcpu='3' cpuset='3'/>
            <vcpupin vcpu='4' cpuset='4'/>
            <vcpupin vcpu='6' cpuset='6'/>
            <vcpupin vcpu='7' cpuset='7'/>
    </cputune> Show more

Important

, , and should be configured together to achieve optimal, deterministic performance. For more information on the tag, see Section 9.3.3, “Domain Processes”. For more information on the tag, see Section 9.3.6, “Using emulatorpin”. 9.3.3. Domain Processes As provided in Red Hat Enterprise Linux, libvirt uses libnuma to set memory binding policies for domain processes. The nodeset for these policies can be configured either as static (specified in the domain XML) or auto (configured by querying numad). See the following XML configuration for examples on how to configure these inside the tag: Show more Show more libvirt uses sched_setaffinity(2) to set CPU binding policies for domain processes. The cpuset option can either be static (specified in the domain XML) or auto (configured by querying numad). See the following XML configuration for examples on how to configure these inside the tag: 8 8 There are implicit inheritance rules between the placement mode you use for and : The placement mode for defaults to the same placement mode of , or to static if a is specified. Similarly, the placement mode for defaults to the same placement mode of , or to static if is specified. This means that CPU tuning and memory tuning for domain processes can be specified and defined separately, but they can also be configured to be dependent on the other's placement mode. It is also possible to configure your system with numad to boot a selected number of vCPUs without pinning all vCPUs at startup. For example, to enable only 8 vCPUs at boot on a system with 32 vCPUs, configure the XML similar to the following: 32 Note See the following URLs for more information on vcpu and numatune: http://libvirt.org/formatdomain.html#elementsCPUAllocation and http://libvirt.org/formatdomain.html#elementsNUMATuning 9.3.4. Domain vCPU Threads In addition to tuning domain processes, libvirt also permits the setting of the pinning policy for each vcpu thread in the XML configuration. Set the pinning policy for each vcpu thread inside the tags: Show more In this tag, libvirt uses either cgroup or sched_setaffinity(2) to pin the vcpu thread to the specified cpuset. Note For more details on , see the following URL: http://libvirt.org/formatdomain.html#elementsCPUTuning In addition, if you need to set up a virtual machines with more vCPU than a single NUMA node, configure the host so that the guest detects a NUMA topology on the host. This allows for 1:1 mapping of CPUs, memory, and NUMA nodes. For example, this can be applied with a guest with 4 vCPUs and 6 GB memory, and a host with the following NUMA settings: 4 available nodes (0-3) Node 0: CPUs 0 4, size 4000 MiB Node 1: CPUs 1 5, size 3999 MiB Node 2: CPUs 2 6, size 4001 MiB Node 3: CPUs 0 4, size 4005 MiB Show more In this scenario, use the following Domain XML setting: Show more 9.3.5. Using Cache Allocation Technology to Improve Performance You can make use of Cache Allocation Technology (CAT) provided by the kernel on specific CPU models. This enables allocation of part of the host CPU's cache for vCPU threads, which improves real-time performance. See the following XML configuration for an example on how to configure vCPU cache allocation inside the cachetune tag: Show more The XML file above configures the thread for vCPUs 0 and 1 to have 3 MiB from the first L3 cache (level='3' id='0') allocated, once for the L3CODE and once for L3DATA. Note A single virtual machine can have multiple elements. For more information see cachetune in the upstream libvirt documentation. 9.3.6. Using emulatorpin Another way of tuning the domain process pinning policy is to use the tag inside of . The tag specifies which host physical CPUs the emulator (a subset of a domain, not including vCPUs) will be pinned to. The tag provides a method of setting a precise affinity to emulator thread processes. As a result, vhost threads run on the same subset of physical CPUs and memory, and therefore benefit from cache locality. For example: Show more Note In Red Hat Enterprise Linux 7, automatic NUMA balancing is enabled by default. Automatic NUMA balancing reduces the need for manually tuning , since the vhost-net emulator thread follows the vCPU tasks more reliably. For more information about automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”. 9.3.7. Tuning vCPU Pinning with virsh Important These are example commands only. You will need to substitute values according to your environment. The following example virsh command will pin the vcpu thread rhel7 which has an ID of 1 to the physical CPU 2: % virsh vcpupin rhel7 1 2 You can also obtain the current vcpu pinning configuration with the virsh command. For example: % virsh vcpupin rhel7 9.3.8. Tuning Domain Process CPU Pinning with virsh Important These are example commands only. You will need to substitute values according to your environment. The emulatorpin option applies CPU affinity settings to threads that are associated with each domain process. For complete pinning, you must use both virsh vcpupin (as shown previously) and virsh emulatorpin for each guest. For example: % virsh emulatorpin rhel7 3-4 9.3.9. Tuning Domain Process Memory Policy with virsh Domain process memory can be dynamically tuned. See the following example command: % virsh numatune rhel7 --nodeset 0-10 More examples of these commands can be found in the virsh man page. 9.3.10. Guest NUMA Topology Guest NUMA topology can be specified using the tag inside the tag in the guest virtual machine's XML. See the following example, and replace values accordingly: ... ... Show more Each element specifies a NUMA cell or a NUMA node. cpus specifies the CPU or range of CPUs that are part of the node, and memory specifies the node memory in kibibytes (blocks of 1024 bytes). Each cell or node is assigned a cellid or nodeid in increasing order starting from 0. Important When modifying the NUMA topology of a guest virtual machine with a configured topology of CPU sockets, cores, and threads, make sure that cores and threads belonging to a single socket are assigned to the same NUMA node. If threads or cores from the same socket are assigned to different NUMA nodes, the guest may fail to boot. Warning Using guest NUMA topology simultaneously with huge pages is not supported on Red Hat Enterprise Linux 7 and is only available in layered products such as Red Hat Virtualisation or Red Hat OpenStack Platform. 9.3.11. NUMA Node Locality for PCI Devices When starting a new virtual machine, it is important to know both the host NUMA topology and the PCI device affiliation to NUMA nodes, so that when PCI passthrough is requested, the guest is pinned onto the correct NUMA nodes for optimal memory performance. For example, if a guest is pinned to NUMA nodes 0-1, but one of its PCI devices is affiliated with node 2, data transfer between nodes will take some time. In Red Hat Enterprise Linux 7.1 and above, libvirt reports the NUMA node locality for PCI devices in the guest XML, enabling management applications to make better performance decisions. This information is visible in the sysfs files in /sys/devices/pci*/*/numa_node. One way to verify these settings is to use the lstopo tool to report sysfs data: # lstopo-no-graphics Machine (126GB) NUMANode L#0 (P#0 63GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14) HostBridge L#0 PCIBridge PCI 8086:1521 Net L#0 "em1" PCI 8086:1521 Net L#1 "em2" PCI 8086:1521 Net L#2 "em3" PCI 8086:1521 Net L#3 "em4" PCIBridge PCI 1000:005b Block L#4 "sda" Block L#5 "sdb" Block L#6 "sdc" Block L#7 "sdd" PCIBridge PCI 8086:154d Net L#8 "p3p1" PCI 8086:154d Net L#9 "p3p2" PCIBridge PCIBridge PCIBridge PCIBridge PCI 102b:0534 GPU L#10 "card0" GPU L#11 "controlD64" PCI 8086:1d02 NUMANode L#1 (P#1 63GB) Socket L#1 + L3 L#1 (20MB) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#1) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#3) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#5) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#7) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#9) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#11) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#13) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) HostBridge L#8 PCIBridge PCI 1924:0903 Net L#12 "p1p1" PCI 1924:0903 Net L#13 "p1p2" PCIBridge PCI 15b3:1003 Net L#14 "ib0" Net L#15 "ib1" OpenFabrics L#16 "mlx4_0" Show more This output shows: NICs em* and disks sd* are connected to NUMA node 0 and cores 0,2,4,6,8,10,12,14. NICs p1* and ib* are connected to NUMA node 1 and cores 1,3,5,7,9,11,13,15.