What is the suggested I/O scheduler to improve disk performance when using Red Hat Enterprise Linux with virtualization?

What is the suggested I/O scheduler to improve disk performance when using Red Hat Enterprise Linux with virtualisation?

https://access.redhat.com/solutions/5427

Solution Verified - Updated August 7 2024 at 7:25 AM - English Red Hat Lightspeed can detect this issue Proactively detect and remediate issues impacting your systems. View matching systems and remediation
Environment Red Hat Enterprise Linux (RHEL) 4, 5, 6, 7, 8, and 9 Virtualisation, e.g. KVM, Xen, VMware or Microsoft Hyper-V Virtualisation guest or virtualisation host Virtual disk Issue What is the recommended I/O scheduler for Red Hat Enterprise Linux as a virtualisation host? Resolution There is no single one “best” I/O scheduler recommendation which applies to all situations or for any given generic environment. Any changes made to the I/O scheduler should be done in conjunction with testing to determine which one provides the most advantages for the application suite’s specific I/O workload that is present.

The following are the common starting recommendations for an I/O scheduler for a RHEL based virtual guest based upon kernel version and disk type.

RHEL 8,9 : mq-deadline is default I/O scheduler unless otherwise changed [FN.1]

Virtual disks: keep current io scheduler setting (mq-deadline) Physical disks: keep current io scheduler setting (mq-deadline, or for NVMe none) RHEL 7.5+ : deadline is default io scheduler unless otherwise changed

Virtual disks: keep current io scheduler setting (deadline) 6.2. I/O Scheduling with Red Hat Enterprise Linux as a Virtualisation Guest Physical disks: keep current io scheduler setting (deadline) RHEL 4,5,6,(7.0-7.4) : cfq is default I/O scheduler unless otherwise changed

Virtual disks: change to noop scheduler [FN.2] 6.3. I/O Scheduling with Red Hat Enterprise Linux as a Virtualisation Guest Physical disks: keep current io scheduler setting Online configuring the I/O scheduler on Red Hat Enterprise Linux See “Can I change the I/O scheduler for a particular disk without the system rebooting?”

Determine which schedulers are available:

Raw

cat /sys/block/sda/queue/scheduler

[mq-deadline] kyber bfq none Change the scheduler for a device and verify using one of the above:

Raw

echo ‘none’ > /sys/block/sda/queue/scheduler

cat /sys/block/sda/queue/scheduler

mq-deadline kyber bfq [none] Additional documentation references:

RHEL 9: Chapter 11. Setting the disk scheduler RHEL 8: Chapter 12. Setting the disk scheduler in Monitoring and Managing System Status and Performance Chapter 19. Setting the disk scheduler in Managing Storage Devices RHEL 7: Section 6.2.1. Configuring the I/O Scheduler for Red Hat Enterprise Linux 7 RHEL 6: Section 6.4. Configuring the I/O Scheduler Configuring the I/O scheduler on Red Hat Enterprise Linux 8 and 9 In RHEL8 and newer new I/O schedulers are available. These are mq-deadline, none, kyber, and bfq. Note that the noop scheduler is called none. More info on the noop/none schedulers can be found in “How to use the Noop or None IO Schedulers “ In RHEL8 and newer, the default scheduler is mq-deadline. The I/O scheduler can be set via using tuned or udev rules. The udev method of setting is often preferred due to its robust configuration options. Configuring the I/O sheduler via tuned and udev Another way to change the default I/O scheduler is to use tuned. More information on creating a custom tuned profile can be found in Chapter 2. Customising TuneD profiles “How do I create my own tuned profile on RHEL7 ?” The default scheduler in Red Hat Enterprise Linux 4, 5 and 6 is cfq. The available tuned profiles use the deadline elevator. See “How do I create my own tuned profile on RHEL6?” on creating a custom I/O scheduler via tuned. See also “How to set a permanent I/O scheduler on one or more specific devices using udev” Footnotes FN.1 | With the advent of mq-deadline being the default scheduler, there is no longer a compelling reason to change to the none I/O scheduler for virtual disks.

An exception to that would be if the virtual disks are backed by high speed NVMe or NVDIMM technology within the hypervisor and the guest is performing a very large number of I/O per second along with small I/O sizes (4kb). In this corner case, then switching to the none scheduler can provide a slight overall improvement in iops and therefore throughput. This is due to the slightly longer execution code path of mq-deadline vs none schedulers. Testing with NVMe backed virtual disks between none and mq-deadline with simple single io depth test showed ~2% difference with block I/O sizes 32kb or larger, and less than 1% difference at 512kb io size. In normal circumstances, with more real life (more complex) I/O loads this difference can be within statistical noise when testing between the two schedulers as it will depend on what other operations the hypervisor is busy with at any given time and whether the NVMe device is utilised by more than one virtual disk and/or more than one guest. The mq-deadline scheduler is also the scheduler used by the tuned profile virtual-host and the default scheduler in RHEL 8 and 9 (changed to mq-deadline from deadline in 7.6+) for all but direct attached SATA rotating media disks. FN.2 | On RHEL 7.5 and earlier: while the default cfq scheduler is a reasonable choice even for virtual disks within virtual guests, but it does have drawbacks. The main one being it is tuned to maximise I/O to a single rotating media physical disk. Moreover, most hypervisors also perform their own I/O scheduling for the physical resouces behind the virtual disks. And multiple virtual disks can use the same physical storage resource and presented to one or more guests. Under these circumstances switching to the noop I/O scheduler for virtual disks is recommended. Doing so reduces code path time (e.g. removes the slice idle time in cfq) associated with cfq and other schedulers. Using noop reduces the time the I/O spends in the scheduler layer of the linux I/O stack and allows the hypervisor to better schedule the I/O against the physical resources it is managing.

If the hypervisor is known to do its own I/O scheduling – which is normally the case – then guests often have sufficient benefit from the noop I/O scheduler versus cfq, and bfq, schedulers to make the recommended switch to noop as a first step. If there are heavy I/O loads from a guest, then sometimes switching the guest to deadline can provide a small performance edge over the noop scheduler – but it highly dependent on the I/O load itself. For example, a database i/O load of synchronous reads and asynchronous writes can benefit from deadline by biasing dispatch of reads first (that are blocking processes) over writes (which are non-blocking I/O). FN.3 | The storage technology in use can affect which io scheduler produces the best results for a given configuration and I/O workload.

If physical disks, iSCSI, SR-IOV pass-through are provisioned to guests, then the none (noop) scheduler should not be used. Using none does not allow the linux virtual host to optimise I/O requests in terms of type or order to the underlying physical device. Only the guest itself should perform I/O scheduling in such configurations so choose mq-deadline (or deadline depending on kernel version). If virtual disks are presented to guests, then for most I/O workloads, the mq-deadline scheduler is likely statistically close enough vs using none. Given that mq-deadline is the default in later kernel versions, there is no compelling reason to change to none. If the hypervisor is known to do its own I/O scheduling – which is normally the case – then guests often benefit greatly from switching to the noop I/O scheduler from cfq or bfq schedulers. There is a much less messurable change in performance if switching to none from mq-deadline or deadline schedulers. Switching to noop from cfq allows the hypervisor to optimise the I/O requests and prioritise based on it’s view on all of the I/O from one or multiple guests. The hypervisor receives I/O from the guest in as submitted order within the guest. Within the linux virtual guest, the noop scheduler can still combine sequential small requests into larger requests before submitting the I/O to the hypervisor. Switching to none from mq-deadline results in a slightly shorter code expecution path associated with none over mq-deadline. But in performing the switch, the linux virtual guest looses the ability to prioritise dispatching blocking I/O over non-blocking I/O. FN.4 | While there is a significant difference between the default cfq and noop use in RHEL 4, 5, and 6, there is less difference in performance in a virtual disk envoronment betweend the default mq-deadline and none in RHEL 8 and 9. However, to minimise I/O latency within the guest is more important than maximising I/O throughput on the guest’s I/O workloads then it may be beneficial to switch to none in RHEL 8 and 9 environments. Just be aware that nominal measured differences are typically in the range or +/- 1-3% differences between the two schedulers for virtual disks. But every I/O workload is different - so be sure to perform proper testing within your environment to determine how the scheduler change impacts your specific workload.

Root Cause Testing NOTE: All scheduler tuning should be tested under normal operating conditions, as synthetic benchmarks typically do not accurately compare performance of systems using shared resources in virtual environments.

In this document, we refer to testing and comparing multiple schedulers. Some hints:

All scheduler tuning should be tested under normal operating conditions, as synthetic benchmarks typically do not accurately compare performance of systems using shared resources in virtual environments. Recommendations and defaults are only a place to start. Outside of some specific corner cases, the typical change in performance when comparing different schedulers is nominally in the +/- 5% range. Its very unusual, even in corner cases like all sequential reads for video streaming, to see more than a 10-20% improvement in I/O performance via just a scheduler change. So desiring a 5-10x improvement by finding the right scheduler is not very likely to happen. One should first be clear about the goal or the goals one wants to optimise for. Do I want as many I/O as possible to storage? Do I want to optimise an application to provide service in a certain way, for example “this apache webserver should be able to hand out as many static files (fetched from storage) as possible”? With the goal clear, one can decide on the best tool to measure. Applications can then be started, and measured. Not changing the conditions, several schedulers can be tried out, and the measurement might change. Special attention should be payed to mutual influence of the components. A RHEL might host 10 KVM guests, and each of the guests various applications. Benchmarking should consider this whole system. Product(s) Red Hat Enterprise LinuxComponent kvm vmware xenCategory Learn moreTags kvm performance rhel_4 rhel_5 rhel_6 vmware xen This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Was this helpful?

People who viewed this solution also viewed How to improve disk performance on OpenStack Solution - 14 Jun 2024 Have a suggestion about how we can improve this site? Tell us here! Discussion - 28 Feb 2012 Virtualised database I/O performance improvements in RHEL 9.4 DevelopersArticle - 10 Sept 2024 Get notified when this content is updated Comments Newbie Add your comment: Add comment Send notifications to content followers Submit Active Contrib… 84 points Aug 16, 2011 7:23 AM G says: We’ve done lots of testing on this with RHEL 5 and found cfq to be best for the guest in an ESXi env.

Reply Red Hat Community Member 22 points Nov 1, 2011 7:57 AM Christoph Doerbeck says: RHCSA If I’m not mistaken, there is a race condition when running RHEL 4 with elevator=NOOP in VMWare. Use deadline or cfq.

Reply Active Contrib… 121 points Mar 24, 2016 1:09 PM Frank says: Typo, one does not edit /etc/grub2.cfg See /etc/sysconfig/grub

Reply Guru 472 points Dec 9, 2016 12:41 AM Dan says: etc/grub2.cfg should not be edited, the correct way is to edit /etc/default/grub and then run grub2-mkconfig

Community Member 22 points May 3, 2017 5:55 PM Tadej Janež says: On UEFI- based machines, one must execute grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg instead of grub2-mkconfig -o /boot/grub2/grub.cfg.

For more info, see: Customising the GRUB 2 Configuration File.

Guru 36580 points Mar 6, 2018 6:22 PM Christian Labisch says: Community Leader Another option would be to specify the scheduler depending on the disk being in use. Create this new rule -> sudo nano /etc/udev/rules.d/60-schedulers.rules

Put the following content into the empty file (save it - reboot the system afterwards) :

set cfq scheduler for rotating disks

ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1”, ATTR{queue/scheduler}=”cfq”

set deadline scheduler for non-rotating disks

ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”0”, ATTR{queue/scheduler}=”deadline”

Red Hat Guru 4206 points Mar 7, 2018 1:00 AM Dwight Brown says: The rule does not differentiate between the virtual disks and pass-through (for example enterprise san storage). So my assumption is the above is for virtual disks only.

Guru 36580 points Mar 7, 2018 1:09 AM Christian Labisch says: Community Leader Hi Dwight, this method is working for physical disks as well - I have tested it on various Linux distributions. When you are running a RHEL host with KVM virtualisation on a SSD, the deadline scheduler will be in use.

Red Hat Guru 4206 points Mar 7, 2018 4:49 AM Dwight Brown says: Pass thru san based disks are typically marked rotational, but because they are nominally backed by multiple physical disks, deadline is a better scheduler than cfq for those disks. The above rules will set cfq for rotational based san disks, which is often sub-optimal.

Guru 36580 points Mar 7, 2018 6:22 PM Christian Labisch says: Community Leader Hi Dwight, thanks for the information - I only wanted to provide the additional option to tweak the scheduler. If you want to use the deadline scheduler for rotational disks, you of course can do this by changing the rule : ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1”, ATTR{queue/scheduler}=”deadline”

Reply Expert 1269 points Sep 4, 2019 9:09 PM J Schmidt says: If the schedulers noop and deadline disables the OS I/O-scheduling completely, the ionice command and the automatic I/O-prioritisation in the OS does not work anymore. Since the VMware host have no way of prioritising I/O between individual processes in the guest OS, there must definitely be times when disabling OS scheduling is a bad idea. For example when a low performing or very busy storage is used for the guest, and you for example have a low priority backup-process and a real-time scheduled production application/process.

Red Hat Guru 4206 points Sep 5, 2019 5:56 AM Dwight Brown says: “If the schedulers noop and deadline disables the OS I/O-scheduling completely … the automatic I/O-prioritisation in the OS does not work anymore”

Clarification is needed for above. noop and deadline do not disable io scheduling completely. They just don’t allow significant tuning changes on how the scheduling (queueing) is performed. For example deadline has two queues: read and write, but the two queues are each accessed via two methods, location sorted and time sorted: so two inputs and 4 outputs. That’s all its got and it cannot be tuned further in terms of what is in the queue. It can be turned to bias the draining of io out of the queues more for reads or writes depending on how the tunables are set – that is how many io are stripped from each and interleaved/sent to storage or how long a “deadline” period is used before switching between location ordered stripping of io from queues to disk vs time ordered draining of io from queues.

The cfq has some additional automatic sorting of io (not really “scheduling”) but allows ionice and cGroup manipulation of processes io into different queues and these queues are in sets of idle, best effort, real-time… and then further broken down into queues of different priorities within those queue types.

“there must definitely be times when disabling OS scheduling is a bad idea.”

Yes there can be, hence the wiggle room with the descriptions of each io scheduler:

“…cfq scheduler is usually ideal” which conversely means there will be circumstances where it is not ideal but in general is a reasonable starting point for tuning. “…guests often benefit greatly from the noop I/O scheduler”, which conversely measn there will be circumstances where noop in the application environment isn’t a benefit. “…schedulers like deadline can be more advantageous”, not shall be but can be for certain storage environments and application io loads, especially if those loads are read latency sensitive and less sensitive to and have a major component of buffered writes associated with the application. This article is about the majority of cases and a reasonable starting point for further tuning. It doesn’t cover cases for things like RT guests or where there is significant need to tuning/assigning priorities of individual processes… that is just way too specific to niche configurations. Also see the caveats in above, specifically about not treating paravirt or pass through devices to the same scheduling tuning as virtual disks.

Tuning outside of the general recommendations tend to be corner cases where the physical resources of the hypervisor being spread out over a number of guest is well separated and/or well bounded in terms of potential collective io load from all guests. But if using virtual disk io, then a single guest’s scheduling of io is done in a vacuum and when it all trickles downhill into the hypervisor to be mixed in with other vm guests io going to the same underlying disks - setting up one guest to use RT queue scheduling means nothing to the hypervisor when it blends that io in with other guests io. Trying to tune one vm guest cannot know the definitive steady state impact to performance since it cannot know what io load being presented from other vm guests is at any given moment. Basically its trial and error with generally noop or deadline being nominally best over a wide range of vm guests and application loads. If you know the underlying physical storage is only being used by one guest’s virtual disks, then you can almost treat it as physical pass-thru storage within the guest in terms of tuning… but most the time you may not know you have dedicated physical disks behind virtual disks just for the current vm guest. And if you did, it would be best to present them as pass-thru resources to the guest in the first place.

“for example have a low priority backup-process and a real-time scheduled production application/process.”

You have to be very careful doing things like that as you can end up in a hung or really poorly performing system due to priority inversion. For example the backup process locks a directory on disk as it processes files so things can’t change while grabbing files. While it has that filesystem lock, higher RT io to same filesystem is going on which blocks the backup io from being processed (while backup is holding that fs lock) to the point that the RT process trips across the filesystem lock the low priority backup process is holding. In truth I’ve see tuned applications using RT to the point that the filesystem’s metadata transactions can’t be flushed resulting in some parts of the RT application environment locking up (while other parts continue to read/write data to other parts of the filesystem or to other partitions on the disk) and it ends up locking out the filesystem’s metadata transactions to the detriment of all other processes… eventually.

Tuning isn’t for the faint of heart.

Reply Expert 1142 points Dec 6, 2019 12:50 AM Klaas Demter says: If the noop scheduler is usually the better choice for virtual guests, why isn’t it the rhel default on those systems?

Red Hat Guru 3001 points Dec 6, 2019 9:50 AM Christian Horn says: RHCSARHCE I think our internal discussions did not end with a clear winner. The mentions currently both noop and deadline. Having a view over the workloads/envirionments our customers run would be required, and then an investigation whether noop or deadline would be better.

If you have hints on relevant statistics which help to point out that noop would be the better default, then that could be a base to rethink the defaults. That would work best in requesting an RFE via a customer centre case.

Reply Expert 1142 points Dec 6, 2019 4:06 PM Klaas Demter says: This article reads like noop is the first choice for VMs and deadline is only better for certain workloads, that’s why I asked the question :) if your internal discussions do not support this I would suggest to make that more clear in this article.

Reply Expert 1269 points Dec 7, 2019 12:31 AM J Schmidt says: +1

I totally agree with @Klaas: Saying one thing and doing another just creates unnecessary confusion and possibly extra work for nothing.

Red Hat Guru 3001 points Dec 9, 2019 2:33 PM Christian Horn says: RHCSARHCE Klaas, Jesper, I looked a bit deeper. RHEL 7.5 and later use deadline in guests by default, RHEL7.4 and earlier do not explicitly set one, they have ‘none’.

The Virt tuning guide is also touching the topic, with quite similar wording to this kbase.

I see no clear recommendation towards noop or deadline in both sources, but ‘noop’ is mentioned before ‘deadline’. The virt-tuning-guide also mentions that rhel7 defaults to deadline.

I did not see something as clearly as “noop scheduler is usually the better choice”, such a clear recommendation would indeed be in contrast with the current defaults.

I think things could be clearer though. Let me start a mail thread with Dwight, and the authors of the virt-tuning-guide. Thank you for bringing this to our attention.

Reply Expert 1142 points Dec 9, 2019 4:36 PM Klaas Demter says: From this article: “RHEL guests often benefit greatly from the noop I/O scheduler, which allows the host/hypervisor to optimise the I/O requests and prioritise based on incoming guest load. The noop scheduler can still combine small requests from the guest OS into larger requests before handing the I/O to the hypervisor, however noop follows the idea to spend as few CPU cycles as possible in the guest for I/O scheduling. The host/hypervisor will have an overview of the requests of all guests and have a separate strategy for handling I/O.” this does (at least to me) read as if noop is usually (to quote “often”) the better choice :)

Reply Community Member 42 points Dec 10, 2019 3:31 PM Emil Golinelli says: I am reading this solution in the same manner as Klaas. Unless noop is actually the preferred choice, please change the wording on this page.

Red Hat Guru 3001 points Jan 7, 2020 5:05 PM Christian Horn says: RHCSARHCE Happy new year, all.

We had great threads on the topic internally, down to applications where the vendor would recommend schedulers, with comments regarding decision trees for deciding on the best scheduler for a given environment/workload. The common line is that the current default of deadline in guests is since rhel7.5GA, and was not willingly implemented for virtual guests, but more a side affect of other changes. So all agree that testing a given environment/workload with noop and deadline is the best recommendation. I hope with the recent kbase modification, this becomes more clear.