Download Nice_sys Driver

TL;DR Modern disks are so fast that system performance bottleneck shifts to RAM access and CPU. With up to 64 cores, PCIe 4.0 and 8 memory channels, even a single-socket AMD ThreadRipper Pro workstation makes a hell of a powerful machine - if you do it right!

Sys Driver Download. I require a driver for the viahduaa.sys. I hope some1 cane help me. Below you can download viahduaa.sys driver for Windows. Download the latest SYS device drivers (Official and Certified). SYS drivers updated daily. In this post you can find nvefdxp.sys File is secure, passed Norton virus scan! Driver Info: File name: nvefdxp-sys.zip Version: 2.0.6 Size. Alcxwdm Sys Driver for Windows 7 32 bit, Windows 7 64 bit, Windows 10, 8, XP. Uploaded on 3/8/2019, downloaded 387 times, receiving a 97/100 rating by 254 users. $ mpstat -P 0 Linux 3.2.0-57-generic (USERNB01) x8664 (2 CPU) 03:54:00 PM CPU%usr%nice%sys%iowait%irq%soft%steal%guest%idle 03:54:00 PM 0 3.82 0.01 1.16 3.88 0.00 0.06 0.00 0.00 91.06 20. Vmstat is a tool that provides reporting virtual memory statistics. It covers the system’s memory, swap and processors.

Introduction

In this post I’ll explain how I configured my AMD ThreadRipper Pro workstation with 10 PCIe 4.0 SSDs to achieve 11M IOPS with 4kB random reads and 66 GiB/s throughput with larger IOs - and what bottlenecks & issues I fixed to get there. We’ll look into Linux block I/O internals and their interaction with modern hardware. We’ll use tools & techniques, old and new, for measuring bottlenecks - and other adventures in the kernel I/O stack.

When running fio on Linux with io_uring, the end result is this:

With 4kB blocksize random reads, I get over 11 million IOPS, at 42 GiB/ (~45 GB/s) throughput. I will go through technical details and larger blocksizes below, but I’ll start by answering the “why” question first.

Index

Why?

Why would you even need such IO throughput in a single machine? Shouldn’t I be building a 50-node cluster in the cloud “for scalability”? This is exactly the point of my experiment - do you really want to have all the complexity of clusters or performance implications of remote storage if you can run your I/O heavy workload on just one server with local NVMe storage? How many databases out there need to sustain even “only” 1M disk IOPS? Or if you really do need that sweet 1 TB/s data scanning speed, you could do this with 10-20 well-configured cluster nodes instead of 200. Modern hardware is powerful, if used right!

I’m well aware of various enterprise data management functional requirements, such as availability, remote replication, data sharing or just huge data volumes that may direct you away from using local NVMe SSDs as your primary storage and I am not arguing against that. Some of these requirements can be addressed in software, some not. The scope of this article is to show the raw performance even cheap commodity hardware gives you nowadays, so if your company is paying 10x more for 100x lower throughput, it’s good to be aware of other options.

My future plans for this article series include running I/O heavy performance tests with various database engines, other ideas are welcome too!

Hardware

Machine

  • Lenovo ThinkStation P620 workstation
  • Tech specs are here and more details below

CPU

  • Zen 2 microarchitecture (7nm)
  • Max number of CPU cores is 64 (128 threads), but I bought a 16-core version because I’m cheap
  • All 16 cores can run at 3.9 GHz sustained base frequency, with 4.3 GHz max “boost” frequency
  • This single CPU has 8 memory channels supporting 8 x DDR4 3200 ECC RDIMMs
  • This CPU supports 128 PCIe 4.0 lanes

So, the excellent throughput of this workstation does not come only from CPU processing speeds, but from the additional bandwidth 8 memory channels and 128 PCIe 4.0 lanes offer! A single PCIe 4.0 lane gives you about 1.969 GB/s bandwidth in each direction (PCIe is switched, point-to-point full duplex). In theory, 128 lanes should mean that the CPU can handle ~250 GB/s (2 terabit/s!) PCIe traffic in each direction, if all lanes were fully used.

Note that on modern CPUs, most PCIe lanes are connected directly to the CPU (as are memory channels) and do not go through some external “southbridge” controller or “front-side bus”.

Download nice_sys drivers

This also means that a single PCIe 4.0 x4 card can achieve close to 8 GB/s data transfer speed, assuming that the devices can handle it and you do not hit other bottlenecks first. This leads us to the next section, suitable SSDs.

Storage

The Samsung 980 Pro is a true PCIe 4.0 SSD, with specs claiming 7000 MB/s read and 5000 MB/s write throughput (with caveats). Its internal controller is capable of true PCIe 4.0 transfer speeds and is not an old PCIe 3.0 chip that just presents it as a PCIe 4.0 compatible one with low GT/s. I understand that some other “PCIe4” SSDs out there will max out at the PCIe3 speeds (~3.5 GB/s) as they don’t have new generation controllers in them.

Download Nice_sys Driver Windows 10

I ended up buying 8 x 1 TB SSDs, eventually used for data volumes and 2 x 500 GB ones for boot disks and software. I also have a 380 GB Intel Optane 905P SSD for low latency writes (like transaction logs), but more about that in a future post.

The NVMe disks show up like this (8x1 TB and 2x500 GB):

Samsung’s specs pages (1,2) state that the max sequential read & write speeds as 7000 MB/s & 5000 MB/s for the 1 TB drives using PCIe 4.0. The IO pattern doesn’t really have to be sequential, just with big enough I/O sizes. When these cards are plugged in to PCIe 3.0 slots, you’d get only ~3500 MB/s read and write speeds due to the PCIe 3.0 x4 throughput limitations.

Take the write speeds with a grain of salt, as TLC & QLC cards have slower multi-bit writes into the main NAND area, but may have some DIMM memory for buffering writes and/or a “TurboWrite buffer” (as Samsung calls it) that uses part of the SSDs NAND as faster SLC storage. It’s done by issuing single-bit “SLC-like” writes into TLC area. So, once you’ve filled up the “SLC” TurboWrite buffer at 5000 MB/s, you’ll be bottlenecked by the TLC “main area” at 2000 MB/s (on the 1 TB disks).

Apparently on the 980’s the TurboWrite buffer defaults to ~6 GB on the 1 TB SSD, but it’s dynamic and can grow up to 108 GB if there’s high write demand and enough unused NAND space. Samsung also has their own designed-in-house disk controller (Elpis) that is built for achieving PCIe 4.0 speeds. It can handle 128 I/O queues (with 64k command sets per queue!). This article is a good reference about this disk - and an overview of complexity (and potential bottlenecks) of modern SSDs!

As I want to focus on just the max raw block I/O performance for this article, I will run my tests directly against the NVMe block devices (without a filesystem or LVM on the I/O path). NVMe block devices have per-CPU multi-queues (MQ) enabled by default and device interrupts are “striped” across all CPUs. I’ll present some LVM, multiqueue and file system tests in future articles.

You can subscribe to email updates or follow me here.

Single-disk test

I used fio 3.25 on Ubuntu 20.10 with Ubuntu-provided Linux kernel 5.8.0-29-generic for these tests. I briefly tested the (currently) latest Linux kernel 5.11-rc4 too, it has some io_uring enhancements, but got lower throughput out of it. There are some big AMD ThreadRipper (power-aware scheduler) updates in it, apparently with some unsolved performance regressions.

I first ran a single-disk test to avoid hitting any system-wide throughput bottlenecks when accessing all 10 SSDs simultaneously. This test aimed to give me a theoretical max throughput of a single disk. I used fio with --io_uringasynchronous I/O option. As expected, it was more efficient than libaio.

I ended up using the following fio command for my narrowly focused synthetic benchmark. I’ll explain the reasoning for some of the command line options later in this post.

Here’s the onessd.sh script that I used for single-disk testing:

Ok, let’s first run fio with 3 concurrent workers doing 4kB reads against a single disk only (with queue_depth=32 per worker), to see the maximum theoretical throughput of such a disk in my machine:

Looks like we got even more IOPS out of the disk than Samsung promised - their specs said “only” 1,000,000 IOPS per disk. If you wonder why I’m running the single-disk test with 3 concurrent worker processes - it turns out that a single process maxes out a single CPU at about 450k IOPS on my setup. So, a single process is physically incapable of submitting (and reaping) more than 450k I/Os per second of CPU time (being 100% on CPU). Assuming that its CPU core was running at around 3.9 GHz (and didn’t have much else to do), it translates to about 3,900,000,000 / 450,000 = 8,666 CPU cycles used for submitting + issuing + completing + reaping each I/O.

If you look into the metrics below, you see that my 3 worker processes used about 9-10% of CPU capacity of my machine with 32 (logical) processors:

Sidenote: It looks like dstat on Ubuntu 20.10 has some CPU utilization rounding & reporting errors. I used other tools to verify that my CPU numbers in this article are similar.

How about large I/Os? Let’s try 1 MB sized reads, something that a database engine would be using for scanning large tables:

We are scanning data at over 6 GiB/s just using one disk!

Seems like we couldn’t actually do 1 MB sized reads, as we are issuing 13.8k read operations for ~6800 MB below. Either fio wasn’t able to issue 1 MB-sized IOs or the larger requests got split into ~512kB chunks somewhere in the kernel:

Thanks to the big I/O sizes, every request takes a lot of time as the disk’s DMA controller copies the request’s data to relevant memory location in RAM. PCIe 4.0 x4 max theoretical transfer rate is about 7.877 GB/s, even the PCIe transfer from the disk controller memory to CPU will take over 120 us per MB, in addition to the flash reading and SSD controller’s latency.

So, as you see from the CPU usage numbers above, my CPUs were 99% idle, despite running 3 concurrent fio workers and top confirmed that. What were the worker processes doing then? They were mostly sleeping, waiting for events in io_uring completion queue, with WCHAN io_cqring_wait:

You can download the open source psn tool from 0x.tools site.

I/O configuration - direct I/O

In the following two sections I show how I tried different OS level I/O configuration options (direct vs cached I/O and using an I/O scheduler). These sections go pretty deep into Linux kernel troubleshooting topics, if you want to skip this and read about the hardware configuration challenges, jump to the Multi-disk test section.

I used --direct=1 option that forces files to be opened with O_DIRECT flag - bypassing the OS pagecache. When running millions of IOPS, you want to minimize the CPU overhead of every operation and copying, searching & replacing pages in the OS pagecache will radically increase your CPU usage and memory traffic. Most mature database engines have a built-in cache anyway, so why duplicate work (and memory usage).

But for fun, I ran the same test with --direct=0 anyway and the results are below:

Wait, what? My fio test with 3 workers somehow keeps all the CPUs 100% busy in kernel mode? The runnable threads (run column) alternates between 38 and 75? The read throughput has dropped from over 6 GiB/s to 3 GiB/s. But why do we have so much CPU activity? Is that just how it is?

Let’s not guess or give up, but measure! As the kernel CPU usage is very high, we have a couple of options for drilling down.

pSnapper can show which threads (and in which state) are most active:

So, it’s not my 3 fio processes that magically eat all the CPU time, but there’s a lot of io_wge_worker-N kernel threads that are doing something! It starting to look like “it’s not me, it’s you - Linux kernel”. When you scroll the above output right, you see that the top entry reports these threads waiting in thread state D - uninterruptible (usually) disk I/O sleep and the kernel function (wchan) that requested the sleep is wait_on_page_bit_common. That’s the common WHAN that shows up whenever you’re waiting in cached I/O (pagecache) codepath. We still don’t know whether this is just a “waiting for slow I/O via pagecache to complete” scenario or some kernel issue. Remember that there’s a significant amount of these kernel worker threads also burning CPU in kernel mode, not waiting for anything (the 2nd highlighted line with “Running (ON CPU)” above).

The good news is that we can easily drill down further! pSnapper allows you to sample /proc/PID/stack too to get a rough idea of in which kernel locations any of your threads are. You will have to scroll all the way to the right, until you see the highlighted functions:

Driver

Or, if you don’t like to read ultra-wide outputs, you can just use a sed trick to replace the -> with n in pSnapper stack prints. I usually use my terminal to highlight the normal output lines of pSnapper, so I can easily spot where in the output a stack trace ends and the next thread entry starts:

I showed the output of just 3 top pSnapper entries, but you see there’s a bunch of page-cache read-ahead happening! And worker threads are searching for free pages for this read-ahead. And to find enough free pages, they’ve had to resort to shrinking - throwing away existing pages using the pagecache LRU. And judging from the earlier CPU utilization and “runnable tasks” numbers, there are tends of threads, trying to do the same things at the same time.

Could the high CPU usage be explained by some sort of contention in the Linux kernel?

Let’s not guess - let’s measure!

I have 0x.tools enabled in my machines and among other things, run_xcpu.sh runs perf record at 1 Hz frequency. It’s always on and has unnoticeable overhead as it samples on-CPU thread stacks infrequently (I built 0x.tools for always-on low frequency profiling of production systems).

The CPU stack capture tool creates a separate output file for every minute, so I’ll pick the latest:

Here’s the output:

What’s the best solution for reducing spinlock contention? Use it less!

Since I don’t want to do less I/O, on the contrary, I want to push the system to do more of it - let’s just disable the readahead and see if it makes anything better. fio has a --fadvise_hint=random hint to tell the OS it expects to be doing random I/O. This didn’t seem to make a difference in my brief tests, so I just moved on and used the other option - I disabled the readahead at block device level:

Ok, readahead is enabled, let’s turn it off for this device:

And now let’s run the onessd.sh benchmark again, with the same command as earlier:

At first, it looked like a small success, without readahead we were doing 4000+ MiB/s instead of ~3200 MiB/s and so.

But…

… after running for ~30-35 seconds, the throughput dropped to 3000+ MiB/s again and kernel-mode CPU usage increased too. The perf CPU profile was following:

I highlighted a few items in the above screenshot with “talk bubbles”. If you compare the CPU time consumed by _raw_spinlock_irq function vs. their parents/grandparents handling cached reads or LRU operations, you’ll see that the majority of time is spent in spinlock spinning and not the actual LRU/cached read function code. One question to ask would be, why are there so many worker threads, trying to concurrently work on my 16 core/32 thread machine and ending up with a spin storm. I have only 3 fio processes accessing just one disk. With less concurrency, we’d get less spinlock contention on these pagecache LRU structures (hint: when bypassing pagecache using direct I/O, these threads won’t show up at all).

Even though the previously seen page_cache_readahead functions do not show up in kernel stack traces, we still see similar pagecache LRU replacement activity and spinlock contention. It makes sense as even without read-ahead, fio is still issuing IOs as fast as it can. My hypothesis that the readahead functions (that clearly showed up in the perf profile) had something to do with this problem, was wrong. In fact, I went back to my previous test (with read-ahead enabled) and indeed, at first I got 6800+ MiB/s scan rates - they only dropped to 3000-something after a while!

I had missed this detail at first. Even with buffered I/O mode, I got the full throughput out of the single disk (at a fairly high CPU overhead), but after a while the throughput dropped 2x, CPU usage jumped up 3-4x - and the page replacement related spinlock contention showed up in the stack. Incidentally, I have 256 GB of RAM in this machine that is not used for anything else and reading a disk for ~35 seconds at 6800 MiB gets close to 256 GB (of my RAM). And similarly, with readahead disabled, it took about 55-60 seconds at 4000 MiB/s (close to 256 GB of data read), until the pagecache LRU spinlock problem kicked in. Maybe a coincidence, but it was consistently reproducible - more research needed.

This is a good example of sudden spinlock contention showing up due to a behavior change somewhere higher in the code-path or state change in some underlying data structure, that in turn causes a behavior change. It is a feasible hypothesis that as long as there are enough unused pages in VM freelists or in the inactive area of the pagecache, kernel does not have to scan and discard pages from the LRU list, thus only allocating new pages require using spinlocks. But once you’ve exhausted the available memory (and inactive “cold” pages in the LRU), then every thread has to first search, discard pages from the LRU list, before adding them back as needed. This will suddenly cause a huge kernel CPU usage spike, even though “nothing has changed”.

There are various tracing options and /proc/vmstat metrics that would help us to drill down further, but I won’t do it here for keeping this article “short”. Remember, all this CPU usage and spinning problem happens only when using reads via OS pagecache (fio --direct=0) and we’ll go back to using direct I/O in next steps.

If you want to avoid disk I/O, use the built-in cache in your database engines and applications. They don’t have purpose-built caches, then use OS pagecache, if you expect it to be able to hold your “working set” of hot, frequently accessed data. But if you want to do efficient disk I/O, use direct I/O via whatever application configuration settings that use the O_DIRECT flag on file open.

IO configuration - IO scheduler

Here’s another example of many things that may reduce your performance when doing millions of IOPS (or even only 100k) - I/O scheduler. This feature trades CPU time for hopefully better contained I/O latency for spinning disks. Every I/O request submission and handling will have extra CPU overhead, requests get inserted into sorted queues, timestamps are checked, hashes are computed, etc. When doing thousands of IOPS, you might not even notice any CPU overhead and on “dumb” spinning disks the investment will likely pay off.

For NVME SSDs, the I/O scheduler is disabled by default - good! If you have uniform access latency regardless of data location (like SSDs provide) or a storage array/controller with large enough queue depth, so that I/Os can be rearranged & scheduled at the storage layer, you won’t need I/O scheduling at the host level anyway. And at high IOPS rates, the CPU overhead will be noticeable, with no I/O benefit.

Let’s test this anyway! The I/O scheduler is currently none for all my SSDs:

Let’s enable the mq-deadline (Multi-Queue deadline) scheduling option for my test disk:

Rerun the previous onessd.sh command with direct I/O enabled again and with 4kB I/O size, we get this:

The “run” column shows just 3 threads running on CPU most of the time, there’s no need for hundreds of kernel threads doing concurrent I/O for pagecache. pSnapper confirms that, too. Note that you don’t have to run pSnapper with sudo if you’re not reporting the “advanced” columns like syscall and kstack.

However, we only get ~525k IOPS out of these 3 processes running 100% on CPU, instead of 1150k IOPS seen earlier in my initial tests. So, with an I/O scheduler enabled, we get 2x less throughput with similar amount of CPU usage. In other, words - every I/O request seems to be using 2x more CPU now. This is not surprising, as the additional I/O scheduling magic needs to use CPU too.

As usual, you can run perf or look into a 0x.toolsalways-on CPU profiling output file from around the right time:

Partial output is below, I highlighted the interesting parts:

We are doing direct I/O, like the blkdev_direct_IO function in the stack shows, but in the bottom of the output, you see a lot of “timer” and “delay” functions show up under blk_mq_do_dispatch_sched. So, we will do extra CPU work for deciding whether an I/O request (that’s already submitted to the block device’s I/O queue in the OS) should be dispatched to the actual hardware device driver that’s handling this disk.

Ok, enough of this software stuff for now, as I mentioned I will write about performance of LVM, filesystems and databases in future posts, let’s move on.

Multi-disk test

We have been looking into just one disk’s performance so far! The hope is that now we have a better idea of theoretical max throughput of a single disk, when plenty of extra PCIe, inter-CPU-core and memory channel bandwidth was available. Now let’s put all 10 disks into use at the same time - and make sure that we have set them up right! I will show the command I used in the end of the post, let’s start with hardware setup & configuration.

PCIe 4.0 Quad-SSD adapter - Asus Hyper M.2 X16 Gen4 Card

The workstation had only 2 existing M.2 NVMe slots - and I deliberately stayed away from buying SATA SSDs as the SATA interface is completely different (and wayyy slower) than using direct PCIe-NVMe protocol. Lenovo sells quad M.2 SSD adapters that go into PCIe x16 slots, but the one they currently have supports only PCIe3.0 as far as I know.

I was happy to find that ASUS was already selling a PCIe4.0 capable adapter card that fits into a PCIe x16 slot and holds 4 x M.2 SSDs. All this for just $70. I had to buy two of these. 2 x 4 SSDs went inside these ASUS adapter cards and the remaining 2 SSDs went directly into the existing M.2 slots built in to the Lenovo machine.

Should you want to buy similar adapters, make sure you buy the “Gen 4” version from ASUS if you need PCIe 4.0 compatibility. Or when buying a different brand, you don’t want to get an adapter that has a built-in PCIe switch in it, but a card (like ASUS) that supports the motherboard’s PCIe lane bifurcation directly (see below). Otherwise, you may end up with more latency and more potential bottlenecks, if the switch doesn’t support full PCIe 4.0 speeds.

BIOS Settings

I had to make two changes in BIOS. First, I had to manually configure the PCIe Port Bifurcation from Auto to x4x4x4x4 (other options were x16 and x8x8), as the BIOS didn’t automatically realize that I had 4 separate PCIe x4 devices plugged into that physical x16 slot (no disks showed up at all). Perhaps the auto-detection didn’t work as I wasn’t using the Lenovo’s official Quad SSD adapter.

I also had to change the Link Speed from Auto to Gen 4 (16 GT/s), as otherwise BIOS defaulted the speed to PCIe 3.0 / Gen 3 and I got only half of the throughput out of the SSDs behind these Quad SSD adapters.

After enabling 16 GT/s from BIOS, I was able to get 6800 MiB/s read throughput from a single disk, yay! But…

PCIe Root Complex Bottleneck

I had physically installed the Quad SSD adapters like this (the adapter cards have a silver/aluminum casing):

However, when running a throughput test with all disks and 1 MB read sizes, I got only about 50 GiB/s aggregate throughput. Pretty good, but 10 SSDs should have given up to 68 GiB/s, judging from their specs and individual single-disk tests. There must have been some bottleneck between the disks and my CPU.

Nice_sys

Since multiple PCIe slots may be sharing a PCIe root complex and no component in the CPU-IO network topology has infinite bandwidth, I was wondering whether I had accidentally installed the SSD adapter cards under the same PCIe root complex. Modern CPUs can have separate sets of PCIe lanes connected to different “edges” of the CPU, similar to memory channels. So, if you place your PCIe cards suboptimally, you may end up saturating some PCIe lanes (going into the same CPU), while others have pretty low traffic.

Let’s see how many PCIe root complexes (PCIe lane sets connecting to the CPU) my machine has:

So, 4 PCIe root complexes, each having 32 lanes (I’m assuming equal distribution, the system has 128 lanes in total).

The lspci command shows the PCIe network topology (click on the triangle below to expand). I have removed some unused device entries from the list. Also, I made the root complex nodes bold and the SSD devices yellow.

View lspci output (initial)

Indeed, I had accidentally placed both quad SSD cards into two PCIe slots that were connected to the same PCIe root complex (and connection to the CPU). When setting things up, I didn’t bother diving deep into this workstation’s documentation and got unlucky. The two additional Samsung devices under 0000:20 are the SSDs that I plugged into the M.2 slots already built in to the workstation.

PCIe Root Complex Bottleneck (after)

From the lspci output above you can see that my NVIDIA GPU is under the PCIe address 0000:60 and there are no other I/O hungry cards there. Since I’m currently not using the GPU on that machine, I just swapped one of the Quad SSD adapters with the “mostly idle” video card. Now each of the quad SSD adapters can feed their data through separate PCIe root complexes.

View lspci output (after moving cards around)

So, now both of the heavy-hitting quad SSD cards have their own PCIe root complex to communicate through. The NVIDIA Quadro P620 video card shares the same PCIe root complex with one quad SSD adapter, but since I’m not using the GPU in this machine for anything, I don’t expect it to eat significant bandwidth.

Final results

So, here are the promised results! Let’s execute 10 fio commands concurrently (the script is listed below):

We get up to 11 million IOPS with 4k reads (while CPUs are 100% busy, mostly in kernel mode):

Since issuing so many block I/Os on my 16-core CPU keeps it 100% busy, I had to find other tricks to reduce the CPU usage of fio and I/O handling layer in general. This is why I’m running 10 separate fio commands (one for each disk) as when using one single fio instance accessing all 10 disks, I couldn’t get beyond some measly 8M IOPS while using 100% CPU. Possibly there’s some internal coordination overhead in fio, but I didn’t drill down into that, running separate fio instances was an acceptable workaround:

allmulti.sh:

At 4k I/O size, I also set fio --hipri option as this does not rely on interrupts to get notified about I/O completions, instead if polls for completion in a busy-loop on CPU. Since my CPUs were 100% busy under this load anyway, then not having to handle interrupts and go “straight to the source”, polling I/O states from NVMe queues via direct device-mapped memory reads. This kind of busy-loop polling on all CPU cores is probably not practical for real-life workloads, but for real-life workloads I wouldn’t be so cheap and would go buy a bigger CPU.

onessd.sh:

Download Nice_sys Driver

For final tests, I even disabled the frequent gettimeofday system calls that are used for I/O latency measurement. Less CPU usage = More IOPS when you’re bottlenecked by CPU capacity.

I removed the --fixedpri option in following tests, as with larger I/O sizes, the bottleneck is in waiting for the NVMe device’s DMA engine that in turn waits for the disk -> RAM memory copy to finish (over the PCIe network to memory channels to memory DIMMs). I’d rather let my application threads to go sleep and be waken up later by OS scheduler who receives an interrupt, so some CPU time would be available for application processing too!

In my earlier tests, 1M I/Os were split to 512k apparently (or perhaps fio didn’t request higher I/O sizes itself), I’ll run a 512k I/O size test:

We are now doing 138k IOPS with 512kB block size, this translates to 66 GiB/s throughput (71 GB/s when reporting the units correctly). And look at the CPU usage - my total CPU utilization is only around 7.5%!

Note that I used mpstat for CPU reporting in this case, I hit some sort of dstat bug that didn’t report CPU correctly in this case. I possibly happens due to this Ubuntu kernel not being configured to break out interrupt reporting separately (I will write about this in the future).

The full iostat output is here too (it’s pretty wide).

This is from the 4k maximize IOPS test:

Probably all disks above would have been able to deliver over 1.1M IOPS each, but my CPUs just couldn’t issue & handle I/O requests fast enough!

This is from a test where I used 64kB/s I/O sizes (8 x lower IO size compared to 4k, thus 8 x less IOPS). This test was able to saturate the 66 GiB/s max disk throughput, yet do 1M IOPS in aggregate and have 75% of CPU time free for application work!

Summary

Modern machines are ridiculously fast! You just have to build, configure and use them right. Modern SSDs are super-fast, the PCIe 4.0/CPU/RAM network has very high throughput if you have enough PCIe lanes, memory channels (with enough memory banks populated with DIMMs) and when there’s no CPU- or front-side-bus-imposed data transfer bottleneck (in some IO hub in your old-school architecture).

Download Nice_sys Driver Download

As I mentioned in the beginning, having 128 x PCIe 4.0 lanes is what makes this ridiculous I/O throughput (for a desktop workstation!) possible and the 8 memory channels should make a difference when I actually start using the blocks of data just read from disk.

  • Computers are networks.
  • PCIe, memory channel networks are just one part of the story.
  • Modern CPUs have networks within them, and their layout & capabilities differ by vendor and generation!
  • No network has infinite bandwidth and zero latency.
  • Disk I/O used to be the slow part of computing …
  • Today the bottleneck is CPU and feeding that CPU via multiple levels of networks inside your computer.

Thanks for reading all the way to the end! There will be more posts where I put the same machine into use using various database workloads, including heavy writes, different filesystems - and maybe even throw a GPU in the mix!

More stuff!

  • You can subscribe to email updates or follow me here.
  • You can sign up for a “Hacking Session” webinar on this same topic (Thursday, 4. February 12pm-13:30pm ET)
  • HackerNews discussion

Further reading

Download Nice_sys Driver Update

  • High Performance Linux Block I/O video - it’s a follow up “hacking session” video to this article
  • High Linux System Load - drilling down into kernel thread activity
  • 0x.tools - Always-on Profiling for Production Systems