art with code

2018-04-17

IO limits

It's all about latency, man. Latency, latency, latency. Latency drives your max IOPS. The other aspects are how big are your IOs and how many can you do in parallel. But, dude, it's mostly about latency. That's the thing, the big kahuna, the ultimate limit.

Suppose you've got a workload. Just chasing some pointers. This is a horrible workload. It just chases tiny 8-byte pointers around an endless expanse of memory, like some sort of demented camel doing a random walk in the Empty Quarter.

This camel, this workload, it's all about latency. How fast can you go from one pointer to the next. That gives you your IOPS. If it's from a super-fast spinning disk with a 10 ms latency, you'll get maybe like a 100 IOPS. From NVMe flash SSD with 0.1 ms latency, 10000 IOPS. Optane's got 6-10 us latency, which gets you 100-170k IOPS. If it's, I don't know, a camel. Yeah. Man. How many IOPS can a camel do? A camel caravan can travel 40 kilometers per day. The average distance between two points in Rub' al Khali? Well, it's like a 500x1000 km rectangle, right? About 400 kilometers[1] then. So on average it'd take the camel 40 days to do an IO. That comes down to, yeah, 289 nanoIOPS.

Camels aren't the best for random access workloads.

There's also the question of the IO size. If you can only read one byte at a time, you aren't going to get huge throughput no matter how fast your access times. Imagine a light speed interconnect with a length of 1.5 meters. That's about a 10 picosecond roundtrip. One bit at a time, you could do 12.5 GB per second. So, while that's super fast, it's still an understandable number. And that's the best-case scenario raw physical limit.

Now, imagine our camel. Trudging along in the sand, carrying a saddle bag with decorative stitchwork, tassels swinging from side to side. Inside the pouches of the saddle bags are 250 kilograms of MicroSD cards at 250 mg each, tiny brightly painted chips protected from the elements in anti-static bags. Each card can store 256 GB of data and the camel is carrying a million of them. The camel's IO size is 256 petabytes. At 289 nanoIOPS, its bandwidth is 74 GB/s. The camel has a higher bandwidth than our light speed interconnect. It's a FTL camel.

Let's add IO parallelism to the mix. Imagine a caravan of twenty camels, each camel carrying 256 petabytes of data. An individual camel has a bandwidth of 74 GB/s, so if you multiply that by 20, you get the aggregate caravan bandwidth: 1.5 TB/s. These camels are a rocking interconnect in a high-latency, high-bandwidth world.

Back to chasing 8-byte pointers. All we want to do is find one tiny pointer, read it, and go to the next one. Now it doesn't really matter how many camels you have or how much each can carry, all that matters is how fast they can go from place to place. In this kind of scenario, the light speed interconnect would still be doing 12.5 GB/s (heck, it'd be doing 12.5 GB/s at any IO size larger than a bit), but our proud caravan of camels would be reduced to 0.0000023 bytes per second. Yes, that's bytes. 2.3 microbytes per second.

If you wanted to speed up the camel network, you could spread them evenly over the desert. Now the maximum distance a camel has to travel to the data is divided by the number of camels serving the requests. This works like a Redundant Array of Independent Camels, or RAIC for short. We handwave away the question how the camels synchronize with each other.

Bringing all this back to the mundane world of disks and chips, the throughput of a chip device at QD1 goes through two phases: first it runs at maximum IOPS up to its maximum IO block size, then it runs at flat IOPS up to its maximum parallelism. In theory this would give you a linear throughput increase with increasing block size until you run into the device throughput limit or the bus throughput limit.

You can roughly calculate the maximum throughput of a device by multiplying its IOPS by its IO block size and its parallelism. E.g. if a flash SSD can do ten thousand 8k IOPS and 16 parallel requests, its throughput would be 1.28 GB/s. If you keep the controller and the block size and replace the flash chips with Optane that can do 10x as many QD1 IOPS, you could reach 12.8 GB/s throughput. PCIe x16 Optane cards anyone?

To take it a step further, DRAM runs at 50 ns latency, which would give you 20 million IOPS, or 200x that of Optane. So why don't we see RAM throughput in the 2.5 TB/s region? First, DDR block size is 64 bits (or 8 bytes). Second, CPUs only have two to four memory channels. Taking those numbers at face value, we should only be seeing 320 MB/s to 640 MB/s memory bandwidth.

"But that doesn't make sense", I hear you say, "my CPU can do 90 GB/s reads from RAM!" Glad you asked! After the initial access latency, DRAM actually operates in a streaming mode that ups the block size eight-fold to 64 bytes and uses the raw 400 MHz bus IOPS [2]. Plugging that number into our equation, we get a four channel setup running at 102.4 GB/s.

To go higher than that, you have to boost that bus. E.g. HBM uses a 1024-bit bus, which gets you up to 400 GB/s over a single channel. With dual memory channels, you're nearly at 1 TB/s. Getting to camel caravan territory. You'll still be screwed on pointer-chasing workloads though. For those, all you want is max MHz.

[1] var x=0, samples=100000; for (var i=0; i < samples; i++) { var dx = 500*(Math.random() - Math.random()), dy = 1000*(Math.random() - Math.random()); x += Math.sqrt(dx*dx + dy*dy); } x /= samples;

[2] Please tell me how it actually works, this is based on incomplete understanding of Wikipedia's incomplete explanation. As in, what kind of workload can you run from DRAM at burst rate.

No comments:

Blog Archive