art with code

2017-10-09

Ethereum algorithmically

Ethereum is a cryptocurrency with a mining process designed to stress random access memory bandwidth. The basic idea in mining Ethereum is to find a a 64-bit number that hashes with a given seed to a 64-bit number that's smaller than the target number.

Think of it like cryptographic lottery. You pick a number, hash it, and compare the hash to the target. If you got a hash that's below the target, you win 5 ETH.

What makes this difficult is the hashing function. Ethereum uses a hashing function that first expands the 32-byte seed into a 16 MB intermediate data structure using a memory-hard hashing function (if you have less than X bytes of RAM, the hash takes exponentially longer to compute), then expands the 16 MB intermediate data structure into a multi-gigabyte main data structure. Hashing a number generates a pseudo-random walk through the main data structure, where you do 64 rounds of "read 128 bytes from location X and update the hash and location X based on the read bytes."

While the computation part of the Ethereum hashing function isn't super cheap, it pales in comparison to the time spent doing random memory accesses. Here's the most expensive line in the Ethereum compute kernel: addToMix = mainDataStructure[X]. If you turn X into a constant, the hash function goes ten times faster.

Indeed, you can get a pretty accurate estimate for the mining speed of a device by taking its memory bandwidth and dividing it by 64 x 128 bytes = 8192 B.

Zo. What is one to do.

Maximize memory bandwidth. Equip every 4 kB block of RAM with a small ALU that can receive an execution state, do a bit of integer math, and pass the execution state to another compute unit. In 4 GB of RAM, you'd have a million little compute units. If it takes 100 ns to send 128 bytes + execution state from one compute unit to another, you'd get 1.28 PB/s aggregate memory bandwidth. Yep, that's over a million gigabytes per second.

With a million GB/s, you could mine ETH at 150 GH/s. At the moment, 25 MH/s of compute power nets you about $1 a day. 150 GH/s would be $6000 per day. If you can fab ten thousand of them, you'd make sixty million a day. Woooooinflation.


2017-10-08

Fast marching cubes in JavaScript

Marching cubes! Where do they march? What is their tune? The name of their leader, a mystery if any.

Marching cubes works like this:

  1. You have a 3D array of bits and want to create a mesh out of it.
  2. Read a 2x2x2 cube from the array.
  3. Generate a mesh based on the values of the cube.
  4. Repeat for every 2x2x2 cube in the array and concatenate the meshes.

The individual cube meshes work like Lego blocks, they click together to form a seamless mesh.

How to do it kinda fast:

  1. Create cached meshes and normals for each different 2x2x2 bit cube (there are 2^8 of them). You get an array like cubeMeshes[eightBitCubeIndex].
  2. Create a binary array based on the original data. Unroll loops to process in chunks of 8, do it SIMD-like, pass over the original data and spit out ones and zeroes into a Uint8Array. (You could put 8 bits per byte, but it's a hassle.) 
  3. Create a cube index Uint8Array that's going to be filled with the eight-bit cube indexes of each 2x2x2 cube in the data.
  4. Fill the cube index array by marching a 2x2x2 cube over the binary array and converting the read cube values into eight-bit cube indexes. Increment total mesh vertex count by cubeMeshLengths[eightBitCubeIndex].
  5. Allocate Float32Arrays for vertices and normals based on the total mesh vertex count.
  6. Iterate over the cube index array. Write the mesh corresponding to the cube index to the vertex array, offset each vertex with the xyz-coordinates of the cube index. Write the normals corresponding to the cube index to the vertex array.

Source: fastIsosurface.js - demo

This runs in ~150ms on an Intel i7 7700HQ for a 7 million element data array (256x256x109).

Future directions

As you may notice from the source, it's SIMD-friendly, in case you can use SIMD. The algorithm parallelizes easily too.

Web Workers with transferable objects? Transform feedback in WebGL 2 + a reducer kernel to remove empty triangles? Do it in a fragment shader to a float render target? Magic?

Handwaving

The test dataset contains 7 million 16-bit uints which takes about 14 megabytes of RAM. This means that it won't fit in the 7700HQ's 4x1.5MB L3 cache, much less the 4x256kB L2 or the 4x32kB L1.

By compressing the dataset into a bit array, it would fit in 7 megabits, or 875 kB. Processing that with four cores (8 threads) would keep the read operations in the L2 cache. Chunking the processing into 30 kB cubes would keep the reads mostly in the L1 cache.

The output array for the marching cubes step consists of a byte per cube. The original input array has two bytes per element. The bit array has one byte or one bit per element. The output vertex arrays have up to 90 floats, or 360 bytes per cube (but they're very sparse, the average is 1-2 bytes per cube). There's roughly one cube per input array element.

Taking the sum of the above, we get about 1 + 2 + 1 + 1 = 5 bytes per cube. We could process 6000 cubes in a 32kB L1 cache. That might come to 64x10x10 input elements that output 63x9x9 cubes for total memory use of 29406 bytes and 5103 cubes.

How fast would that be? Let's see. You need to read in the input data. That's going to come from RAM at 40 GB/s => something like 0.05 ns per cube. You can crunch it into the binary array as you go: two comparisons, a logical AND, and a store to L1 would work out to 2 ns per input element at 3GHz. For per-cube time, divide by 8 as each element is used by 8 cubes: 0.25ns per cube.

Then read through it with a 2x2x2 window for the cube generation, do a couple multiply-adds. Updating the window requires avg 4 reads per step plus processing to generate the cube indexes, say 4x7 cycles in total.

Then write the vertices to the vertex array. This might take 6 cycles for the array fetch and write.

Add some fudge, 3 GHz clock rate. Each cube takes 4x7 + 6 = 34 cycles. Estimated runtime 12ns per cube (+ 0.25ns for input data processing). Need 10 million cubes for the mesh: 120 ms. Do it in parallel in four L1 caches => 30 ms.

But, uh, it already runs in 150 ms for some meshes. And crunching the input data takes 20 ms of that. In JavaScript. What.

Blog Archive