art with code

2012-01-21

Fast code

I was thinking of the characteristics of high-performance language runtimes (read: execution times close to optimal for hardware) and came up with this list:
  • Flat data structures (err, like an array of structs where struct N is in memory right before struct N+1)
    • streaming memory reads are prefetcher-friendly, spend less time chasing pointers
  • Tightly-packed data
    • memory fetches happen in cache line -sized chunks, tightly-packed data gives you more payload per memory fetch
    • fit more payload into cache, faster subsequent memory accesses
  • Reusable memory
    • keep more of the working set of data in cache
  • Unboxed values
    • spend less time chasing pointers
    • generate tight code for data manipulation because data type known (float/double/int/short/byte/vector)
  • Vector instructions
    • more bytes manipulated per instruction, data moves faster through an execution unit
  • Parallel execution
    • more bytes manipulated per clock cycle, data moves faster through the processor
  • Keep data in registers when possible
    • less time spent waiting for caches
  • Keep data in cache when possible
    • less time spent waiting for memory
    • instead of going over the full data set several times end-to-end, split it into cache-sized chunks and process each one fully before moving onto the next one
  • Minimize amount of unnecessary data movement between processing units
    • keep data close to processor until you're done with it, even more important with GPUs
  • Flat code layout
    • low amount of jumps per byte processed
  • Tight code
    • keep more of the program in cache
  • Interleaved I/O
    • work on already loaded data while loading in new data
    • minimum amount of time spent waiting for I/O
You might notice that the major theme is optimizing memory use. I started thinking of program execution as a way to read in the input data set and write out the output data set. The sizes of the input and output data give you a nice optimum execution time by dividing the data set size by memory bandwidth (or I/O bandwidth if you're working on big things). The flow of the program then becomes pushing this river of data through the CPU.

Suck in the data in cache line -sized chunks, process entire cache line before moving to the next, preload next cache line while processing the current one. Use vector instructions to manipulate several bytes of data at the same time, use parallelism to manipulate several streams of data at the same time. Make your processing kernel fit into L1 instruction cache

Gnnnn ok, back to writing JavaScript.


No comments:

Blog Archive