The benchmark that supported the decision for 512-byte
prefetching previously lived outside the repo.
Let's add it where it belongs so it cannot get lost.
Spawn a worker goroutine that reads the next 512-byte block
while the current one is being drained.
This should help reduce waiting times when /dev/urandom is very
slow (like on Linux 3.16 kernels).
On my machine, reading 512-byte blocks from /dev/urandom
(same via getentropy syscall) is a lot faster in terms of
throughput:
Blocksize Throughput
16 28.18 MB/s
512 83.75 MB/s
For a single-threaded streaming write, this drops the CPU usage of
nonceGenerator.Get to almost 1/3:
flat flat% sum% cum cum%
Before 0 0% 95.08% 0.35s 2.92% github.com/rfjakob/gocryptfs/internal/cryptocore.(*nonceGenerator).Get
After 0.01s 0.092% 92.34% 0.13s 1.20% github.com/rfjakob/gocryptfs/internal/cryptocore.(*nonceGenerator).Get
This change makes the nonce reading single-threaded, which may
hurt massively-parallel writes.