I noticed that growslice() shows up in the cpuprofile.
Avoiding slice append for the private jey copy gives a 0.6% speedup:
gocryptfs/internal/speed$ benchstat old new
name old time/op new time/op delta
StupidXchacha-4 5.68µs ± 0% 5.65µs ± 0% -0.63% (p=0.008 n=5+5)
name old speed new speed delta
StupidXchacha-4 721MB/s ± 0% 725MB/s ± 0% +0.63% (p=0.008 n=5+5)