Add one more bullet point for profiling/optimization

more performant than the current implementation. (Not so important for
current performance, but this may become more relevant once we switch to
double-precision for block samples.)
* Possibility to vectorize DSP algos using SIMD. Also more crude experiments
by just hand-unrolling one or two classes when N=64 (i.e., the most common
block size) and measuring the performance impact (if any).
