TODO: Error diffusion experiments

Platforms: all

Coding time: M
Experimentation time: XL
Skill required: XL

Prerequisite reading:
  doc/less-than-8-bit.txt


Overview
========

In internal/pack.h, the Requantize function takes care of requantizing
input 8 bit values to less than 8 bit. This is currently done either by
rounding-to-nearest, or by probabilistic rounding.

People have suggested trying error diffusion instead.
https://en.wikipedia.org/wiki/Error_diffusion
This technique originally from graphics might be adaptable to GEMM; however,
that is far from trivial.

Still, it may be worth experimenting with it, as the reward of higher accuracy
could be very worthwhile especially if it allows to explore even smaller
bit-depths.


Why getting error diffusion to work is nontrivial
=================================================

In graphics, there is only one array to
apply error diffusion to, and the criteria are mostly aesthetic. Here in GEMM,
there are two arrays involved, allowing for unwanted interaction between the
error diffusion terms added on either side separately; and we have stringent
accuracy criteria.

Here is a toy example showing how naive approaches to error diffusion may
suffer from unwanted interactions between the LHS and RHS separate error
diffusion terms:

Say that we're working on 1-dimensional data (as opposed to 2-D matrices) to
simplify the discussion.

Say that our input values are real numbers in [0, 1] and that we're quantizing
them to either 0 or 1.

Say that the left-hand-side is filled with the constant value 0.9 and that our
error-diffusion filter results in the following sequence of quantized values:
    1 (repeated 9 times), 0, ... (repeat).

Say that the left-hand-side is filled with the constant value 0.1 and that our
error-diffusion filter results in the following sequence of quantized values:
    0 (repeated 9 times), 1, ... (repeat).

So if we compute the dot product (which is what we really do in a GEMM) of
these quantized vectors, we're computing
    1*0 + ... (repeated 9 times) + 0*1 + ... (repeat)

So we get exactly 0! This shows how a naive approach to error diffusion may
suffer from bias issues similar to round-to-nearest.


Some avenues to explore to make error diffusion work
====================================================

1. Maybe some fixed error diffusion kernels just happen to avoid that issue?

2. Maybe it's just a matter of doing error diffusion for a different vector
error metric, e.g. l^2 instead of l^1?

3. Maybe some randomization (adding some random term to the error term being
diffused) would be acceptable? It seems like it would allow to avoid the
interference problem discussed above.


Performance considerations
==========================

Error diffusion is going to be relatively expensive compared to the current
requantization methods. It may be acceptable for large enough GEMM depth,
since it only needs to be applied once for the n^2 input matrix entries, thus
becoming negligible compared to the n^3 arithmetic cost of GEMM for large
enough n.

Alternatively, we may consider doing requantization of some matrices once and
for all, but that would likely be the case only for one of LHS or RHS,
otherwise one might as well precompute the whole GEMM.
