I was going to search on the internet about it, but then I realized you are the author (and there is nothing online I think). I imagine that the activations are left in FP16 and the weights are converted in FP16 during inference, right?
Yes, computes are carried out in FP16 (so there is no compute efficiency gains, might be latency reductions due to memory-bandwidth saving). These savings are not realized yet because no custom kernels introduced yet.
Btw very cool