5. #Kimi Attention Residuals (AttnRes) + Block - nuked my initial results, horrifically slow if using just the pseudocode (dropping got me from ~6 sec/it to 0.8~1.4 it/sec). Maybe needs better kernels (or a better coder than Opus, gemini and gpt and me).