strategies:
- strategy: Convert high-level code to hardware-specific kernel code
- strategy: Convert a small amount of high-level code to hardware-specific kernel code
- strategy: Decompose large matrix operations into blocked matmul kernels using `pl.BlockSpec` with 128-aligned tile shapes
    to fully utilize the four 128×128 MXU systolic arrays, accumulating partial results via consecutive grid iterations on
    the innermost reduction axis.
- strategy: Replace elementwise or reduction operations on large arrays with pipelined `pallas_call` kernels that tile inputs
    into VMEM-resident blocks via `BlockSpec` index maps, using bfloat16 inputs to halve HBM transfer volume and stay within
    the ~246 ops/byte arithmetic intensity threshold.
- strategy: Convert sparse or masked computations into scalar-prefetch-driven kernels using `PrefetchScalarGridSpec` with
    SMEM-resident index arrays, enabling data-dependent `index_map` lookups that skip zero blocks and avoid unnecessary HBM-to-VMEM
    transfers.
- strategy: Fuse chains of elementwise operations (activations, scaling, bias adds) into the epilogue of a matmul kernel body
    to avoid HBM round-trips, applying the fused transform inside a `pl.when(pl.program_id(reduction_axis) == nsteps - 1)`
    guard on the final accumulation step.
- strategy: Restructure reduction operations so the reduction dimension is the innermost (last) grid axis with explicit `pl.when(pl.program_id(axis)
    == 0)` initialization, enabling the pipeline to reuse the same output VMEM buffer across consecutive iterations for correct
    in-place accumulation.
