optimizations:
- strategy: reduce data movement
- strategy: overlap data movement and compute
- strategy: loop tiling
- strategy: loop reordering and restructuring
- strategy: fuse operations
- strategy: use lower precision
- strategy: double buffering
- strategy: software pipelining
- strategy: hoist redundant operations out of loops
- strategy: simplify or remove unnecessary code
- strategy: Maximize partition dimension to 128 and free dimension to ≥128 elements per instruction to fully utilize parallel
    lanes and amortize overhead
- strategy: Fuse multiply-add with nonlinear activation into single `nisa.activation` instruction using scale and bias parameters
- strategy: Combine two sequential element-wise ops into one `nisa.tensor_scalar` instruction with dual operands (op0 and
    op1)
- strategy: Use `nisa.tensor_tensor_scan` for associative recurrences instead of loops of dependent instructions
- strategy: Accumulate partial matmul results in PSUM across contraction-dimension tiles to avoid SBUF/HBM round-trips
- strategy: Assign smaller free-dimension matrix as moving tensor in `nc_matmul` to exploit fast LoadStationary (up to 4x
    speedup)
- strategy: Choose matmul operand assignment to produce output in layout expected by downstream consumer, avoiding intermediate
    transposes
- strategy: Use FP8/BF16/MxFP4 input types for matmul to achieve ~4x higher TensorE throughput versus FP32
- strategy: Maximize DMA transfer size (≥32 KiB) by loading large contiguous tiles in both partition and free dimensions
- strategy: Replace `nl.load_transpose2d` with `nl.load` + `nisa.nc_transpose` when kernel is memory-bound and TensorE is
    idle
- strategy: Use `affine_range` for independent loop iterations to enable compiler pipelining across engines; `sequential_range`
    only when needed
- strategy: Schedule instructions across Tensor/Vector/Scalar/GpSimd engines in parallel by structuring code to avoid data
    dependencies between engines
- strategy: Declare intermediate buffers inside inner loops to reduce SBUF fragmentation and avoid compiler-triggered spill/reload
    traffic
- strategy: Use direct allocation APIs (`ncc.sbuf.alloc`, `ncc.sbuf.mod_alloc`) to bypass compiler heuristic memory placement
    and control SBUF/PSUM layout
- strategy: Avoid PSUM bank collisions by assigning accumulation tensors to distinct banks in multi-matmul loops
- strategy: Use modulo allocation with `num_free_tiles` to implement multi-buffering with minimal code complexity
- strategy: Use SPMD grid launch with `nl.program_id` to distribute independent tiles across cores instead of sequential loops
- strategy: Delay division in softmax until after subsequent matmul, exploiting scalar-matrix associativity to reduce FLOPs
- strategy: Compute tiled reductions with running statistics (online max, log-sum-exp) to avoid materializing full intermediate
    tensors
- strategy: Use `nki.isa.bn_stats`/`bn_aggr` for single-pass mean and variance computation instead of separate reduction passes
