general:
- The rewritten program should be semantically equivalent to the original program, within a small numerical tolerance.
- Keep the same function name and signature as the original program (helper functions can be renamed or deleted).
- The partition dimension (P) is always the first/leftmost dimension of any SBUF/PSUM tile, has a hard maximum of 128 elements
  (pmax=128), and all NKI compute APIs require tiles with the partition dimension as the first dimension.
- All kernel input/output tensors must reside in HBM; data must be explicitly loaded from HBM to SBUF via nl.load before computation,
  and results must be explicitly stored back via nl.store. Matmul results in PSUM must first be copied to SBUF via nl.copy
  before storing to HBM.
- nc_matmul computes stationary.T @ moving, reads both inputs from SBUF, and always writes FP32 results to PSUM. The contraction
  axis must be in the partition dimension for both operands. Stationary free axis ≤ 128, moving free axis ≤ 512, partition
  axis ≤ 128.
- 'PSUM accumulation for matmuls requires exactly this pattern: (1) initialize with nl.zeros(..., buffer=nl.psum), (2) use
  nl.affine_range for the loop, and (3) accumulate via psum_buf += nl.matmul(...). Using psum_buf[...] = psum_buf + nisa.nc_matmul(...)
  will NOT trigger PSUM accumulation.'
- When tile indices may exceed actual tensor dimensions (tensor size not a multiple of tile size), a mask parameter must be
  passed to nl.load, nl.store, and compute APIs to prevent out-of-bounds access. Mask expressions must be affine expressions
  of nl.arange, nl.affine_range, or nl.program_id.
planning:
- Limit the scope of the plan to the selected optimization.
- Do not count out any of the optimizations unless they are clearly irrelevant to the code.
- SBUF capacity is 24 MiB (128 partitions × 176 KiB usable each); PSUM capacity is 2 MiB (128 partitions × 16 KiB each, 8
  banks × 512 FP32 elements per bank). All simultaneously live tiles must fit within these limits.
- nl.affine_range must only be used when there are no loop-carried dependencies between iterations; associative reductions
  (e.g., matmul accumulation via +=) are NOT loop-carried dependencies. nl.sequential_range must be used for true loop-carried
  dependencies. Python range() is silently converted to sequential_range.
- 'Tensor Engine tile size constraints: LHS (stationary) up to [128, 128], RHS (moving) up to [128, 512]. If the contraction
  dimension exceeds 128, it must be split into chunks ≤ 128 and accumulated across multiple nc_matmul calls into the same
  PSUM buffer.'
- PSUM free dimension ≤ 512 per tile (one bank). bn_stats free dimension ≤ 512. These are hard hardware limits that constrain
  tiling strategies.
- When using direct (manual) memory allocation, ALL tensors must use direct allocation — mixing with automatic allocation
  (buffer=nl.sbuf/nl.psum) is forbidden. Also, nisa.nc_transpose with TensorEngine and high-level APIs like nl.softmax are
  not allowed in allocated kernels.
coding:
- Wrap the generated code with ```python at the beginning and ``` at the end.
- Tiles in SBUF/PSUM must have at least 2 dimensions; 1D tiles cause 'Insufficient rank' errors. Use shapes like (128, 1)
  instead of (128,). Buffers with shape [N, 1] or [1, M] must be indexed with both indices explicitly (e.g., my_sbuf[0:N,
  0]).
- Partition dimension indices must be column vectors (nl.arange(N)[:, None]) and free dimension indices must be row vectors
  (nl.arange(M)[None, :]). Mixing basic indexing (slices) and advanced indexing (nl.arange-based) in the same index tuple
  is not supported.
- Partition dimension broadcasting is NOT supported on operator overloads (+, -, *, /, etc.); use nl.add, nl.multiply, and
  other nki.language APIs instead. Free-axis broadcasting works implicitly.
- Output tensors must be declared with buffer=nl.shared_hbm via nl.ndarray, must be explicitly written to via nl.store, and
  must be returned from the kernel. Tensors defined inside if/else/for blocks cannot be used outside that scope — declare
  in outer scope and assign into them.
- Use & to combine masks (not Python 'and'). Python logical operators (and, or, not) cannot be used on NKI tensors. Control
  flow conditions (if/while) cannot depend on nl.arange or runtime tensor values — use the mask parameter on APIs instead.
