general:
- The rewritten program should be semantically equivalent to the original program, within a small numerical tolerance.
- Keep the same function name and signature as the original program (helper functions can be renamed or deleted).
- The partition dimension (P) must be the first (leftmost) dimension of any SBUF/PSUM tile and must never exceed 128 (nl.tile_size.pmax).
  All NKI compute APIs require tiles with shape[0] as the partition dimension.
- All kernel inputs/outputs must reside in HBM. Data must be explicitly loaded from HBM to SBUF via nl.load before computation,
  and results must be explicitly stored from SBUF to HBM via nl.store. nl.store cannot directly store from PSUM—data must
  first be copied from PSUM to SBUF via nl.copy.
- For matrix multiplication (nc_matmul/nl.matmul), the contraction axis K must be mapped to the partition dimension (first
  dimension) for both operands. The stationary free axis ≤ 128, moving free axis ≤ 512, and the operation computes stationary.T
  @ moving. Output is always FP32 in PSUM.
- PSUM free dimension must not exceed 512 FP32 elements (nl.tile_size.psum_fmax). SBUF free dimension can be up to 64K elements.
  Tiles on SBUF/PSUM must have at least 2 dimensions—1D tiles cause 'Insufficient rank' errors.
- Use nki.* namespace (not neuronxcc.nki.*). Use nki.language (nl) for high-level APIs and nki.isa (nisa) for ISA-level APIs.
  All nki.isa APIs require dst as a parameter (not dtype or mask which were removed).
planning:
- Limit the scope of the plan to the selected optimization.
- Do not count out any of the optimizations unless they are clearly irrelevant to the code.
- nl.affine_range must be used for loops without loop-carried dependencies (associative reductions like PSUM accumulation
  via += are NOT loop-carried dependencies). nl.sequential_range must be used when true loop-carried dependencies exist. Python
  range() is auto-converted to sequential_range.
- 'PSUM accumulation for tiled matmul requires exactly this pattern: (1) initialize with nl.zeros(..., buffer=nl.psum), (2)
  use nl.affine_range for the loop, (3) accumulate via psum_buf += nl.matmul(...). Using psum_buf[...] = psum_buf + nl.matmul(...)
  will NOT trigger PSUM accumulation and falls back to slower VectorEngine addition.'
- When mixing direct allocation with automatic allocation is forbidden—if any tensor uses direct allocation (ncc.sbuf.alloc/mod_alloc),
  ALL tensors must use direct allocation. In allocated kernels, nisa.nc_transpose with TensorEngine and high-level APIs like
  nl.softmax are not allowed.
- When tile indices may exceed actual tensor dimensions (dimension not a multiple of tile size), a mask parameter must be
  passed to nl.load, nl.store, and compute APIs. Masks must be compile-time constant predicates built from comparison expressions
  of nl.arange/nl.affine_range/nl.program_id—runtime tensor values are not allowed.
- Tensor Engine reads from SBUF and writes to PSUM only. VectorE and ScalarE can read/write both SBUF and PSUM. GpSimdE can
  only access SBUF. VectorE and GpSimdE cannot access SBUF simultaneously; VectorE and ScalarE cannot access PSUM simultaneously.
coding:
- Wrap the generated code with ```python at the beginning and ``` at the end.
- Tensor indices for the partition dimension must be column vectors (nl.arange(N)[:, None]) and free dimension indices must
  be row vectors (nl.arange(M)[None, :]). Mixing basic indexing (slices) and advanced indexing (nl.arange) in the same index
  tuple is not supported.
- Output tensors must be declared with buffer=nl.shared_hbm using nl.ndarray and must be explicitly returned from the kernel
  via a return statement. All mutable kernel parameters must be annotated and returned. Unwritten output regions contain garbage
  data.
- Partition dimension broadcasting is NOT supported on operator overloads (+, -, *, /, etc.)—use nl.add, nl.multiply, and
  other nki.language APIs instead. Use & to combine masks (not Python 'and'). Use element-wise &, |, ~ instead of logical
  and/or/not on tensors.
- nisa.activation scale must be float32 scalar or vector of shape (P, 1); bias must be float32/float16/bfloat16 vector of
  shape (P, 1) with free dimension exactly 1. The computation order is output = f_act(data * scale + bias). On NeuronCore-v2,
  reciprocal activation does not support bias.
- Tensors defined inside if/else/for blocks cannot be used outside those blocks—declare tensors in the outer scope and use
  tensor[...] = assignment inside the block. Reassigning a tensor variable (data = data + x) creates a new tensor causing
  scope errors; use data[...] = data + x for in-place updates. shared_hbm tensors can only be created at kernel top-level
  scope.
