## Data Types and Constants

### Supported Data Types by NKI

## Supported Data Types
Supported Data Types by NKI below lists all supported data types by NKI. Almost all the NKI APIs accept a data type field, `dtype`, which can either be a `NumPy` equivalent type or a `nki.language` data type.

Table 6 Supported Data Types by NKI# | Data Type | Accepted `dtype` Field by NKI APIs  
---|---|---  
Integer | 8-bit unsigned integer | `nki.language.uint8` or `numpy.uint8`  
8-bit signed integer | `nki.language.int8` or `numpy.int8`  
16-bit unsigned integer | `nki.language.uint16` or `numpy.uint16`  
16-bit signed integer | `nki.language.int16` or `numpy.int16`  
32-bit unsigned integer | `nki.language.uint32` or `numpy.uint32`  
32-bit signed integer | `nki.language.int32` or `numpy.int32`  
Float | float8_e4m3 (1S,4E,3M) [2] | `nki.language.float8_e4m3`  
float8_e5m2 (1S,5E,2M) | `nki.language.float8_e5m2`  
float16 (1S,5E,10M) | `nki.language.float16` or `numpy.float16`  
bfloat16 (1S,8E,7M) | `nki.language.bfloat16`  
tfloat32 (1S,8E,10M) | `nki.language.tfloat32`  
float32 (1S,8E,23M) | `nki.language.float32` or `numpy.float32`  
Boolean | boolean stored as uint8 | `nki.language.bool_` or `numpy.bool`  
  

---

### NKI Type Promotion

## NKI Type Promotion
When the data types (dtypes) of inputs to an arithmetic operation (i.e., add, multiply, tensor_tensor, etc.) differ, we promote the dtypes following the rules below:

(float, integer): Pick the float type. Example:

  * `(np.int32, np.float16) -> np.float16`

  * `(np.uint16, nl.tfloat32) -> nl.tfloat32`

(float, float): Pick the wider float type or a new widened type that fits the values range. Example:

  * `(np.float32, nl.tfloat32) -> np.float32`

  * `(np.float32, nl.bfloat16) -> np.float32`

  * `(np.float16, nl.bfloat16) -> np.float32` (new widened type)

  * `(nl.float8_e4m3, np.float16) -> np.float16`

  * `(nl.float8_e4m3, nl.bfloat16) -> nl.bfloat16`

  * `(nl.float8_e4m3, nl.float8_e5m2) -> nl.bfloat16` (new widened type)

(int, int): Pick the wider type or a new widened type that fits the values range. Example:

  * `(np.int16, np.int32) -> np.int32`

  * `(np.uint8, np.uint16) -> np.uint16`

  * `(np.uint16, np.int32) -> np.int32`

  * `(np.int8, np.uint8) -> np.int16` (new widened type)

  * `(np.int8, np.uint16) -> np.int32` (new widened type)

  * `(np.int32, np.uint32) -> np.float32` (new widened type is float32, since int64 isn’t supported on the hardware)

The output of the arithmetic operation will get the promoted type by default.

Note: The Vector Engine internally performs most of the computation in FP32 (see Vector Engine) and casts the output back to the specific type.
    
    
    x = np.ndarray((N, M), dtype=nl.float8_e4m3)
    y = np.ndarray((N, M), dtype=np.float16)
    z = nl.add(x, y) # calculation done in FP32, output cast to np.float16
    assert z.dtype == np.float16
    

To prevent the compiler from automatically widening output dtype based on mismatching input dtypes, you may explicitly set the output dtype in the arithmetic operation API. This would be useful if the output is passed into another operation that benefits from a smaller dtype.
    
    
    x = np.ndarray((N, M), dtype=nl.bfloat16)
    y = np.ndarray((N, M), dtype=np.float16)
    z = nl.add(x, y, dtype=nl.bfloat16)  # without explicit `dtype`, `z.dtype` would have been np.float32
    assert z.dtype == nl.bfloat16
    

---

### Weakly Typed Scalar Type Inference

### Weakly Typed Scalar Type Inference
Weakly typed scalars (scalar values where the type wasn’t explicitly specified) will be inferred as the widest dtype supported by hardware:

  * `bool --> uint8`

  * `integer --> int32`

  * `floating --> float32`

Doing an arithmetic operation with a scalar may result in a larger output type than expected, for example:

  * `(np.int8, 2) -> np.int32`

  * `(np.float16, 1.2) -> np.float32`

To prevent larger dtypes from being inferred from weak scalar types, do either of:

  1. Explicitly set the datatype of the scalar, like `np.int8(2)`, so that the output type is what you desire:

> 
>     x = np.ndarray((N, M), dtype=np.float16)
>     y = np.float16(2)
>     z = nl.add(x, y)
>     assert z.dtype == np.float16
>     

  2. Explicitly set the output dtype of the arithmetic operation:

> 
>     x = np.ndarray((N, M), dtype=np.int16)
>     y = 2
>     z = nl.add(x, y, dtype=nl.bfloat16)
>     assert z.dtype == nl.bfloat16
>     

Note: The Vector Engine internally performs most of the computation in FP32 (see Vector Engine) and casts the output back to the specific type.

---

### tfloat32

`tfloat32` | 32-bit floating-point number (1S,8E,10M)  

---

### bfloat16

`bfloat16` | 16-bit floating-point number (1S,8E,7M)  

---

### float8_e4m3

`float8_e4m3` | 8-bit floating-point number (1S,4E,3M)  

---

### float8_e5m2

`float8_e5m2` | 8-bit floating-point number (1S,5E,2M)  

---

### fp32

`fp32` | FP32 Constants  

---

### tile_size

  * `tile_size`: Added total_available_sbuf_size field

### tile_size

`tile_size` | Tile size constants.  

---

### nki.language.tile_size

class nki.language.tile_size
    

Tile size constants.

Attributes

`bn_stats_fmax` | Maximum free dimension of BN_STATS  
---|---  
`gemm_moving_fmax` | Maximum free dimension of the moving operand of General Matrix Multiplication on Tensor Engine.  
`gemm_stationary_fmax` | Maximum free dimension of the stationary operand of General Matrix Multiplication on Tensor Engine.  
`pmax` | Maximum partition dimension of a tile.  
`psum_fmax` | Maximum free dimension of a tile on PSUM buffer.  
`psum_min_align` | The minimum byte alignment requirement for PSUM free dimension address.  
`sbuf_min_align` | The minimum byte alignment requirement for SBUF free dimension address.  
`total_available_sbuf_size` | The total SBUF available size  

---

### tensor

`tensor` | A tensor object represents a multidimensional, homogeneous array of fixed-size items  

## Tensor Creation and Initialization

### ndarray

`ndarray` | Create a new tensor of given shape and dtype on the specified buffer.  

---

### empty_like

`empty_like` | Create a new tensor with the same shape and type as a given tensor.  

---

### zeros

`zeros` | Create a new tensor of given shape and dtype on the specified buffer, filled with zeros.  

---

### zeros_like

`zeros_like` | Create a new tensor of zeros with the same shape and type as a given tensor.  

---

### ones

`ones` | Create a new tensor of given shape and dtype on the specified buffer, filled with ones.  

---

### full

`full` | Create a new tensor of given shape and dtype on the specified buffer, filled with initial value.  

---

### rand

`rand` | Generate a tile of given shape and dtype, filled with random values that are sampled from a uniform distribution between 0 and 1.  

---

### random_seed

`random_seed` | Sets a seed, specified by user, to the random number generator on HW.  

---

### shared_constant

`shared_constant` | Create a new tensor filled with the data specified by data array.  

---

### shared_identity_matrix

`shared_identity_matrix` | Create a new identity tensor with specified data type.  

---

### nki.language.full

nki.language.full(shape, fill_value, dtype, *, buffer=None, name='', **kwargs)
    

Create a new tensor of given shape and dtype on the specified buffer, filled with initial value.

((Similar to numpy.full))

Parameters:
    

  * shape – the shape of the tensor.

  * fill_value – the initial value of the tensor.

  * dtype – the data type of the tensor (see Supported Data Types for more information).

  * buffer – the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.

  * name – the name of the tensor.

Returns:
    

a new tensor allocated on the buffer.

---

### nki.language.zeros

nki.language.zeros(shape, dtype, *, buffer=None, name='', **kwargs)
    

Create a new tensor of given shape and dtype on the specified buffer, filled with zeros.

((Similar to numpy.zeros))

Parameters:
    

  * shape – the shape of the tensor.

  * dtype – the data type of the tensor (see Supported Data Types for more information).

  * buffer – the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.

  * name – the name of the tensor.

Returns:
    

a new tensor allocated on the buffer.

---

### nki.language.ndarray

nki.language.ndarray(shape, dtype, *, buffer=None, name='', **kwargs)
    

Create a new tensor of given shape and dtype on the specified buffer.

((Similar to numpy.ndarray))

Parameters:
    

  * shape – the shape of the tensor.

  * dtype – the data type of the tensor (see Supported Data Types for more information).

  * buffer – the specific buffer (ie, sbuf, psum, hbm), defaults to sbuf.

  * name – the name of the tensor.

Returns:
    

a new tensor allocated on the buffer.

---

### nki.language.rand

nki.language.rand(shape, dtype=<class 'numpy.float32'>, **kwargs)
    

Generate a tile of given shape and dtype, filled with random values that are sampled from a uniform distribution between 0 and 1.

Parameters:
    

  * shape – the shape of the tile.

  * dtype – the data type of the tile (see Supported Data Types for more information).

Returns:
    

a tile with random values.

---

### memset

`memset` | Initialize a tile filled with a compile-time constant value using Vector or GpSimd Engine.  

---

### iota

`iota` | Build a constant literal in SBUF using GpSimd Engine, rather than transferring the constant literal values from the host to device.  

## Memory Operations and DMA

### load

`load` | Load a tensor from device memory (HBM) into on-chip memory (SBUF).  

---

### store

`store` | Store into a tensor on device memory (HBM) from on-chip memory (SBUF).  

---

### load_transpose2d

`load_transpose2d` | Load a tensor from device memory (HBM) and 2D-transpose the data before storing into on-chip memory (SBUF).  

---

### atomic_rmw

`atomic_rmw` | Perform an atomic read-modify-write operation on HBM data `dst = op(dst, value)`  

---

### copy

`copy` | Create a copy of the src tile.  

---

### nki.isa.dma_copy

nki.isa.dma_copy(*, dst, src, mask=None, dst_rmw_op=None, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown)
    

Copy data from `src` to `dst` using DMA engine. Both `src` and `dst` tiles can be in device memory (HBM) or SBUF. However, if both `src` and `dst` tiles are in SBUF, consider using nisa.tensor_copy instead for better performance.

Parameters:
    

  * src – the source of copy.

  * dst – the dst of copy.

  * dst_rmw_op – the read-modify-write operation to be performed at the destination. Currently only `np.add` is supported, which adds the source data to the existing destination data. If `None`, the source data directly overwrites the destination. If `dst_rmw_op` is specified, only `oob_mode=oob_mode.error` is allowed. For best performance with Descriptor Generation Engine (DGE), unique dynamic offsets must be used to access `dst`. Multiple accesses to the same offset will cause a data hazard. If duplicated offsets are present, the compiler automatically adds synchronization to avoid hazards, which slows down computation.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * mode – 

(optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:

    * `oob_mode.error`: (Default) Raises an error when encountering out-of-bounds indices.

    * `oob_mode.skip`: Silently skips any operations involving out-of-bounds indices.

For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.

  * dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for copy: `nki.isa.dge_mode.none` (turn off DGE) or `nki.isa.dge_mode.swdge` (software DGE) or `nki.isa.dge_mode.hwdge` (hardware DGE) or `nki.isa.dge_mode.unknown` (by default, let compiler select the best DGE mode). HWDGE is only supported for NeuronCore-v3+.

A cast will happen if the `src` and `dst` have different dtype.

Example:
    
    
    import neuronxcc.nki.isa as nisa
    
    ############################################################################
    # Example 1: Copy over the tensor to another tensor
    ############################################################################
    nisa.dma_copy(dst=b, src=a)
    
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ############################################################################
    # Example 2: Load elements from HBM with indirect addressing. If addressing 
    # results out-of-bound access, the operation will fail.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n//2, 0:m]
    
    expr_arange = 2*nl.arange(n//2)[:, None]
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
    nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ############################################################################
    # Example 3: Load elements from HBM with indirect addressing. If addressing 
    # results in out-of-bounds access, the operation will fail.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n//2, 0:m]
    
    # indices are out of range on purpose to demonstrate the error
    expr_arange = 3*nl.arange(n//2)[:, None] 
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
    nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
    
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ############################################################################
    # Example 4: Load elements from HBM with indirect addressing. If addressing 
    # results in out-of-bounds access, the operation will skip indices.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n//2, 0:m]
    
    # indices are out of range on purpose
    expr_arange = 3*nl.arange(n//2)[:, None] 
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype)
    nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.skip)
    
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ############################################################################
    # Example 5: Store elements to HBM with indirect addressing and with 
    # read-modifed-write operation.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n, 0:m]
    
    expr_arange = 2*nl.arange(n)[:, None]
    inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=1, dtype=in_tensor.dtype)
    nl.store(out_tensor, value=out_tile)
    nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, dst_rmw_op=np.add)
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ############################################################################
    # Example 6: Store elements to HBM with indirect addressing. If indirect 
    # addressing results out-of-bound access, the operation will fail.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n, 0:m]
    
    expr_arange = 2*nl.arange(n)[:, None]
    inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype)
    nl.store(out_tensor, value=out_tile)
    nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ############################################################################
    # Example 7: Store elements to HBM with indirect addressing. If indirect 
    # addressing results out-of-bounds access, the operation will skip indices.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n, 0:m]
    
    # indices are out of range on purpose to demonstrate the error
    expr_arange = 3*nl.arange(n)[:, None] 
    inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype)
    nl.store(out_tensor, value=out_tile)
    nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
    
    
    
    
    ############################################################################
    # Example 8: Store elements to HBM with indirect addressing. If indirect 
    # addressing results out-of-bounds access, the operation will skip indices.
    ############################################################################
    
    ...
    n, m = in_tensor.shape
    ix, iy = nl.mgrid[0:n, 0:m]
    
    # indices are out of range on purpose
    expr_arange = 3*nl.arange(n)[:, None] 
    inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy])
    idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
    
    out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype)
    nl.store(out_tensor, value=out_tile)
    nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.skip)
    

---

### dma_copy

`dma_copy` | Copy data from `src` to `dst` using DMA engine.  

---

### nki.language.load

nki.language.load(src, *, mask=None, dtype=None, **kwargs)
    

Load a tensor from device memory (HBM) into on-chip memory (SBUF).

See Memory hierarchy for detailed information.

Parameters:
    

  * src – HBM tensor to load the data from.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

a new tile on SBUF with values from `src`.
    
    
    import neuronxcc.nki.language as nl
    
    # load from in_tensor[P, F] that is on HBM
    # copy into data_tile[P, F] that is on SBUF
    data_tile = nl.load(in_tensor)
    ...
    

Note

Partition dimension size can’t exceed the hardware limitation of `nki.language.tile_size.pmax`, see Tile size considerations.

Partition dimension has to be the first dimension in the index tuple of a tile. Therefore, data may need to be split into multiple batches to load/store, for example:
    
    
    import neuronxcc.nki.language as nl
    
    for i_b in nl.affine_range(4):
      data_tile = nl.zeros((128, 512), dtype=in_tensor.dtype) 
      # load from in_tensor[4, 128, 512] one batch at a time
      # copy into data_tile[128, 512]
      i_p, i_f = nl.mgrid[0:128, 0:512]
      data_tile[i_p, i_f] = nl.load(in_tensor[i_b, i_p, i_f])
      ...
    

Also supports indirect DMA access with dynamic index values:
    
    
    import neuronxcc.nki.language as nl
    ...
    
    
    ############################################################################################
    # Indirect DMA read example 1:
    # - data_tensor on HBM has shape [128 x 512].
    # - idx_tensor on HBM has shape [64] (with values [0, 2, 4, 6, ...]).
    # - idx_tensor values read from HBM and stored in SBUF idx_tile of shape [64 x 1]
    # - data_tensor values read from HBM indexed by values in idx_tile 
    #   and store into SBUF data_tile of shape [64 x 512].
    ############################################################################################
    i_p = nl.arange(64)[:, None]
    i_f = nl.arange(512)[None, :]
    
    idx_tile = nl.load(idx_tensor[i_p]) # indices have to be in SBUF
    data_tile = nl.load(data_tensor[idx_tile[i_p, 0], i_f]) 
    ...
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    ...
    
    
    ############################################################################################
    # Indirect DMA read example 2:
    # - data_tensor on HBM has shape [128 x 512].
    # - idx_tile on SBUF has shape [64 x 1] (with values [[0], [2], [4], ...] generated by iota)
    # - data_tensor values read from HBM indexed by values in idx_tile 
    #   and store into SBUF data_tile of shape [64 x 512].
    ############################################################################################
    i_f = nl.arange(512)[None, :]
    
    idx_expr = 2*nl.arange(64)[:, None]
    idx_tile = nisa.iota(idx_expr, dtype=np.int32)
    data_tile = nl.load(data_tensor[idx_tile, i_f]) 
    ...
    

---

### nki.language.store

nki.language.store(dst, value, *, mask=None, **kwargs)
    

Store into a tensor on device memory (HBM) from on-chip memory (SBUF).

See Memory hierarchy for detailed information.

Parameters:
    

  * dst – HBM tensor to store the data into.

  * value – An SBUF tile that contains the values to store. If the tile is in PSUM, an extra copy will be performed to move the tile to SBUF first.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:
    
    
    
    import neuronxcc.nki.language as nl
    
    ...
    # store into out_tensor[P, F] that is on HBM
    # from data_tile[P, F] that is on SBUF
    nl.store(out_tensor, data_tile)
    

Note

Partition dimension size can’t exceed the hardware limitation of `nki.language.tile_size.pmax`, see Tile size considerations.

Partition dimension has to be the first dimension in the index tuple of a tile. Therefore, data may need to be split into multiple batches to load/store, for example:
    
    
    import neuronxcc.nki.language as nl
    
    for i_b in nl.affine_range(4):
      data_tile = nl.zeros((128, 512), dtype=in_tensor.dtype) 
    
    ...
    # store into out_tensor[4, 128, 512] one batch at a time
    # from data_tile[128, 512] 
    i_p, i_f = nl.mgrid[0:128, 0:512]
    nl.store(out_tensor[i_b, i_p, i_f], value=data_tile[i_p, i_f]) 
    

Also supports indirect DMA access with dynamic index values:
    
    
    import neuronxcc.nki.language as nl
    ...
    
    
    ##################################################################################
    # Indirect DMA write example 1:
    #  - data_tensor has shape [128 x 512].
    #  - idx_tensor on HBM has shape [64] (with values [0, 2, 4, 6, ...]).
    #  - idx_tensor values read from HBM and stored in SBUF idx_tile.
    #  - data_tile of shape [64 x 512] values written into
    #    HBM data_tensor indexed by values in idx_tile.
    ##################################################################################
    i_p = nl.arange(64)[:, None]
    i_f = nl.arange(512)[None, :]
    idx_tile = nl.load(idx_tensor[i_p]) # indices have to be in SB
    
    nl.store(data_tensor[idx_tile[i_p, 0], i_f], value=data_tile[0:64, 0:512])
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    ...
    
    
    #############################################################################################
    # Indirect DMA write example 2:
    #  - data_tensor has shape [128 x 512].
    #  - idx_tile on SBUF has shape [64 x 1] (with values [[0], [2], [4], ...] generated by iota)
    #  - data_tile of shape [64 x 512] values written into
    #    HBM data_tensor indexed by values in idx_tile.
    #############################################################################################
    idx_expr = 2*nl.arange(64)[:, None]
    idx_tile = nisa.iota(idx_expr, dtype=np.int32)
    
    nl.store(data_tensor[idx_tile, i_f], value=data_tile[0:64, 0:512]) 
    

---

### nki.language.load_transpose2d

nki.language.load_transpose2d(src, *, mask=None, dtype=None, **kwargs)
    

Load a tensor from device memory (HBM) and 2D-transpose the data before storing into on-chip memory (SBUF).

Parameters:
    

  * src – HBM tensor to load the data from.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

a new tile on SBUF with values from `src` 2D-transposed.
    
    
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    ...
    
    
    # load from in_tensor[F, P] that is on HBM
    # transpose and copy into local_tile[P, F] that is on SBUF
    N, M = in_tensor.shape
    local_tile: tensor[M, N] = nl.load_transpose2d(in_tensor)
    ...
    

Note

Partition dimension size can’t exceed the hardware limitation of `nki.language.tile_size.pmax`, see Tile size considerations.

---

### nki.language.copy

nki.language.copy(src, *, mask=None, dtype=None, **kwargs)
    

Create a copy of the src tile.

Parameters:
    

  * src – the source of copy, must be a tile in SBUF or PSUM.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

a new tile with the same layout as src, this new tile will be in SBUF, but can be also assigned to a PSUM tensor.

---

### nki.isa.dma_transpose

nki.isa.dma_transpose(src, *, axes=None, mask=None, dtype=None, **kwargs)
    

Perform a transpose on input `src` using DMA Engine.

The permutation of transpose follow the rules described below:

  1. For 2-d input tile, the permutation will be [1, 0]

  2. For 3-d input tile, the permutation will be [2, 1, 0]

  3. For 4-d input tile, the permutation will be [3, 1, 2, 0]

Parameters:
    

  * src – the source of transpose, must be a tile in HBM or SBUF.

  * axes – transpose axes where the i-th axis of the transposed tile will correspond to the axes[i] of the source. Supported axes are `(1, 0)`, `(2, 1, 0)`, and `(3, 1, 2, 0)`.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for copy: `nki.isa.dge_mode.none` (turn off DGE) or `nki.isa.dge_mode.swdge` (software DGE) or `nki.isa.dge_mode.hwdge` (hardware DGE) or `nki.isa.dge_mode.unknown` (by default, let compiler select the best DGE mode). HWDGE is only supported for NeuronCore-v3+.

Returns:
    

a tile with transposed content

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    
    ############################################################################
    # Example 1: Simple 2D transpose (HBM->SB)
    ############################################################################
    def nki_dma_transpose_2d_hbm2sb(a):
      b = nisa.dma_transpose(a)
      return b
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    
    ############################################################################
    # Example 2: Simple 2D transpose (SB->SB)
    ############################################################################
    def nki_dma_transpose_2d_sb2sb(a):
      a_sb = nl.load(a)
      b = nisa.dma_transpose(a_sb)
      return b
    

---

### tensor_copy

`tensor_copy` | Create a copy of `src` tile within NeuronCore on-chip SRAMs using Vector, Scalar or GpSimd Engine.  

---

### tensor_copy_dynamic_src

`tensor_copy_dynamic_src` | Create a copy of `src` tile within NeuronCore on-chip SRAMs using Vector or Scalar or GpSimd Engine, with `src` located at a dynamic offset within each partition.  

---

### nki.isa.tensor_copy_dynamic_src

nki.isa.tensor_copy_dynamic_src(src, *, mask=None, dtype=None, engine=engine.unknown, **kwargs)
    

Create a copy of `src` tile within NeuronCore on-chip SRAMs using Vector or Scalar or GpSimd Engine, with `src` located at a dynamic offset within each partition.

Both source and destination tiles can be in either SBUF or PSUM. By default, this API returns a tile in SBUF, unless the returned value is assigned to a pre-declared PSUM tile.

The source and destination tiles must also have the same number of partitions and the same number of elements per partition.

The dynamic offset must be a scalar value resided in SBUF. If you have a list of dynamic offsets for gathering tiles in SBUF/PSUM, you may loop over each offset and call `tensor_copy_dynamic_src` once per offset.

Estimated instruction cost:

`max(MIN_II_DYNAMIC, N)` engine cycles, where:

  * `N` is the number of elements per partition in the `src` tile,

  * `MIN_II_DYNAMIC` is the minimum instruction initiation interval for instructions with dynamic source location. `MIN_II_DYNAMIC` is roughly 600 engine cycles.

Parameters:
    

  * src – the source of copy, must be a tile in SBUF or PSUM that is dynamically indexed within each partition.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * engine – (optional) the engine to use for the operation: nki.isa.vector_engine, nki.isa.gpsimd_engine, nki.isa.scalar_engine or nki.isa.unknown_engine (default, let compiler select best engine).

  * return – the modified destination of copy.

Example:
    
    
    import neuronxcc.nki.typing as nt
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    ...
    
    
    #########################################################################################
    # TensorCopyDynamicSrc example 0:
    # - src_tensor in HBM of shape [128, 512]
    # - offsets in HBM of shape [1, 64] (with values [4, 5, 6, 7, ...])
    # - Gather tiles of shape [128, 1] from src_tensor into out_tensor using offsets
    #########################################################################################
    
    # Load src_tensor and offsets into SBUF
    src_tensor_sbuf: nt.tensor[128, 512] = nl.load(src_tensor)
    offsets_sbuf: nt.tensor[1, 64] = nl.load(offsets)
    
    # Copy into output tensor in SBUF
    out_sbuf: nt.tensor[128, 64] = nl.ndarray([128, 64], dtype=src_tensor.dtype,
                                              buffer=nl.sbuf)
    
    # Static indices to access a tile of shape [128, 1];
    # Add dynamic offsets to iy for tensor_copy_dynamic_src
    ix, iy = nl.mgrid[0:128, 0:1]
    
    for idx in nl.affine_range(offsets_sbuf.shape[1]):
      out_sbuf[ix, idx] = nisa.tensor_copy_dynamic_src(
          src_tensor_sbuf[ix, offsets_sbuf[0, idx] + iy])
    
    nl.store(out_tensor, value=out_sbuf)
    ...
    
    
    
    import neuronxcc.nki.typing as nt
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    ...
    
    
    #########################################################################################
    # TensorCopyDynamicSrc example 1:
    # - src_tensor in HBM of shape [128, 512, 4]
    # - offsets in HBM of shape [1 x 8] (with values [4, 5, 6, 7, ...]) to index into
    #   second axis of src_tensor
    # - Gather tiles of shape [128, 4] from src_tensor into out_tensor using offsets
    #########################################################################################
    
    # Load src_tensor and offsets into SBUF
    src_tensor_sbuf: nt.tensor[128, 512, 4] = nl.load(src_tensor)
    offsets_sbuf: nt.tensor[1, 8] = nl.load(offsets)
    
    # Copy into output tensor in SBUF
    out_sbuf: nt.tensor[128, 8, 4] = nl.ndarray([128, 8, 4], dtype=src_tensor.dtype,
                                                buffer=nl.sbuf)
    
    # Static indices to access a tile of shape [128, 1, 4];
    # Use dynamic offsets directly to index the second axis for tensor_copy_dynamic_src
    ix, _, iz = nl.mgrid[0:128, 0:1, 0:4]
    
    for idx in nl.affine_range(offsets.shape[1]):
      out_sbuf[ix, idx, iz] = nisa.tensor_copy_dynamic_src(
          src_tensor_sbuf[ix, offsets_sbuf[0, idx], iz])
    
    nl.store(out_tensor, value=out_sbuf)
    ...

---

### tensor_copy_dynamic_dst

`tensor_copy_dynamic_dst` | Create a copy of `src` tile within NeuronCore on-chip SRAMs using Vector or Scalar or GpSimd Engine, with `dst` located at a dynamic offset within each partition.  

---

### nki.isa.tensor_copy_dynamic_dst

nki.isa.tensor_copy_dynamic_dst(*, dst, src, mask=None, dtype=None, engine=engine.unknown, **kwargs)
    

Create a copy of `src` tile within NeuronCore on-chip SRAMs using Vector or Scalar or GpSimd Engine, with `dst` located at a dynamic offset within each partition.

Both source and destination tiles can be in either SBUF or PSUM.

The source and destination tiles must also have the same number of partitions and the same number of elements per partition.

The dynamic offset must be a scalar value resided in SBUF. If you have a list of dynamic offsets for scattering tiles in SBUF/PSUM, you may loop over each offset and call `tensor_copy_dynamic_dst` once per offset.

Estimated instruction cost:

`max(MIN_II_DYNAMIC, N)` engine cycles, where:

  * `N` is the number of elements per partition in the `src` tile,

  * `MIN_II_DYNAMIC` is the minimum instruction initiation interval for instructions with dynamic destination location. `MIN_II_DYNAMIC` is roughly 600 engine cycles.

Parameters:
    

  * dst – the destination of copy, must be a tile in SBUF of PSUM that is dynamically indexed within each dimension.

  * src – the source of copy, must be a tile in SBUF or PSUM.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * engine – (optional) the engine to use for the operation: nki.isa.vector_engine, nki.isa.gpsimd_engine, nki.isa.scalar_engine or nki.isa.unknown_engine (default, let compiler select best engine).

---

### tensor_copy_predicated

`tensor_copy_predicated` | Conditionally copy elements from the `src` tile to the destination tile on SBUF / PSUM based on a `predicate` using Vector Engine.  

---

### nki.isa.tensor_copy_predicated

nki.isa.tensor_copy_predicated(*, src, dst, predicate, reverse_pred=False, mask=None, dtype=None, **kwargs)
    

Conditionally copy elements from the `src` tile to the destination tile on SBUF / PSUM based on a `predicate` using Vector Engine.

This instruction provides low-level control over conditional data movement on NeuronCores, optimized for scenarios where only selective copying of elements is needed. Either `src` or `predicate` may be in PSUM, but not both simultaneously. Both `src` and `predicate` are permitted to be in SBUF.

Shape and data type constraints:

  1. `src` (if it is a tensor), `dst`, and `predicate` must occupy the same number of partitions and same number of elements per partition.

  2. `predicate` must be of type `uint8`, `uint16`, or `uint32`.

  3. `src` and `dst` must share the same data type.

Behavior:

  * Where predicate is True: The corresponding elements from src are copied to dst tile. If src is a scalar, the scalar is copied to the dst tile.

  * Where predicate is False: The corresponding values in dst tile are unmodified

Estimated instruction cost:

Cost `(Vector Engine Cycles)` | Condition  
---|---  
`max(MIN_II, N)` | If `src` is from SBUF and `predicate` is from PSUM or the other way around  
`max(MIN_II, 2N)` | If both `src` and `dst` are in SBUF  
  
  * `N` is the number of elements per partition in `src` tile

  * `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

Parameters:
    

  * src – The source tile or number to copy elements from when `predicate` is True

  * dst – The destination tile to copy elements to

  * predicate – A tile that determines which elements to copy

  * reverse_pred – A boolean that reverses the effect of `predicate`.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ##################################################################
    # Example 1: Conditionally copies elements from the `on_true` tile to 
    # SBUF/PSUM destination tile using Vector Engine, where copying occurs 
    # only at positions where the predicate evaluates to True.
    ##################################################################
    
    ...
    pre_tile: tensor[128, 512] = nl.load(predicate)
    src_tile: tensor[128, 512] = nl.load(on_true_tensor)
    
    ix, iy = nl.mgrid[0:128, 0:512]
    dst_tile: tensor[128, 512] = nl.zeros(shape=src_tile.shape, dtype=src_tile.dtype)
    dst_tile[ix, iy] = nl.load(on_false_tensor)
    
    nisa.tensor_copy_predicated(src=src_tile, dst=dst_tile, predicate=pre_tile)
    

## Memory Allocation

### psum

`psum` | PSUM - Only visible to each individual kernel instance in the SPMD grid, alias of `nki.compiler.psum.auto_alloc()`  

---

### sbuf

`sbuf` | State Buffer - Only visible to each individual kernel instance in the SPMD grid, alias of `nki.compiler.sbuf.auto_alloc()`  

---

### hbm

`hbm` | HBM - Alias of private_hbm  

---

### private_hbm

`private_hbm` | HBM - Only visible to each individual kernel instance in the SPMD grid  

---

### shared_hbm

`shared_hbm` | Shared HBM - Visible to all kernel instances in the SPMD grid  

---

### sbuf.alloc

`sbuf.alloc` | Allocate SBUF memory space for each logical block in a tensor using a customized allocation method.  

---

### sbuf.mod_alloc

`sbuf.mod_alloc` | Allocate SBUF memory space for each logical tile in a tensor through modulo allocation.  

---

### sbuf.auto_alloc

`sbuf.auto_alloc` | Returns a maker to indicate the tensor should be automatically allocated by compiler.  

---

### psum.alloc

`psum.alloc` | Allocate PSUM memory space for each logical block in a tensor using a customized allocation method.  

---

### psum.mod_alloc

`psum.mod_alloc` | Allocate PSUM memory space for each logical block in a tensor through modulo allocation.  

---

### psum.auto_alloc

`psum.auto_alloc` | Returns a maker to indicate the tensor should be automatically allocated by compiler.  

---

### nki.compiler.sbuf.mod_alloc

nki.compiler.sbuf.mod_alloc(*, base_addr, base_partition=0, num_par_tiles=(), num_free_tiles=())
    

Allocate SBUF memory space for each logical tile in a tensor through modulo allocation.

This is one of the NKI direction allocation APIs. We recommend reading NKI Direct Allocation Developer Guide before using these APIs.

This API is equivalent to calling nisa.compiler.alloc() with a callable `psum_modulo_alloc_func` as defined below.
    
    
     1from typing import Optional, Tuple
     2from functools import reduce
     3from operator import mul
     4import unittest
     5
     6def num_elms(shape):
     7  return reduce(mul, shape, 1)
     8
     9def linearize(shape, indices):
    10  return sum(i * num_elms(shape[dim+1:]) for dim, i in enumerate(indices))
    11
    12def modulo_allocate_func(base, allocate_shape, scale):
    13  def func(indices):
    14    if not allocate_shape:
    15      # default shape is always (1, 1, ...)
    16      allocate_shape_ = (1, ) * len(indices)
    17    else:
    18      allocate_shape_ = allocate_shape
    19    mod_idx = tuple(i % s for i, s in zip(indices, allocate_shape_))
    20    return linearize(shape=allocate_shape_, indices=mod_idx) * scale + base
    21  return func
    22
    23def mod_alloc(base_addr: int, *, 
    24               base_partition: Optional[int] = 0,
    25               num_par_tiles: Optional[Tuple[int, ...]] = (),
    26               num_free_tiles: Optional[Tuple[int, ...]] = ()):
    27  def sbuf_modulo_alloc_func(idx, pdim_size, fdim_size):
    28    return (modulo_allocate_func(base_partition, num_par_tiles, pdim_size)(idx),
    29          modulo_allocate_func(base_addr, num_free_tiles, fdim_size)(idx))
    30  return sbuf_modulo_alloc_func
    31
    

Here’s an example usage of this API:
    
    
    nki_tensor = nl.ndarray((4, par_dim(128), 512), dtype=nl.bfloat16,
                            buffer=nki.compiler.sbuf.mod_alloc(base_addr=0, num_free_tiles=(2, )))
    
    for i_block in nl.affine_range(4):
      nki_tensor[i_block, :, :] = nl.load(...)
      ...                       = nl.exp(nki_tensor[i_block, :, :])
    

This produces the following allocation:

Table 4 Modulo Allocation Example# Logical Tile Index | Physical Tile `start_partition` | Physical Tile `byte_addr`  
---|---|---  
(0, ) | 0 | 0 + (0 % 2) * 512 * sizeof(nl.bfloat16) = 0  
(1, ) | 0 | 0 + (1 % 2) * 512 * sizeof(nl.bfloat16) = 1024  
(2, ) | 0 | 0 + (2 % 2) * 512 * sizeof(nl.bfloat16) = 0  
(3, ) | 0 | 0 + (3 % 2) * 512 * sizeof(nl.bfloat16) = 1024  
  
With above scheme, we are able to implement double buffering in `nki_tensor`, such that `nl.load` in one iteration can write to one physical tile while `nl.exp` of the previous iteration can read from the other physical tile simultaneously.

---

### skip_middle_end_transformations

`skip_middle_end_transformations` | Skip all middle end transformations on the kernel  

---

### enable_stack_allocator

`enable_stack_allocator` | Use stack allocator to allocate the psum and sbuf tensors in the kernel.  

---

### force_auto_alloc

`force_auto_alloc` | Force automatic allocation to be turned on in the kernel.  

## Indexing and Slicing

### ds

`ds` | Construct a dynamic slice for simple tensor indexing.  

---

### arange

`arange` | Return contiguous values within a given interval, used for indexing a tensor to define a tile.  

---

### mgrid

`mgrid` | Same as NumPy mgrid: "An instance which returns a dense (or fleshed out) mesh-grid when indexed, so that each returned argument has the same shape.  

---

### expand_dims

`expand_dims` | Expand the shape of a tile.  

---

### nki.language.arange

# nki.language.arange
nki.language.arange(*args)
    

Return contiguous values within a given interval, used for indexing a tensor to define a tile.

((Similar to numpy.arange))

arange can be called as:
    

  * `arange(stop)`: Values are generated within the half-open interval `[0, stop)` (the interval including zero, excluding stop).

  * `arange(start, stop)`: Values are generated within the half-open interval `[start, stop)` (the interval including start, excluding stop).

---

### par_dim

`par_dim` | Mark a dimension explicitly as a partition dimension.  

---

### nki.language.par_dim

nki.language.par_dim = Ellipsis#
    

Mark a dimension explicitly as a partition dimension.

## Arithmetic Operations

### add

`add` | Add the inputs, element-wise.  

---

### subtract

`subtract` | Subtract the inputs, element-wise.  

---

### multiply

`multiply` | Multiply the inputs, element-wise.  

---

### divide

`divide` | Divide the inputs, element-wise.  

---

### power

`power` | Elements of x raised to powers of y, element-wise.  

---

### mod

`mod` | Integer Mod of `x / y`, element-wise  

---

### fmod

`fmod` | Floor-mod of `x / y`, element-wise.  

---

### abs

`abs` | Absolute value of the input, element-wise.  

---

### negative

`negative` | Numerical negative of the input, element-wise.  

---

### sign

`sign` | Sign of the numbers of the input, element-wise.  

---

### square

`square` | Square of the input, element-wise.  

---

### reciprocal

`reciprocal` | Compute reciprocal of each element in the input `data` tile using Vector Engine.  

### reciprocal

`reciprocal` | Reciprocal of the the input, element-wise.  

## Math Functions

### Supported Math Operators for NKI ISA

## Supported Math Operators for NKI ISA
Supported Math Operators by NKI ISA below lists all the mathematical operator primitives supported by NKI. Many nki.isa APIs (instructions) allow programmable operators through the `op` field. The supported operators fall into two categories: bitvec and arithmetic. In general, instructions using bitvec operators expect integer data types and treat input elements as bit patterns. On the other hand, instructions using arithmetic operators accept any valid NKI data types and convert input elements into float32 before performing the operators.

Table 7 Supported Math Operators by NKI ISA# | Operator | `op` | Legal Reduction `op` | Supported Engine  
---|---|---|---|---  
Bitvec | Bitwise Not | `nki.language.invert` | N | Vector  
Bitwise And | `nki.language.bitwise_and` | Y | Vector  
Bitwise Or | `nki.language.bitwise_or` | Y | Vector  
Bitwise Xor | `nki.language.bitwise_xor` | Y | Vector  
Arithmetic Shift Left | `nki.language.left_shift` | N | Vector  
Arithmetic Shift Right | Not supported | N | Vector  
Logical Shift Left | `nki.language.left_shift` | N | Vector  
Logical Shift Right | `nki.language.right_shift` | N | Vector  
Arithmetic | Add | `nki.language.add` | Y | Vector/GpSIMD/Scalar  
Subtract | `nki.language.subtract` | Y | Vector  
Multiply | `nki.language.multiply` | Y | Vector/GpSIMD/Scalar  
Max | `nki.language.maximum` | Y | Vector  
Min | `nki.language.minimum` | Y | Vector  
Is Equal to | `nki.language.equal` | N | Vector  
Is Not Equal to | `nki.language.not_equal` | N | Vector  
Is Greater than or Equal to | `nki.language.greater_equal` | N | Vector  
Is Greater than to | `nki.language.greater` | N | Vector  
Is Less than or Equal to | `nki.language.less_equal` | N | Vector  
Is Less than | `nki.language.less` | N | Vector  
Logical Not | `nki.language.logical_not` | N | Vector  
Logical And | `nki.language.logical_and` | Y | Vector  
Logical Or | `nki.language.logical_or` | Y | Vector  
Logical Xor | `nki.language.logical_xor` | Y | Vector  
Reverse Square Root | `nki.language.rsqrt` | N | GpSIMD/Scalar  
Reciprocal | `nki.language.reciprocal` | N | Vector/Scalar  
Absolute | `nki.language.abs` | N | Vector/Scalar  
Power | `nki.language.power` | N | GpSIMD  
  
Note Add and Multiply are supported on Scalar Engine only from NeuronCore-v3. 32-bit integer Add and Multiply are only supported on GpSIMD Engine.

---

### exp

`exp` | Exponential of the input, element-wise.  

---

### log

`log` | Natural logarithm of the input, element-wise.  

---

### cos

`cos` | Cosine of the input, element-wise.  

---

### sin

`sin` | Sine of the input, element-wise.  

---

### tan

`tan` | Tangent of the input, element-wise.  

---

### tanh

`tanh` | Hyperbolic tangent of the input, element-wise.  

---

### arctan

`arctan` | Inverse tangent of the input, element-wise.  

---

### sqrt

`sqrt` | Non-negative square-root of the input, element-wise.  

---

### rsqrt

`rsqrt` | Reciprocal of the square-root of the input, element-wise.  

---

### erf

`erf` | Error function of the input, element-wise.  

---

### erf_dx

`erf_dx` | Derivative of the Error function (erf) on the input, element-wise.  

---

### trunc

`trunc` | Truncated value of the input, element-wise.  

---

### floor

`floor` | Floor of the input, element-wise.  

---

### ceil

`ceil` | Ceiling of the input, element-wise.  

## Activation Functions

### Supported Activation Functions for NKI ISA

## Supported Activation Functions for NKI ISA
Supported Activation Functions by NKI ISA below lists all the activation function supported by the `nki.isa.activation` API. These activation functions are approximated with piece-wise polynomials on Scalar Engine. NOTE: if input values fall outside the supported Valid Input Range listed below, the Scalar Engine will generate invalid output results.

Table 8 Supported Activation Functions by NKI ISA# Function Name | Accepted `op` by Scalar Engine | Valid Input Range  
---|---|---  
Identity | `nki.language.copy` or `numpy.copy` | `[-inf, inf]`  
Square | `nki.language.square` or `numpy.square` | `[-inf, inf]`  
Sigmoid | `nki.language.sigmoid` | `[-inf, inf]`  
Relu | `nki.language.relu` | `[-inf, inf]`  
Gelu | `nki.language.gelu` | `[-inf, inf]`  
Gelu Derivative | `nki.language.gelu_dx` | `[-inf, inf]`  
Gelu with Tanh Approximation | `nki.language.gelu_apprx_tanh` | `[-inf, inf]`  
Gelu with Sigmoid Approximation | `nki.language.gelu_apprx_sigmoid` | `[-inf, inf]`  
Silu | `nki.language.silu` | `[-inf, inf]`  
Silu Derivative | `nki.language.silu_dx` | `[-inf, inf]`  
Tanh | `nki.language.tanh` or `numpy.tanh` | `[-inf, inf]`  
Softplus | `nki.language.softplus` | `[-inf, inf]`  
Mish | `nki.language.mish` | `[-inf, inf]`  
Erf | `nki.language.erf` | `[-inf, inf]`  
Erf Derivative | `nki.language.erf_dx` | `[-inf, inf]`  
Exponential | `nki.language.exp` or `numpy.exp` | `[-inf, inf]`  
Natural Log | `nki.language.log` or `numpy.log` | `[2^-64, 2^64]`  
Sine | `nki.language.sin` or `numpy.sin` | `[-PI, PI]`  
Arctan | `nki.language.arctan` or `numpy.arctan` | `[-PI/2, PI/2]`  
Square Root | `nki.language.sqrt` or `numpy.sqrt` | `[2^-116, 2^118]`  
Reverse Square Root | `nki.language.rsqrt` | `[2^-87, 2^97]`  
Reciprocal | `nki.language.reciprocal` or `numpy.reciprocal` | `±[2^-42, 2^42]`  
Sign | `nki.language.sign` or `numpy.sign` | `[-inf, inf]`  
Absolute | `nki.language.abs` or `numpy.abs` | `[-inf, inf]`  
  

---

### sigmoid

`sigmoid` | Logistic sigmoid activation function on the input, element-wise.  

---

### relu

`relu` | Rectified Linear Unit activation function on the input, element-wise.  

---

### gelu

`gelu` | Gaussian Error Linear Unit activation function on the input, element-wise.  

---

### gelu_dx

`gelu_dx` | Derivative of Gaussian Error Linear Unit (gelu) on the input, element-wise.  

---

### gelu_apprx_tanh

`gelu_apprx_tanh` | Gaussian Error Linear Unit activation function on the input, element-wise, with tanh approximation.  

---

### gelu_apprx_sigmoid

`gelu_apprx_sigmoid` | Gaussian Error Linear Unit activation function on the input, element-wise, with sigmoid approximation.  

### gelu_apprx_sigmoid

  * `gelu_apprx_sigmoid`: GELU activation with sigmoid approximation

---

### nki.language.gelu_apprx_sigmoid

nki.language.gelu_apprx_sigmoid(x, *, dtype=None, mask=None, **kwargs)
    

Gaussian Error Linear Unit activation function on the input, element-wise, with sigmoid approximation.

Parameters:
    

  * x – a tile.

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:
    

a tile that has gelu of `x`.

---

### silu

`silu` | Sigmoid Linear Unit activation function on the input, element-wise.  

---

### silu_dx

`silu_dx` | Derivative of Sigmoid Linear Unit activation function on the input, element-wise.  

---

### softplus

`softplus` | Softplus activation function on the input, element-wise.  

---

### mish

`mish` | Mish activation function on the input, element-wise.  

---

### softmax

`softmax` | Softmax activation function on the input, element-wise.  

---

### nki.isa.activation

nki.isa.activation(op, data, *, bias=None, scale=1.0, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, mask=None, dtype=None, **kwargs)
    

Apply an activation function on every element of the input tile using Scalar Engine. The activation function is specified in the `op` input field (see Supported Activation Functions for NKI ISA for a list of supported activation functions and their valid input ranges).

The activation instruction can optionally multiply the input `data` by a scalar or vector `scale` and then add another vector `bias` before the activation function is applied, at no additional performance cost:

\\[output = f_{act}(data * scale + bias)\\]

When the scale is a scalar, it must be a compile-time constant. In this case, the scale is broadcasted to all the elements in the input `data` tile. When the scale/bias is a vector, it must have the same partition axis size as the input `data` tile and only one element per partition. In this case, the element of scale/bias within each partition is broadcasted to elements of the input `data` tile in the same partition.

There are 128 registers on the scalar engine for storing reduction results, corresponding to the 128 partitions of the input. The scalar engine can reduce along free dimensions without extra performance penalty, and store the result of reduction into these registers. The reduction is done after the activation function is applied.

\\[output = f_{act}(data * scale + bias) accu\\_registers = reduce\\_op(accu\\_registers, reduce\\_op(output, axis=<FreeAxis>))\\]

These registers are shared between `activation` and `activation_accu` calls, and the state of them can be controlled via the `reduce_cmd` parameter.

  * `nisa.reduce_cmd.reset`: Reset the accumulators to zero

  * `nisa.reduce_cmd.idle`: Do not use the accumulators

  * `nisa.reduce_cmd.reduce`: keeps accumulating over the current value of the accumulator

  * `nisa.reduce_cmd.reset_reduce`: Resets the accumulators then immediately accumulate the results of the current instruction into the accumulators

We can choose to read out the current values stored in the register by passing in a tensor in the `reduce_res` arguments. Reading out the accumulator will incur a small overhead.

Note that `activation_accu` can also change the state of the registers. It’s user’s responsibility to ensure correct ordering. It’s recommended to not mixing the use of `activation_accu` and `activation`, when `reduce_cmd` is not set to idle.

Note, the Scalar Engine always performs the math operations in float32 precision. Therefore, the engine automatically casts the input `data` tile to float32 before performing multiply/add/activate specified in the activation instruction. The engine is also capable of casting the float32 math results into another output data type specified by the `dtype` field at no additional performance cost. If `dtype` field is not specified, Neuron Compiler will set output data type of the instruction to be the same as input data type of `data`. On the other hand, the `scale` parameter must have a float32 data type, while the `bias` parameter can be float32/float16/bfloat16.

The input `data` tile can be an SBUF or PSUM tile. Similarly, the instruction can write the output tile into either SBUF or PSUM, which is specified using the `buffer` field. If not specified, `nki.language.sbuf` is selected by default.

Estimated instruction cost:

`max(MIN_II, N)` Scalar Engine cycles, where

  * `N` is the number of elements per partition in `data`.

  * `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

Parameters:
    

  * op – an activation function (see Supported Activation Functions for NKI ISA for supported functions)

  * data – the input tile; layout: (partition axis <= 128, free axis)

  * bias – a vector with the same partition axis size as `data` for broadcast add (after broadcast multiply with `scale`)

  * scale – a scalar or a vector with the same partition axis size as `data` for broadcast multiply

  * reduce_op – the reduce operation to perform on the free dimension of the activation result

  * reduce_res – a tile of shape `(data.shape[0], 1)`, where data.shape[0] is the partition axis size of the input `data` tile. The result of `sum(ReductionResult)` is written in-place into the tensor.

  * reduce_cmd – an enum member from `nisa.reduce_cmd` to control the state of reduction registers

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:
    

output tile of the activation instruction; layout: same as input `data` tile

Example:
    
    
    import neuronxcc.nki.language as nl
    import neuronxcc.nki.isa as nisa
    
    ##################################################################
    # Example 1: perform exponential function on matrix a of shape (128, 1024)
    ##################################################################
    a = nl.load(a_tensor)
    activated_a = nisa.activation(op=nl.exp, data=a)
    nl.store(a_act_tensor, activated_a)
    
    ##################################################################
    # Example 2: perform the following operations to matrix b of shape (128, 512)
    # using a single activation instruction: np.square(b * 2.0) + c
    # 1) compute `np.square(b * 2.0 + c)`
    # 2) cast 1) results into bfloat16
    ##################################################################
    b = nl.load(b_tensor)
    c = nl.load(c_tensor)
    activated_b = nisa.activation(op=np.square, data=b, bias=c, scale=2.0,
                                  dtype=nl.bfloat16)
    nl.store(b_act_tensor, activated_b)
    

---

### activation_reduce

`activation_reduce` | Perform the same computation as `nisa.activation` and also a reduction along the free dimension of the `nisa.activation` result using Scalar Engine.  

## Reduction Operations

### max

`max` | Maximum of elements along the specified axis (or axes) of the input.  

---

### min

`min` | Minimum of elements along the specified axis (or axes) of the input.  

---

### mean

`mean` | Arithmetic mean along the specified axis (or axes) of the input.  

---

### var

`var` | Variance along the specified axis (or axes) of the input.  

---

### sum

`sum` | Sum of elements along the specified axis (or axes) of the input.  

---

### prod

`prod` | Product of elements along the specified axis (or axes) of the input.  

---

### all

`all` | Whether all elements along the specified axis (or axes) evaluate to True.  

---

### maximum

`maximum` | Maximum of the inputs, element-wise.  

---

### minimum

`minimum` | Minimum of the inputs, element-wise.  

---

### nki.isa.tensor_reduce

nki.isa.tensor_reduce(op, data, axis, *, mask=None, dtype=None, negate=False, keepdims=False, **kwargs)
    

Apply a reduction operation to the free axes of an input `data` tile using Vector Engine.

The reduction operator is specified in the `op` input field (see Supported Math Operators for NKI ISA for a list of supported reduction operators). There are two types of reduction operators: 1) bitvec operators (e.g., bitwise_and, bitwise_or) and 2) arithmetic operators (e.g., add, subtract, multiply). For bitvec operators, the input/output data types must be integer types and Vector Engine treats all input elements as bit patterns without any data type casting. For arithmetic operators, there is no restriction on the input/output data types, but the engine automatically casts input data types to float32 and performs the reduction operation in float32 math. The float32 reduction results are cast to the target data type specified in the `dtype` field before written into the output tile. If the `dtype` field is not specified, it is default to be the same as input tile data type.

When the reduction `op` is an arithmetic operator, the instruction can also multiply the output reduction results by `-1.0` before writing into the output tile, at no additional performance cost. This behavior is controlled by the `negate` input field.

The reduction axes are specified in the `axis` field using a list of integer(s) to indicate axis indices. The reduction axes can contain up to four free axes and must start at the most minor free axis. Since axis 0 is the partition axis in a tile, the reduction axes must contain axis 1 (most-minor). In addition, the reduction axes must be consecutive: e.g., [1, 2, 3, 4] is a legal `axis` field, but [1, 3, 4] is not.

Since this instruction only supports free axes reduction, the output tile must have the same partition axis size as the input `data` tile. To perform a partition axis reduction, we can either:

  1. invoke a `nki.isa.nc_transpose` instruction on the input tile and then this `reduce` instruction to the transposed tile, or

  2. invoke `nki.isa.nc_matmul` instructions to multiply a `nki.language.ones([128, 1], dtype=data.dtype)` vector with the input tile.

Estimated instruction cost:

Cost (Vector Engine Cycles) | Condition  
---|---  
`N/2` | both input and output data types are `bfloat16` and the reduction operator is add or maximum  
`N` | otherwise  
  
where,

  * `N` is the number of elements per partition in `data`.

  * `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

Parameters:
    

  * op – the reduction operator (see Supported Math Operators for NKI ISA for supported reduction operators)

  * data – the input tile to be reduced

  * axis – int or tuple/list of ints. The axis (or axes) along which to operate; must be free dimensions, not partition dimension (0); can only be the last contiguous dim(s) of the tile: `[1], [1,2], [1,2,3], [1,2,3,4]`

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * negate – if True, reduction result is multiplied by `-1.0`; only applicable when op is an arithmetic operator

  * keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

Returns:
    

output tile of the reduction result

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import numpy as np
    ...
    
    ##################################################################
    # Example 1: reduce add tile a of shape (128, 512)
    # in the free dimension and return
    # reduction result in tile b of shape (128, 1)
    ##################################################################
    i_p_a = nl.arange(128)[:, None]
    i_f_a = nl.arange(512)[None, :]
    
    b = nisa.tensor_reduce(np.add, a[i_p_a, i_f_a], axis=[1])
    

---

### tensor_reduce

`tensor_reduce` | Apply a reduction operation to the free axes of an input `data` tile using Vector Engine.  

---

### tensor_partition_reduce

`tensor_partition_reduce` | Apply a reduction operation across partitions of an input `data` tile using GpSimd Engine.  

---

### tensor_scalar_reduce

`tensor_scalar_reduce` | Perform the same computation as `nisa.tensor_scalar` with one math operator and also a reduction along the free dimension of the `nisa.tensor_scalar` result using Vector Engine.  

---

### loop_reduce

`loop_reduce` | Apply reduce operation over a loop.  

## Normalization and Statistics

### rms_norm

`rms_norm` | Apply Root Mean Square Layer Normalization.  

---

### nki.isa.bn_stats

nki.isa.bn_stats(data, *, mask=None, dtype=None, **kwargs)
    

Compute mean- and variance-related statistics for each partition of an input tile `data` in parallel using Vector Engine.

The output tile of the instruction has 6 elements per partition:

  * the `count` of the even elements (of the input tile elements from the same partition)

  * the `mean` of the even elements

  * `variance * count` of the even elements

  * the `count` of the odd elements

  * the `mean` of the odd elements

  * `variance * count` of the odd elements

To get the final mean and variance of the input tile, we need to pass the above `bn_stats` instruction output into the bn_aggr instruction, which will output two elements per partition:

  * mean (of the original input tile elements from the same partition)

  * variance

Due to hardware limitation, the number of elements per partition (i.e., free dimension size) of the input `data` must not exceed 512 (nl.tile_size.bn_stats_fmax). To calculate per-partition mean/variance of a tensor with more than 512 elements in free dimension, we can invoke `bn_stats` instructions on each 512-element tile and use a single `bn_aggr` instruction to aggregate `bn_stats` outputs from all the tiles. Refer to Example 2 for an example implementation.

Vector Engine performs the above statistics calculation in float32 precision. Therefore, the engine automatically casts the input `data` tile to float32 before performing float32 computation and is capable of casting the float32 computation results into another data type specified by the `dtype` field, at no additional performance cost. If `dtype` field is not specified, the instruction will cast the float32 results back to the same data type as the input `data` tile.

Estimated instruction cost:

`max(MIN_II, N)` Vector Engine cycles, where `N` is the number of elements per partition in `data` and `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

Parameters:
    

  * data – the input tile (up to 512 elements per partition)

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

an output tile with 6-element statistics per partition

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ##################################################################
    # Example 1: Calculate the mean and variance for each partition
    # of tile a with shape (128, 128)
    ##################################################################
    a: tensor[128, 128] = nl.load(a_tensor)
    stats_a: tensor[128, 6] = nisa.bn_stats(a)
    mean_var_a: tensor[128, 2] = nisa.bn_aggr(stats_a)
    
    # Extract mean and variance
    mean_a = mean_var_a[:, 0]
    var_a = mean_var_a[:, 1]
    nl.store(mean_a_tensor, mean_a)
    nl.store(var_a_tensor, var_a)
    
    # ##################################################################
    # # Example 2: Calculate the mean and variance for each partition of
    # # tile b with shape [128, 1024]
    # ##################################################################
    b: tensor[128, 1024] = nl.load(b_tensor)
    
    # Run bn_stats in two tiles because b has 1024 elements per partition,
    # but bn_stats has a limitation of nl.tile_size.bn_stats_fmax
    # Initialize a bn_stats output tile with shape of [128, 6*2] to
    # hold outputs of two bn_stats instructions
    stats_b = nl.ndarray((128, 6 * 2), dtype=nl.float32)
    bn_tile = nl.tile_size.bn_stats_fmax
    ix, iy = nl.mgrid[0:128, 0:bn_tile]
    iz, iw = nl.mgrid[0:128, 0:6]
    
    for i in range(1024 // bn_tile):
      stats_b[iz, i * 6 + iw] = nisa.bn_stats(b[ix, i * bn_tile + iy], dtype=nl.float32)
    
    mean_var_b = nisa.bn_aggr(stats_b)
    
    # Extract mean and variance
    mean_b = mean_var_b[:, 0]
    var_b = mean_var_b[:, 1]
    
    nl.store(mean_b_tensor, mean_b)
    nl.store(var_b_tensor, var_b)
    

---

### bn_stats

`bn_stats` | Compute mean- and variance-related statistics for each partition of an input tile `data` in parallel using Vector Engine.  

---

### bn_aggr

`bn_aggr` | Aggregate one or multiple `bn_stats` outputs to generate a mean and variance per partition using Vector Engine.  

---

### nki.isa.dropout

nki.isa.dropout(data, prob, *, mask=None, dtype=None, **kwargs)
    

Randomly replace some elements of the input tile `data` with zeros based on input probabilities using Vector Engine. The probability of replacing input elements with zeros (i.e., drop probability) is specified using the `prob` field: \- If the probability is 1.0, all elements are replaced with zeros. \- If the probability is 0.0, all elements are kept with their original values.

The `prob` field can be a scalar constant or a tile of shape `(data.shape[0], 1)`, where each partition contains one drop probability value. The drop probability value in each partition is applicable to the input `data` elements from the same partition only.

Data type of the input `data` tile can be any valid NKI data types (see Supported Data Types for more information). However, data type of `prob` has restrictions based on the data type of `data`:

  * If data type of `data` is any of the integer types (e.g., int32, int16), `prob` data type must be float32

  * If data type of data is any of the float types (e.g., float32, bfloat16), `prob` data can be any valid float type

The output data type of this instruction is specified by the `dtype` field. The output data type must match the input data type of `data` if input data type is any of the integer types. Otherwise, output data type can be any valid NKI data types. If output data type is not specified, it is default to be the same as input data type.

Estimated instruction cost:

`max(MIN_II, N)` Vector Engine cycles, where `N` is the number of elements per partition in `data`, and `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

Parameters:
    

  * data – the input tile

  * prob – a scalar or a tile of shape `(data.shape[0], 1)` to indicate the probability of replacing elements with zeros

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

an output tile of the dropout result

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ###########################################################################
    # Example 1: From an input tile a of shape [128, 512], dropout its values
    # with probabilities in tile b of shape [128, 1] and store the result in c.
    ###########################################################################
    a: tensor[128, 512] = nl.load(a_tensor)
    b: tensor[128, 1] = nl.load(b_tensor)
    
    c: tensor[128, 512] = nisa.dropout(a, prob=b)
    
    nl.store(c_tensor, c)
    
    ######################################################
    # Example 2: From an input tile a, dropout its values 
    # with probability of 0.2 and store the result in b.
    ######################################################
    a = nl.load(in_tensor)
    
    b = nisa.dropout(a, prob=0.2)
    
    nl.store(out_tensor, b)
    

---

### dropout

`dropout` | Randomly replace some elements of the input tile `data` with zeros based on input probabilities using Vector Engine.  

### dropout

`dropout` | Randomly zeroes some of the elements of the input tile given a probability rate.  

## Matrix and Tensor Operations

### matmul

`matmul` | `x @ y` matrix multiplication of `x` and `y`.  

---

### nki.language.matmul

nki.language.matmul(x, y, *, transpose_x=False, mask=None, **kwargs)
    

`x @ y` matrix multiplication of `x` and `y`.

((Similar to numpy.matmul))

Note

For optimal performance on hardware, use `nki.isa.nc_matmul()` or call `nki.language.matmul` with `transpose_x=True`. Use `nki.isa.nc_matmul` also to access low-level features of the Tensor Engine.

Note

Implementation details: `nki.language.matmul` calls `nki.isa.nc_matmul` under the hood. `nc_matmul` is neuron specific customized implementation of matmul that computes `x.T @ y`, as a result, `matmul(x, y)` lowers to `nc_matmul(transpose(x), y)`. To avoid this extra transpose instruction being inserted, use `x.T` and `transpose_x=True` inputs to this `matmul`.

Parameters:
    

  * x – a tile on SBUF (partition dimension `<= 128`, free dimension `<= 128`), `x`’s free dimension must match `y`’s partition dimension.

  * y – a tile on SBUF (partition dimension `<= 128`, free dimension `<= 512`)

  * transpose_x – Defaults to False. If `True`, `x` is treated as already transposed. If `False`, an additional transpose will be inserted to make `x`’s partition dimension the contract dimension of the matmul to align with the Tensor Engine.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:
    

`x @ y` or `x.T @ y` if `transpose_x=True`

---

### nki.isa.nc_matmul

nki.isa.nc_matmul(stationary, moving, *, is_stationary_onezero=False, is_moving_onezero=False, is_transpose=False, tile_position=(), tile_size=(), mask=None, **kwargs)
    

Compute `stationary.T @ moving` matrix multiplication using Tensor Engine.

The `nc_matmul` instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the `stationary` and `moving` must be SBUF tiles, and the result tile is a PSUM tile.

The nc_matmul instruction currently supports `float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32/float32` input data types as listed in Supported Data Types. The matmul accumulation and results are always in float32.

The Tensor Engine imposes special layout constraints on the input tiles. First, the partition axis sizes of the `stationary` and `moving` tiles must be identical and `<=128`, which corresponds to the contraction dimension of the matrix multiplication. Second, the free axis sizes of `stationary` and `moving` tiles must be `<= 128` and `<=512`, respectively, For example, `stationary.shape = (128, 126)`; `moving.shape = (128, 512)` and `nc_matmul(stationary,moving)` returns a tile of `shape = (126, 512)`. For more information about the matmul layout, see Tensor Engine.

Fig. 15 MxKxN Matrix Multiplication Visualization.#

If the contraction dimension of the matrix multiplication exceeds `128`, you may accumulate multiple `nc_matmul` instruction output tiles into the same PSUM tile. See example code snippet below.

Estimated instruction cost:

The Tensor Engine has complex performance characteristics given its data flow and pipeline design. The below formula is the average nc_matmul cost assuming many `nc_matmul` instructions of the same shapes running back-to-back on the engine:

Cost (Tensor Engine Cycles) | Condition  
---|---  
`max(min(64, N_stationary), N_moving)` | input data type is one of `float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32`  
`4 * max(min(64, N_stationary), N_moving)` | input data type is `float32`  
  
where,

  * `N_stationary` is the number of elements per partition in `stationary` tile.

  * `N_moving` is the number of elements per partition in `moving` tile.

The Tensor Engine, as a systolic array with 128 rows and 128 columns of processing elements (PEs), could be underutilized for small `nc_matmul` instructions, i.e., the `stationary` tile has small free axis size or small partition axis size (e.g. 32, 64). In such a case, the Tensor Engine allows PE tiling, i.e., multiple small `nc_matmul` instructions to execute in parallel on the PE array, to improve compute throughput. PE tiling is enabled by setting `tile_position` and `tile_size`. `tile_position` indicates the PE tile starting position (row position, column position) for a `nc_matmul` instruction in the PE array. `tile_size` indicates the PE tile size (row size, column size) to hold by a `nc_matmul` instruction starting from the `tile_position`. For example, setting `tile_position` to (0, 0) and `tile_size` to (128, 128) means using full PE array.

Requirements on `tile_position` and `tile_size` are:

  1. `tile_position` and `tile_size` must be both set to enable PE tiling.

  2. The type of values in `tile_position` and `tile_size` must be integer or affine expression.

  3. Values in `tile_position` and `tile_size` must be multiple of 32.

  4. `tile_size` must be larger than or equal to accessed `stationary` tile size.

  5. Both the row and column sizes in `tile_size` cannot be 32 for NeuronCore-v2.

Parameters:
    

  * stationary – the stationary operand on SBUF; layout: (partition axis `<= 128`, free axis `<= 128`)

  * moving – the moving operand on SBUF; layout: (partition axis `<= 128`, free axis `<= 512`)

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * is_stationary_onezero – hints to the compiler whether the `stationary` operand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance if `stationary` tile is in float32; the field has no impact for non-float32 `stationary`.

  * is_moving_onezero – hints to the compiler if the `moving` operand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance if `moving` tile is in float32; the field has no impact for non-float32 `moving`.

  * is_transpose – hints to the compiler that this is a transpose operation with `moving` as an identity matrix.

  * tile_position – a 2D tuple (row, column) for the start PE tile position to run `nc_matmul`.

  * tile_size – a 2D tuple (row, column) for the PE tile size to hold by `nc_matmul` starting from `tile_position`.

Returns:
    

a tile on PSUM that has the result of matrix multiplication of `stationary` and `moving` tiles; layout: partition axis comes from free axis of `stationary`, while free axis comes from free axis of `moving`.

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    
    ##################################################################
    # Example 1:
    # multiply matrix a of shape (128, 128) and matrix b of shape (128, 512)
    # to get matrix c in PSUM of shape (128, 512)
    ##################################################################
    a_mgrid = nl.mgrid[0:128, 0:128]
    b_mgrid = nl.mgrid[0:128, 0:512]
    c_mgrid = nl.mgrid[0:128, 0:512]
    
    a = nl.load(a_tensor[a_mgrid.p, a_mgrid.x])
    b = nl.load(b_tensor[b_mgrid.p, b_mgrid.x])
    
    c_psum = nisa.nc_matmul(a[a_mgrid.p, a_mgrid.x], b[b_mgrid.p, b_mgrid.x])
    
    nl.store(c_tensor[c_mgrid.p, c_mgrid.x], c_psum)
    
    ##################################################################
    # Example 2:
    # multiply matrix d of shape (256, 128) and matrix e of shape (256, 512)
    # to get matrix f in PSUM of shape (128, 512) using psum accumulation
    ##################################################################
    d_mgrid = nl.mgrid[0:128, 0:128]
    e_mgrid = nl.mgrid[0:128, 0:512]
    f_mgrid = nl.mgrid[0:128, 0:512]
    
    f_psum = nl.zeros((128, 512), nl.float32, buffer=nl.psum)
    
    for i_contract in nl.affine_range(2):
      d = nl.load(d_tensor[i_contract * 128 + d_mgrid.p, d_mgrid.x])
      e = nl.load(e_tensor[i_contract * 128 + e_mgrid.p, e_mgrid.x])
      f_psum += nisa.nc_matmul(d[d_mgrid.p, d_mgrid.x], e[e_mgrid.p, e_mgrid.x])
      
    nl.store(f_tensor[f_mgrid.p, f_mgrid.x], f_psum)
    
    ##################################################################
    # Example 3:
    # perform batched matrix multiplication on matrix g of shape (16, 64, 64) 
    # and matrix h of shape (16, 64, 512) to get matrix i of (16, 64, 512) 
    # using Tensor Engine PE tiling mode. 
    ##################################################################
    g_mgrid = nl.mgrid[0:64, 0:64]
    h_mgrid = nl.mgrid[0:64, 0:512]
    i_mgrid = nl.mgrid[0:64, 0:512]
    
    for i in nl.affine_range(4):
      for j in nl.affine_range(4):
        g = nl.load(g_tensor[i * 4 + j, g_mgrid.p, g_mgrid.x])
        h = nl.load(h_tensor[i * 4 + j, h_mgrid.p, h_mgrid.x])
        i_psum = nisa.nc_matmul(g, h, tile_position=((i % 2) * 64, (j % 2) * 64), tile_size=(64, 64))
        nl.store(i_tensor[i * 4 + j, i_mgrid.p, i_mgrid.x], i_psum)
    
    return c_tensor, f_tensor, i_tensor
    

---

### nc_matmul

`nc_matmul` | Compute `stationary.T @ moving` matrix multiplication using Tensor Engine.  

---

### transpose

`transpose` | Transposes a 2D tile between its partition and free dimension.  

---

### nki.language.transpose

nki.language.transpose(x, *, dtype=None, mask=None, **kwargs)
    

Transposes a 2D tile between its partition and free dimension.

Parameters:
    

  * x – 2D input tile

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:
    

a tile that has the values of the input tile with its partition and free dimensions swapped.

---

### nki.isa.nc_transpose

# nki.isa.nc_transpose
nki.isa.nc_transpose(data, *, mask=None, dtype=None, engine=engine.unknown, **kwargs)
    

Perform a 2D transpose between the partition axis and the free axis of input `data`, i.e., a PF-transpose, using Tensor or Vector Engine. If the `data` tile has more than one free axes, this API implicitly collapses all free axes into one axis and then performs a 2D PF-transpose.

In NeuronCore, both Tensor and Vector Engine can perform a PF-transpose, but they support different input shapes. Tensor Engine `nc_transpose` can handle an input tile of shape (128, 128) or smaller, while Vector Engine can handle shape (32, 32) or smaller. Therefore, when the input tile shape is (32, 32) or smaller, we have an option to run it on either engine, which is controlled by the `engine` field. If no `engine` is specified, Neuron Compiler will automatically select an engine based on the input shape. Note, similar to other Tensor Engine instructions, the Tensor Engine `nc_transpose` must read the input tile from SBUF and write the transposed result to PSUM. On the other hand, Vector Engine `nc_transpose` can read/write from/to either SBUF or PSUM.

Note, PF-transpose on Tensor Engine is done by performing a matrix multiplication between `data` as the stationary tensor and an identity matrix as the moving tensor. See architecture guide for more information. On NeuronCore-v2, such matmul-style transpose is not bit-accurate if the input `data` contains NaN/Inf. You may consider replacing NaN/Inf with regular floats (float_max/float_min/zeros) in the input matrix before calling `nc_transpose(engine=nki.isa.constants.engine.tensor)`.

Estimated instruction cost:

Cost (Engine Cycles) | Condition  
---|---  
`max(MIN_II, N)` | `engine` set to `nki.isa.constants.engine.vector`  
`max(P, min(64, F))` | `engine` set to `nki.isa.constants.engine.tensor` and assuming many back-to-back `nc_transpose` of the same shape on Tensor Engine  
  
where,

  * `N` is the number of elements per partition in `data`.

  * `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

  * `P` is partition axis size of `data`.

  * `F` is the number of elements per partition in `data`.

Parameters:
    

  * data – the input tile to be transposed

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – if specified and it’s different from the data type of input tile `data`, an additional nki.isa.cast instruction will be inserted to cast the transposed data into the target `dtype` (see Supported Data Types for more information)

  * engine – specify which engine to use for transpose: `nki.isa.tensor_engine` or `nki.isa.vector_engine` ; by default, the best engine will be selected for the given input tile shape

Returns:
    

a tile with transposed result of input `data` tile

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    ...
    
    ##################################################################
    # Example 1: transpose tile a of shape (128, 64)
    ##################################################################
    i_p_a = nl.arange(128)[:, None]
    i_f_a = nl.arange(64)[None, :]
    aT = nisa.nc_transpose(a[i_p_a, i_f_a])
    
    
    ##################################################################
    # Example 2: transpose tile b of shape (32, 2) using Vector Engine
    ##################################################################
    i_p_b = nl.arange(32)[:, None]
    i_f_b = nl.arange(2)[None, :]
    bT = nisa.nc_transpose(b[i_p_b, i_f_b], engine=nisa.vector_engine)
    

---

### nc_transpose

`nc_transpose` | Perform a 2D transpose between the partition axis and the free axis of input `data`, i.e., a PF-transpose, using Tensor or Vector Engine.  

---

### broadcast_to

`broadcast_to` | Broadcast the `src` tile to a new shape based on numpy broadcast rules.  

## Element-wise Tensor Engine Operations

### tensor_tensor

`tensor_tensor` | Perform an element-wise operation of input two tiles using Vector Engine or GpSimd Engine.  

---

### tensor_tensor_scan

`tensor_tensor_scan` | Perform a scan operation of two input tiles using Vector Engine.  

---

### scalar_tensor_tensor

`scalar_tensor_tensor` | Apply up to two math operators using Vector Engine: `(data <op0> operand0) <op1> operand1`.  

---

### tensor_scalar

`tensor_scalar` | Apply up to two math operators to the input `data` tile by broadcasting scalar/vector operands in the free dimension using Vector or Scalar or GpSimd Engine: `(data <op0> operand0) <op1> operand1`.  

## Bitwise Operations

### bitwise_and

`bitwise_and` | Bitwise AND of the two inputs, element-wise.  

---

### bitwise_or

`bitwise_or` | Bitwise OR of the two inputs, element-wise.  

---

### bitwise_xor

`bitwise_xor` | Bitwise XOR of the two inputs, element-wise.  

---

### invert

`invert` | Bitwise NOT of the input, element-wise.  

---

### left_shift

`left_shift` | Bitwise left-shift x by y, element-wise.  

---

### right_shift

`right_shift` | Bitwise right-shift x by y, element-wise.  

## Comparison and Logical Operations

### equal

`equal` | Element-wise boolean result of x == y.  

---

### not_equal

`not_equal` | Element-wise boolean result of x != y.  

---

### greater

`greater` | Element-wise boolean result of x > y.  

---

### greater_equal

`greater_equal` | Element-wise boolean result of x >= y.  

---

### less

`less` | Element-wise boolean result of x < y.  

---

### less_equal

`less_equal` | Element-wise boolean result of x <= y.  

---

### logical_and

`logical_and` | Element-wise boolean result of x AND y.  

---

### logical_or

`logical_or` | Element-wise boolean result of x OR y.  

---

### logical_xor

`logical_xor` | Element-wise boolean result of x XOR y.  

---

### logical_not

`logical_not` | Element-wise boolean result of NOT x.  

## Selection and Conditional Operations

### where

`where` | Return elements chosen from x or y depending on condition.  

---

### nki.isa.select_reduce

nki.isa.select_reduce(*, dst, predicate, on_true, on_false, reduce_res=None, reduce_cmd=reduce_cmd.idle, reduce_op=<function amax>, reverse_pred=False, mask=None, dtype=None, **kwargs)
    

Selectively copy elements from either `on_true` or `on_false` to the destination tile based on a `predicate` using Vector Engine, with optional reduction (max).

The operation can be expressed in NumPy as:
    
    
    # Select:
    predicate = ~predicate if reverse_pred else predicate
    result = np.where(predicate, on_true, on_false)
    
    # With Reduce:
    reduction_result = np.max(result, axis=1, keepdims=True)
    

Memory constraints:

  * Both `on_true` and `predicate` are permitted to be in SBUF

  * Either `on_true` or `predicate` may be in PSUM, but not both simultaneously

  * The destination `dst` can be in either SBUF or PSUM

Shape and data type constraints:

  * `on_true`, `dst`, and `predicate` must have identical shapes (same number of partitions and elements per partition)

  * `on_true` can be any supported dtype except `tfloat32`, `int32`, `uint32`

  * `on_false` dtype must be `float32` if `on_false` is a scalar.

  * `on_false` has to be either scalar or vector of shape `(on_true.shape[0], 1)`

  * `predicate` dtype can be any supported integer type `int8`, `uint8`, `int16`, `uint16`

  * `reduce_res` must be a vector of shape `(on_true.shape[0], 1)`

  * `reduce_res` dtype must of float type

  * `reduce_op` only supports `max`

Behavior:

  * Where predicate is True: The corresponding elements from `on_true` are copied to `dst`

  * Where predicate is False: The corresponding elements from `on_false` are copied to `dst`

  * When reduction is enabled, the max value from each partition of the `result` is computed and stored in `reduce_res`

Accumulator behavior:

The Vector Engine maintains internal accumulator registers that can be controlled via the `reduce_cmd` parameter:

  * `nisa.reduce_cmd.reset_reduce`: Reset accumulators to -inf, then accumulate the current results

  * `nisa.reduce_cmd.reduce`: Continue accumulating without resetting (useful for multi-step reductions)

  * `nisa.reduce_cmd.idle`: No accumulation performed (default)

Note

Even when `reduce_cmd` is set to `idle`, the accumulator state may still be modified. Always use `reset_reduce` after any operations that ran with `idle` mode to ensure consistent behavior.

Note

The accumulator registers are shared for other Vector Engine accumulation instructions such nki.isa.range_select

Parameters:
    

  * dst – The destination tile to write the selected values to

  * predicate – Tile that determines which value to select (on_true or on_false)

  * on_true – Tile to select from when predicate is True

  * on_false – Value to use when predicate is False, can be a scalar value or a vector tile of `(on_true.shape[0], 1)`

  * reduce_res – (optional) Tile to store reduction results, must have shape `(on_true.shape[0], 1)`

  * reduce_cmd – (optional) Control accumulator behavior using `nisa.reduce_cmd` values, defaults to idle

  * reduce_op – (optional) Reduction operator to apply (only `np.max` is supported)

  * reverse_pred – (optional) Reverse the meaning of the predicate condition, defaults to False

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Example 1: Basic selection
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    
    ##################################################################
    # Example 1: Basic usage of select_reduce
    # Create source data, predicate, and destination tensors
    ##################################################################
    # Create output tensor for result
    result_tensor = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.hbm)
    
    # Load input data to SBUF
    predicate = nl.load(predicate_data[...])
    on_true = nl.load(on_true_data[...])
    
    # Create destination tensor
    dst = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.sbuf)
    
    # Perform select operation - copy from on_true where predicate is true
    # and set to fp32.min where predicate is false
    nisa.select_reduce(
        dst=dst,
        predicate=predicate,
        on_true=on_true,
        on_false=nl.fp32.min,
    )
    
    # Store result to HBM
    nl.store(result_tensor, value=dst)
    

Example 2: Selection with reduction
    
    
    ##################################################################
    # Example 2: Using select_reduce with reduction
    # Perform selection and compute max reduction per partition
    ##################################################################
    # Create output tensors for results
    result_tensor = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.hbm)
    reduce_tensor = nl.ndarray((on_true_data.shape[0], 1), dtype=nl.float32, buffer=nl.hbm)
    
    # Load input data to SBUF
    predicate = nl.load(predicate_data)
    on_true = nl.load(on_true_data)
    on_false = nl.load(on_false_data)
    
    # Create destination tensor
    dst = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.sbuf)
    
    # Create tensor for reduction results
    reduce_res = nl.ndarray((on_true_data.shape[0], 1), dtype=nl.float32, buffer=nl.sbuf)
    
    # Perform select operation with reduction
    nisa.select_reduce(
        dst=dst,
        predicate=predicate,
        on_true=on_true,
        on_false=on_false,
        reduce_cmd=nisa.reduce_cmd.reset_reduce,
        reduce_res=reduce_res,
        reduce_op=nl.max
    )
    
    # Store results to HBM
    nl.store(result_tensor, value=dst)
    nl.store(reduce_tensor, value=reduce_res)
    

Example 3: Selection with reversed predicate
    
    
    ##################################################################
    # Example 3: Using select_reduce with reverse_pred option
    # Reverse the meaning of the predicate
    ##################################################################
    # Create output tensor for result
    result_tensor = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.hbm)
    
    # Load input data to SBUF
    predicate = nl.load(predicate_data[...])
    on_true = nl.load(on_true_data[...])
    
    # Create destination tensor
    dst = nl.ndarray(on_true_data.shape, dtype=nl.float32, buffer=nl.sbuf)
    
    # Perform select operation with reverse_pred=True
    # This will select on_true where predicate is FALSE
    nisa.select_reduce(
        dst=dst,
        predicate=predicate,
        on_true=on_true,
        on_false=nl.fp32.min,
        reverse_pred=True  # Reverse the meaning of the predicate
    )
    
    # Store result to HBM
    nl.store(result_tensor, value=dst)
    

---

### nki.isa.range_select

nki.isa.range_select(*, on_true_tile, comp_op0, comp_op1, bound0, bound1, reduce_cmd=reduce_cmd.idle, reduce_res=None, reduce_op=<function amax>, range_start=0, on_false_value=<property object>, mask=None, dtype=None, **kwargs)
    

Select elements from `on_true_tile` based on comparison with bounds using Vector Engine.

Note

Available only on NeuronCore-v3 and beyond.

For each element in `on_true_tile`, compares its free dimension index + `range_start` against `bound0` and `bound1` using the specified comparison operators (`comp_op0` and `comp_op1`). If both comparisons evaluate to True, copies the element to the output; otherwise uses `on_false_value`.

Additionally performs a reduction operation specified by `reduce_op` on the results, storing the reduction result in `reduce_res`.

Note on numerical stability:

In self-attention, we often have this instruction sequence: `range_select` (VectorE) -> `reduce_res` -> `activation` (ScalarE). When `range_select` outputs a full row of `fill_value`, caution is needed to avoid NaN in the activation instruction that subtracts the output of `range_select` by `reduce_res` (max value):

  * If `dtype` and `reduce_res` are both FP32, we should not hit any NaN issue since `FP32_MIN - FP32_MIN = 0`. Exponentiation on 0 is stable (1.0 exactly).

  * If `dtype` is FP16/BF16/FP8, the fill_value in the output tile will become `-INF` since HW performs a downcast from FP32_MIN to a smaller dtype. In this case, you must make sure reduce_res uses FP32 `dtype` to avoid NaN in `activation`. NaN can be avoided because `activation` always upcasts input tiles to FP32 to perform math operations: `-INF - FP32_MIN = -INF`. Exponentiation on `-INF` is stable (0.0 exactly).

Constraints:

The comparison operators must be one of:

  * np.equal

  * np.less

  * np.less_equal

  * np.greater

  * np.greater_equal

Partition dim sizes must match across `on_true_tile`, `bound0`, and `bound1`:

  * `bound0` and `bound1` must have one element per partition

  * `on_true_tile` must be one of the FP dtypes, and `bound0/bound1` must be FP32 types.

The comparison with `bound0`, `bound1`, and free dimension index is done in FP32. Make sure `range_start` \+ free dimension index is within 2^24 range.

Estimated instruction cost:

`max(MIN_II, N)` Vector Engine cycles, where:

  * `N` is the number of elements per partition in `on_true_tile`, and

  * `MIN_II` is the minimum instruction initiation interval for small input tiles.

  * `MIN_II` is roughly 64 engine cycles.

Numpy equivalent:
    
    
    indices = np.zeros(on_true_tile.shape)
    indices[:] = range_start + np.arange(on_true_tile[0].size)
    
    mask = comp_op0(indices, bound0) & comp_op1(indices, bound1)
    select_out_tile = np.where(mask, on_true_tile, on_false_value)
    reduce_tile = reduce_op(select_out_tile, axis=1, keepdims=True)
    

Parameters:
    

  * on_true_tile – input tile containing elements to select from

  * on_false_value – constant value to use when selection condition is False. Due to HW constraints, this must be FP32_MIN FP32 bit pattern

  * comp_op0 – first comparison operator

  * comp_op1 – second comparison operator

  * bound0 – tile with one element per partition for first comparison

  * bound1 – tile with one element per partition for second comparison

  * reduce_op – reduction operator to apply on across the selected output. Currently only `np.max` is supported.

  * reduce_res – optional tile to store reduction results.

  * range_start – starting base offset for index array for the free dimension of `on_true_tile` Defaults to 0, and must be a compiler time integer.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

output tile with selected elements

Example:
    
    
    import neuronxcc.nki as nki
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import numpy as np
    ...
    
    ##################################################################
    # Example 1: # Select elements where 
    # bound0 <= range_start + index < bound1 and compute max reduction
    # 
    # on_false_value must be nl.fp32.min
    ##################################################################
    on_true_tile = nl.load(on_true[...])
    bound0_tile = nl.load(bound0[...])
    bound1_tile = nl.load(bound1[...])
    
    reduce_res_tile = nl.ndarray((on_true.shape[0], 1), dtype=nl.float32, buffer=nl.sbuf)
    result = nl.ndarray(on_true.shape, dtype=nl.float32, buffer=nl.sbuf)
    
    result[...] = nisa.range_select(
        on_true_tile=on_true_tile,
        comp_op0=compare_op0,
        comp_op1=compare_op1,
        bound0=bound0_tile,
        bound1=bound1_tile,
        reduce_cmd=nisa.reduce_cmd.reset_reduce,
        reduce_res=reduce_res_tile,
        reduce_op=np.max,
        range_start=range_start,
        on_false_value=nl.fp32.min
    )
    
    nl.store(select_res[...], value=result[...])
    nl.store(reduce_result[...], value=reduce_res_tile[...])
    

Alternatively, `reduce_cmd` can be used to chain multiple calls to the same accumulation register to accumulate across multiple range_select calls. For example:
    
    
    import neuronxcc.nki as nki
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import numpy as np
    ...
    
    ##################################################################
    # Example 2.a: Initialize reduction with first range_select
    # Notice we don't pass reduce_res since the accumulation
    # register keeps track of the accumulation until we're ready to 
    # read it. Also we use reset_reduce in order to "clobber" or zero
    # out the accumulation register before we start accumulating.
    #
    # Note: Since the type of these tensors are fp32, we use nl.fp32.min
    # for on_false_value due to HW constraints.
    ##################################################################
    on_true_tile = nl.load(on_true[...])
    bound0_tile = nl.load(bound0[...])
    bound1_tile = nl.load(bound1[...])
    
    reduce_res_sbuf = nl.ndarray((on_true.shape[0], 1), dtype=np.float32, buffer=nl.sbuf)
    result_sbuf = nl.ndarray(on_true.shape, dtype=np.float32, buffer=nl.sbuf)
    
    result_sbuf[...] = nisa.range_select(
        on_true_tile=on_true_tile,
        comp_op0=compare_op0,
        comp_op1=compare_op1,
        bound0=bound0_tile,
        bound1=bound1_tile,
        reduce_cmd=nisa.reduce_cmd.reset_reduce,
        reduce_op=np.max,
        range_start=range_start,
        on_false_value=nl.fp32.min
    )
    
    ##################################################################
    # Example 2.b: Chain multiple range_select operations 
    # with reduction in an affine loop. Adding ones just lets us ensure the reduction 
    # gets updated with new values.
    ##################################################################
    ones = nl.full(on_true.shape, fill_value=1, dtype=np.float32, buffer=nl.sbuf)
    # we are going to loop as if we're tiling on the partition dimension    
    iteration_step_size = on_true_tile.shape[0]
    
    # Perform chained operations using an affine loop index for range_start
    for i in range(1, 2):
        # Update input values
        on_true_tile[...] = nl.add(on_true_tile, ones)
        
        # Continue reduction with updated values
        # notice, we still don't have reduce_res specified
        result_sbuf[...] = nisa.range_select(
            on_true_tile=on_true_tile,
            comp_op0=compare_op0,
            comp_op1=compare_op1,
            bound0=bound0_tile,
            bound1=bound1_tile,
            reduce_cmd=nisa.reduce_cmd.reduce,
            reduce_op=np.max,
            # we can also use index expressions for setting the start of the range
            range_start=range_start + (i * iteration_step_size),
            on_false_value=nl.fp32.min
        )
    
    range_start = range_start + (2 * iteration_step_size)
    ##################################################################
    # Example 2.c: Final iteration, we actually want the results to 
    # return to the user so we pass reduce_res argument so the 
    # reduction  will be written from the accumulation 
    # register to reduce_res_tile
    ##################################################################
    on_true_tile[...] = nl.add(on_true_tile, ones)
    result_sbuf[...] = nisa.range_select(
        on_true_tile=on_true_tile,
        comp_op0=compare_op0,
        comp_op1=compare_op1,
        bound0=bound0_tile,
        bound1=bound1_tile,
        reduce_cmd=nisa.reduce_cmd.reduce,
        reduce_res=reduce_res_sbuf[...],
        reduce_op=np.max,
        range_start=range_start,
        on_false_value=nl.fp32.min
    )
    
    nl.store(select_res[...], value=result_sbuf[...])
    nl.store(reduce_result[...], value=reduce_res_sbuf[...])
    

---

### range_select

`range_select` | Select elements from `on_true_tile` based on comparison with bounds using Vector Engine.  

---

### affine_select

`affine_select` | Select elements between an input tile `on_true_tile` and a scalar value `on_false_value` according to a boolean predicate tile using GpSimd Engine.  

---

### nki.isa.sequence_bounds

nki.isa.sequence_bounds(*, segment_ids, dtype=None)
    

Compute the sequence bounds for a given set of segment IDs using GpSIMD Engine.

Given a tile of segment IDs, this function identifies where each segment begins and ends. For each element, it returns a pair of values: [start_index, end_index] indicating the boundaries of the segment that element belongs to. All segment IDs must be non-negative integers. Padding elements (with segment ID of zero) receive special boundary values: a start index of n and an end index of (-1), where n is the length of `segment_ids`.

The output tile contains two values per input element: the start index (first column) and end index (second column) of each segment. The partition dimension must always be 1. For example, with input shape (1, 512), the output shape becomes (1, 2, 512), where the additional dimension holds the start and end indices for each element.

The input tile (`segment_ids`) must have data type np.float32 or np.int32. The output tile data type is specified using the `dtype` field (must be np.float32 or np.int32). If `dtype` is not specified, the output data type will be the same as the input data type of `segment_ids`.

NumPy equivalent:
    
    
    def compute_sequence_bounds(sequence):
      n = len(sequence)
    
      min_bounds = np.zeros(n, dtype=sequence.dtype)
      max_bounds = np.zeros(n, dtype=sequence.dtype)
    
      min_bound_pad = n
      max_bound_pad = -1
    
      min_bounds[0] = 0 if sequence[0] != 0 else min_bound_pad
      for i in range(1, n):
        if sequence[i] == 0:
          min_bounds[i] = min_bound_pad
        elif sequence[i] == sequence[i - 1]:
          min_bounds[i] = min_bounds[i - 1]
        else:
          min_bounds[i] = i
    
      max_bounds[-1] = n if sequence[-1] != 0 else max_bound_pad
      for i in range(n - 2, -1, -1):
        if sequence[i] == 0:
          max_bounds[i] = max_bound_pad
        elif sequence[i] == sequence[i + 1]:
          max_bounds[i] = max_bounds[i + 1]
        else:
          max_bounds[i] = i + 1
    
      return np.vstack((min_bounds, max_bounds))
    
    b = (
      np.apply_along_axis(
        compute_sequence_bounds, axis=1, arr=reshaped_segment_ids
      )
      .reshape(m, 2, n)
      .astype(np.float32)
    )
    

Parameters:
    

  * segment_ids – tile containing the segment IDs. Elements with ID=0 are treated as padding.

  * dtype – data type of the output (must be np.float32 or np.int32)

Returns:
    

tile containing the sequence bounds.

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ######################################################################
    # Example 1: Generate tile of boundaries of sequence for each element:
    ######################################################################
    # Input example
    # segment_ids = np.array([[0, 1, 1, 2, 2, 2, 0, 3, 3]], dtype=np.int32)
    
    # Expected output for this example:
    # [[
    #   [9, 1, 1, 3, 3, 3, 9, 7, 7]       # start index
    #   [-1, 3, 3, 6, 6, 6, -1, 9, 9]     # end index
    #   ]]
    m, n = segment_ids.shape
    
    ix, iy, iz = nl.mgrid[0:m, 0:2, 0:n]
    
    out_tile = nl.ndarray([m, 2, n], dtype=segment_ids.dtype, buffer=nl.sbuf)
    seq_tile = nl.load(segment_ids)
    out_tile[ix, iy, iz] = nisa.sequence_bounds(segment_ids=seq_tile)
    

## Gather and Shuffle Operations

### gather_flattened

`gather_flattened` | Gather elements from `data` according to the `indices`.  

---

### nki.language.gather_flattened

nki.language.gather_flattened(data, indices, *, mask=None, dtype=None, **kwargs)
    

Gather elements from `data` according to the `indices`.

This instruction gathers elements from the `data` tensor using integer indices provided in the `indices` tensor. For each element in the `indices` tensor, it retrieves the corresponding value from the `data` tensor using the index value to select from the free dimension of `data`. The gather instruction effectively performs up to 128 parallel gather operations, with each operation using the corresponding partition of `data` and `indices`.

The output tensor has the same shape as the `indices` tensor, with each output element containing the value from `data` at the position specified by the corresponding index. Out of bounds indices will return garbage values.

Both `data` and `indices` must be 2-, 3-, or 4-dimensional. The `indices` tensor must contain uint32 values.

For indexing purposes, all free dimensions are flattened and indexed as the same “row”. Consider this example:
    
    
    data =
    [[[1., 2.],
     [3., 4.]],
    [[5., 6.],
     [7., 8.]]]
    indices =
    [[[0, 1],
      [1, 3]],
     [[3, 3],
      [1, 0]]]
    nl.gather_flattened(data, indices) produces this result:
    [[[1., 2.],
      [2., 4.]],
     [[8., 8.],
      [6., 5.]]]
    

With the exception of handling out-of-bounds indices, this behavior is equivalent to:
    
    
    indices_flattened = indices.reshape(indices.shape[0], -1)
    data_flattened = data.reshape(data.shape[0], -1)
    result = np.take_along_axis(data_flattened, indices_flattened, axis=-1)
    result.reshape(indices.shape)
    

((Similar to torch.gather))

Parameters:
    

  * data – the source tensor to gather values from

  * indices – tensor containing uint32 indices to gather across the flattened free dimension.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

a tensor with the same shape as indices containing gathered values from data

Example:
    
    
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    ##################################################################
    # Example 1: Gather values from a tensor using indices
    ##################################################################
    # Create source tensor
    N = 32
    M = 64
    data = nl.rand((N, M), dtype=nl.float32)
    
    # Create indices tensor - gather every 5th element
    indices = nl.zeros((N, 10), dtype=nl.uint32)
    for i in nl.static_range(N):
        for j in nl.static_range(10):
            indices[i, j] = j * 5
    
    # Gather values from data according to indices
    result = nl.gather_flattened(data=data, indices=indices)
    

---

### local_gather

`local_gather` | Gather SBUF data in `src_buffer` using `index` on GpSimd Engine.  

---

### nki.isa.nc_stream_shuffle

nki.isa.nc_stream_shuffle(src, dst, shuffle_mask, *, dtype=None, mask=None, **kwargs)
    

Apply cross-partition data movement within a quadrant of 32 partitions from source tile `src` to destination tile `dst` using Vector Engine.

Both source and destination tiles can be in either SBUF or PSUM, and passed in by reference as arguments. In-place shuffle is allowed, i.e., `dst` same as `src`. `shuffle_mask` is a 32-element list. Each mask element must be in data type int or affine expression. `shuffle_mask[i]` indicates which input partition the output partition [i] copies from within each 32-partition quadrant. The special value `shuffle_mask[i]=255` means the output tensor in partition [i] will be unmodified. `nc_stream_shuffle` can be applied to multiple of quadrants. In the case with more than one quadrant, the shuffle is applied to each quadrant independently, and the same `shuffle_mask` is used for each quadrant. `mask` applies to `dst`, meaning that locations masked out by `mask` will be unmodified. For more information about the cross-partition data movement, see Cross-partition Data Movement.

This API has 3 constraints on `src` and `dst`:

  1. `dst` must have same data type as `src`.

  2. `dst` must have the same number of elements per partition as `src`.

  3. The access start partition of `src` (`src_start_partition`), does not have to match or be in the same quadrant as that of `dst` (`dst_start_partition`). However, `src_start_partition`/`dst_start_partition` needs to follow some special hardware rules with the number of active partitions `num_active_partitions`. `num_active_partitions = ceil(max(src_num_partitions, dst_num_partitions)/32) * 32`, where `src_num_partitions` and `dst_num_partitions` refer to the number of partitions the `src` and `dst` tensors access respectively. `src_start_partition`/`dst_start_partition` is constrained based on the value of `num_active_partitions`:

>   * If `num_active_partitions` is 96/128, `src_start_partition`/`dst_start_partition` must be 0.
> 
>   * If `num_active_partitions` is 64, `src_start_partition`/`dst_start_partition` must be 0/64.
> 
>   * If `num_active_partitions` is 32, `src_start_partition`/`dst_start_partition` must be 0/32/64/96.
> 
> 

Estimated instruction cost:

`max(MIN_II, N)` Vector Engine cycles, where `N` is the number of elements per partition in `src`, and `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

Parameters:
    

  * src – the source tile

  * dst – the destination tile

  * shuffle_mask – a 32-element list that specifies the shuffle source and destination partition

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    from neuronxcc.nki.typing import tensor
    
    #####################################################################
    # Example 1: 
    # Apply cross-partition data movement to a 32-partition tensor,
    # in-place shuffling the data in partition[i] to partition[(i+1)%32].
    #####################################################################
    
    ...
    a: tensor[32, 128] = nl.load(in_tensor)
    a_mgrid = nl.mgrid[0:32, 0:128]
    shuffle_mask = [(i - 1) % 32 for i in range(32)]
    nisa.nc_stream_shuffle(src=a[a_mgrid.p, a_mgrid.x], dst=a[a_mgrid.p, a_mgrid.x], shuffle_mask=shuffle_mask)
    
    nl.store(out_tensor, value=a)
    
    
    
    #####################################################################
    # Example 2: 
    # Broadcast data in 1 partition to 32 partitions.
    #####################################################################
    
    ...
    a: tensor[1, 128] = nl.load(in_tensor)
    b = nl.ndarray(shape=(32, 128), dtype=np.float32)
    dst_mgrid = nl.mgrid[0:32, 0:128]
    src_mgrid = nl.mgrid[0:1, 0:128]
    shuffle_mask = [0] * 32
    nisa.nc_stream_shuffle(src=a[0, src_mgrid.x], dst=b[dst_mgrid.p, dst_mgrid.x], shuffle_mask=shuffle_mask)
    
    nl.store(out_tensor, value=b)
    
    
    
    #####################################################################
    # Example 3: 
    # In the case where src and dst access more than one quadrant (32 
    # partitions), the shuffle is applied to each quadrant independently, 
    # and the same shuffle_mask is used for each quadrant.
    #####################################################################
    
    ...
    a: tensor[128, 128] = nl.load(in_tensor)
    b = nl.ndarray(shape=(128, 128), dtype=np.float32)
    mgrid = nl.mgrid[0:128, 0:128]
    shuffle_mask = [(i - 1) % 32 for i in range(32)]
    nisa.nc_stream_shuffle(src=a[mgrid.p, mgrid.x], dst=b[mgrid.p, mgrid.x], shuffle_mask=shuffle_mask)
    
    nl.store(out_tensor, value=b)
    

---

### nc_stream_shuffle

`nc_stream_shuffle` | Apply cross-partition data movement within a quadrant of 32 partitions from source tile `src` to destination tile `dst` using Vector Engine.  

---

### nki.isa.nc_match_replace8

nki.isa.nc_match_replace8(*, data, vals, imm, dst_idx=None, mask=None, dtype=None, **kwargs)
    

Replace first occurrence of each value in `vals` with `imm` in `data` using the Vector engine and return the replaced tensor. If `dst_idx` tile is provided, the indices of the matched values are written to `dst_idx`.

This instruction reads the input `data`, replaces the first occurrence of each of the given values (from `vals` tensor) with the specified immediate constant and, optionally, output indices of matched values to `dst_idx`. When performing the operation, the free dimensions of both `data` and `vals` are flattened. However, these dimensions are preserved in the replaced output tensor and in `dst_idx` respectively. The partition dimension defines the parallelization boundary. Match, replace, and index generation operations execute independently within each partition.

The `data` tensor can be up to 5-dimensional, while the `vals` tensor can be up to 3-dimensional. The `vals` tensor must have exactly 8 elements per partition. The data tensor must have no more than 16,384 elements per partition. The replaced output will have the same shape as the input data tensor. `data` and `vals` must have the same number of partitions. Both input tensors can come from SBUF or PSUM.

Behavior is undefined if vals tensor contains values that are not in the data tensor.

If provided, a mask is applied to the data tensor.

Estimated instruction cost:

`min(MIN_II, N)` engine cycles, where:

  * `N` is the number of elements per partition in the data tensor

  * `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.

NumPy equivalent:
    
    
    # Let's assume we work with NumPy, and ``data``, ``vals`` are 2-dimensional arrays
    # (with shape[0] being the partition axis) and imm is a constant float32 value.
    
    import numpy as np
    
    # Get original shapes
    data_shape = data.shape
    vals_shape = vals.shape
    
    # Reshape to 2D while preserving first dimension
    data_2d = data.reshape(data_shape[0], -1)
    vals_2d = vals.reshape(vals_shape[0], -1)
    
    # Initialize output array for indices
    indices = np.zeros(vals_2d.shape, dtype=np.uint32)
    
    for i in range(data_2d.shape[0]):
      for j in range(vals_2d.shape[1]):
        val = vals_2d[i, j]
        # Find first occurrence of val in data_2d[i, :]
        matches = np.where(data_2d[i, :] == val)[0]
        if matches.size > 0:
          indices[i, j] = matches[0]  # Take first match
          data_2d[i, matches[0]] = imm
    
    output = data_2d.reshape(data.shape)
    indices = indices.reshape(vals.shape) # Computed only if ``dst_idx`` is specified
    

Parameters:
    

  * data – the data tensor to modify

  * dst_idx – (optional) the destination tile to write flattened indices of matched values

  * vals – tensor containing the 8 values per partition to replace

  * imm – float32 constant to replace matched values with

  * mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  * dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:
    

the modified data tensor

Example:
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import neuronxcc.nki.typing as nt
    
    
    ##################################################################
    # Example 1: Generate tile a of random floating point values,
    # get the 8 largest values in each row, then replace their first
    # occurrences with -inf:
    ##################################################################
    N = 4
    M = 16
    data_tile = nl.rand((N, M))
    max_vals = nisa.max8(src=data_tile)
    
    result = nisa.nc_match_replace8(data=data_tile[:, :], vals=max_vals, imm=float('-inf'))
    result_tensor = nl.ndarray([N, M], dtype=nl.float32, buffer=nl.shared_hbm)
    nl.store(result_tensor, value=result)
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import neuronxcc.nki.typing as nt
    
    
    ##################################################################
    # Example 2: Read the 8 largest values in each row of the tensor,
    # replace the first occurrence with imm, write indices, and return
    # the replaced output.
    ##################################################################
    n, m = in_tensor.shape
    
    dst_idx = nl.ndarray((n, 8), dtype=idx_tensor.dtype)
    
    ix, iy = nl.mgrid[0:n, 0:8]
    
    inp_tile: nt.tensor[n, m] = nl.load(in_tensor)
    max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile)
    
    out_tile = nisa.nc_match_replace8(
      dst_idx=dst_idx[ix, iy], data=inp_tile[:, :], vals=max_vals, imm=imm
    )
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import neuronxcc.nki.typing as nt
    
    
    ##################################################################
    # Example 3: Read the 8 largest values in each row of the tensor,
    # after applying the specified mask, replace the first occurrence
    # with imm, write indices, and return the replaced output.
    ##################################################################
    n, m = in_tensor.shape
    
    idx_tile = nisa.memset(shape=(n, 8), value=0, dtype=nl.uint32)
    
    ix, iy = nl.mgrid[0:n, 0:m]
    inp_tile: nt.tensor[n, m] = nl.load(in_tensor)
    max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile[ix, iy], mask=(ix < n //2 and iy < m//2))
    
    out_tile = nisa.nc_match_replace8(
      dst_idx=idx_tile[:, :],
      data=inp_tile[ix, iy],
      vals=max_vals,
      imm=imm,
      mask=(ix < n // 2 and iy < m // 2),  # mask applies to `data`
    )
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import neuronxcc.nki.typing as nt
    
    
    ##################################################################
    # Example 4: Read the 8 largest values in each row of the tensor,
    # replace the first occurrence with 0.0, write indices, and return 
    # the replaced output.
    ##################################################################
    n, b, m = data_tensor.shape
    
    n, b, m = data_tensor.shape
    
    out_tensor = nl.ndarray([n, b, m], dtype=data_tensor.dtype, buffer=nl.hbm)
    idx_tensor = nl.ndarray([n, 8], dtype=nl.uint32, buffer=nl.hbm)
    
    imm = 0.0
    idx_tile = nisa.memset(shape=(n, 8), value=0, dtype=nl.uint32)
    out_tile = nisa.memset(shape=(n, b, m), value=0, dtype=data_tensor.dtype)
    
    iq, ir, iw = nl.mgrid[0:n, 0:b, 0:m]
    ip, io = nl.mgrid[0:n, 0:8]
    
    inp_tile = nl.load(data_tensor[iq, ir, iw])
    max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile)
    
    out_tile[iq, ir, iw] = nisa.nc_match_replace8(
      dst_idx=idx_tile[ip, io],
      data=inp_tile[iq, ir, iw],
      vals=max_vals[ip, io],
      imm=imm,
    )
    
    
    
    
    import neuronxcc.nki.isa as nisa
    import neuronxcc.nki.language as nl
    import neuronxcc.nki.typing as nt
    
    
    ##################################################################
    # Example 5: Read the 8 largest values in each row of the tensor,
    # replace the first occurrence with 0.0 in-place and write indices.
    ##################################################################
    n, b, m = data_tensor.shape
    
    n, b, m = data_tensor.shape
    
    out_tensor = nl.ndarray([n, b, m], dtype=data_tensor.dtype, buffer=nl.hbm)
    idx_tensor = nl.ndarray([n, 8], dtype=nl.uint32, buffer=nl.hbm)
    
    imm = 0.0
    idx_tile = nisa.memset(shape=(n, 8), value=0, dtype=nl.uint32)
    
    iq, ir, iw = nl.mgrid[0:n, 0:b, 0:m]
    ip, io = nl.mgrid[0:n, 0:8]
    
    inp_tile = nl.load(data_tensor[iq, ir, iw])
    max_vals: nt.tensor[n, 8] = nisa.max8(src=inp_tile)
    
    inp_tile[iq, ir, iw] = nisa.nc_match_replace8(
      dst_idx=idx_tile[ip, io],
      data=inp_tile[iq, ir, iw],
      vals=max_vals[ip, io],
      imm=imm,
    )
    

---

### max8

`max8` | Find the 8 largest values in each partition of the source tile.  

---

### nc_find_index8

`nc_find_index8` | Find indices of the 8 given vals in each partition of the data tensor.  

## Control Flow and SPMD

### static_range

`static_range` | Create a sequence of numbers for use as loop iterators in NKI, resulting in a fully unrolled loop.  

---

### affine_range

`affine_range` | Create a sequence of numbers for use as parallel loop iterators in NKI.  

---

### nki.language.affine_range

# nki.language.affine_range
nki.language.affine_range(*args, **kwargs)
    

Create a sequence of numbers for use as parallel loop iterators in NKI. `affine_range` should be the default loop iterator choice, when there is no loop carried dependency. Note, associative reductions are not considered loop carried dependencies in this context. A concrete example of associative reduction is multiple nl.matmul or nisa.nc_matmul calls accumulating into the same output buffer defined outside of this loop level (see code example #2 below).

When the above conditions are not met, we recommend using sequential_range instead.

Notes:

  * Using `affine_range` prevents Neuron compiler from unrolling the loops until entering compiler backend, which typically results in better compilation time compared to the fully unrolled iterator static_range.

  * Using `affine_range` also allows Neuron compiler to perform additional loop-level optimizations, such as loop vectorization in current release. The exact type of loop-level optimizations applied is subject to changes in future releases.

  * Since each kernel instance only runs on a single NeuronCore, affine_range does not parallelize different loop iterations across multiple NeuronCores. However, different iterations could be parallelized/pipelined on different compute engines within a NeuronCore depending on the invoked instructions (engines) and data dependency in the loop body.

    
    
     1import neuronxcc.nki.language as nl
     2
     3#######################################################################
     4# Example 1: No loop carried dependency
     5# Input/Output tensor shape: [128, 2048]
     6# Load one tile ([128, 512]) at a time, square the tensor element-wise,
     7# and store it into output tile
     8#######################################################################
     9
    10# Every loop instance works on an independent input/output tile.
    11# No data dependency between loop instances.
    12for i_input in nl.affine_range(input.shape[1] // 512):
    13  offset = i_input * 512
    14  input_sb = nl.load(input[0:input.shape[0], offset:offset+512])
    15  result = nl.multiply(input_sb, input_sb)
    16  nl.store(output[0:input.shape[0], offset:offset+512], result)
    17
    18#######################################################################
    19# Example 2: Matmul output buffer accumulation, a type of associative reduction
    20# Input tensor shapes for nl.matmul: xT[K=2048, M=128] and y[K=2048, N=128]
    21# Load one tile ([128, 128]) from both xT and y at a time, matmul and
    22# accumulate into the same output buffer
    23#######################################################################
    24
    25result_psum = nl.zeros((128, 128), dtype=nl.float32, buffer=nl.psum)
    26for i_K in nl.affine_range(xT.shape[0] // 128):
    27  offset = i_K * 128
    28  xT_sbuf = nl.load(offset:offset+128, 0:xT.shape[1]])
    29  y_sbuf = nl.load(offset:offset+128, 0:y.shape[1]])
    30
    31  result_psum += nl.matmul(xT_sbuf, y_sbuf, transpose_x=True)
    

---

### sequential_range

`sequential_range` | Create a sequence of numbers for use as sequential loop iterators in NKI.  

---

### nki.language.sequential_range

# nki.language.sequential_range
nki.language.sequential_range(*args, **kwargs)
    

Create a sequence of numbers for use as sequential loop iterators in NKI. `sequential_range` should be used when there is a loop carried dependency. Note, associative reductions are not considered loop carried dependencies in this context. See affine_range for an example of such associative reduction.

Notes:

  * Inside a NKI kernel, any use of Python `range(...)` will be replaced with `sequential_range(...)` by Neuron compiler.

  * Using `sequential_range` prevents Neuron compiler from unrolling the loops until entering compiler backend, which typically results in better compilation time compared to the fully unrolled iterator static_range.

  * Using `sequential_range` informs Neuron compiler to respect inter-loop dependency and perform much more conservative loop-level optimizations compared to `affine_range`.

  * Using `affine_range` instead of `sequential_range` in case of loop carried dependency incorrectly is considered unsafe and could lead to numerical errors.

    
    
     1import neuronxcc.nki.language as nl
     2
     3#######################################################################
     4# Example 1: Loop carried dependency from tiling tensor_tensor_scan
     5# Both sbuf tensor input0 and input1 shapes: [128, 2048]
     6# Perform a scan operation between the two inputs using a tile size of [128, 512]
     7# Store the scan output to another [128, 2048] tensor
     8#######################################################################
     9
    10# Loop iterations communicate through this init tensor
    11init = nl.zeros((128, 1), dtype=input0.dtype)
    12
    13# This loop will only produce correct results if the iterations are performed in order
    14for i_input in nl.sequential_range(input0.shape[1] // 512):
    15  offset = i_input * 512
    16
    17  # Depends on scan result from the previous loop iteration
    18  result = nisa.tensor_tensor_scan(input0[:, offset:offset+512],
    19                                   input1[:, offset:offset+512],
    20                                   initial=init,
    21                                   op0=nl.multiply, op1=nl.add)
    22
    23  nl.store(output[0:input0.shape[0], offset:offset+512], result)
    24
    25  # Prepare initial result for scan in the next loop iteration
    26  init[:, :] = result[:, 511]
    

---

### program_id

`program_id` | Index of the current SPMD program along the given axis in the launch grid.  

---

### nki.language.program_id

nki.language.program_id(axis)
    

Index of the current SPMD program along the given axis in the launch grid.

Parameters:
    

axis – The axis of the ND launch grid.

Returns:
    

The program id along `axis` in the launch grid

---

### num_programs

`num_programs` | Number of SPMD programs along the given axes in the launch grid.  

---

### program_ndim

`program_ndim` | Number of dimensions in the SPMD launch grid.  

---

### spmd_dim

`spmd_dim` | Create a dimension in the SPMD launch grid of a NKI kernel with sub-dimension tiling.  

---

### nc

`nc` | Create a logical neuron core dimension in launch grid.  

## Masking and API Behavior

### NKI API Masking

## NKI API Masking
All nki.language and nki.isa APIs accept an optional input field, `mask`. The `mask` field is an execution predicate known at compile-time, which informs the compiler to skip generating the instruction or generate the instruction with a smaller input tile shape. Masking is handled completely by Neuron compiler and hence does not incur any performance overhead in the generated instructions.

The `mask` can be created using comparison expressions (e.g., `a < b`) or multiple comparison expressions concatenated with `&` (e.g., `(a < b) & (c > d)`). The left- or right-hand side expression of each comparator must be an affine expression of `nki.language.arange()`, `nki.language.affine_range()` or `nki.language.program_id()` . Each comparison expression should indicate which range of indices along one of the input tile axes should be valid for the computation. For example, assume we have an input tile `in_tile` of shape `(128, 512)`, and we would like to perform a square operation on this tile for elements in `[0:64, 0:256]`, we can invoke the `nki.language.square()` API using the following:
    
    
    import neuronxcc.nki.language as nl
    
    ...
    i_p = nl.arange(128)[:, None]
    i_f = nl.arange(512)[None, :]
    
    out_tile = nl.square(in_tile, mask=((i_p<64) & (i_f<256)))
    

The above example will be lowered into a hardware ISA instruction that only processes 64x256 elements by Neuron Compiler.

The above `mask` definition works for most APIs where there is only one input tile or both input tiles share the same axes. One exception is the `nki.language.matmul` and similarly `nki.isa.nc_matmul` API, where the two input tiles `lhs` and `rhs` contain three unique axes:

  1. The contraction axis: both `lhs` and `rhs` partition axis (`lhs_rhs_p`)

  2. The first axis of matmul output: `lhs` free axis (`lhs_f`)

  3. The second axis of matmul output: `rhs` free axis (`rhs_f`)

As an example, let’s assume we have `lhs` tile of shape `(sz_p, sz_m)` and `rhs` tile of shape `(sz_p, sz_n)`, and we call `nki.language.matmul` to calculate an output tile of shape `(sz_m, sz_n)`:
    
    
    import neuronxcc.nki.language as nl
    
    i_p = nl.arange(sz_p)[:, None]
    
    i_lhs_f = nl.arange(sz_m)[None, :]
    i_rhs_f = nl.arange(sz_n)[None, :] # same as `i_rhs_f = i_lhs_f`
    
    result = nl.matmul(lhs[i_p, i_lhs_f], rhs[i_p, i_rhs_f], transpose_x=True)
    

Since both `i_lhs_f` and `i_rhs_f` are identical to the Neuron Compiler, the Neuron Compiler cannot distinguish the two input axes if they were to be passed into the `mask` field directly.

Therefore, we introduce “operand masking” syntax for matmult APIs to let users to precisely define the masking on the inputs to the matmult APIs (currently only matmult APIs support operand masking, subject to changes in future releases). Let’s assume we need to constraint `sz_m <= 64` and `sz_n <= 256`:
    
    
    import neuronxcc.nki.language as nl
    
    i_p = nl.arange(sz_p)[:, None]
    
    i_lhs_f = nl.arange(sz_m)[None, :]
    i_rhs_f = nl.arange(sz_n)[None, :] # same as `i_rhs_f = i_lhs_f`
    
    i_lhs_f_virtual = nl.arange(sz_m)[None, :, None]
    
    result = nl.matmul(lhs_T[i_lhs_f <= 64], rhs[i_rhs_f <= 256], transpose_x=True)
    

There are two notable use cases for masking:

  1. When the tiling factor doesn’t divide the tensor dimension sizes

  2. Skip ineffectual instructions that compute known output values

We will present an example of the first use case below. Let’s assume we would like to evaluate the exponential function on an input tensor of shape `[sz_p, sz_f]` from HBM. Since the input to `nki.language.load/nki.language.store/nki.language.exp` expects a tile with a partition axis size not exceeding `nki.language.tile_size.pmax == 128`, we should loop over the input tensor using a tile size of `[nki.language.tile_size.pmax, sz_f]`.

However, `sz_p` is not guaranteed to be an integer multiple of `nki.language.tile_size.pmax`. In this case, one option is to write a loop with trip count of `sz_p // nki.language.tile_size.pmax` followed by a single invocation of `nki.language.exp` with an input tile of shape `[sz_p % nki.language.tile_size.pmax, sz_f]`. This effectively “unrolls” the last instance of tile computation, which could lead to messy code in a complex kernel. Using masking here will allow us to avoid such unrolling, as illustrated in the example below:
    
    
    import neuronxcc.nki.language as nl
    from torch_neuronx import nki_jit
    
    @nki_jit
    def tensor_exp_kernel_(in_tensor, out_tensor):
    
    sz_p, sz_f = in_tensor.shape
    
    i_f = nl.arange(sz_f)[None, :]
    
    trip_count = math.ceil(sz_p/nl.tile_size.pmax)
    
    for p in nl.affine_range(trip_count):
        # Generate tensor indices for the input/output tensors
        # pad index to pmax, for simplicity
        i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]
    
        # Load input data from external memory to on-chip memory
        # only read up to sz_p
        in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p < sz_p))
    
        # perform the computation
        out_tile = nl.exp(in_tile, mask=(i_p < sz_p))
    
        # store the results back to external memory
        # only write up to sz_p
        nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))
    

---

### NKI Engine Selection for Operators Supported on Multiple Engines

## NKI Engine Selection for Operators Supported on Multiple Engines
There is a tradeoff between precision and speed on different engines for operators with multiple engine options. Users can select which engine to map to based on their needs. We take reciprocal and reverse square root as two examples and explain the tradeoff below.

  1. Reciprocal can run on Scalar Engine or Vector Engine:

> Reciprocal can run on Vector Engine with `nki.isa.reciprocal` or on Scalar Engine with `nki.isa.activation(nl.reciprocal)`. Vector Engine performs reciprocal at a higher precision compared to Scalar Engine; however, the computation throughput of reciprocal on Vector Engine is about 8x lower than Scalar Engine for large input tiles. For input tiles with a small number of elements per partition (less than 64, processed one per cycle), instruction initiation interval (roughly 64 cycles) dominates performance so Scalar Engine and Vector Engine have comparable performance. In this case, we suggest using Vector Engine to achieve better precision.
> 
> Estimated cycles on different engines:
> 
> Cost (Engine Cycles) | Condition  
> ---|---  
> `max(MIN_II, N)` | mapped to Scalar Engine `nki.isa.scalar_engine`  
> `max(MIN_II, 8*N)` | mapped to Vector Engine `nki.isa.vector_engine`  
>   
> where,
> 
>   * `N` is the number of elements per partition in the input tile.
> 
>   * `MIN_II` is the minimum instruction initiation interval for small input tiles. `MIN_II` is roughly 64 engine cycles.
> 
> 

> 
> Note `nki.isa.activation(op=nl.reciprocal)` doesn’t support setting bias on NeuronCore-v2.

  2. Reverse square root can run on GpSIMD Engine or Scalar Engine:

> Reverse square root can run on GpSIMD Engine with `nki.isa.tensor_scalar(op0=nl.rsqrt, operand0=0.0)` or on Scalar Engine with `nki.isa.activation(nl.rsqrt)`. GpSIMD Engine performs reverse square root at a higher precision compared to Scalar Engine; however, the computation throughput of reverse square root on GpSIMD Engine is 4x lower than Scalar Engine.

## Hardware Engine Enumerations

### engine

`engine` | Neuron Device engines  

---

### nki.isa.engine

class nki.isa.engine(value)
    

Neuron Device engines

Attributes

`tensor` | Tensor Engine  
---|---  
`vector` | Vector Engine  
`scalar` | Scalar Engine  
`gpsimd` | GpSIMD Engine  
`sync` | Sync Engine  
`unknown` | Unknown Engine  

---

### reduce_cmd

`reduce_cmd` | Engine Register Reduce commands  

---

### nki.isa.reduce_cmd

class nki.isa.reduce_cmd(value)
    

Engine Register Reduce commands

Attributes

`idle` | Not using the accumulator registers  
---|---  
`reset` | Resets the accumulator registers to its initial state  
`reset_reduce` | Resets the accumulator registers then immediately accumulate the results of the current instruction into the accumulators  
`reduce` | keeps accumulating over the current value of the accumulator registers  

---

### dge_mode

`dge_mode` | Neuron Descriptor Generation Engine Mode  

---

### nc_version

`nc_version` | NeuronCore version  

---

### get_nc_version

`get_nc_version` | Returns the `nc_version` of the current target context.  

## Debugging

### device_print

`device_print` | Print a message with a String `prefix` followed by the value of a tile `x`.  