## Hardware Architecture Summary: AWS Trainium1 (NeuronCore-v2)

### Device Overview

AWS Trainium1 (trn1 instances) contains NeuronDevices, each with 2 NeuronCore-v2 chips. Each NeuronCore-v2 is a fully independent heterogeneous compute unit with four compute engines (Tensor, Vector, Scalar, GpSimd) and software-managed on-chip SRAM. NKI kernels execute on a single NeuronCore. There is no hardware cache — all data movement between memory levels is explicit and software-controlled.

The programming model is tile-based: computation operates on multi-dimensional tiles that must be explicitly loaded from off-chip memory (HBM) into on-chip memory (SBUF), computed upon, and stored back. NKI provides near-ISA-level control over tiling, data layout, memory placement, and engine targeting. The compiler back-end handles memory allocation and instruction scheduling unless overridden via direct allocation.

### Memory Hierarchy

**HBM (Device Memory):** 32 GiB total per device (shared by 2 NeuronCores), 820 GB/s bandwidth. Linear address space; most performant with sequential access. All kernel inputs/outputs reside here. Achieving 60%+ memory bandwidth utilization (MBU) is considered good.

**SBUF (State Buffer) — Main On-Chip SRAM:** 24 MiB total per NeuronCore, organized as 128 partitions × 192 KiB each (16 KiB reserved per partition, leaving 176 KiB usable). This is the primary working memory — all compute engines can read/write SBUF. Bandwidth is approximately 20× higher than HBM, sufficient to keep all engines busy. The free dimension supports up to 64K elements per partition. SBUF is 2D: the partition dimension (P, up to 128) is the parallel axis; the free dimension (F) is the sequential/streaming axis.

**PSUM (Partial Sum Buffer) — Accumulator Memory:** 2 MiB total per NeuronCore, organized as 128 partitions × 16 KiB each. Each partition is divided into 8 banks, each holding up to 512 FP32 values (2 KiB). PSUM is dedicated to accumulating Tensor Engine matmul results via near-memory read-accumulate-write. Accumulation is always in FP32. PSUM should be treated as transient storage — evict results to SBUF as soon as possible due to limited capacity. The free dimension supports up to 4K elements per partition.

**Memory access characteristics:** SBUF and PSUM support up to 4D tensorized access patterns along the free dimension with per-dimension stride. Peak bandwidth requires the most-minor free dimension stride to be < 16 bytes; strides ≥ 16 bytes incur a 50% bandwidth penalty. Each tensor access request has ~60 cycles of static overhead, amortized by large tile sizes. The partition dimension does not support striding — tensors must occupy contiguous partitions.

**Spilling:** When live data exceeds SBUF/PSUM capacity, the compiler inserts spill-save (SBUF→HBM) and spill-reload (HBM→SBUF) transfers. Spill traffic exceeding 30% of total SBUF↔HBM traffic is a significant performance concern.

### Compute Engines

All four engines execute asynchronously in parallel, synchronized via compiler-inserted semaphores. Instruction-level parallelism across engines is the primary mechanism for high throughput.

**Tensor Engine (TensorE):** A 128×128 systolic array accounting for >90% of NeuronCore FLOPS. Reads from SBUF, writes to PSUM. Performance: 92 TFLOPS for BF16/FP16/TF32/cFP8; 23 TFLOPS for FP32 (4× slower). Each `nc_matmul(stationary, moving)` computes `stationary.T @ moving` via a LoadStationary (LS) + MultiplyMoving (MM) instruction pair. Tile size limits: stationary free axis ≤ 128, moving free axis ≤ 512, contraction (partition) axis ≤ 128. The contraction axis of both inputs must be mapped to the partition dimension. Fast LoadStationary is up to 4× faster than MM with the same free axis size, so the tensor with the larger free axis should be stationary. MM initiation interval is ~max(N, 64) TensorE cycles for BF16/FP16/TF32/cFP8 (MM_INIT_LATENCY = 64 cycles on trn1). Best throughput comes from back-to-back nc_matmul calls at maximum tile sizes. TensorE can also perform 128×128 transposes (via identity matrix multiply), partition broadcasts, and cross-partition summation, though these are secondary uses. Accumulation always in FP32. The required PSUM accumulation pattern is: `psum_buf = nl.zeros(..., buffer=nl.psum)` + `nl.affine_range` loop + `psum_buf += nl.matmul(...)`.

**Vector Engine (VectorE):** 128 parallel vector lanes (one per SBUF partition), deeply pipelined. Handles operations where each output depends on multiple inputs: reductions, tensor-tensor element-wise ops, layer normalization, pooling. Performance: 2.3 TFLOPS FP32. Supports all NKI data types with automatic FP32 internal arithmetic. Free dimension up to 64K elements (SBUF) or 4K (PSUM). Cost for free axis size N > 128: ~N cycles for one-input ops, ~2N cycles for two-input ops. Cross-partition data movement limited to groups of 32 partitions (32×32 transpose, 32-partition shuffle).

**Scalar Engine (ScalarE):** 128 parallel lanes, deeply pipelined. Handles element-wise operations where each output depends on one input: activations (GELU, EXP, SQRT, etc.), scale/bias. Performance: 2.9 TFLOPS FP32. Supports all NKI data types with FP32 internal arithmetic. Key feature: pipelined multiply-add-activate in a single instruction via `nki.isa.activation` — `out = func(in * scale + bias)` — giving up to 2× speedup over separate instructions. Also supports pipelined reduction (`activation_reduce`). All activation instructions have the same cost regardless of scale/bias enablement, so always combine operations.

**GpSimd Engine (GpSimdE):** 8 fully programmable 512-bit vector processors, each with 64 KB tightly-coupled memory (3-cycle latency). Each processor connects to 16 SBUF partitions. Total: 128 FP32 lanes, 256 FP16 lanes, or 512 INT8 lanes. Used for custom operators and operations not natively supported by other engines.

**Engine-memory access constraints:** VectorE and GpSimdE cannot access SBUF simultaneously (serialized). VectorE and ScalarE cannot access PSUM simultaneously (serialized). Valid simultaneous SBUF access: {VectorE or GpSimdE} + ScalarE + TensorE. Valid simultaneous PSUM access: {VectorE or ScalarE} + TensorE.

### DMA Engines

16 DMA engines per NeuronCore, each capable of one transfer at a time at peak 27 GiB/s. All 16 operate in parallel. A single `nl.load`/`nl.store` of 128 partitions maps to 16 transfers (8 partitions each). Minimum transfer size for good bandwidth: ≥ 32 KiB per engine (e.g., 8 partitions × 1024 elements × 4 bytes). Maximize both partition dimension (ideally 128) and free dimension (≥ 4 KiB, sweet spot ~1024 elements) for efficient DMA. `nl.load_transpose2d` has much lower bandwidth than `nl.load`; prefer `nl.load` + `nisa.nc_transpose` on TensorE when TensorE is idle.

### Key Tile Size Constraints

| Constraint | Value | Constant |
|---|---|---|
| Max partition dimension (SBUF/PSUM) | 128 | `nl.tile_size.pmax` |
| Max PSUM free dimension | 512 (FP32 elements) | `nl.tile_size.psum_fmax` |
| Max matmul stationary free axis | 128 | `nl.tile_size.gemm_stationary_fmax` |
| Max matmul moving free axis | 512 | `nl.tile_size.gemm_moving_fmax` |
| Max SBUF free dimension | 64K elements | — |
| Max PSUM free dimension | 4K elements | — |

Partition start alignment: pdim_size > 64 → start at 0; pdim_size > 32 → start at 0 or 64; pdim_size ≤ 32 → start at 0, 32, 64, or 96.

### Key Optimization Principles

The arithmetic intensity threshold to saturate TensorE on NeuronCore-v2 is approximately 222 Flops/Byte for BF16. Below this, kernels are memory-bound; above, compute-bound. Maximizing tile sizes (especially blocking along free dimensions to increase data reuse from SBUF) is the primary lever for crossing this threshold.

The dominant compute resource is TensorE at 92 TFLOPS vs. ~2-3 TFLOPS for Vector/Scalar engines — a ~30-40× ratio. Kernels should maximize TensorE utilization and minimize non-matmul work on the critical path. Using FP32 inputs to TensorE costs 4× throughput; downcast to BF16/FP16/TF32/cFP8 before matmul.

Instructions using fewer than 128 partitions underutilize all engines. Free dimension sizes below 128 elements per partition are dominated by ~100-cycle static instruction overhead. The sweet spot for free dimension is ~1024 elements.

Double/multi-buffering of SBUF tiles enables overlapping DMA and compute across engines. Declaring buffers inside inner loops (smaller scope) reduces spilling pressure versus outer-scope declarations. Use `nl.affine_range` for loops without carried dependencies (enables compiler optimization); use `nl.sequential_range` when loop order must be preserved.