# Instructions for writing a DSL mapper

## General instructions for writing a DSL mapper:

### Language Design
Every statement in the program should end with `;` (like `C`)

Functions should start with `def FuncName(Arg1Type arg1, ...)` and the function body needs to be wrapped with `{ ... }`. Functions will be used in the IndexTaskMap statement.

Comments should start with `#` (like `Python`)

### Task Placement
```
Task foo CPU; # for task named "foo", run on CPU
Task * GPU,OMP,CPU; # for any other task, by default try running on GPU first (if a GPU variant exists), then try OpenMP Processor, finally try CPU.
```
The task named `foo` will use `CPU`. For other tasks, they will use the default (fallback) strategies.
The wildcard `*` is a way to describe fallback policies, after which is a priority-list of processors kinds.
The supported processor kinds are: `CPU`, `GPU`, `OMP`.

### Region Placement
There are two modes for Region placement. One is called semantic naming mode, and the other is called index-based mode.
```
Region * * GPU FBMEM; # for any tasks (first *), any regions (second *), if mapped onto GPU, use GPU FrameBuffer Memory as default
Region * * CPU SYSMEM; # for any tasks, any regions, if mapped onto CPU, use CPU System Memory as default

Region * region_name1 GPU ZCMEM;
Region * region_name2 GPU ZCMEM;
```
The first is the task name, here "*" means "for any tasks" (that use those regions). The second argument is the region names. Here the region_name1 and region_name2 are the semantic names of the regions. They will be mapped to GPU ZeroCopy memory.

Users can also choose to use index-based approach for placing regions.
```
Region foo 0 GPU ZCMEM; # map task foo's first region onto GPU ZeroCopy memory
Region bar 3 GPU FBMEM; # map task bar's fourth region onto GPU FrameBuffer memory
```


The supported memory kinds include:
  - `SYSMEM`: system memory (available for CPU processors)
  - `FBMEM`: framebuffer memory (available for GPU processors)
  - `ZCMEM`: ZeroCopy memory (available for GPU processors)
  - `SOCKMEM`: Socket memory (available on OpenMP processors)

### Layout Constraint
```
# Task (first *), Region (second *), Processor Kind (third *), List of Constraints
Layout * * * F_order AOS;
```
You should list all the constraints in the Layout statement, e.g., `Layout * * * Align==128 F_order` specifies that the memory should align to 128 bytes while using Fortrain order.

Below are all the supported constraints:
- `SOA` refers to `Struct of Array` while `AOS` refers to `Array of Struct`.
- `C_order` and `F_order` are two different orderings (C-style ordering and Fortran-style ordering)
- `Align` can specify memory alignment. For example, `Layout * * * Align==32` specifies that the memory of all data should align to 32 bytes.

### Backpressure
```
InstanceLimit task_4 1;
InstanceLimit task_6 1;
```
On each node, only one `task_4` can be mapped at the same time. Only one `task_6` can be mapped at the same time. Typically, tasks are mapped ahead of time, and backpressure can avoid mapping too many tasks at the same time ahead of time, which can avoid consuming too much memory.


### Memory Collection
```
CollectMemory task_4 *;
```
`CollectMemory` can tell Legion runtime to untrack certain read-only regions as valid regions so that those memories can be collected. This is useful when users know that certain read-only regions will only be used once for some tasks, and keeping the regions in memory costs too much memory consumption.

### Index Task Launch Placement
```
mcpu = Machine(CPU); # 2-dim tuple: nodes * CPU processors, assuming one Legion runtime per node
mgpu = Machine(GPU); # 2-dim tuple: nodes * GPU processors, assuming one Legion runtime per node

def block4GPUs(Task task) {
    # task.ipoint is a n-dim tuple (in this case n=1) indicating index point within the index launch domain
    # task.ispace is a n-dim tuple (in this case n=1) indicating launch domain, not used here
    return mgpu[task.ipoint[0] / 4, task.ipoint[0] % 4]; #
}

# specify $task_name(s) and sharding+slicing function
IndexTaskMap calculate_new_currents,distribute_charge,update_voltages block4GPUs;
```
Note that you don't have to have IndexTaskMap statements for the index task launches because a cyclic-block distribution will be applied by default.
If you write IndexTaskMap statements, you need to provide the function definitions as well.
`IndexTaskMap` can specify a list of comma-separated task names and the corresponding sharding+slicing functions to use.

On a high level, users want to map each N-dim point from the launch domain to a point in the machine model to pick the processor.
`Machine($PROC_KIND)` can be used to initialize a 2-dim machine model with respect to a processor kind. The initialized machine model is a 2-dim tuple, and we can access the number of nodes with `mcpu.size[0]` and the number of CPU processors in each node with `mcpu.size[1]`.

The function signature should always be taking one argument `Task`, and return one point in a machine model. Inside the function, users need to specify that for each index point, how to map the index point (`task.ipoint[0]` for 1d index launch) to one point in the machine model.
The following mapping for the task calculate_new_currents (with 1D launch space task.ispace=8) will be conducted with the function block4GPUs running on a 2 node (mgpu.size[0]=2), 4 GPU per node (mgpu.size[1]=4) machine.
Note that users can replace 4 (i.e., the number of GPUs per node) with mgpu.size[1] to make the distribution more general.
```
node mapping for calculate_new_currents (with expression task.ipoint[0] / 4):
task.ipoint = 0,1,2,3 will be mapped to node 0
task.ipoint = 4,5,6,7 will be mapped to node 1

processor mapping for calculate_new_currents (with expression task.ipoint[0] % 4):
task.ipoint=0,4 will be mapped to GPU 0 on node 0 and 1 respectively
task.ipoint=1,5 will be mapped to GPU 1 on node 0 and 1 respectively
task.ipoint=2,6 will be mapped to GPU 2 on node 0 and 1 respectively
task.ipoint=3,7 will be mapped to GPU 3 on node 0 and 1 respectively
```

#### Some example mapping functions
```
def block1d(Task task) {
    ip = task.ipoint;
    # is = task.ispace;
    # special case -- one point task per gpu
    return mgpu[ip[0] / mgpu.size[1], ip[0] % mgpu.size[1]];
}

def hierarchicalblock1d(Task task) {
    ip = task.ipoint;
    is = task.ispace;
    node_idx = ip[0] * mgpu.size[0] / is[0]; # block over node
    blk_size = is[0] / mgpu.size[0];
    gpu_ip = ip[0] % blk_size; # index within the node-partitioned block for mapping to GPU
    gpu_idx = gpu_ip * mgpu.size[1] / blk_size; # block over gpu
    return mgpu[node_idx, gpu_idx];
}
```

### Machine Model Transformation
#### Merge
The `merge` transformation is a method supported on a machine model, and it takes two integers (`dim1`, `dim2`) as the arguments. The `dim1` and `dim2` dimensions will be merged into the `dim1` dimension. More specifically, if `dim1 < dim2`, then the new merged dimension will be `dim1`; if `dim1 > dim2`, then the new merged dimension will be `dim1-1`. The returned machine model `model_new` will have one dimension smaller than `model_old`.
```
model_new = model_old.merge(dim1, dim2);
```
We can guarantee that:
- The merged dimension in the new machine model `merge_dim = dim1 < dim2 ? dim1 : (dim1 - 1)`
- We specify that `dim2` is the faster changing dimension during merging. More specifically, suppose the `model_old.size[dim2]` is `dim2_volume`. Then the processsor indexed by `model_new[..., i, ...]` where `i` is at index `merge_dim` refers to the same processor as indexing `model_old` with `i / dim2_volume` in `dim1` and `i % dim2_volume` in `dim2`.
- `model_new.size[merge_dim] == model_old.size[dim1] * model_old.size[dim2]`
- `model_new.size` will be a `N-1`-dim tuple if `model_old` is a `N`-dim tuple

A real example of the `merge` transformation is below. In general, `merge` transformation can return a new machine model with fewer dimensions, which can be useful to index launches with lower dimensions.
```
m_2d = Machine(GPU); # nodes * processors
m_1d = m_2d.merge(0, 1);

def block_primitive(IPoint x, ISpace y, MSpace z, int dim1, int dim2) {
    return x[dim1] * z.size[dim2] / y.size[dim1];
}

def block1d(Task task) {
    return m_1d[block_primitive(task.ipoint, task.ispace, m_1d, 0, 0)];
}

IndexTaskMap init_cublas block1d;
```
The task `init_cublas` is a 1D index launch. The machine model `m_1d` is obtained by applying the `merge` transformation to the `m_2d` machine model. Then the it does a blockwise distribution. Here we define a function `block_primitive` which is a more general blockwise function that will be useful for later examples. It takes index point `IPoint x`, index space `ISpace y`, and machine model `MSpace z` as input, and also allows users to specify which dimensions of the index launch `int dim1` and which dimension of the machine model `int dim2` to be blockwise mapped. In this case, because both the machine model `m_1d` and the task's index launch (`task.ipoint`, `task.ispace`) are 1D, `dim1` and `dim2` are both set to `0`.

#### Split
The `split` transformation is a method supported on a machine model, and it takes two integers (`split_dim`, `split_factor`) as the arguments. The `split_dim` of the original machine model will be splitted into two dimensions `split_dim` (with size `split_factor`) and `split_dim+1`. Therefore, the returned machine model `model_new` will have one dimension bigger than `model_old`. You may think of `split` as a reverse transformation of `merge`.
```
model_new = model_old.split(split_dim, split_factor);
```
We can guarantee that:
-  The processor indexed by `model_new[..., i, j, ...]` (where `i`,`j` is at index `split_dim` and `split_dim+1`) is the same as `model_old[..., i + j * split_factor]`
- `model_new.size[split_dim] * model_new.size[split_dim+1] == model_old.size[split_dim]`
- `model_new.size[split_dim] == split_factor`
- `model_new.size` will be a `N+1`-dim tuple if `model_old` is a `N`-dim tuple

Due to the more expressivity that [autosplit](#auto_split) provides, `split` transformation is not used quite often in practice.

#### Swap
The `swap` transformation is a method supported on a machine model, and it takes two integers (`dim1`, `dim2`) as the arguments. The returned machine model `model_new` will swap the two dimensions of `dim1` and `dim2`.
```
model_new = model_old.swap(dim1, dim2);
```
We can guarantee that:
-  The processor indexed by `model_new[..., i, ..., j, ...]` (where `i`,`j` is at index `dim1` and `dim2`) is the same as `model_old[..., j, ..., i, ...]`
- `model_new.size[dim1] == model_old.size[dim2]`
- `model_new.size[dim2] == model_old.size[dim1]`
- `model_new.size` will be a `N`-dim tuple if `model_old` is a `N`-dim tuple

`swap` could be useful when the default ordering of `split` transformation is not what you expect. `swap` transformation is not used quite often in practice.

#### Slice
The `slice` transformation is a method supported on a machine model, and it takes three integers (`dim`, `low`, `high`) as the arguments. The returned machine model `model_new` will  be a machine model whose `dim` dimension only contains a subset of processors that ranges from `low` to `high` (both ends included), with other dimensions the same as `model_old`.
```
model_new = model_old.slice(dim, low, high);
```
We can guarantee that:
-  The processor indexed by `model_new[..., i, ...]` (where `i` is at index `dim`) is the same as `model_old[..., i + low, ...]`
- `model_new.size[dim] == high - low + 1`
- `model_new.size[other_dim] == model_old.size[other_dim]` (`other_dim` not equal to `dim`)
- `model_new.size` will be a `N`-dim tuple if `model_old` is a `N`-dim tuple

#### Auto_split

The `auto_split` transformation is a method supported on a machine model, and it takes an integer `dim` and a tuple of integers `vec` as the arguments. The `dim` of the original machine model will be split into more dimensions, which will be the same number of dimension of the tuple `vec`.
```
model_new = model_old.auto_split(dim, vec);
```
Suppose `vec` is a 3D tuple: `vec=(4,2,4)` and we want to split `model_old`'s `dim` dimension, for instance, `model_old.size[dim]=4`. The balanced way to split 32 into 3 numbers with respect to `(4,2,4)` is `4=2*1*2`. As a result, `model_new[dim]=2, model_new[dim+1]=1, model_new[dim+2]=2`.

Generally, given a N-dim tuple $vec=(L_1, L_2, ..., L_N)$,  `auto_split` aims to automatically split the `model_old`'s `dim` dimension of size  $O$ into a N-dim tuple $O_1, O_2, ..., O_N$ satifying the following property:
- $O_1 * O_2 * ... * O_N = O$
- Define  $W_i = \frac{L_i}{O_i}$. We guarantee that $\Sigma_{i \neq j} (W_i * W_j)$ is minimized


It can be proven that the minimum is achieved when $W_i$ are as close to each other as possible. In the above example, $(L_1,L_2,L_3)=(4,2,4)$, we split $O=4$ into $(O_1,O_2,O_3)=(2,1,2)$ such that $(W_1,W_2,W_3)=(\frac{L_1}{O_1}, \frac{L_2}{O_2}, \frac{L_3}{O_3}) = (2,2,2)$ are equal to each other.

This transformation primitive is quite useful in practice. If $dim$ represents the node dimension (i.e., $O$ is the number of nodes), and $vec$ is the task's index launch  domain, and `auto_split` can minimize inter-node communication assuming stencil computation pattern.

We can guarantee that:

- `model_new.size` will be a `N+vec.len`-dim tuple if `model_old` is a `N`-dim tuple

Below is a real use case for `auto_split`.

```
def block_primitive(IPoint x, ISpace y, MSpace z, int dim1, int dim2) {
    return x[dim1] * z.size[dim2] / y.size[dim1];
}

def cyclic_primitive(IPoint x, ISpace y, MSpace z, int dim1, int dim2) {
    return x[dim1] % z.size[dim2];
}

m_2d = Machine(GPU); # nodes * processors
def auto3d(Task task) {
    m_4d = m_2d.auto_split(0, task.ispace); # split the original 0 dim into 0,1,2 dim
    # subspace: task.ispace / m_4d[:-1]
    m_6d = m_4d.auto_split(3, task.ispace / m_4d[:-1]); # split the processor (previosly 1, now 3) dim into 3,4,5 dim w.r.t subspace
    upper = tuple(block_primitive(task.ipoint, task.ispace, m_6d, i, i) for i in (0,1,2));
    lower = tuple(cyclic_primitive(task.ipoint, task.ispace, m_6d, i, i + 3) for i in (0,1,2));
    return m_6d[*upper, *lower];
}
IndexTaskMap task_5 auto3d; # task_5 launch space: (rpoc, rpoc, c)
```

The first transformation is done via `m_2d.auto_split(0, task.ispace)` where `0` stands for the node dimension of the original machine model, and`task.ispace` is a tuple of integers representing the launch domain. The launch domain is 3D, so the original dimension 0 will be split into 3 dimensions, so the resulting machine model is `m_4d`, and its last dimension (dimension 3) corresponds to `m_2d`'s last dimension (dimension 1, representing processors per node).

Imagine that the launching domains will be mapped onto the nodes, and each node will get a subset of the index points. The tuple which represents each nodes' workload can be computed as another tuple: `task.ispace / m_4d[:-1]` . Here we use `m_4d[:-1]` to get the first dimensions of a machine model as a tuple of integers, and we directly support `/` operator for two tuple of integers.

To express the mapping more easily, we need `m_4d.auto_split(3, task.ispace / m_4d[:-1])` to obtain a 6D machine model. Here we split the last dimension (`3`) of `m_4d`, with respect to the workload tuple.

It's not hard to intuitively understand the resulting 6D machine model -- the first 3 dimensions correspond to node dimension, and the last 3 dimensions correspond to the processor dimension. For each point in the launching domain, we choose nodes in a blockwise way while we choose processors in a cyclic way. The way to specify this is to compute the `upper` 3D tuple and `lower` 3D tuple respectively with`tuple` construction, user-defined functions (`block_primitive` and `cyclic_primitive`).

Finally, we can use `*` operator to turn a tuple of integers into integers (e.g., unwrapping `(1,2,1)` into `1,2,1`) for indexing the machine model to pick a specific processor via `m_6d[*upper, *lower]`.

### Single Task Launch Placement
```
m_2d = Machine(GPU); # nodes * processors
def same_point(Task task) {
    return m_2d[*task.parent.processor(m_2d)]; # same point as its parent
}
SingleTaskMap task_4 same_point;
```
For tasks that are not index launch, users can also specify where to place tasks.

The above example code specifies that `task_4` will be placed on the same processor as its parent task (originating processsor). `task.parent` is another `Task` object, and `task.parent.processor(m_2d).processor(m_2d)` will return a tuple representing the position with respect to the machine model `m_2d`, e.g., (1, 1) means that `task_4`'s parent is placed on the second node's second GPU. `*` is used here to turn `(1, 1)` into `1, 1` with which we can use to index the `m_2d` again so that we place `task_4` on exactly the same processor as its parent.

```
def same_node(Task task) {
    return m_2d[task.parent.processor(m_2d)[0], *]; # same node as its parent
}
```
Users can also specify `*` to specify a set of points in the machine model, to define a round-robin strategy. The above `same_node` function specifies that the task will be placed on the same node as its parent (`task.parent.processor(m_2d)[0]`), and any processor on that node (`*`) is acceptable. The runtime will make the choice dynamically.
