To write a mapper, users typically need to inherit the `DefaultMapper` class and overwrite certain functions in C++.

Here is the full documentation of the Legion C++ mapping interface:

The Legion mapper interface is a key part of the Legion programming system.
Through the mapping interface applications can control most decisions that
impact application performance. The philosophy is that these choices are
better left to applications rather than using hard-wired heuristics in Legion
that attempt to “do the right thing” in every situation. The few performance
heuristics that are included in Legion are associated with low levels of the
system where there is no good way to expose those choices to the application.
For everything else applications can set the policies.
This design resulted from our own past experience with systems where
built-in performance heuristics did not behave as we desired and there was
no recourse to override those decisions. While Legion does allows experts
to squeeze every last bit of performance from a system, it is important to
realize that doing so potentially requires understanding and setting a wide
variety of parameters exposed in the mapping interface. This level of control
can be overwheling at first to users who are not used to considering all the
possible dimensions that influence performance in complex, distributed and
heterogeneous systems.
To help users write initial versions of their applications without needing
to concern themselves with tuning the performance knobs exposed by the
mapper interface, Legion provides a default mapper. The default mapper
implements the Legion mapper API (like any other mapper) and provides
a number of heuristics that can provide reasonably performant, or at least
correct, initial settings. A good way to think about the default mapper is
that it is the version of Legion with built-in heuristics that allows casual
users to write Legion applications and allows experts to start quickly on a
new application. It is, however, unreasonable to expect the default mapper

to provide excellent performance, and in particular assuming that the performance of an application using the default mapper is even an approximation of
the performance that could be achieved with a custom mapper is a mistake.
We will use several examples from the default mapper to illustrate how
mappers are constructed. We will also describe where possible the heuristics
that the default mapper employs to achieve reasonable performance. Because
the default mapper uses generic heuristics with no specific knowledge of the
apllication, it is almost certain to make poor decisions at least some of the
time. Performance benchmarking using only the default mapper is strongly
discouraged, while using custom application-specific mappers is encouraged.
It is likely that the moment when you are dissatisfied with the heuristics
in the default mapper will come sooner rather than later. At that point
the information in this chapter will be necessary for you to write your own
custom mapper. In practice, our experience has been that in many cases
all that is necessary is to replace a small number of policies in the default
mapper that are a poor fit for the application.
7.1 Mapper Organization
The Legion mapper interface is an abstract C++ class that defines a set of
pure virtual functions that the Legion runtime invokes as callbacks for making
performance-related decisions. A Legion mapper is a class that inherits from
the base abstract class and provides implementations of the associated pure
virtual methods.
A callback is just a function pointer—when the runtime system calls a
mapper function, it is said to have “invoked the callback”. Callbacks are
a commomly-used mechanism in software systems for parameterizing some
specific functionality; in our case mappers parameterize the performance
heuristics of the Legion runtime system. There are a few general things to
keep in mind about mappers and callbacks:
• The runtime may invoke callbacks in an unpredictable order. While
multiple callbacks associated with a single instance of a Legion object,
such as a task, will happen in a specific order for that task, other
callbacks for other operations may be interleaved.
• Depending on the synchronization model selected (see Section 7.1.2),
mappers may have a degree of concurrency between mapper callbacks.
• Since mappers are C++ objects, they can have arbitrary internal
state. For example, it may be useful to maintain performance or load-

balancing statistics that inform mapping decisions. However, state
updates done by a mapper must take into account the unpredictable
order in which callbacks are invoked as well any issues of concurrent
access to mapper data structures.
7.1.1 Mapper Registration
After the Legion runtime is created, but before the application begins, mapper
objects can be registered with the runtime. Figure 7.1 gives a small example
registering a custom mapper.
To register CustomMapper objects, the application adds the mapper
callback function by invoking the Runtime::add_registration_callback
method, which takes as an argument a function pointer to be invoked. The
function pointer must have a specific type, taking as arguments a Machine
object, a Runtime pointer, and a reference to an STL set of Processor
objects. The call can be invoked multiple times to record multiple callback
functions (e.g., to register multiple custom mappers, perhaps for dierent
libraries). All callback functions must be added prior to the invocation of
the Runtime::start method. We recommend that applications include the
registration method as a static method on the mapper class (as in Figure 7.1)
so that it is closely coupled to the custom mapper itself.
Before invoking any of the registration callback functions, the runtime
creates an instance of the default mapper for each processor of the system.
The runtime then invokes the callback functions in the order they were added.
Each callback function is invoked once on each instance of the Legion runtime.
For multi-process jobs, there will be one copy of the Legion runtime per
process and therefore one invocation of each callback per process. The set
of processors passed into each registration callback function will be the set
of application processors that are local to the process1, thereby providing
a registration callback function with the necessary context to know which
processors it will create new custom mappers for. If no callback functions
are registered then the only mappers that will be available are instances of
the default mapper associated with each application processor.
Upon invocation, the registration callbacks should create instances of
custom mappers and associate them with application processors. This
step can be done through one of two runtime mapper calls. The mapper
can replace the default mappers (always registered with MapperID 0) by

1 void top_level_task(const Task task,
2 const std::vector<PhysicalRegion> &regions,
3 Context ctx,
4 Runtime runtime)
5 {
6 printf("Runningtopleveltask...\n");
7 }
8
9 class CustomMapperA : public DefaultMapper {
10 public:
11 CustomMapperA(MapperRuntime rt, Machine m, Processor p)
12 : DefaultMapper(rt, m, p) { }
13 public:
14 static void register_custom_mappers(Machine machine, Runtime rt,
15 const std::set<Processor> &local_procs);
16 };
17
18 /static/
19 void CustomMapperA::register_custom_mappers(Machine machine, Runtime rt,
20 const std::set<Processor> &local_procs)
21 {
22 printf("ReplacingdefaultmapperswithcustommapperAonallprocessors...\n");
23 MapperRuntime const map_rt = rt->get_mapper_runtime();
24 for (std::set<Processor>::const_iterator it = local_procs.begin();
25 it != local_procs.end(); it++)
26 {
27 rt->replace_default_mapper(new CustomMapperA(map_rt, machine, it), it);
28 }
29 }
30
31 class CustomMapperB : public DefaultMapper {
32 public:
33 CustomMapperB(MapperRuntime rt, Machine m, Processor p)
34 : DefaultMapper(rt, m, p) { }
35 public:
36 static void register_custom_mappers(Machine machine, Runtime rt,
37 const std::set<Processor> &local_procs);
38 };
39
40 /static/
41 void CustomMapperB::register_custom_mappers(Machine machine, Runtime rt,
42 const std::set<Processor> &local_procs)
43 {
44 printf("AddingcustommapperBforallprocessors...\n");
45 MapperRuntime const map_rt = rt->get_mapper_runtime();
46 for (std::set<Processor>::const_iterator it = local_procs.begin();
47 it != local_procs.end(); it++)
48 {
49 rt->add_mapper(1/MapperID/, new CustomMapperA(map_rt, machine, it), it);
50 }
51 }
52
53 int main(int argc, char argv)
54 {
55 Runtime::set_top_level_task_id(TOP_LEVEL_TASK_ID);
56 {
57 TaskVariantRegistrar registrar(TOP_LEVEL_TASK_ID, "top_level_task");
58 registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
59 Runtime::preregister_task_variant<top_level_task>(registrar);
60 }
61 Runtime::add_registration_callback(CustomMapperA::register_custom_mappers);
62 Runtime::add_registration_callback(CustomMapperB::register_custom_mappers);
63
64 return Runtime::start(argc, argv);
65 }

calling Runtime::replace_default_mapper, which is the only way to replace the default mappers. Alternatively, the registration callback can use
Runtime::add_mapper to register a mapper with a new MapperID. Both the
Runtime::replace_default_mapper and the Runtime::add_mapper methods support an optional processor argument, which tells the runtime to
associate the mapper with a specific processor. If no processor is specified,
the mapper is associated with all processors on the local node. The choice
between whether one mapper object should handle a single application processor’s mapping decisions or one mapper object handling mapping decisions
for all application processors on a node is mapper-specific. Legion supports
both use cases and it is up to custom mappers to make the best choice. From
a performance perspective, the best choice is likely to depend on the mapper
synchronization model (see Section 7.1.2).
Note that the mapper calls require a pointer to the MapperRuntime, such
as on lines 27 and 49 of Figure 7.1. The mapper runtime provides the
interface for mapper calls to call back into the runtime to acquire access to
dierent physical resources. We will see examples of the use of the mapper
runtime throughout this chapter.
7.1.2 Synchronization Model
Within an instance of the Legion runtime there are often several threads
performing the analysis necessary to advance the execution of an application.
If some threads are performing work for operations owned by the same
mapper, it is possible that they will attempt to invoke mapper calls for the
same mapper object concurrently. For both productivity and correctness
reasons, we do not want users to be responsible for making their mappers
thread-safe. Therefore we allow mappers to specify a synchronization model
that the runtime follows when concurrent mapper calls are made.
Each mapper object can specify its synchronization model via the get_mapper_sync_model
mapper call. The runtime invokes this method exactly once per mapper
object immediately after the mapper is registered with the runtime. Once
the synchronization model has been set for a mapper object it cannot be
changed. Currently three synchronization models are supported:
• Serialized Non-Reentrant. Calls to the mapper object are serialized
and execute atomically. If the mapper calls out to the runtime and the
mapper call is preempted, no other mapper calls can be invoked by the
runtime. This synchronization model conforms to the original version
of the Legion mapper interface.

• Serialized Reentrant. At most one mapper call executes at a time.
However, if a mapper call invokes a runtime method that preempts the
mapper call, the runtime may execute another mapper call or resume
a previously blocked mapper call. It is up to the user to handle any
changes in internal mapper state that might occur while a mapper call
is preempted (e.g., the invalidation of STL iterators to internal mapper
data structures).
• Concurrent. Mapper calls to the same mapper object can proceed
concurrently. Users can invoke the lock_mapper and unlock_mapper
calls to perform their own synchronization of the mapper. This synchronization model is particularly useful for mappers that simply return
static mapping decisions without changing internal mapper state.
The mapper synchronization oers mappers tradeos between mapper
complexity and performance. The default mapper uses the serialized reentrant
synchronization model as it oers a good trade-o between programmability
and performance.
7.1.3 Machine Interface
All mappers are given a Machine object to enable introspection of the
hardware on which the application is executing. The Machine object is
defined by Realm, Legion’s low-level portability layer (see realm/machine.h).
There are two interfaces for querying the machine object. The old interface
contains methods such as get_all_processors and get_all_memories.
These methods populate STL data structures with the appropriate names
of processors and memories. We strongly discourage using these methods
as they are not scalable on large architectures with tens to hundreds of
thousands of processors or memories.
The recommended, and more ecient and scalable, interface is based on
queries, which come in two types: ProcessorQuery and MemoryQuery. Each
query is initially given a reference to the machine object. After initialization
the query lazily materializes the (entire) set of either processors or memories
of the machine. The mapper applies filters to the query to reduce the set
to processors or memories of interest. These filters can include specializing
the query to the local node using local_address_space, to one kind of
processors with the only_kind method, or by requesting that the processor
or memory have a specific anity to another processor or memory with the
has_affinity_to method. Anity can either be specified as a maximum
bandwidth or a minimum latency. Figure 7.2 shows how to create a custom

mapper that uses queries to find the local set of processors with the same
processor kind as and the memories with anities to the local mapper
processor. In some cases, these queries are still expensive, so we encourage
the creation of mappers that memoize the results of their most commonly
invoked queries to avoid duplicated work.
7.2 Mapping Tasks
There are a number of dierent kinds of operations with mapping callbacks,
but the core of the mapping interface, and the parts of mappers that users
will most commonly customize, are the callbacks for mapping tasks. When a
task is launched it proceeds through a pipeline of mapping callbacks. The
most important pipeline stages are:
1. select_task_options
2. select_sharding_functor (for control-replicated tasks)
3. slice_task (for index launches)
4. select_tasks_to_map (tasks remain in this stage until selected for
mapping)
5. map_task
Stages 2 and 3 do not apply to every task, and tasks may repeat stage 4 any
number of times depending on the implementation of select_tasks_to_map.
After discussing these five components of the task mapping pipeline, we
discuss a few other topics relevant to task mapping: allocating new physical
instances, postmapping of tasks, virtual mappings, and profiling requests.
7.2.1 Controlling Task Mapping
select_task_options is the first callback for mapping tasks. It is invoked
for every task t exactly once in the Legion process where t is launched. The
signature of the function is:
1 virtual void select_task_options(const MapperContext ctx,
2 const Task& task,
3 TaskOptions& output) = 0;
The purpose of the callback is to set fields of the output object. All of
the fields have defaults, so none are required to be set by the callback
implementation. This callback comes first because the fields of TaskOptions
control the rest of the mapping process for the task.

1 class MachineMapper : public DefaultMapper {
2 public:
3 MachineMapper(MapperRuntime rt, Machine m, Processor p);
4 public:
5 static void register_machine_mappers(Machine machine, Runtime rt,
6 const std::set<Processor> &local_procs);
7 };
8
9 MachineMapper::MachineMapper(MapperRuntime rt, Machine m, Processor p)
10 : DefaultMapper(rt, m, p)
11 {
12 // Find all processors of the same kind on the local node
13 Machine::ProcessorQuery proc_query(m);
14 // First restrict to the same node
15 proc_query.local_address_space();
16 // Make it the same processor kind as our processor
17 proc_query.only_kind(p.kind());
18 for (Machine::ProcessorQuery::iterator it = proc_query.begin();
19 it != proc_query.end(); it++)
20 {
21 // skip ourselves
22 if ((it) == p)
23 continue;
24 printf("Mapper%s:shares" IDFMT "\n", get_mapper_name(), it->id);
25 }
26 // Find all the memories that are visible from this processor
27 Machine::MemoryQuery mem_query(m);
28 // Find anity to our local processor
29 mem_query.has_anity_to(p);
30 for (Machine::MemoryQuery::iterator it = mem_query.begin();
31 it != mem_query.end(); it++)
32 printf("Mapper%s:has affinity to memory" IDFMT "\n", get_mapper_name(), it->id);
33 }
34
35 /static/
36 void MachineMapper::register_machine_mappers(Machine machine, Runtime rt,
37 const std::set<Processor> &local_procs)
38 {
39 printf("ReplacingdefaultmapperswithcustommapperAonallprocessors...\n");
40 MapperRuntime const map_rt = rt->get_mapper_runtime();
41 for (std::set<Processor>::const_iterator it = local_procs.begin();
42 it != local_procs.end(); it++)
43 {
44 rt->replace_default_mapper(new MachineMapper(map_rt, machine, it), it);
45 }
46 }
47
48 int main(int argc, char argv)
49 {
50 Runtime::set_top_level_task_id(TOP_LEVEL_TASK_ID);
51 {
52 TaskVariantRegistrar registrar(TOP_LEVEL_TASK_ID, "top_level_task");
53 registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
54 Runtime::preregister_task_variant<top_level_task>(registrar);
55 }
56 Runtime::add_registration_callback(MachineMapper::register_machine_mappers);
57
58 return Runtime::start(argc, argv);
59 }
Figure 7.2: Examples/Mapping/machine/machine.cc

• For a single task t (not an index launch), output.initial_proc is
the processor that will execute t; the default is the current processor.
The processor does not need to be local—the mapper can select any
processor in the machine model for which a variant of t exists. As we
will see, t’s target processor can be changed by subsequent stages. The
reason for choosing a target processor here is that by default t is sent
to the Legion process that manages the target processor to be mapped.
• If output.inline_task is true (the default is false) the task will be
inlined into the parent task and use the parent task’s regions. Any
needed regions that are unmapped will be remapped. Inline tasks do
not go through the rest of the task pipeline, except for the selection of
a task variant.
• If output.stealable is true then the task can be stolen for load
balancing; the default is false. A stealable task t can be stolen by
another mapper until t is chosen by select_tasks_to_map.
• As mentioned above, by default the map_task stage of the mapping
pipeline is done by the Legion process that manages the processor where
the task will execute. If output.map_locally is true (the default is
false) then map_task will be run by the current mapper. Just to
emphasize: map_locally controls where a mapping callback for the
task is run, not where the task executes. This option is mostly useful for
leaf tasks that will be sent to remote processors. In this case, making
the mapping decisions locally saves transmitting task metadata to the
remote Legion runtime.
• If valid_instances is set to false, then the task will not recieve a
list of the currently valid instances of regions in subsequent calls to
request_valid_instances, which saves some runtime overhead. This
setting is useful if the task will never use a currently valid region
instance, such as when all the regions of an inner task will be virtually
mapped.
• Setting replicate_default to true turns on replication of single tasks
in a control-replication context, which means that the task will be
executed separately in every Legion process participating in the replication of the parent task. The default setting is false; in this case
only one instance of a single task with a control-replicated parent is
executed on one processor and then the results are broadcast to the
other Legion processes. Replicating single tasks avoids the broadcast

communication. There are some restrictions on replicated single tasks
to ensure the replicated versions all have identical behavior: the tasks
cannot have reduction-only privileges on any field, and any fields with
write privileges must use a separate instance for each replicated task.
• A task can set the priority of the parent task by modifying output.parent_priority,
if that is permitted by the mapper. The default is the parent’s current
priority. When tasks are ready to execute, tasks with higher priority
are moved to the front of the ready queue.
7.2.2 Sharding
As the name suggests, select_sharding_functor is used to select the functor for sharding index task launches in control-replicated contexts. Sharding
divides the index space of the task launch into subspaces and associates each
shard with a mapper (a processor) where those tasks will be mapped. This
callback is invoked once per replicated task index launch in each replicated
context:
1 virtual void select_sharding_functor(
2 const MapperContext ctx,
3 const Task& task,
4 const SelectShardingFunctorInput& input,
5 SelectShardingFunctorOutput& output) = 0;
6
7 struct SelectShardingFunctorInput {
8 std::vector<Processor> shard_mapping;
9 };
10
11 struct SelectShardingFunctorOutput {
12 ShardingID chosen_functor;
13 bool slice_recurse;
14 };
The shard_mapping of the input structure provides a vector of the processors where the replicated task is running. The callback must fill in
the chosen_functor field of the output structure with the id of a sharding function registered with the mapper at startup. The callback can set
slice_recurse to indicate whether or not the index subspaces chosen by
the sharding functor should be recursively sharded on the destination processor. The same sharding functor must be selected in every control-replicated
context, which will be checked by the runtime when in debug mode.
7.2.3 Slicing
slice_task is called for every index launch. To make index launches efficient,

the index space of tasks is first sliced into smaller sets of tasks and each
set is sent to a destination mapper as a single object rather than sending
multiple individual tasks. The signature of slice_task is:
1 virtual void slice_task(const MapperContext ctx,
2 const Task& task,
3 const SliceTaskInput& input,
4 SliceTaskOutput& output) = 0;
The SliceTaskInput includes the index space of the task launch (field
domain_is). The index space of the shard is also included for controlreplicated tasks.
1 struct SliceTaskInput {
2 IndexSpace domain_is;
3 Domain domain;
4 IndexSpace sharding_is;
5 };
The implementation of slice_task should set the fields of SliceTaskOutput:
1 struct SliceTaskOutput {
2 std::vector<TaskSlice> slices;
3 bool verify_correctness; // = false
4
5 struct TaskSlice {
6 public:
7 TaskSlice(void) : domain_is(IndexSpace::NO_SPACE),
8 domain(Domain::NO_DOMAIN), proc(Processor::NO_PROC),
9 recurse(false), stealable(false){}
10 TaskSlice(const Domain &d, Processor p, bool r, bool s)
11 : domain_is(IndexSpace::NO_SPACE), domain(d),
12 proc(p), recurse(r), stealable(s) { }
13 TaskSlice(IndexSpace is, Processor p, bool r, bool s)
14 : domain_is(is), domain(Domain::NO_DOMAIN),
15 proc(p), recurse(r), stealable(s) { }
16 public:
17 IndexSpace domain_is;
18 Domain domain;
19 Processor proc;
20 bool recurse;
21 bool stealable;
22 };
The slices field is a vector of TaskSlice, each of which names a subspace
of the index space in domain_is and a destination processor proc for the
slice of tasks. The tasks of the slice can be marked as stealable, and setting
the recurse field means that slice_task will be called again by the mapper
associated with the destination processor to allow the slice to be further
subdivided before processing individual tasks.

7.2.4 Selecting Tasks to Map
select_tasks_to_map gives the mapper control over which tasks should
be mapped and which should be sent to other processors—the initial processor assignment set in select_task_options can be changed if desired.
At this point in the task mapping pipeline all index tasks have been expanded into single tasks, and select_tasks_to_map is called by the mapper
associated with the destination process, unless map_locally was chosen in
select_task_options. The signature of the callback is:
1 virtual void select_tasks_to_map(const MapperContext ctx,
2 const SelectMappingInput& input,
3 SelectMappingOutput& output) = 0;
4 struct SelectMappingInput {
5 std::list<const Task> ready_tasks;
6 };
7 struct SelectMappingOutput {
8 std::set<const Task> map_tasks;
9 std::map<const Task,Processor> relocate_tasks;
10 MapperEvent deferral_event;
11 };
For each task in ready_tasks of the SelectMappingInput structure, the
callback implementation can do one of three things:
• Add the task to map_tasks, in which case the task will proceed with
mapping on the assigned local processor.
• Add the task to relocate_tasks along with a new destination processor to which the task will be transferred.
• Nothing, in which case the task will remain in the ready_tasks list
for the next call to select_tasks_to_map.
If the call does not select at least one task to map or transfer, then it
must provide a MapperEvent in the field deferral_event—another call to
select_tasks_to_map will not be made until that event is triggered. Of
course, it is up to the mapper to guarantee that the event is eventually
triggered.
7.2.5 Map_Task
map_task is normally the final stage of the task mapping pipeline. This
callback selects a processor or processors for the task, maps the task’s region
arguments, and selects the task variant to use, after which the task will run
on one of the selected processors.

1 virtual void map_task(
2 const MapperContext ctx,
3 const Task& task,
4 const MapTaskInput& input,
5 MapTaskOutput& output) = 0;
6
7 struct MapTaskInput {
8 std::vector<std::vector<PhysicalInstance> > valid_instances;
9 std::vector<unsigned> premapped_regions;
10 };
11
12 struct MapTaskOutput {
13 std::vector<std::vector<PhysicalInstance> > chosen_instances;
14 std::vector<std::vector<PhysicalInstance> > source_instances;
15 std::vector<Memory> output_targets;
16 std::vector<LayoutConstraintSet> output_constraints;
17 std::set<unsigned> untracked_valid_regions;
18 std::vector<Memory>future_locations;
19 std::vector<Processor> target_procs;
20 VariantID chosen_variant; // = 0
21 TaskPriority task_priority; // = 0
22 TaskPriority profiling_priority;
23 ProfilingRequest task_prof_requests;
24 ProfilingRequest copy_prof_requests;
25 bool postmap_task; // = false
26 };
The input structure contains a vector of vector of valid instances: each
element of the vector is a vector of instances that hold valid data for the
corresponding region requirement. The premapped_regions is a vector of
indices of region requirements that are already satisfied and do not need to
be mapped by the callback.
The callback must fill in the following fields of the output structure:
• target_procs is a vector of processors. All processors must be on the
same node and of the same kind (e.g., all LOCs or all TOCs). The
runtime will execute the task on the first processor in the vector that
becomes available.
• chosen_variant is the VariantID of a variant of the task. The chosen
variant must be compatible with the chosen processor kind.
• For each region requirement, the input structure has a vector of valid
instances of the region in the same order as region requirements are
added to the task launcher. The entry of the chosen_instances field
should be filled either with one or more instances from the correponding entry of valid_instances, or the mapper can add newly
created instances. A new instance is created by the runtime call
create_physical_instance, which, in addition to other arguments,

takes a target memory in which the instance should be created and a
vector of logical regions—physical instances can be created that hold
the data of multiple logical regions. If new physical regions are created,
the mapper calls select_task_sources to choose existing instances
to be the source of data to fill those new instances (see below).
• For any regions that are strictly output regions (e.g, with WRITE_DISCARD
privileges) where no input data will be loaded, the callback must fill
in the output_targets with a memory for the corresponding region
requirement. These memories must be visible to the selected processor(s).
• The callback should set a memory that will hold each future produced
by the task in future_locations.
• Normally the runtime system will retain regions with valid data even if
no tasks are known that will use those regions at the time the task finishes. This policy can lead to an accumulation of read-only regions that
are never garbage colleted (since read-only regions are not invalidated
by any write operations). This policy can be controlled by specifying
a set of indices of read-only regions in untracked_valid_regions—
these instances will be marked for garbage collection after the task is
complete.
• Optionally the task may request that the postmap_task be invoked
for this task once mapping is complete; see Section 7.2.8.
7.2.6 Creating Physical Instances
New phyiscal instances are created by the runtime call create_physical_instance:
1 bool MapperRuntime::create_physical_instance(
2 MapperContext ctx, Memory target_memory,
3 const LayoutConstraintSet &constraints,
4 const std::vector<LogicalRegion> &regions,
5 PhysicalInstance &result,
6 bool acquire, GCPriority priority,
7 bool tight_bounds, size_t footprint,
8 const LayoutConstraint unsat) const
Besides the standard runtime context, the arguments to this function are:
• The target_memory is the memory where the instance will be created.
• The constraints specify the layout constraints of the region, such as
whether it should be laid out in column-major or row-major order for
2D index spaces. Layout constraints are discussed in Section 3.4.

• The regions field is a vector of logical regions, all of which should be
included in the created instance. The ability to have more than one
logical region in an instance allows for colocation of data from multiple
regions.
• The result field holds the newly created instance after the call returns;
if successful the function returns true.
• If tight_bounds is true, then the call will select the most specific
(tightest) solution to the constraints, if more than one solution is
possible. Otherwise, the runtime is free to pick any valid solution.
• footprint holds the size of the allocated instance in bytes.
• unsat holds any constraints that could not be satisfied if the call fails.
The runtime function find_or_create_physical_instance provides
higher level functionality that preferentially finds an existing physical instance
satisfying some constraints or creates a new one if necessary. The default mapper also provides higher-level functions that wrap create_physical_instance;
see default_create_custom_instances for an example.
7.2.7 Selecting Sources for New Physical Instances
When a new physical instance is created, if its contents may be read the
mapper callback select_task_sources will be invoked to pick a source of
data for the instance:
1 virtual void select_task_sources(const MapperContext ctx,
2 const Task& task,
3 const SelectTaskSrcInput& input,
4 SelectTaskSrcOutput& output) = 0;
5
6 struct SelectTaskSrcInput {
7 PhysicalInstance target;
8 std::vector<PhysicalInstance> source_instances;
9 unsigned region_req_index;
10 };
11
12 struct SelectTaskSrcOutput {
13 std::deque<PhysicalInstance> chosen_ranking;
14 };
An implementation of this callback fills in chosen_ranking with a queue of instances selected from source_instances, most preferred instance first. The
default mapper, for example, ranks instances in order of bandwidth between
the source instance and the target memory—see default_policy_select_target_memory
in default_mapper.cc.

Despite its name, this callback is also used for other operations that
create new physical instances, such as copy operations.
7.2.8 Postmapping
The callback postmap_task is called only if requested by map_task (see
Section 7.2.5). The purpose of this callback is to allow additional copies of
regions updated by a task to be made once the task has finished. As input
the callback is given the mapped instances for each region requirement as
well as the valid instances. The callback should fill in chosen_instances
with a vector for each region requirement of additional copies to be made;
possible sources of these copies are specified by source_instances.
1 virtual void postmap_task(
2 const MapperContext ctx,
3 const Task& task,
4 const PostMapInput& input,
5 PostMapOutput& output) = 0;
6
7 struct PostMapInput {
8 std::vector<std::vector<PhysicalInstance> > mapped_regions;
9 std::vector<std::vector<PhysicalInstance> > valid_instances;
10 };
11
12 struct PostMapOutput {
13 std::vector<std::vector<PhysicalInstance> > chosen_instances;
14 std::vector<std::vector<PhysicalInstance> > source_instances;
15 };
7.2.9 Using Virtual Mappings
A useful optimization is to use virtual mapping for a logical region argument
that a task does not use itself but only passes as an argument to a subtask.
A virtual mapping is just a way of recording that no physical instance will be
created for the region argument, but the name and metadata for the region
are still available so that it can be passed as an argument to subtasks.
The function PhysicalInstances::get_virtual_instance() returns a
virtual instance which can be used as the chosen physical isntance of a region
requirement. If a task variant is marked as an inner task (meaning that
it does not access any of its regions and only passes them on to subtasks),
the default mapper will use virtual instances for all of the region arguments,
except for fields with reduction privileges, for which the Legion runtime
always requires a real physical instance to be mapped. See map_task in
default_mapper.cc.

7.3 Other Mapping Features
Custom policies for mapping tasks and their region requirements are the most
common reasons for users to write their own mappers. In this section we
cover a few other mapping features that can be included in custom mappers.
This section is very incomplete; only a handful of calls relevant to other
features covered in this manual are currently included.
7.3.1 Profiling Requests
Legion has a general interface to profiling through the type ProfileRequest,
which has one public method, add_measurement(). Most Legion operations
take an optional profile request that will turn on the gathering of profiling
information for that specific operation. Most profiling is done in the Realm
low-level runtime, and running a Legion program with the command-line
flag -lg:prof will turn on profiling of many runtime operations; see https:
//legion.stanford.edu/profiling/index.html#legion-prof for an introduction to using the Legion profiler. Most users only use the Legion
profiler, but ProfileRequests are available for users who want more selective control over profiling.
7.3.2 Mapping Acquires and Releases
The callback map_acquire is called for every acquire operation. Other than
the possibility of adding a profiling request, map_acquire has no options to
set.
For the callback map_release there is a policy decision to make:
1 virtual void select_release_sources(
2 const MapperContext ctx,
3 const Release& release,
4 const SelectReleaseSrcInput& input,
5 SelectReleaseSrcOutput& output) = 0;
6
7 struct SelectReleaseSrcInput {
8 PhysicalInstance target;
9 std::vector<PhysicalInstance> source_instances;
10 };
11
12 struct SelectReleaseSrcOutput {
13 std::deque<PhysicalInstance> chosen_ranking;
14 };
Recall that the semantics of release is that it restores the copy restriction
on a region with simultaneous coherence and any updates to the region
are flushed to the original target instance. This call allows the mapper to

produce a ranking chosen_ranking of which of the valid instances of the
region source_instances should be the source of the copy to the target
at the point of the release.
7.3.3 Controlling Stealing
There are two callbacks for controlling how tasks are stolen. A mapper may
try to steal tasks from another mapper using select_steal_targets, and a
mapper can control which tasks it allows to be stolen using permit_steal_request.
Mappers that want to steal tasks should implement select_steal_targets.
This callback sets targets to a set of processors from which tasks can be
stolen. A blacklist is supplied as input, which records processors for which
a previous steal request failed due to insucient work. The blacklist is
managed automatically by the runtime system, and processors are removed
from the blacklist when they acquire additional work.
1 struct SelectStealingInput {
2 std::set<Processor> blacklist;
3 };
4
5 struct SelectStealingOutput {
6 std::set<Processor> targets;
7 };
8
9 virtual void select_steal_targets(
10 const MapperContext ctx,
11 const SelectStealingInput& input,
12 SelectStealingOutput& output) = 0;
When a mapper receives a steal request the permit_steal_request
callback is invoked, notifying the mapper of the requesting processor (the
thief) and the tasks the mapper has available to steal, from which the
callback selects a set of stolen_tasks.
1 struct StealRequestInput {
2 Processor thief_proc;
3 std::vector<const Task> stealable_tasks;
4 };
5
6 struct StealRequestOutput {
7 std::set<const Task> stolen_tasks;
8 };
9
10 virtual void permit_steal_request(const MapperContext ctx,
11 const StealRequestInput& input,
12 StealRequestOutput& output) = 0;
