
.. _aliasing:

=======================================================
Understanding Memory Aliasing for Speed and Correctness
=======================================================

The aggressive reuse of memory is one of the ways through which Theano makes code fast, and
it is important for the correctness and speed of your program that you understand
how Theano might alias buffers.

This section describes the principles based on which Theano handles memory, and explains
when you might want to alter the default behaviour of some functions and
methods for faster performance.


The Memory Model: Two Spaces
============================

There are some simple principles that guide Theano's handling of memory.  The
main idea is that there is a pool of memory managed by Theano, and Theano tracks
changes to values in that pool.

* Theano manages its own memory space, which typically does not overlap with
  the memory of normal Python variables that non-Theano code creates.

* Theano functions only modify buffers that are in Theano's memory space.

* Theano's memory space includes the buffers allocated to store ``shared``
  variables and the temporaries used to evaluate functions.

* Physically, Theano's memory space may be spread across the host, a GPU
  device(s), and in the future may even include objects on a remote machine.

* The memory allocated for a ``shared`` variable buffer is unique: it is never
  aliased to another ``shared`` variable.

* Theano's managed memory is constant while Theano functions are not running
  and Theano's library code is not running.

* The default behaviour of a function is to return user-space values for
  outputs, and to expect user-space values for inputs.

The distinction between Theano-managed memory and user-managed memory can be
broken down by some Theano functions (e.g. ``shared``, ``get_value`` and the
constructors for ``In`` and ``Out``) by using a ``borrow=True`` flag.
This can make those methods faster (by avoiding copy operations) at the expense
of risking subtle bugs in the overall program (by aliasing memory).

The rest of this section is aimed at helping you to understand when it is safe
to use the ``borrow=True`` argument and reap the benefits of faster code.

Borrowing when Creating Shared Variables
========================================

A ``borrow`` argument can be provided to the shared-variable constructor.

.. testcode:: borrow

   import numpy, theano
   np_array = numpy.ones(2, dtype='float32')

   s_default = theano.shared(np_array)
   s_false   = theano.shared(np_array, borrow=False)
   s_true    = theano.shared(np_array, borrow=True)

By default (*s_default*) and when explicitly setting ``borrow=False``, the
shared variable we construct gets a [deep] copy of *np_array*.  So changes we
subsequently make to *np_array* have no effect on our shared variable.

.. testcode:: borrow

   np_array += 1 # now it is an array of 2.0 s

   print(s_default.get_value())
   print(s_false.get_value())
   print(s_true.get_value())

.. testoutput:: borrow

   [ 1.  1.]
   [ 1.  1.]
   [ 2.  2.]


If we are running this with the CPU as the device,
then changes we make to *np_array* *right away* will show up in
``s_true.get_value()``
because NumPy arrays are mutable, and *s_true* is using the *np_array*
object as it's internal buffer.

However, this aliasing of *np_array* and *s_true* is not guaranteed to occur,
and may occur only temporarily even if it occurs at all.
It is not guaranteed to occur because if Theano is using a GPU device, then the
``borrow`` flag has no effect. It may occur only temporarily because
if we call a Theano function that updates the value of *s_true* the aliasing
relationship *may* or *may not* be broken (the function is allowed to
update the ``shared`` variable by modifying its buffer, which will preserve
the aliasing, or by changing which buffer the variable points to, which
will terminate the aliasing).

*Take home message:*

It is a safe practice (and a good idea) to use ``borrow=True`` in a ``shared``
variable constructor when the ``shared`` variable stands for a large object (in
terms of memory footprint) and you do not want to create copies of it in
memory.

It is not a reliable technique to use ``borrow=True`` to modify ``shared`` variables
through side-effect, because with some devices (e.g. GPU devices) this technique will
not work.

Borrowing when Accessing Value of Shared Variables
==================================================

Retrieving
----------

A ``borrow`` argument can also be used to control how a ``shared`` variable's value is
retrieved.


.. testcode:: borrow

   s = theano.shared(np_array)

   v_false = s.get_value(borrow=False) # N.B. borrow default is False
   v_true = s.get_value(borrow=True)


When ``borrow=False`` is passed to ``get_value``, it means that the return value
may not be aliased to any part of Theano's internal memory.
When ``borrow=True`` is passed to ``get_value``, it means that the return value
*might* be aliased to some of Theano's internal memory.
But both of these calls might create copies of the internal memory.

The reason that ``borrow=True`` might still make a copy is that the internal
representation of a ``shared`` variable might not be what you expect.  When you
create a ``shared`` variable by passing a NumPy array for example, then ``get_value()``
must return a NumPy array too.  That's how Theano can make the GPU use
transparent.  But when you are using a GPU (or in the future perhaps a remote machine),
then the numpy.ndarray is not the internal representation of your data.
If you really want Theano to return its internal representation *and never copy it*
then you should use the ``return_internal_type=True`` argument to
``get_value``.  It will never cast the internal object (always return in
constant time), but might return various datatypes depending on contextual
factors (e.g. the compute device, the dtype of the NumPy array).

.. testcode:: borrow

    v_internal = s.get_value(borrow=True, return_internal_type=True)

It is possible to use ``borrow=False`` in conjunction with
``return_internal_type=True``, which will return a deep copy of the internal object.
This is primarily for internal debugging, not for typical use.

For the transparent use of different type of optimization Theano can make,
there is the policy that ``get_value()`` always return by default the same object type
it received when the ``shared`` variable was created. So if you created manually data on
the gpu and create a ``shared`` variable on the gpu with this data, ``get_value`` will always
return gpu data even when ``return_internal_type=False``.

*Take home message:*

It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
your code does not modify the return value.  *Do not use this to modify a ``shared``
variable by side-effect* because it will make your code device-dependent.
Modification of GPU variables through this sort of side-effect is impossible.

Assigning
---------

``Shared`` variables also have a ``set_value`` method that can accept an optional
``borrow=True`` argument. The semantics are similar to those of creating a new
``shared`` variable - ``borrow=False`` is the default and ``borrow=True`` means
that Theano *may* reuse the buffer you provide as the internal storage for the variable.

A standard pattern for manually updating the value of a ``shared`` variable is as
follows:

.. testsetup:: borrow

   def some_inplace_fn(v):
       return v

.. testcode:: borrow

    s.set_value(
        some_inplace_fn(s.get_value(borrow=True)),
        borrow=True)

This pattern works regardless of the computing device, and when the latter
makes it possible to expose Theano's internal variables without a copy, then it
proceeds as fast as an in-place update.


When ``shared`` variables are allocated on the GPU, the transfers to and from the GPU device memory can
be costly.  Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:

* Prior to Theano 0.3.1, ``set_value`` did not work in-place on the GPU. This meant that, sometimes,
  GPU memory for the new value would be allocated before the old memory was released. If you're
  running near the limits of GPU memory, this could cause you to run out of GPU memory
  unnecessarily.

  *Solution*: update to a newer version of Theano.

* If you are going to swap several chunks of data in and out of a ``shared`` variable repeatedly,
  you will want to reuse the memory that you allocated the first time if possible - it is both
  faster and more memory efficient.

  *Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
  data to make sure that every chunk is the same size.

* It is also worth mentioning that, current GPU copying routines support only contiguous memory.
  So Theano must make the value you provide *C-contiguous* prior to copying it.
  This can require an extra copy of the data on the host.

  *Solution*: make sure that the value
  you assign to a CudaNdarraySharedVariable is *already*  *C-contiguous*.

(Further information on the current implementation of the GPU version of ``set_value()`` can be found
here: :ref:`libdoc_cuda_var`)

.. _borrowfunction:

Borrowing when Constructing Function Objects
============================================

A ``borrow`` argument can also be provided to the ``In`` and ``Out`` objects
that control how ``theano.function`` handles its argument[s] and return value[s].

.. testcode::

    import theano, theano.tensor

    x = theano.tensor.matrix()
    y = 2 * x
    f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))

Borrowing an input means that Theano will treat the argument you provide as if
it were part of Theano's pool of temporaries.  Consequently, your input
may be reused as a buffer (and overwritten!) during the computation of other variables in the
course of evaluating that function (e.g. ``f``).


Borrowing an output means that Theano will not insist on allocating a fresh
output buffer every time you call the function.  It will possibly reuse the same one as
on a previous call, and overwrite the old content.  Consequently, it may overwrite
old return values through side-effect.
Those return values may also be overwritten in
the course of evaluating *another compiled function* (for example, the output
may be aliased to a ``shared`` variable).  So be careful to use a borrowed return
value right away before calling any more Theano functions.
The default is of course to *not borrow* internal results.

It is also possible to pass a ``return_internal_type=True`` flag to the ``Out``
variable which has the same interpretation as the ``return_internal_type`` flag
to the ``shared`` variable's ``get_value`` function.  Unlike ``get_value()``, the
combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
``Out()`` are not guaranteed to avoid copying an output value.  They are just
hints that give more flexibility to the compilation and optimization of the
graph.

For GPU graphs, this borrowing can have a major speed impact.  See the following code:

.. code-block:: python

   from theano import function, config, shared, sandbox, tensor, Out
   import numpy
   import time

   vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
   iters = 1000

   rng = numpy.random.RandomState(22)
   x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
   f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
   f2 = function([],
                 Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
                     borrow=True))
   t0 = time.time()
   for i in range(iters):
       r = f1()
   t1 = time.time()
   no_borrow = t1 - t0
   t0 = time.time()
   for i in range(iters):
       r = f2()
   t1 = time.time()
   print(
       "Looping %s times took %s seconds without borrow "
       "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
   )
   if numpy.any([isinstance(x.op, tensor.Elemwise) and
                 ('Gpu' not in type(x.op).__name__)
                 for x in f1.maker.fgraph.toposort()]):
       print('Used the cpu')
   else:
       print('Used the gpu')

Which produces this output:

.. code-block:: none

   $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
   Using gpu device 0: GeForce GTX 275
   Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
   Used the gpu

*Take home message:*

When an input *x* to a function is not needed after the function
returns and you would like to make it available to Theano as
additional workspace, then consider marking it with ``In(x,
borrow=True)``.  It may make the function faster and reduce its memory
requirement.  When a return value *y* is large (in terms of memory
footprint), and you only need to read from it once, right away when
it's returned, then consider marking it with an ``Out(y,
borrow=True)``.

