
.. _tut_profiling:

=========================
Profiling Theano function
=========================

.. note::

    This method replace the old ProfileMode. Do not use ProfileMode
    anymore.

Besides checking for errors, another important task is to profile your
code in terms of speed and/or memory usage.

You can profile your
functions using either of the following two options:


1. Use Theano flag :attr:`config.profile` to enable profiling.
    - To enable the memory profiler use the Theano flag:
      :attr:`config.profile_memory` in addition to :attr:`config.profile`.
    - Moreover, to enable the profiling of Theano optimization phase,
      use the Theano flag: :attr:`config.profile_optimizer` in addition
      to :attr:`config.profile`.
    - You can also use the Theano flags :attr:`profiling.n_apply`,
      :attr:`profiling.n_ops` and :attr:`profiling.min_memory_size`
      to modify the quantity of information printed.

2. Pass the argument :attr:`profile=True` to the function :func:`theano.function <function.function>`. And then call :attr:`f.profile.print_summary()` for a single function.
    - Use this option when you want to profile not all the
      functions but one or more specific function(s).
    - You can also combine the profile of many functions: 
    
      .. doctest::
          :hide:

          profile = theano.compile.ProfileStats()
          f = theano.function(..., profile=profile)  # doctest: +SKIP
          g = theano.function(..., profile=profile)  # doctest: +SKIP
          ...  # doctest: +SKIP
          profile.print_summary()



The profiler will output one profile per Theano function and profile
that is the sum of the printed profiles. Each profile contains 4
sections: global info, class info, Ops info and Apply node info.

In the global section, the "Message" is the name of the Theano
function. theano.function() has an optional parameter ``name`` that
defaults to None. Change it to something else to help you profile many
Theano functions. In that section, we also see the number of times the
function was called (1) and the total time spent in all those
calls. The time spent in Function.fn.__call__ and in thunks is useful
to understand Theano overhead.

Also, we see the time spent in the two parts of the compilation
process: optimization (modify the graph to make it more stable/faster)
and the linking (compile c code and make the Python callable returned
by function).

The class, Ops and Apply nodes sections are the same information:
information about the Apply node that ran. The Ops section takes the
information from the Apply section and merge the Apply nodes that have
exactly the same op. If two Apply nodes in the graph have two Ops that
compare equal, they will be merged. Some Ops like Elemwise, will not
compare equal, if their parameters differ (the scalar being
executed). So the class section will merge more Apply nodes then the
Ops section.

Note that the profile also shows which Ops were running a c implementation.

Developers wishing to optimize the performance of their graph should
focus on the worst offending Ops and Apply nodes – either by optimizing
an implementation, providing a missing C implementation, or by writing
a graph optimization that eliminates the offending Op altogether.
You should strongly consider emailing one of our lists about your
issue before spending too much time on this.

Here is an example output when we disable some Theano optimizations to
give you a better idea of the difference between sections. With all
optimizations enabled, there would be only one op left in the graph.

.. note::

    To profile the peak memory usage on the GPU you need to do::

        * In the file theano/sandbox/cuda/cuda_ndarray.cu, set the macro
          COMPUTE_GPU_MEM_USED to 1.
        * Then call theano.sandbox.cuda.theano_allocated()
          It return a tuple with two ints. The first is the current GPU
          memory allocated by Theano. The second is the peak  GPU memory
          that was allocated by Theano.

    Do not always enable this, as this slows down memory allocation and
    free. As this slows down the computation, this will affect speed
    profiling. So don't use both at the same time.

to run the example:

  THEANO_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py

The output:

.. literalinclude:: profiling_example_out.prof
