# Parameter guide for executing experiments

The parameters are generally seperated into five categories:

- Parameters for controlling execution of the experiment
- Parameters for controlling the individual training steps
- Parameters to pass to the algorithms
- Parameters to pass to the policy
- Parameters to pass to the environments

The parameters in the head of the execute_experiment function are ordered with this respect. In the case of algorithms and environments, there may be parameters unique to some of the options. These will all be passed in a special argument, where all possible arguments will be explained in detail below.

## 1. Parameters for controlling execution of the experiment

### base_folder (Str)

The path to the folder in which the data should be saved. This folder needs to already exist.

### num_runs (Int)

The number of times the algorithm should be trained.

### progress (Bool)

If True, a progress bar will be displayed showing the progress for the amount of times the algorithm is trained.

### project_name (Str)

The name of the project for which the experiments are done. The results will be saved under the folder with this name in the base_folder.

### runtime_estimation (Bool):

If True, an estimate of the total runtime will be calculated and displayed.

### safe_mode (True):

If True, the inputs will be checked. The displayed messages might make it easier to understand what parameters need to be changed in which way in case of errors.

### verbose (Bool):

If True, the function is verbose, meaning more messages will be printed to the terminal.

## 2. Parameters for controlling the individual training steps

### algo (Algo)

The algorithm on which the training should be performed.

### algo_special_logs (Bool)

If True, the algorithm's special logs will be logged.

### algo_special_logs_kwargs (Dict)

If the algorithm's special logs are to be logged, this dictionary contains the necessary keywords. For the following algorithms there are special log options available:

#### ADP
| Keyword                          | Type  | Description                                                                                     |
|----------------------------------|-------|-------------------------------------------------------------------------------------------------|
| target_q_values                  | -     | If key is supplied, the whole target_q_fct will be logged as tuple: ("target Q value of state {state} and action {action}", target_q_fct[(state, action)]). |
| which_target_q_values            | tuple | A tuple containing a list of states and a list of corresponding actions to log the target_q_fct_values for. The first coordinate contains the list of states, where each state needs to be a valid positive integer. The second coordinate contains a list of lists for each state, which contain the actions corresponding to the states given as positive integers. |
| updated_q_values                 | -     | If key is supplied, the updated Q values will be logged as tuple: ("updated Q value of state {state} and action {action}", q_fct_update[(state, action)]). |
| which_updated_q_values           | tuple | A tuple containing a list of states and a list of corresponding actions to log the updated Q values for. The first coordinate contains the list of states, where each state needs to be a valid positive integer. The second coordinate contains a list of lists for each state, which contain the actions corresponding to the states given as positive integers. |
| cycle_means                      | -     | If key is supplied, the cycle mean raw updates will be logged as tuple: ("cycle mean raw_update of state {state} and action {action}", pair_cycle_mean[(state, action)]). |
| which_cycle_means                | tuple | A tuple containing a list of states and a list of corresponding actions to log the cycle mean raw updates for. The first coordinate contains the list of states, where each state needs to be a valid positive integer. The second coordinate contains a list of lists for each state, which contain the actions corresponding to the states given as positive integers. |
| global_cycle_mean_of_means       | -     | If key is supplied, the global cycle mean of absolute means will be logged as a single value: ("global cycle mean of abs means", last_mean_m_hat). |

### bias_estimation (Bool)

If True, the various bias estimation metrics will be computed at the individual runs and the averages over the runs will be stored.

### correct_act_q_fct_mode (Str)

If the logging of the bias estimation metrics, the correct action rates, or the initial Q function estimations is on, the correct action and Q function need to be known to the function. The mode with which the correct actions and Q function will be determined can be either "manual" or "value_iteration". In the latter case, the value iteration algorithm will be performed to determine both.

### correct_act_q_fct_mode_kwargs (Dict)

The dictionary of necessary keyword arguments for the chosen mode of determination of the correct actions and Q function. For "manual", it needs the keys "correct_actions" and "correct_q_fct", which correspond to the list containing the lists of assumed best actions for each state and the dictionary mapping each state action pair to their assumend optimal Q function value. For "value_iteration", the keys need to be "n_max", "tol", "env_mean_rewards", and "mean_rewards_mc_runs", where "n_max" maps to a strictly positive integer representing the maximum amount of iterations for which the value iteration should be performed, "tol" maps to a strictly positive numerical value representing the desired maximum error after completing the value iteration, "env_mean_rewards" is a dictionary in the same shape as rewards for the chosen environment (see below) but with the mean of the rewards instead of possible stochastic rewards everywhere if the means are known and else it is an empty dictionary, and "mean_rewards_mc_runs" is the number of Monte Carlo runs per stochastic state that should be performed for determining the actual environment mean rewards in case the corresponding dictionary is left empty.

### correct_action_log (Bool)

If True, the correct action rates will be logged after each epoch and each evaluation cycle.

### correct_action_log_which (Str/List)

If the correct action rates are logged, describes which states should be considered for calculating the current policies' correct action rate. It can be passed as 'all', meaning all states will be considered or as a list of states passed as valid state numbers.

### env (Env)

The environment you want to train your algorithm on.

### env_randomization (Bool)

If True, the evaluation environment's parameters will be chosen randomly after each run after the environment randomization seed schedule is exhausted. Be advised that there is no check in place that the resulting environment is sensible, i.e. e.g. the "goal" is still a goal in the sense that reaching this state solves the RL problem, so make sure to think your randomization keyword arguments thoroughly through.

### env_randomization_kwargs (Dict)

If the environment randomization is on, specifies arguments to be passed to the random environment constructor. For each game a description of the parameters that determine random environments is provided.

#### 1. GridWorld

| Keyword                                   | Type                     | Description                                                                      |
|-------------------------------------------|--------------------------|----------------------------------------------------------------------------------|
| check_goal_is_goal                        | bool                     | If True, a check is performed that the state labeled "goal" indeed has the highest (discounted) average reward. States' rewards, which have higher means after being drawn will be shifted downwards. Default value is True. |
| discounted_reward_goal_limit              | float                    | The maximum ratio of reward to discounted goal value that states can get assigned. This value will be used to determine the downwards shift of reward means. Default value is 0.95. |
| reward_normalization                      | bool                     | If True, the scores on the optimum path to the goal are normalized to give a discounted sum of 1. If this is not possible, warnings will be displayed. Default value is True. |
| reward_normalization_factor_for_negatives | (int,float,Callable,str) | If reward_normalization is on, but the discounted sum of rewards on the best path is negative, the negative values need to be iteratively scaled first such that a positive reward overall is achieved. This parameter decides how they get scaled. If it is numerical, they will be scaled by the amount presented - thus, the factor should be bigger than one. If it is a lambda function, the sum of discounted rewards as an input will determine the scaling factor. If it is the string 'default', the absolute value of the sum of discounted rewards plus one will be the scaling factor. Default value is 'default'. |
| reward_normalization_num_tries            | int                      | The number of maximum tries for reward normalization if the discounted sum of rewards on the best path is negative. If it is -1, no maximum number of tries is specified. Default value is -1. |
| randomization_kwargs                      | dict                     | A dictionary containing dictionaries specifying how to randomly draw each of the parameters of GridWorld. If one or more of the dictionaries is not present, the corresponding GridWorld parameters will be taken from the initial initialization. A description of its contents is listed below. |

Elements of randomization_kwargs:

| Keyword                             | Type |Description                                                                      |
|-------------------------------------|------|----------------------------------------------------------------------------------|
| randomize_gridsize_kwargs           | dict | A tuple containing instructions on how to randomly draw the components of the grid_size parameter (first and second entry) in the format as described in the init-function. If this is activated, randomize_locations_kwargs needs to be specified as well. |
| randomize_locations_kwargs          | dict | A dictionary containing valid state names as keys mapping to a tuple. The first entry specifies the number of states of this type to be drawn and the second the locations from which it can be drawn. The number of states can be either a fixed number or a random drawing specification in the format described in the rewards parameter of GridWorld. The locations to be drawn from need to be passed as either a list of state numbers (where in case the grid does not allow for this many states, the rest of the states are ejected), a tuple of tuples of row and column, delimiting an area (upper left to lower right point of the Grid, starting with state 0 in the upper left corner), or the string "all" (meaning randomly drawn from all states). "Goal" and "Start" will be chosen first. Then any state that does not undergo randomization is set if the new grid size allows it and then all other random state locations will be drawn in the order of their appearance in the dictionary. |
| randomize_rewards_kwargs            | dict | A dictionary containing state names present in the original game as keys mapping to a tuple. Its first element specifies if it should be a random reward after drawing (True) or not (False). In case not random was selected the second element of the tuple should specify how to draw the deterministic reward in the format as described in the rewards parameter of GridWorld. In case random was assigned, you should specify how to draw the reward in format as described in the rewards parameter of GridWorld, but instead of numerical values you may write strings, in which case you also need to pass the "codenames" dict mapping theses strings to a method of how to draw the value they represent. In this case, additionally, the mean should be added to the tuple for faster computing. If it is dependent on some drawn constant, you can use the alias from the "codenames" dict. |
| codenames                           | dict | A dictionary specifying random drawings for the randomly drawn numbers in randomize_rewards_kwargs. It should use the same codewords as contained in the randomize_rewards_kwargs dictionary and the format as described in the rewards parameter of GridWorld. |
| randomize_game_modifications_kwargs | dict | A dictionary containing information regarding the arguments hovering, windy, wind_prob, wind_dir, slippery, slip_prob, random_actions, random_prob, and random_vec. It maps these to lists following the same conventions as in randomize_rewards_kwargs, with random_prob being a list of 4 of those. |

### env_randomization_schedule (List)

The list of seeds scheduled for being used during environment randomization. For each instance of -1, a random seed will be drawn.

### eval_reseeding (Bool)

If True, the evaluation environment's seed will be reseeded after each evaluation run after the evaluation seed schedule is exhausted.

### eval_seed_schedule (List)

The list of seeds scheduled for being used during evaluations. For each instance of -1, a random seed will be drawn.

### eval_steps (int)

The number of steps that the environment should be played during each evaluation

### eval_freq (int)

The number of update steps that should be performed before the algorithm is evaluated.

### eval_policy_choice (str)

The choice of policy at evaluation times. Can be either greedy or softmax.

### eval_policy_choice_kwargs (Dict)

The keyword arguments corresponding to the choice of evaluation policy. For greedy this needs to be an empty dictionary and for softmax it needs to contain a value for the inverse temperature beta.

### focus_state_actions (bool)
If True, the state action pairs contained in which_state_actions_focus will be focussed, meaning that their estimated Q values at evaluations and their biases at evaluations will be logged separately.

### max_steps_per_epoch (int)

The maximum amount of update steps the algorithm is allowed to perform before the epoch gets cut off. If it is -1, there is no maximum amount.

### num_steps (int)

The number of steps (as defined by the training mode) to be performed for each individual training cycle.

### policy (Policy)

The policy to be used during training

### progress_single_games (Bool)

If True, for each individual training cycle a separate progress bar will be shown (and dropped after completion), if the general progress bar is also displayed.

### training_mode (Str)

The chosen training mode. It can either be "steps" or "epoch", depending on if you want to play a fixed number of steps or a fixed number of epochs in each training cycle.

### training_reseeding (Bool)

If True, the algorithm's seed will be reseeded after each step (in the sense of the training mode) after the training seed schedule is exhausted.

### training_seed_schedule (List)
The list of seeds scheduled for being used during training. For each instance of -1, a random seed will be drawn.

### which_state_actions_focus (Tuple)

Describes for which state action pairs the Q function and the bias at evaulations are to be logged seperately. The first argument of the tuple is a list of integers corresponding to the chosen states. 'start' can be passed, corresponding to the start state. The second argument is a list of lists of actions corresponding to each of the chosen states. The actions can be either passed with their numerical value or one of the action values can be "best", meaning for this state, the best action (according to the estimated best actions) will be chosen.

## 3. Parameters to pass to the algorithms

### learning_rate_kwargs (Dict)

A dictionary that contains the keyword arguments for scheduling the learning rate. The following table summarizes the keywords and their options.

| Keyword      | Type  | Description                                                                      |
|--------------|-------|----------------------------------------------------------------------------------|
| initial_rate | float | The initial stepsize to be used at the beginning of the scheduling process. |
| mode         | str   | The mode with which the stepsize should be stepwise updated. The implemented modes are: "constant", meaning the initial rate will always be used as stepsize. "linear", meaning the initial rate and a desired end rate may be linearly interpolated between based on a set amount of steps or a slope in order to schedule the stepsizes. "rate", meaning a specified rate function is used to schedule the stepsizes until a final rate is reached. |
| mode_kwargs  | dict  | A dictionary containing the necessary keyword arguments for the chosen scheduling mode. For "constant" this needs to be a dictionary containing the keyword "final_rate" mapping to the same value as initial_rate. For "linear" it needs to be a dictionary containing the keywords "final_rate", "num_steps", and "slope", mapping to the desired final rate, the number of steps upon which it should be reached, and the slope at which this should happen. The value of the slope may either be positive, in which case the num_steps argument will be ignored and the slope will be used until the rate hits the final rate value, or it may be -1, meaning the num_steps argument (passed as a positive integer) will be used to determine the slope automatically. For "rate" it needs to be a dictionary containing the keywords "rate_fct", "iteration_num", and "final_rate", mapping to a rate function to be used, which should be a decreasing lambda funtion, the current iteration number, which needs to be set to one, and the desired final rate. |

### learning_rate_state_action_wise (Bool)

If True, it means that the learning rate schedule will be executed seperately for each state action pair as opposed to being updated for all state action pairs at the same time on each step. If the algorithm works using multiple copies, the learning rate schedule will be applied seperately for each copy.

### gamma (Float)

The discount factor to be used for the value functions given as a float between 0 and 1.

### algo_specific_params (Dict)

The rest of the possible parameters are algorithm specific. Some of them are equal for classes of algorithms. Therefore, in the following, for each class of algorithm and some individual algorithms a description of the parameters you can pass is provided.

#### 1. Q function based algorithms

This class currently encases Q, Double, and WDQ.

| Keyword           | Type            | Description                                                      |
|-------------------|-----------------|-------------------------------------------------------------------|
| q_fct_manual_init | bool            | If True, the Q function(s) will be manually initialized, if False the initial Q functions will take the value 0 on all state action pairs. |
| initial_q_fct     | dict/list(dict) | The dictionary (or list of dictionaries) containing the initialization(s) of the Q function(s). In the case of the Q algorithm one dictionary needs to be passed, in the case of the Double and WDQ algorithm either one dictionary (meaning both Q functions will be initialized with the same custom Q function), or a list of two dictionaries may be passed. The keys of the dictionary (or dictionaries) should correspond to all allowed tuples of state and action (as integers) for the chosen game and the values of the dictiony (or dictionaries) to the Q values to be initialized. |

## 4. Parameters to pass to the environment

### env_specific_params (Dict)

For environments, there are no parameters shared in common. Therefore, in the following, for each game a description of the parameters you can pass is provided. If environment randomization is on, the environment will be initialized with the specified parameters but will be modified according to the specified environment randomization arguments passed before the first run.

#### 1. GridWorld

| Keyword        | Type  | Description                                                                    |
|----------------|-------|--------------------------------------------------------------------------------|
| grid_size      | tuple | The size of the grid given as (rows, columns). Both must be positive integers. |
| state_type_loc | dict  | The dictionary mapping state types given as strings with their name to locations and information if its a terminal state. Each entry is a tuple where the first element is a list of coordinates, given as (row,column) for that state type, where the row and column numeration starts with one, and the second element is a boolean indicating whether the state is terminal. Needs to contain the locations of "goal" and "start". The goal must be terminal, while the start can not be terminal. |
| rewards        | dict  | The dictionary mapping state types given as strings with their name (The same keywords as in the stae_type_loc dictionary need to be passed) to their respective rewards. The special key "default" is used for all states not specified in the dictionary and must be passed. Some states may have stochastic rewards represented by a distribution (e.g., normal). In this case a list containing the distribution name and a dictionary of keyword arguments compatible with the numpy random generator need to be passed. Additionally, the mean of the distribution may be added as a third element of the list. In all other cases, the reward must be a numerical value. |
| hovering       | bool  | If True, the player is allowed to choose actions that make it bump into the wall and thus hover in the same place. |
| windy          | bool  | If True, wind is applied to the environment and the player may be pushed in a direction given by wind_dir instead of the direction of the action it chooses. Can not be activated in combination with slippery and/or random_actions. |
| wind_prob      | float | The probability that wind will affect the environment in each step if turned on. Must be a numerical value between 0 and 1. |
| wind_dir       | str   | Direction of the wind as one of the following strings: "up", "right", "down", "left". |
| slippery       | float | If True, the environment is slippery, causing random movement adjacent to the player's chosen action. Can not be activated in combination with windy and/or random_actions. |
| slip_prob      | float | The probability that a random slip occurs if turned on. Must be a numerical value between 0 and 1 |
| random_actions | bool  | If True, the environment may perform random actions instead of the player's chosen action. Can not be activated in combination with windy and/or slippery. |
| random_prob    | float | The probability that a random action will be taken if turned on. Must be a numerical value between 0 and 1. |
| random_vec     | list  | A list of four probabilities, corresponding in order to the probability of randomly moving up, right, down, and left, if turned on. All values in the list need to be numerical and between 0 and 1. |

## 5. Parameters to pass to the policy

### policy_specific_params (Dict)

There is only one implemented policy at the moment. But since there might be more in the future that do not share the same parameters, in the following, for the implemented policy a description of the parameters you can pass is provided with the option to supplement the list in the future.

#### 1. BasePolicy

| Keyword            | Type | Description                                                                 |
|--------------------|------|----------------------------------------------------------------------------|
| policy_mode        | str  | Mode for choosing the next action via the policy. The implemented modes are: "offpolicy", meaning a certain policy determined by the kwargs will always be played. "epsilon_greedy" and "epsilon_greedy_statewise", meaning the policy will decide greedy with respect to the current Q function and a certain rate in each step, which will be updated via a schedule determined by the kwargs either for all steps at once or individually, if statewise is chosen. "greedy", meaning the policy will decide greedy with respect to the current Q function. "softmax", meaning the policy will decide with respect to the softmax of the current Q function. |
| policy_mode_kwargs | dict | The dictionary containing the necessary keyword arguments for the chosen policy mode. These will be explained in the table below for each policy mode individually. |

##### Policy mode "offpolicy" 

| Keyword | Type | Description                                                                            |
|---------|------|----------------------------------------------------------------------------------------|
| type    | str  | The type of behaviour policy chosen. The implemented types are: "uniform_random", meaning the behaviour policy will be a uniform random one over all allowed actions in each state. "full_init", meaning a behaviour policy will be manually passed. |
| kwargs  | dict | The dictionary containing the necessary keyword arguments for the chosen behaviour policy type. For "uniform_random" this needs to be an empty dictionary. For "full_init" it needs to only contain the keyword "policy_list", mapping to the desired behaviour policy. The behaviour policy needs to be passed as a list that has the same length as the played game has number of states and contain the actions. For some states the behaviour policy may have stochastic preferences among the arms. In this case a list containing the distribution name "choice" and a dictionary containing the keywords "a" and "p", mapping to a list of arms "a" (passed as integers), between which the agent may choose with probabilities "p", need to be passed. In all other cases, the action must be a numerical value. |

##### Policy modes "epsilon_greedy" and "epsilon_greedy_statewise

| Keyword      | Type  | Description                                                                      |
|--------------|-------|----------------------------------------------------------------------------------|
| initial_rate | float | The initial epsilon to be used at the beginning of the scheduling process. |
| mode         | str   | The mode with which the rate should be stepwise updated. The implemented modes are: "constant", meaning the initial rate will always be used as epsilon. "linear", meaning the initial rate and a desired end rate may be linearly interpolated between based on a set amount of steps or a slope in order to schedule the epsilon. "rate", meaning a specified rate function is used to schedule the epsilon until a final rate is reached. |
| mode_kwargs  | dict  | The dictionary containing the necessary keyword arguments for the chosen scheduling mode. For "constant" this needs to be a dictionary containing the keyword "final_rate" mapping to the same value as initial_rate. For "linear" it needs to be a dictionary containing the keywords "final_rate", "num_steps", and "slope", mapping to the desired final rate, the number of steps upon which it should be reached, and the slope at which this should happen. The value of the slope may either be positive, in which case the num_steps argument will be ignored and the slope will be used until the rate hits the final rate value, or it may be -1, meaning the num_steps argument (passed as a positive integer) will be used to determine the slope automatically. For "rate" it needs to be a dictionary containing the keywords "rate_fct", "iteration_num", and "final_rate", mapping to a rate function to be used, which should be a decreasing lambda funtion, the current iteration number, which needs to be set to one, and the desired final rate. |

##### Policy mode "greedy"

For this policy mode, the dictionary "policy_mode_kwargs" needs to be left empty.

##### Policy mode "softmax"

| Keyword     | Type  | Description                                                                       |
|-------------|-------|-----------------------------------------------------------------------------------|
| temperature | float | The temperature used in the softmax function.                                     |