# RLLF
Reinforcement Learning from Limited Feedback

domain requirement:
- state: dimension-[1, N], type-np.int8
- action: discrete

Available parameters:

- domains: graph, tree, tworooms, breakout  (TODO: icu, different minatar)

- selection methods: heuristic, sequential(greedy, beam)

- dataset
    - data_collecting: 'good', 'mid', 'bad'
    - dataset_size: 100_000 for small domain, 1_000_000 for large domain

- heuristic selections:
    - algo: guided, visit, uniform
    - impute: none, zero, mean, max, min  # none means no impute, meaning to learn a truncated q function
    - alpha schedule:
        - decay: concave, convex, linear
        - fixtime: in the range of (0, 1), meaning to which point alpha decay to zero. E.g., 0.5 means alpha decays to zero when b iterate to 0.5B 
        - decay_temp: control the curve of 'concave', and 'convex'. When it's 0, 'concave' and 'convex' are 'linear'.
    - initial_sample: how much sample to start with, default with 1

- other:
    - budget: when not specify, the default is iterating from 0 to B.
    - root: where to save the result files
    - expname: experiment name


For example, running experiments for 'tworooms' domain would be:
```
python main.py domain=tworooms domain.exp.algo=guided domain.exp.decay=linear domain.exp.fixtime=0.7 domain.exp.decay_temp=12  general.expname=batchjob1 general.seed=99 domain.exp.initial_sample=1 domain.exp.impute=none   
```


