# Knowledge Tracing Set Transformers (KTSTs)

Code for _Toward Principled Transformers for Knowledge Tracing_

> Knowledge tracing aims to reason about changes in students' knowledge and to predict students' performance in educational learning settings. We propose knowledge tracing set transformers (KTSTs), a straightforward model class for knowledge tracing prediction tasks. This model class is conceptually simpler than previous state-of-the-art approaches, which are overly complex due to domain-inspired components, and which are in part based on suboptimal design choices and flawed evaluation. In contrast, for KTSTs we propose principled set representations of student interactions and a simplified variant of learnable modification of attention matrices for positional information in a student's learning history. While being largely domain agnostic, the proposed model class thus accounts for characteristic traits of knowledge tracing tasks. In extensive empirical experiments on standardized benchmark datasets, KTSTs establish new state-of-the-art performance.

## Installation

We use [conda](https://pip.pypa.io/en/stable/) to manage dependencies. 
To create a new conda environment with the necessary dependencies installed, execute the following command:

```bash
conda env create --file env.yaml
```

By default, this will create a conda environment named `knowledge_tracing`. To activate this environment, use:

```bash
conda activate knowledge_tracing
```

## Usage

### Pre-process data

To run the experiments, the datasets must first be downloaded and then pre-processed. 
The code for pre-processing is stored in `preprocess.py`, which has a CLI and internally uses the [pykt](https://github.com/pykt-team/pykt-toolkit) data pre-processing logic. 
Run the following command for more information about the pre-processing of the data, e.g. the links to the datasets:

```bash
python preprocess.py --help
```

### Train a model
We use [hydra](https://hydra.cc/) to manage {model, data, training, evaluation, reproduction, ...}-configurations. 
All configs are stored in `pkg/config`. 

In order to train a single KTST model with its default config on the assist2009 dataset, execute the following command:

```bash
python train.py +model=ktst +data=assist2009 data_path=<path_to_data_dir>
```

This will create a folder in `outputs` in which all relevant files (hydra configs, model checkpoints, tensorboard files, logs, etc.) are stored. 

### Evaluate a model

The training procedure will log and store model states into a folder called `outputs`. 
Use the explicit path(s) to an experiment to run evaluation on trained models (look into `scripts` for examples).

```bash
python eval.py restore_from=<name_of_dir_in_output_folder>
```

### Reproduce results from paper

Configuration files for reproducing the results are stored under `pkg/config/reproduce`.
Each sub-directory has a multirun.yaml that starts all training runs: 

```bash
python train.py -m +reproduce=<path-to-multirun-file>
```

For example, to train the benchmark models run the following command:

```bash
python train.py -m +reproduce=ktst/benchmark/multirun
```

Notebooks containing the code for creating tables and figures can be found under `scripts`. 

## Run pykt benchmark

We have added a convenient way to run the [pykt](https://pykt.org/) benchmark and compute its results (for example, to conduct paired _t_-tests) on a GPU cluster: 
The script `pykt_benchmark.py` has a CLI with which both training and evaluation of models in pykt can be started. 
For more information on the CLI, run the following command: 

```bash
python pykt_benchmark.py --help
```

### (Important) Linking data folder

The computations of the pykt-toolkit depend on a relative reference to the `data` folder. 
To do this, the `data` folder must be linked to the `pykt-toolkit` subfolder via a soft link. 

```bash
ln -s <link_to_data_dir> <link_to_this_repo>/pykt-toolkit/data
```

### Training models

To train a particular model on specific datasets, these can be defined as comma-separated values.
For example, to train the AKT and DKT models on assist2009, the following command can be used:

```bash
python pykt_benchmark.py train --models akt,dkt --datasets assist2009
```

If no specific models or datasets are defined, all available models or datasets are trained. 
If not specified otherwise, training results are stored in `outputs/pykt-models`, organized by model, dataset and fold id. 
The file `done` in the corresponding folder of a run indicates whether the training was completed successfully. 

### Evaluating models

The evaluation can be executed by changing the mode from `train` to `eval`.
For this to work, the models must of course have been trained beforehand and the `done` file needs to be found in the respective directory.

```bash
python pykt_benchmark.py eval --models akt,dkt --datasets assist2009
```

## Misc

### Sweep model parameters

We provide a script with which with you can tune model parameters with [Optuna](https://hydra.cc/docs/plugins/optuna_sweeper/).
You can retrieve information on available parameters via the CLI.

```bash
python sweep.py --help
```

A config defines the model and its corresponding parameter search space. 
Example:

```bash
python sweep.py sweep/ktst=benchmark_mean --datasets statics2011 --folds 0
```

## License

[MIT](https://choosealicense.com/licenses/mit/)
