{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BBC dataset \n",
    "**Caution:** Many data valuation methods require training large number of models to get reliable estimates. **It is extremely slow**. We recommend using embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "\n",
    "import sys\n",
    "import os\n",
    "# Add the parent directory to the sys.path\n",
    "sys.path.append(os.path.abspath('../opendataval'))\n",
    "\n",
    "# Opendataval\n",
    "from dataloader import Register, DataFetcher, mix_labels, add_gauss_noise\n",
    "from dataval import (\n",
    "    AME,\n",
    "    DVRL,\n",
    "    BetaShapley,\n",
    "    DataBanzhaf,\n",
    "    DataOob,\n",
    "    DataShapley,\n",
    "    InfluenceSubsample,\n",
    "    GradientShapley,\n",
    "    KNNShapley,\n",
    "    LavaEvaluator,\n",
    "    LeaveOneOut,\n",
    "    RandomEvaluator,\n",
    "    RobustVolumeShapley,\n",
    ")\n",
    "\n",
    "from experiment import ExperimentMediator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Step 1] Set up an environment\n",
    "`ExperimentMediator` is a fundamental concept in establishing the `opendataval` environment. It empowers users to configure hyperparameters, including a dataset, a type of synthetic noise, and a prediction model. With  `ExperimentMediator`, users can effortlessly compute various data valuation algorithms.\n",
    "\n",
    "The following code cell demonstrates how to set up `ExperimentMediator` with a pre-registered dataset and a prediction model.\n",
    "- Dataset: bbc\n",
    "- Model: transformer's DistilBertModel\n",
    "- Metric: Classification accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_name = \"bbc\" \n",
    "train_count, valid_count, test_count = 1000, 100, 500\n",
    "noise_rate = 0.1\n",
    "noise_kwargs = {'noise_rate': noise_rate}\n",
    "model_name = \"BertClassifier\"\n",
    "metric_name = \"accuracy\"\n",
    "train_kwargs = {\"epochs\": 2, \"batch_size\": 50}\n",
    "device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'\n",
    "\n",
    "exper_med = ExperimentMediator.model_factory_setup(\n",
    "    dataset_name=dataset_name,\n",
    "    cache_dir=\"../data_files/\",  \n",
    "    force_download=False,\n",
    "    train_count=train_count,\n",
    "    valid_count=valid_count,\n",
    "    test_count=test_count,\n",
    "    add_noise=mix_labels,\n",
    "    noise_kwargs=noise_kwargs,\n",
    "    train_kwargs=train_kwargs,\n",
    "    device=device,\n",
    "    model_name=model_name,\n",
    "    metric_name=metric_name\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Step 2] Compute data values\n",
    "`opendataval` provides various state-of-the-art data valuation algorithms. `ExperimentMediator.compute_data_values()` computes data values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_evaluators = [ \n",
    "    GradientShapley(),\n",
    "    RandomEvaluator(),\n",
    "#     LeaveOneOut(), # leave one out ## slow\n",
    "    InfluenceSubsample(num_models=10), # influence function\n",
    "#     DVRL(rl_epochs=10), # Data valuation using Reinforcement Learning ## inappropriate\n",
    "#     KNNShapley(k_neighbors=valid_count), # KNN-Shapley ## inappropriate\n",
    "#     DataShapley(gr_threshold=1.05, mc_epochs=300, cache_name=f\"cached\"), # Data-Shapley ## slow\n",
    "#     BetaShapley(gr_threshold=1.05, mc_epochs=300, cache_name=f\"cached\"), # Beta-Shapley ## slow\n",
    "    DataBanzhaf(num_models=10), # Data-Banzhaf\n",
    "    AME(num_models=10), # Average Marginal Effects\n",
    "    DataOob(num_models=10) # Data-OOB\n",
    "#     LavaEvaluator(),\n",
    "#     RobustVolumeShapley(mc_epochs=300)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# compute data values.\n",
    "## Training multiple DistilBERT models is extremely slow. We recommend using embeddings.\n",
    "exper_med = exper_med.compute_data_values(data_evaluators=data_evaluators)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Step 3] Store data values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from opendataval.experiment.exper_methods import save_dataval\n",
    "\n",
    "# Saving the results\n",
    "output_dir = f\"../tmp/{dataset_name}_{noise_rate=}/\"\n",
    "exper_med.set_output_directory(output_dir)\n",
    "output_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exper_med.evaluate(save_dataval, save_output=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
