{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to make configurations\n",
    "\n",
    "FuxiCTR version: v1.0\n",
    "\n",
    "This tutorial presents the details of how to use the YAML config files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset_config contains the following keys:\n",
    "\n",
    "+ **dataset_id**: the key used to denote a dataset split, e.g., taobao_tiny_data\n",
    "+ **data_root**: the directory to save or load the h5 dataset files\n",
    "+ **data_format**: csv | h5\n",
    "+ **train_data**: training data file path\n",
    "+ **valid_data**: validation data file path\n",
    "+ **test_data**: test data file path\n",
    "+ **min_categr_count**: the default threshold used to filter rare features\n",
    "+ **feature_cols**: a list of feature columns, each containing the following keys\n",
    "  - **name**: feature name, i.e., column header name.\n",
    "  - **active**: True | False, whether to use the feature.\n",
    "  - **dtype**: the data type of this column.\n",
    "  - **type**: categorical | numeric | sequence, which type of features.\n",
    "  - **source**: optional, which feature source, such as user/item/context.\n",
    "  - **share_embedding**: optional, specify which feature_name to share embedding.\n",
    "  - **embedding_dim**: optional, embedding dim of a specific field, overriding the default embedding_dim if used.\n",
    "  - **pretrained_emb**: optional, filepath of pretrained embedding, which should be a h5 file with two columns (id, emb).\n",
    "  - **freeze_emb**: optional, True | False, whether to freeze embedding is pretrained_emb is used.\n",
    "  - **encoder**: optional, \"MaskedAveragePooling\" | \"MaskedSumPooling\" | \"null\", specify how to pool the sequence feature. \"MaskedAveragePooling\" is used by default. \"null\" means no pooling is required.\n",
    "  - **splitter**: optional, how to split the sequence feature during preprocessing; the space \" \" is used by default. \n",
    "  - **max_len**: optional, the max length set to pad or truncate the sequence features. If not specified, the max length of all the training samples will be used. \n",
    "  - **padding**: optional, \"pre\" | \"post\", either pre padding or post padding the sequence.\n",
    "  - **na_value**: optinal, what value used to fill the missing entries of a field; \"\" is used by default.\n",
    "+ **label_col**: label name, i.e., the column header of the label\n",
    "  - **name**: the column header name for label\n",
    "  - **dtype**: the data type\n",
    "  \n",
    "The model_config contains the following keys:\n",
    "\n",
    "+ **expid**: the key used to denote an experiment id, e.g., DeepFM_test. Each expid corresponds to a dataset_id and the model hyper-parameters used for experiment.\n",
    "+ **model_root**: the directory to save or load the model checkpoints and running logs.\n",
    "+ **workers**: the number of processes used for data generator.\n",
    "+ **verbose**: 0 for disabling tqdm progress bar; 1 for enabling tqdm progress bar.\n",
    "+ **patience**: how many epochs to stop training if no improvments are made.\n",
    "+ **pickle_feature_encoder**: True | False, whether pickle feature_encoder\n",
    "+ **use_hdf5**: True | False, whether reuse h5 data if available\n",
    "+ **save_best_only**: True | False, whether to save the best model weights only.\n",
    "+ **every_x_epochs**: how many epochs to evaluate the model on valiadtion set, float supported. For example, 0.5 denotes to evaluate every half epoch.\n",
    "+ **debug**: True | False, whether to enable debug mode. If enabled, every run will generate a new expid to avoid conflicted runs on two code versions. \n",
    "+ **model**: model name used to load the specific model class\n",
    "+ **dataset_id**: the dataset_id used for the experiment\n",
    "+ **loss**: currently support \"binary_crossentropy\" only.\n",
    "+ **metrics**: list, currently support ['logloss', 'AUC'] only\n",
    "+ **task**: currently support \"binary_classification\" only\n",
    "+ **optimizer**: \"adam\" is used by default\n",
    "+ **learning_rate**: the initial learning rate\n",
    "+ **batch_size**: the batch size for model training\n",
    "+ **embedding_dim**: the default embedding dim for all feature fields. It will be ignored if a feature has embedding_dim value.\n",
    "+ **epochs**: the max number of epochs for model training\n",
    "+ **shuffle**: True | False, whether to shuffle data for each epoch\n",
    "+ **seed**: int, fix the random seed for reproduciblity\n",
    "+ **monitor**: 'AUC' | 'logloss' | {'AUC': 1, 'logloss': -1}, the metric used to determine early stopping. The dict can be used for combine multiple metrics. E.g., {'AUC': 2, 'logloss': -1} means 2 * AUC - logloss and the larger the better. \n",
    "+ **monitor_mode**: 'max' | 'min', the mode of the metric. E.g., 'max' for AUC and 'min' for logloss.\n",
    "\n",
    "There are also some model-specific hyper-parameters. E.g., DeepFM has the following specific hyper-parameters:\n",
    "+ **hidden_units**: list, hidden units of MLP\n",
    "+ **hidden_activations**: str or list, e.g., 'relu' or ['relu', 'tanh']. When each layer has the same activation, one could use str; otherwise use list to set activations for each layer.\n",
    "+ **net_regularizer**: regularizaiton weight for MLP, supporting different types such as 1.e-8 | l2(1.e-8) | l1(1.e-8) | l1_l2(1.e-8, 1.e-8). l2 norm is used by default.\n",
    "+ **embedding_regularizer**: regularizaiton weight for feature embeddings, supporting different types such as 1.e-8 | l2(1.e-8) | l1(1.e-8) | l1_l2(1.e-8, 1.e-8). l2 norm is used by default.\n",
    "+ **net_dropout**: dropout rate for MLP, e.g., 0.1 denotes that hidden values are dropped randomly with 10% probability. \n",
    "+ **batch_norm**: False | True, whether to apply batch normalizaiton on MLP.\n",
    "\n",
    "\n",
    "Many config files are available at https://github.com/xue-pai/FuxiCTR/tree/main/config for your reference. Here, we take the config [demo/demo_config](https://github.com/xue-pai/FuxiCTR/tree/main/demo/demo_config) as an example. The dataset_config.yaml and model_config.yaml are as follows. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# dataset_config.yaml\n",
    "taobao_tiny_data: # dataset_id\n",
    "    data_root: ../data/\n",
    "    data_format: csv\n",
    "    train_data: ../data/tiny_data/train_sample.csv\n",
    "    valid_data: ../data/tiny_data/valid_sample.csv\n",
    "    test_data: ../data/tiny_data/test_sample.csv\n",
    "    min_categr_count: 1\n",
    "    feature_cols:\n",
    "        - {name: [\"userid\",\"adgroup_id\",\"pid\",\"cate_id\",\"campaign_id\",\"customer\",\"brand\",\"cms_segid\",\n",
    "                  \"cms_group_id\",\"final_gender_code\",\"age_level\",\"pvalue_level\",\"shopping_level\",\"occupation\"], \n",
    "                  active: True, dtype: str, type: categorical}\n",
    "    label_col: {name: clk, dtype: float}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we merge the feature_cols with the same config settings for compactness. But we also could expand them as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "taobao_tiny_data:\n",
    "    data_root: ../data/\n",
    "    data_format: csv\n",
    "    train_data: ../data/tiny_data/train_sample.csv\n",
    "    valid_data: ../data/tiny_data/valid_sample.csv\n",
    "    test_data: ../data/tiny_data/test_sample.csv\n",
    "    min_categr_count: 1\n",
    "    feature_cols:\n",
    "        [{name: \"userid\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"adgroup_id\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"pid\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"cate_id\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"campaign_id\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"customer\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"brand\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"cms_segid\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"cms_group_id\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"final_gender_code\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"age_level\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"pvalue_level\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"shopping_level\", active: True, dtype: str, type: categorical},\n",
    "         {name: \"occupation\", active: True, dtype: str, type: categorical}]\n",
    "    label_col: {name: clk, dtype: float}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following model config contains two parts. When `Base` is available, the base settings will be shared by all expids. The base settings can be also overridden in expid with the same key. This design is for compactness when a large group of model configs are available, as shown in `./config` folder. `Base` and expid `DeepFM_test` can be either put in the same `model_config.yaml` file or the same `model_config` directory. Note that in any case, each expid should be unique among all the expids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model_config.yaml\n",
    "Base: \n",
    "    model_root: '../checkpoints/'\n",
    "    workers: 3\n",
    "    verbose: 1\n",
    "    patience: 2\n",
    "    pickle_feature_encoder: True\n",
    "    use_hdf5: True\n",
    "    save_best_only: True\n",
    "    every_x_epochs: 1\n",
    "    debug: False\n",
    "\n",
    "DeepFM_test:\n",
    "    model: DeepFM\n",
    "    dataset_id: taobao_tiny_data # each expid corresponds to a dataset_id\n",
    "    loss: 'binary_crossentropy'\n",
    "    metrics: ['logloss', 'AUC']\n",
    "    task: binary_classification\n",
    "    optimizer: adam\n",
    "    hidden_units: [64, 32]\n",
    "    hidden_activations: relu\n",
    "    net_regularizer: 0\n",
    "    embedding_regularizer: 1.e-8\n",
    "    learning_rate: 1.e-3\n",
    "    batch_norm: False\n",
    "    net_dropout: 0\n",
    "    batch_size: 128\n",
    "    embedding_dim: 4\n",
    "    epochs: 1\n",
    "    shuffle: True\n",
    "    seed: 2019\n",
    "    monitor: 'AUC'\n",
    "    monitor_mode: 'max'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `load_config` method will automatically merge the above two parts. If you prefer, it is also flexible to remove `Base` and declare all the settings using only one dict as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DeepFM_test:\n",
    "    model_root: '../checkpoints/'\n",
    "    workers: 3\n",
    "    verbose: 1\n",
    "    patience: 2\n",
    "    pickle_feature_encoder: True\n",
    "    use_hdf5: True\n",
    "    save_best_only: True\n",
    "    every_x_epochs: 1\n",
    "    debug: False\n",
    "    model: DeepFM\n",
    "    dataset_id: taobao_tiny_data\n",
    "    loss: 'binary_crossentropy'\n",
    "    metrics: ['logloss', 'AUC']\n",
    "    task: binary_classification\n",
    "    optimizer: adam\n",
    "    hidden_units: [64, 32]\n",
    "    hidden_activations: relu\n",
    "    net_regularizer: 0\n",
    "    embedding_regularizer: 1.e-8\n",
    "    learning_rate: 1.e-3\n",
    "    batch_norm: False\n",
    "    net_dropout: 0\n",
    "    batch_size: 128\n",
    "    embedding_dim: 4\n",
    "    epochs: 1\n",
    "    shuffle: True\n",
    "    seed: 2019\n",
    "    monitor: 'AUC'\n",
    "    monitor_mode: 'max'"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
