{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/xyz/Doc2docBeirIR/blob/main/Swiss_Doc2doc_IR.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bzui28NKNKA1"
      },
      "source": [
        "# **Doc2doc IR:  Information Retrieval models for Swiss Court Rulings**\n",
        "\n",
        "@misc{rasiah2023scale,\n",
        "      title={SCALE: Scaling up the Complexity for Advanced Language Model Evaluation},\n",
        "      author={Vishvaksenan Rasiah and Ronja Stern and Veton Matoshi and Matthias Stürmer and Ilias Chalkidis and Daniel E. Ho and Joel Niklaus},\n",
        "      year={2023},\n",
        "      eprint={2306.09237},\n",
        "      archivePrefix={arXiv},\n",
        "      primaryClass={cs.CL}\n",
        "}\n",
        "\n",
        "\n",
        "This notebook contains code from BEIR. https://github.com/beir-cellar\n",
        "\n",
        "\n",
        "## References\n",
        "We are using the BEIR packacge and provided examples of:\n",
        "  https://github.com/beir-cellar/beir/wiki/Examples-and-tutorials\n",
        "  Nandan Thakur, Researcher @ UKP Lab, TU Darmstadt\n",
        "  (https://nthakur.xyz) (nandant@gmail.com)\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GkZ61jL8p22R"
      },
      "source": [
        "# **Run Experiments**\n",
        "\n",
        "We have 4 different experiments:\n",
        "\n",
        "\n",
        "**Lexical Retrieval using BM25 (elasticsearch)** run_lr_bm25(queries, qrels, corpus)\n",
        "\n",
        "Use different datasets for experiments and process text in two different ways. First we need to shorten the query text to avoid errors. In addition we can remove stopwords from the text to include more \"interesting\" input text.\n",
        "\n",
        "*   LR-BM25_S_DE (Remove Stopwords and shorten, German)\n",
        "*   LR-BM25_DE (Shortend, German)\n",
        "*   LR-BM25_S_DE (Remove Stopwords and shorten, French)\n",
        "*   LR-BM25_DE (Shortend, French)\n",
        "*   LR-BM25_S_DE (Remove Stopwords and shorten, Italian)\n",
        "*   LR-BM25_DE (Shortend, Italian)\n",
        "*   LR-BM25_S (Remove Stopwords and shorten, Mixed)\n",
        "*   LR-BM25 (Shortend, Mixed)\n",
        "\n",
        "\n",
        "**Multilingual Retrieval BM25 with Elasticsearch** run_mr_bm25(queries, qrels, corpus, language)\n",
        "\n",
        "Specify the used language. Use different datasets for experiments and process text in two different ways. First we need to shorten the query text to avoid errors. In addition we can remove stopwords from the text to include more \"interesting\" input text.\n",
        "\n",
        "*   MR-BM25_S_DE (Remove Stopwords and shorten, German)\n",
        "*   MR-BM25_DE (Shortend, German)\n",
        "*   MR-BM25_S_DE (Remove Stopwords and shorten, French)\n",
        "*   MR-BM25_DE (Shortend, French)\n",
        "*   MR-BM25_S_DE (Remove Stopwords and shorten, Italian)\n",
        "*   MR-BM25_DE (Shortend, Italian)\n",
        "*   MR-BM25_S (Remove Stopwords and shorten, Mixed)\n",
        "*   MR-BM25 (Shortend, Mixed)\n",
        "\n",
        "**Reranking BM25 using Cross-Encoder** run_rce_bm25(queries, qrels, corpus, model_name)\n",
        "\n",
        "Here we are using cross encoders. Ther is a recently published multilingual model called: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1\n",
        "It might be interesting to compare results when using only single language linked dataset. That means query and corpus are the same language. In addition we can try to use mutlilingual where the the corpus is multilingual.\n",
        "Do we need to shorten the input?\n",
        "\n",
        "*  RCE_BM25_M (Mixed dataset)\n",
        "*  RCE_BM25 (All languages but not mixed)\n",
        "*  RCE_BM25_DE (German)\n",
        "*  RCE_BM25_FR French\n",
        "*  RCE_BM25_IT Italian\n",
        "\n",
        "**Rerank sBert with bm25** run_sb_r(queries, qrels, corpus, model_name)\n",
        "\n",
        "We want to compare different sBert models which are multilingual. We will differ between a dataset which contains mixed linkings (different languages) and one containing only same language for query and corpus text.\n",
        "\n",
        "> We will use two different multilingual models: \"distiluse-base-multilingual-cased\" (or named \"distiluse-base-multilingual-cased-v1\") and \"paraphrase-albert-small-v2\"\n",
        "\n",
        "\n",
        "*  SB_R_M1_M\n",
        "*  SB_R_M1_\n",
        "*  SB_R_M2_M\n",
        "*  SB_R_M2_M\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Zbq2V91bCPxO",
        "outputId": "27330ecc-e6d8-4416-97a0-7f0824a63eb3"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: beir in /usr/local/lib/python3.10/dist-packages (1.0.1)\n",
            "Requirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (from beir) (2.2.2)\n",
            "Requirement already satisfied: pytrec-eval in /usr/local/lib/python3.10/dist-packages (from beir) (0.5)\n",
            "Requirement already satisfied: faiss-cpu in /usr/local/lib/python3.10/dist-packages (from beir) (1.7.4)\n",
            "Requirement already satisfied: elasticsearch==7.9.1 in /usr/local/lib/python3.10/dist-packages (from beir) (7.9.1)\n",
            "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (from beir) (2.12.0)\n",
            "Requirement already satisfied: urllib3>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from elasticsearch==7.9.1->beir) (1.26.15)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from elasticsearch==7.9.1->beir) (2022.12.7)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (1.22.4)\n",
            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (9.0.0)\n",
            "Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (0.3.6)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (1.5.3)\n",
            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (2.27.1)\n",
            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (4.65.0)\n",
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (3.2.0)\n",
            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (0.70.14)\n",
            "Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (2023.4.0)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (3.8.4)\n",
            "Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (0.14.1)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (23.1)\n",
            "Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (0.18.0)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets->beir) (6.0)\n",
            "Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (4.29.2)\n",
            "Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (2.0.1+cu118)\n",
            "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (0.15.2+cu118)\n",
            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (1.2.2)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (1.10.1)\n",
            "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (3.8.1)\n",
            "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers->beir) (0.1.99)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (23.1.0)\n",
            "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (2.0.12)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (6.0.4)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (4.0.2)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (1.9.2)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (1.3.3)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->beir) (1.3.1)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets->beir) (3.12.0)\n",
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets->beir) (4.5.0)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets->beir) (3.4)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers->beir) (1.11.1)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers->beir) (3.1)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers->beir) (3.1.2)\n",
            "Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers->beir) (2.0.0)\n",
            "Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers->beir) (3.25.2)\n",
            "Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers->beir) (16.0.5)\n",
            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers->beir) (2022.10.31)\n",
            "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers->beir) (0.13.3)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers->beir) (8.1.3)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers->beir) (1.2.0)\n",
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->beir) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->beir) (2022.7.1)\n",
            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers->beir) (3.1.0)\n",
            "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision->sentence-transformers->beir) (8.4.0)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets->beir) (1.16.0)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.6.0->sentence-transformers->beir) (2.1.2)\n",
            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.6.0->sentence-transformers->beir) (1.3.0)\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.12.0)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.22.4)\n",
            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (9.0.0)\n",
            "Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.6)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n",
            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.27.1)\n",
            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.65.0)\n",
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.2.0)\n",
            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.14)\n",
            "Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.4.0)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.8.4)\n",
            "Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.14.1)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.1)\n",
            "Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.18.0)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.1.0)\n",
            "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.0.12)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.2)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.2)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.3)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (3.12.0)\n",
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (4.5.0)\n",
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (1.26.15)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2022.12.7)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.4)\n",
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2022.7.1)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: tensorflow-text in /usr/local/lib/python3.10/dist-packages (2.12.1)\n",
            "Requirement already satisfied: tensorflow-hub>=0.8.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-text) (0.13.0)\n",
            "Requirement already satisfied: tensorflow<2.13,>=2.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-text) (2.12.0)\n",
            "Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (1.4.0)\n",
            "Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (1.6.3)\n",
            "Requirement already satisfied: flatbuffers>=2.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (23.3.3)\n",
            "Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (0.4.0)\n",
            "Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (0.2.0)\n",
            "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (1.54.0)\n",
            "Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (3.8.0)\n",
            "Requirement already satisfied: jax>=0.3.15 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (0.4.8)\n",
            "Requirement already satisfied: keras<2.13,>=2.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (2.12.0)\n",
            "Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (16.0.0)\n",
            "Requirement already satisfied: numpy<1.24,>=1.22 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (1.22.4)\n",
            "Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (3.3.0)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (23.1)\n",
            "Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (3.20.3)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (67.7.2)\n",
            "Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (1.16.0)\n",
            "Requirement already satisfied: tensorboard<2.13,>=2.12 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (2.12.2)\n",
            "Requirement already satisfied: tensorflow-estimator<2.13,>=2.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (2.12.0)\n",
            "Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (2.3.0)\n",
            "Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (4.5.0)\n",
            "Requirement already satisfied: wrapt<1.15,>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (1.14.1)\n",
            "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.13,>=2.12.0->tensorflow-text) (0.32.0)\n",
            "Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow<2.13,>=2.12.0->tensorflow-text) (0.40.0)\n",
            "Requirement already satisfied: ml-dtypes>=0.0.3 in /usr/local/lib/python3.10/dist-packages (from jax>=0.3.15->tensorflow<2.13,>=2.12.0->tensorflow-text) (0.1.0)\n",
            "Requirement already satisfied: scipy>=1.7 in /usr/local/lib/python3.10/dist-packages (from jax>=0.3.15->tensorflow<2.13,>=2.12.0->tensorflow-text) (1.10.1)\n",
            "Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (2.17.3)\n",
            "Requirement already satisfied: google-auth-oauthlib<1.1,>=0.5 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (1.0.0)\n",
            "Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (3.4.3)\n",
            "Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (2.27.1)\n",
            "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (0.7.0)\n",
            "Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (1.8.1)\n",
            "Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (2.3.0)\n",
            "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (5.3.0)\n",
            "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (0.3.0)\n",
            "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (4.9)\n",
            "Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<1.1,>=0.5->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (1.3.1)\n",
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (1.26.15)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (2022.12.7)\n",
            "Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (2.0.12)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (3.4)\n",
            "Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (2.1.2)\n",
            "Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (0.5.0)\n",
            "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard<2.13,>=2.12->tensorflow<2.13,>=2.12.0->tensorflow-text) (3.2.2)\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.3)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.2.0)\n",
            "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2022.10.31)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.65.0)\n"
          ]
        }
      ],
      "source": [
        "# Install the beir PyPI, datasets, packeges\n",
        "!pip install beir\n",
        "!pip install datasets\n",
        "!pip install tensorflow-text\n",
        "!pip install nltk"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gB6I0UKeRH9-"
      },
      "source": [
        "# **Data Loading**"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import random\n",
        "import re\n",
        "\n",
        "import datasets\n",
        "import json\n",
        "import gc\n",
        "import pandas as pd\n",
        "import ast\n",
        "from datasets import load_dataset\n",
        "from tqdm import tqdm\n",
        "import json\n",
        "import os, pathlib\n",
        "import re\n",
        "import nltk\n",
        "import nltk\n",
        "nltk.download('stopwords')\n",
        "from nltk.corpus import stopwords\n",
        "\n",
        "\n",
        "class ProcessData:\n",
        "\n",
        "    def __init__(self):\n",
        "        self.counter = 0\n",
        "\n",
        "    def load_from_hf(self, name):\n",
        "        dataset = load_dataset(name)\n",
        "        dataset = dataset['train']\n",
        "        return dataset\n",
        "\n",
        "    def get_data(self):\n",
        "        qrel_dataset = self.load_from_hf('xyz/qrel')\n",
        "        querie_dataset = self.load_from_hf(\"xyz/querie\")\n",
        "        corpus_dataset = self.load_from_hf(\"xyz/corpus\")\n",
        "        querie_dataset_train, querie_dataset_test = self.create_splits(querie_dataset)\n",
        "        return querie_dataset, querie_dataset_train, querie_dataset_test, qrel_dataset, corpus_dataset\n",
        "\n",
        "    def create_splits(self, querie_dataset):\n",
        "        splits = querie_dataset.train_test_split(test_size=0.9)\n",
        "        querie_dataset_train = splits['train']\n",
        "        querie_dataset_test = splits['test']\n",
        "        return querie_dataset_train, querie_dataset_test\n",
        "\n",
        "    def create_corpus_dict(self, corpus_dataset):\n",
        "        corpus_dict = {}\n",
        "        def write_dict(row):\n",
        "            id = str(row['id'])\n",
        "            corpus_dict[id] = {\"title\": '', \"text\": row['text']}\n",
        "            return row\n",
        "\n",
        "        corpus_dataset.apply(write_dict, axis=\"columns\")\n",
        "\n",
        "        return corpus_dict\n",
        "\n",
        "    def create_qrels_dict(self, qrels_dataset, corpus_dict):\n",
        "        qrels_dict = {}\n",
        "\n",
        "        def write_qrels(row):\n",
        "            if row['corp_id'] in corpus_dict:\n",
        "                if row['id'] not in qrels_dict:\n",
        "                    qrels_dict[row['id']] = {row['corp_id']: 1}\n",
        "                else:\n",
        "                    qrels_dict[row['id']][row['corp_id']] = 1\n",
        "\n",
        "        qrel_dataset = qrels_dataset.apply(write_qrels, axis='columns')\n",
        "\n",
        "        return qrels_dict\n",
        "\n",
        "    def create_query_dict(self, queries_dataset, qrels_dict):\n",
        "        queries_dict = {}\n",
        "\n",
        "        def write_dict(row):\n",
        "            id = row['id']\n",
        "            if id in qrels_dict:\n",
        "                text = row['text']\n",
        "                queries_dict[id] = str(text)\n",
        "\n",
        "        queries_dataset.apply(write_dict, axis=\"columns\")\n",
        "        return queries_dict\n",
        "\n",
        "    def create_data_dicts(self, querie_dataset, qrel_dataset, corpus_dataset):\n",
        "        corpu = {}\n",
        "\n",
        "        def write_corpus(row):\n",
        "            corpu[row['id']] = {'text': row['text'], 'title': \"\"}\n",
        "            return row\n",
        "\n",
        "        corpus_dataset = corpus_dataset.map(write_corpus)\n",
        "        print(f\"Corpus: {len(corpu.items())}\")\n",
        "\n",
        "        querie_tmp = {}\n",
        "\n",
        "        def write_queries(row):\n",
        "            id = row['id']\n",
        "            text = row['text']\n",
        "            querie_tmp[id] = str(text)\n",
        "            return row\n",
        "\n",
        "        querie_dataset = querie_dataset.map(write_queries, keep_in_memory=True)\n",
        "        print(f\"Query before cleaning: {len(querie_tmp.items())}\")\n",
        "\n",
        "        qrel = {}\n",
        "        ids = []\n",
        "        def write_qrels(row):\n",
        "            # only use qrels with valid query\n",
        "            if row['id'] in querie_tmp:\n",
        "                if row['id'] not in qrel:\n",
        "                    qrel[row['id']] = {row['corp_id']: 1}\n",
        "                    ids.append(row['id'])\n",
        "                else:\n",
        "                    qrel[row['id']][row['corp_id']] = 1\n",
        "                    ids.append(row['id'])\n",
        "            return row\n",
        "        qrel_dataset = qrel_dataset.map(write_qrels)\n",
        "        print(f\"Qrels: {len(qrel.items())}\")\n",
        "\n",
        "        query = {}\n",
        "\n",
        "        for key, value in querie_tmp.items():\n",
        "            if key in ids:\n",
        "                query[key] = value\n",
        "\n",
        "        print(f\"Query: {len(query.items())}\")\n",
        "\n",
        "        return query, qrel, corpu, querie_dataset, qrel_dataset, corpus_dataset\n",
        "\n",
        "    def remove_stopwords(self, queries, corpus, language_long):\n",
        "\n",
        "        sw_nltk = stopwords.words(language_long)\n",
        "        print(sw_nltk)\n",
        "\n",
        "        queries_filtered = {}\n",
        "\n",
        "        for key, value in queries.items():\n",
        "            value = re.sub('\\W+', ' ', str(value))\n",
        "            words = [word for word in value.split() if word.lower() not in sw_nltk]\n",
        "            new_value = \" \".join(words)\n",
        "            # new_value = new_value[:500]\n",
        "            queries_filtered[key] = new_value\n",
        "\n",
        "        corpus_filtered = {}\n",
        "\n",
        "        for key, value in corpus.items():\n",
        "            text = re.sub('\\W+', ' ', str(value['text']))\n",
        "            words = [word for word in text.split() if word.lower() not in sw_nltk]\n",
        "            new_value = \" \".join(words)\n",
        "            # new_value = new_value[:500]\n",
        "            value['text'] = new_value\n",
        "            corpus_filtered[key] = value\n",
        "\n",
        "        print(\"Shortened queries and removed stopwords\")\n",
        "\n",
        "        return queries_filtered, corpus_filtered\n",
        "\n",
        "    def create_subset(self, n, queries, qrels, corpus):\n",
        "        print(\"Start creating subsets\")\n",
        "        counter = 0\n",
        "        qrels_subset = {}\n",
        "        queries_subset = {}\n",
        "        corpus_subset = {}\n",
        "\n",
        "        print(len(queries.items()))\n",
        "        print(len(qrels.items()))\n",
        "        print(len(corpus.items()))\n",
        "        short = 0\n",
        "\n",
        "        for key, value in qrels.items():\n",
        "            for corp_id, _ in value.items():\n",
        "                found = False\n",
        "                if len(corpus[corp_id]['text']) > 10:\n",
        "                    found = True\n",
        "                    corpus_subset[corp_id] = corpus[corp_id]\n",
        "                else:\n",
        "                    short = short + 1\n",
        "            if found:\n",
        "                qrels_subset[key] = value\n",
        "                queries_subset[key] = queries[key]\n",
        "            counter += 1\n",
        "            if counter > n:\n",
        "                break\n",
        "\n",
        "        print(f\"create subset of {n} queries.\")\n",
        "        print(f\"Found {short} entries in corpus with short text.\")\n",
        "\n",
        "        return queries_subset, qrels_subset, corpus_subset\n",
        "\n",
        "    def create_splits_old(self, n, queries, qrels, corpus):\n",
        "        print(\"Start creating splits\")\n",
        "        counter = 0\n",
        "        qrels_train = {}\n",
        "        queries_train = {}\n",
        "        qrels_test = {}\n",
        "        queries_test = {}\n",
        "\n",
        "        print(f\"We have {len(queries)} queries in total. {n * len(queries)} will be used for train, the rest for test.\")\n",
        "        print(f\"We have {len(qrels)} qrels\")\n",
        "        print(f\"We have {len(corpus)} corpus\")\n",
        "        missing_corpus = 0\n",
        "        missing_qrel_train = 0\n",
        "        missing_qrel_test = 0\n",
        "\n",
        "        for key, value in queries.items():\n",
        "            if counter < n * len(queries):\n",
        "                if key in qrels:\n",
        "                    # assert all corp_ids exist in corpus!\n",
        "                    queries_train[key] = value\n",
        "                    qrels_train[key] = qrels[key]\n",
        "                    counter = counter + 1\n",
        "                    for corp_id in qrels[key].keys():\n",
        "                        if corp_id not in corpus:\n",
        "                            missing_corpus = missing_corpus + 1\n",
        "                else:\n",
        "                    missing_qrel_train = missing_qrel_train + 1\n",
        "            elif counter >= n * len(queries):\n",
        "                if key in qrels:\n",
        "                    # assert all corp_ids exist in corpus!\n",
        "                    queries_test[key] = value\n",
        "                    qrels_test[key] = qrels[key]\n",
        "                    counter = counter + 1\n",
        "                    for corp_id in qrels[key].keys():\n",
        "                        if corp_id not in corpus:\n",
        "                            missing_corpus = missing_corpus + 1\n",
        "                else:\n",
        "                    missing_qrel_test = missing_qrel_test + 1\n",
        "\n",
        "        print(f\"created splits.\")\n",
        "        print(f\"Found {missing_corpus} missing entries in corpus.\")\n",
        "        print(f\"We have {len(qrels_train)} train qrels\")\n",
        "        print(f\"We have {len(qrels_test)} test qrels\")\n",
        "        print(f\"We have {len(queries_train)} train queries\")\n",
        "        print(f\"We have {len(queries_test)} test queries\")\n",
        "        return qrels_train, queries_train, qrels_test, queries_test\n",
        "\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "NIj2nPwC4TDw",
        "outputId": "cd0c4c65-bdca-42fd-d349-9f162579e39c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
            "[nltk_data]   Package stopwords is already up-to-date!\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def get_formatted_data(filter_language):\n",
        "    process_data = ProcessData()\n",
        "    querie_dataset, querie_dataset_train, querie_dataset_test, qrels_dataset, corpus_dataset = process_data.get_data()\n",
        "\n",
        "    print(querie_dataset_train)\n",
        "    print(querie_dataset_test)\n",
        "    print(qrels_dataset)\n",
        "    print(corpus_dataset)\n",
        "\n",
        "    # corpus stays always the same\n",
        "    corpus_dict = process_data.create_corpus_dict(pd.DataFrame(corpus_dataset))\n",
        "\n",
        "    # qrels are split into mixed, and single languages\n",
        "    qrels_dataset_ssl = qrels_dataset.filter(lambda row: row['language'] == filter_language)\n",
        "    qrels_dict = process_data.create_qrels_dict(pd.DataFrame(qrels_dataset), corpus_dict)\n",
        "    qrels_dict_ssl = process_data.create_qrels_dict(pd.DataFrame(qrels_dataset_ssl), corpus_dict)\n",
        "\n",
        "    # we always split so results are comparable\n",
        "    if filter_language != 'mixed':\n",
        "        querie_dataset_test = querie_dataset_test.filter(lambda row: row['language'] == filter_language)\n",
        "    queries_dict_train = process_data.create_query_dict(pd.DataFrame(querie_dataset_train), qrels_dict)\n",
        "    queries_dict_test = process_data.create_query_dict(pd.DataFrame(querie_dataset_test), qrels_dict)\n",
        "\n",
        "    query_adapted = {}\n",
        "    for id in queries_dict_test.keys():\n",
        "        if id in qrels_dict_ssl:\n",
        "            # make_sure_this_is_done = 'XXX'\n",
        "            query_adapted[id] = queries_dict_test[id]\n",
        "\n",
        "    print(\"####################################################################################################\")\n",
        "    print(\"Successfully loaded data.\")\n",
        "    print(\"####################################################################################################\")\n",
        "\n",
        "    print(len(corpus_dict))\n",
        "    print(len(qrels_dict))\n",
        "    print(len(qrels_dict_ssl))\n",
        "    print(len(queries_dict_train))\n",
        "    print(len(queries_dict_test))\n",
        "\n",
        "    return queries_dict_train, queries_dict_test, qrels_dict, query_adapted, corpus_dict\n",
        "\n",
        "# queries_dict_train, queries_dict_test, qrels_dict, qrels_dict_ssl, corpus_dict = get_formatted_data('it')"
      ],
      "metadata": {
        "id": "i3Z6mNGB4bFh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# **Models**"
      ],
      "metadata": {
        "id": "7iE5Sr7e6Jaq"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lOvC3C4gafWf"
      },
      "source": [
        "**Lexical Retrieval using BM25 (Elasticsearch)**\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nqP2F4IKCloh"
      },
      "outputs": [],
      "source": [
        "from beir.retrieval.search.lexical import BM25Search as BM25\n",
        "from beir.retrieval.evaluation import EvaluateRetrieval\n",
        "import random\n",
        "\n",
        "def run_lr_bm25(queries, qrels, corpus):\n",
        "\n",
        "  #### Provide parameters for elastic-search\n",
        "  hostname = \"localhost\"\n",
        "  index_name = \"scifact\"\n",
        "  initialize = True # True, will delete existing index with same name and reindex all documents\n",
        "\n",
        "  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)\n",
        "  retriever = EvaluateRetrieval(model)\n",
        "\n",
        "  #### Retrieve dense results (format of results is identical to qrels)\n",
        "  results = retriever.retrieve(corpus, queries)\n",
        "\n",
        "  #### Evaluate your retrieval using NDCG@k, MAP@K ...\n",
        "  ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)\n",
        "  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric=\"recall_cap\")\n",
        "  print(ndcg)\n",
        "  print(_map)\n",
        "  print(recall)\n",
        "  print(precision)\n",
        "  print(recall_cap)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6KxlKkVeGg_6"
      },
      "source": [
        "**Multilingual Retrieval BM25 with Elasticsearch**\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K2mywlrUGgOl"
      },
      "outputs": [],
      "source": [
        "\n",
        "def run_mr_bm25(queries, qrels, corpus, language):\n",
        "\n",
        "  hostname = \"localhost\"\n",
        "  index_name = \"scifact\"\n",
        "  initialize = True # True, will delete existing index with same name and reindex all documents\n",
        "\n",
        "  #### Language ####\n",
        "  # For languages supported by Elasticsearch by default, check here ->\n",
        "  # https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html\n",
        "  # Please provide full names in lowercase for eg. english, hindi ...\n",
        "\n",
        "  #### Sharding ####\n",
        "  # (1) For datasets with small corpus (datasets ~ < 5k docs) => limit shards = 1\n",
        "  # number_of_shards = 1\n",
        "  # model = BM25(index_name=index_name, hostname=hostname, language=language, initialize=initialize, number_of_shards=number_of_shards)\n",
        "\n",
        "  # (2) For datasets with big corpus ==> keep default configuration\n",
        "  model = BM25(index_name=index_name, hostname=hostname, language=language, initialize=initialize)\n",
        "  retriever = EvaluateRetrieval(model)\n",
        "\n",
        "  #### Retrieve dense results (format of results is identical to qrels)\n",
        "  results = retriever.retrieve(corpus, queries)\n",
        "\n",
        "  #### Evaluate your retrieval using NDCG@k, MAP@K ...\n",
        "  logging.info(\"Retriever evaluation for k in: {}\".format(retriever.k_values))\n",
        "  ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)\n",
        "  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric=\"recall_cap\")\n",
        "  print(ndcg)\n",
        "  print(_map)\n",
        "  print(recall)\n",
        "  print(precision)\n",
        "  print(recall_cap)\n",
        "\n",
        "  #### Retrieval Example ####\n",
        "  query_id, scores_dict = random.choice(list(results.items()))\n",
        "  logging.info(\"Query : %s\\n\" % queries[query_id])\n",
        "\n",
        "  scores = sorted(scores_dict.items(), key=lambda item: item[1], reverse=True)\n",
        "  for rank in range(10):\n",
        "      doc_id = scores[rank][0]\n",
        "      logging.info(\"Doc %d: %s [%s] - %s\\n\" % (rank+1, doc_id, corpus[doc_id].get(\"title\"), corpus[doc_id].get(\"text\")))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wfiEjuXAATkZ"
      },
      "source": [
        "**Reranking BM25 using Cross-Encoder**\n",
        "\n",
        "In this example, we rerank the top-20 documents retrieved from BM25, using ([cross-encoder/ms-marco-electra-base](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)) SBERT cross-encoder model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ILM1SNXdAGhn"
      },
      "outputs": [],
      "source": [
        "from beir import util, LoggingHandler\n",
        "from beir.datasets.data_loader import GenericDataLoader\n",
        "from beir.retrieval.evaluation import EvaluateRetrieval\n",
        "from beir.retrieval.search.lexical import BM25Search as BM25\n",
        "from beir.reranking.models import CrossEncoder\n",
        "from beir.reranking import Rerank\n",
        "\n",
        "def run_rce_bm25(queries, qrels, corpus, model_name):\n",
        "\n",
        "  hostname = \"localhost\"\n",
        "  index_name = \"scifact\"\n",
        "  initialize = True # True, will delete existing index with same name and reindex all documents\n",
        "\n",
        "  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)\n",
        "  retriever = EvaluateRetrieval(model)\n",
        "\n",
        "  #### Retrieve dense results (format of results is identical to qrels)\n",
        "  results = retriever.retrieve(corpus, queries)\n",
        "\n",
        "  ################################################\n",
        "  #### (2) RERANK Top-100 docs using Cross-Encoder\n",
        "  ################################################\n",
        "\n",
        "  #### Reranking using Cross-Encoder models #####\n",
        "  #### https://www.sbert.net/docs/pretrained_cross-encoders.html\n",
        "  # 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'\n",
        "  cross_encoder_model = CrossEncoder(model_name)\n",
        "\n",
        "  #### Or use MiniLM, TinyBERT etc. CE models (https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)\n",
        "  # cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')\n",
        "  # cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')\n",
        "\n",
        "  reranker = Rerank(cross_encoder_model, batch_size=128)\n",
        "\n",
        "  # Rerank top-100 results using the reranker provided\n",
        "  rerank_results = reranker.rerank(corpus, queries, results, top_k=100)\n",
        "\n",
        "  #### Evaluate your retrieval using NDCG@k, MAP@K ...\n",
        "  #### Evaluate your retrieval using NDCG@k, MAP@K ...\n",
        "  ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)\n",
        "  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric=\"recall_cap\")\n",
        "  print(ndcg)\n",
        "  print(_map)\n",
        "  print(recall)\n",
        "  print(precision)\n",
        "  print(recall_cap)\n",
        "\n",
        "  #### Print top-k documents retrieved ####\n",
        "  top_k = 10\n",
        "\n",
        "  query_id, ranking_scores = random.choice(list(rerank_results.items()))\n",
        "  scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)\n",
        "  logging.info(\"Query : %s\\n\" % queries[query_id])\n",
        "\n",
        "  for rank in range(top_k):\n",
        "    doc_id = scores_sorted[rank][0]\n",
        "    # Format: Rank x: ID [Title] Body\n",
        "    logging.info(\"Rank %d: %s [%s] - %s\\n\" % (rank+1, doc_id, corpus[doc_id].get(\"title\"), corpus[doc_id].get(\"text\")))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WSMSyGxGoxZy"
      },
      "source": [
        "**Rerank sbert with bm25**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZROhH8ZCo4IQ"
      },
      "outputs": [],
      "source": [
        "from beir import util, LoggingHandler\n",
        "from beir.datasets.data_loader import GenericDataLoader\n",
        "from beir.retrieval.evaluation import EvaluateRetrieval\n",
        "from beir.retrieval.search.lexical import BM25Search as BM25\n",
        "from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES\n",
        "from beir.retrieval import models\n",
        "\n",
        "import pathlib, os\n",
        "import logging\n",
        "import random\n",
        "\n",
        "def run_sb_r(queries, qrels, corpus, model_name):\n",
        "\n",
        "  hostname = \"localhost\"\n",
        "  index_name = \"scifact\"\n",
        "  initialize = True # True, will delete existing index with same name and reindex all documents\n",
        "\n",
        "  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)\n",
        "  retriever = EvaluateRetrieval(model)\n",
        "\n",
        "  #### Retrieve dense results (format of results is identical to qrels)\n",
        "  results = retriever.retrieve(corpus, queries)\n",
        "\n",
        "  #### Reranking top-100 docs using Dense Retriever model\n",
        "  # distiluse-base-multilingual-cased\n",
        "  # distiluse-base-multilingual-cased-v1\n",
        "  # paraphrase-albert-small-v2\n",
        "  model = DRES(models.SentenceBERT(model_name), batch_size=128)\n",
        "  dense_retriever = EvaluateRetrieval(model, score_function=\"cos_sim\", k_values=[1,3,5,10,100])\n",
        "\n",
        "  #### Retrieve dense results (format of results is identical to qrels)\n",
        "  rerank_results = dense_retriever.rerank(corpus, queries, results, top_k=100)\n",
        "\n",
        "  #### Evaluate your retrieval using NDCG@k, MAP@K ...\n",
        "  # ndcg, _map, recall, precision, hole = dense_retriever.evaluate(qrels, rerank_results, retriever.k_values)\n",
        "  ndcg, _map, recall, precision = dense_retriever.evaluate(qrels, rerank_results, retriever.k_values)\n",
        "  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric=\"recall_cap\")\n",
        "  print(ndcg)\n",
        "  print(_map)\n",
        "  print(recall)\n",
        "  print(precision)\n",
        "  print(recall_cap)\n",
        "\n",
        "  #### Print top-k documents retrieved ####\n",
        "  top_k = 10\n",
        "\n",
        "  query_id, ranking_scores = random.choice(list(rerank_results.items()))\n",
        "  scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)\n",
        "  logging.info(\"Query : %s\\n\" % queries[query_id])\n",
        "\n",
        "  for rank in range(top_k):\n",
        "    doc_id = scores_sorted[rank][0]\n",
        "    # Format: Rank x: ID [Title] Body\n",
        "    logging.info(\"Rank %d: %s [%s] - %s\\n\" % (rank+1, doc_id, corpus[doc_id].get(\"title\"), corpus[doc_id].get(\"text\")))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NqZekkWTM49O"
      },
      "source": [
        "**Train S-Bert hard neg**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6Hl8RxIPM2tJ"
      },
      "outputs": [],
      "source": [
        "\n",
        "\n",
        "from sentence_transformers import losses, models, SentenceTransformer\n",
        "from beir import util, LoggingHandler\n",
        "from beir.datasets.data_loader import GenericDataLoader\n",
        "from beir.retrieval.search.lexical import BM25Search as BM25\n",
        "from beir.retrieval.evaluation import EvaluateRetrieval\n",
        "from beir.retrieval.train import TrainRetriever\n",
        "import pathlib, os, tqdm\n",
        "import logging\n",
        "import json\n",
        "\n",
        "def create_triplets(corpus, queries, qrels):\n",
        "\n",
        "  hostname = \"localhost\"\n",
        "  index_name = \"scifact\"\n",
        "  initialize = True # True, will delete existing index with same name and reindex all documents\n",
        "\n",
        "  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)\n",
        "  bm25 = EvaluateRetrieval(model)\n",
        "\n",
        "  print(\"successful until retrieve corpus\")\n",
        "\n",
        "  #### Index passages into the index (seperately)\n",
        "  bm25.retriever.index(corpus)\n",
        "\n",
        "  triplets = []\n",
        "  qids = list(qrels)\n",
        "  print(qids[0])\n",
        "  hard_negatives_max = 10\n",
        "\n",
        "  #### Retrieve BM25 hard negatives => Given a positive document, find most similar lexical documents\n",
        "  for idx in tqdm.tqdm(range(len(qids)), desc=\"Retrieve Hard Negatives using BM25\"):\n",
        "    # due to key error the followin lines were adapted\n",
        "    query_id = qids[idx]\n",
        "    query_text = queries[qids[idx]]\n",
        "    pos_docs = [doc_id for doc_id in qrels[query_id] if qrels[query_id][doc_id] > 0]\n",
        "    pos_doc_texts = [corpus[doc_id][\"title\"] + \" \" + corpus[doc_id][\"text\"] for doc_id in pos_docs]\n",
        "    hits = bm25.retriever.es.lexical_multisearch(texts=pos_doc_texts, top_hits=hard_negatives_max+1)\n",
        "    print(hits)\n",
        "    for (pos_text, hit) in zip(pos_doc_texts, hits):\n",
        "        for (neg_id, _) in hit.get(\"hits\"):\n",
        "            if neg_id not in pos_docs:\n",
        "                neg_text = corpus[neg_id][\"title\"] + \" \" + corpus[neg_id][\"text\"]\n",
        "                triplets.append([query_text, pos_text, neg_text])\n",
        "\n",
        "  print(\"creted triplets\")\n",
        "  with open('triplets.json', 'w', encoding='utf-8') as f:\n",
        "    json.dump(triplets, f, ensure_ascii=False, indent=4)\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nqotyXuIBPt6"
      },
      "source": [
        "# **Set up Elasticsearch**\n",
        "\n",
        "## 1. Download and setup the Elasticsearch instance\n",
        "Reference: https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/elasticsearch.ipynb\n",
        "\n",
        "For demo purposes, the open-source version of the elasticsearch package is used."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "doQU8m_UBFOY",
        "outputId": "10449cca-5851-4400-fa57-f7b7d4386310"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK\n"
          ]
        }
      ],
      "source": [
        "%%bash\n",
        "\n",
        "wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz\n",
        "wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512\n",
        "tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz\n",
        "sudo chown -R daemon:daemon elasticsearch-7.9.2/\n",
        "shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DruoaeGPBxg0"
      },
      "source": [
        "Run the instance as a daemon process\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JVZCVh-VBFDg"
      },
      "outputs": [],
      "source": [
        "%%bash --bg\n",
        "\n",
        "sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "O6A3xsNtBE6B"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "# Sleep for few seconds to let the instance start.\n",
        "time.sleep(20)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9zLrT3zNCAoW"
      },
      "source": [
        "Once the instance has been started, grep for ``elasticsearch`` in the processes list to confirm the availability."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Pz84EzxmBEmV",
        "outputId": "78b3ea6c-371e-4a2c-956e-1e4dad9ef1c9"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root       13952   13950  0 09:57 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch\n",
            "daemon     13953   13952 10 09:57 ?        00:05:21 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-15157512740479983317 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/content/elasticsearch-7.9.2 -Des.path.conf=/content/elasticsearch-7.9.2/config -Des.distribution.flavor=oss -Des.distribution.type=tar -Des.bundled_jdk=true -cp /content/elasticsearch-7.9.2/lib/* org.elasticsearch.bootstrap.Elasticsearch\n",
            "root       27471   27469  0 10:49 ?        00:00:00 grep elasticsearch\n"
          ]
        }
      ],
      "source": [
        "%%bash\n",
        "\n",
        "ps -ef | grep elasticsearch"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "T6ASWFOGCLEm",
        "outputId": "f1fc206f-c3ed-4558-9b52-d8b43c7d9ec7"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"name\" : \"5c892e65cf69\",\n",
            "  \"cluster_name\" : \"elasticsearch\",\n",
            "  \"cluster_uuid\" : \"oYJ7HPjxSTGcTbuOJf-x_A\",\n",
            "  \"version\" : {\n",
            "    \"number\" : \"7.9.2\",\n",
            "    \"build_flavor\" : \"oss\",\n",
            "    \"build_type\" : \"tar\",\n",
            "    \"build_hash\" : \"d34da0ea4a966c4e49417f2da2f244e3e97b4e6e\",\n",
            "    \"build_date\" : \"2020-09-23T00:45:33.626720Z\",\n",
            "    \"build_snapshot\" : false,\n",
            "    \"lucene_version\" : \"8.6.2\",\n",
            "    \"minimum_wire_compatibility_version\" : \"6.8.0\",\n",
            "    \"minimum_index_compatibility_version\" : \"6.0.0-beta1\"\n",
            "  },\n",
            "  \"tagline\" : \"You Know, for Search\"\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "%%bash\n",
        "\n",
        "curl -sX GET \"localhost:9200/\""
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# **Experients**\n"
      ],
      "metadata": {
        "id": "q47Gyz0Y6d6G"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# we us train queries only for creation of triplets.\n",
        "queries_dict_train, queries_dict_test, qrels_dict, qrels_dict_ssl, corpus_dict = get_formatted_data('mixed')"
      ],
      "metadata": {
        "id": "ojttRRYT6dKL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# make sure there are only qrels for queries present in corpus\n",
        "qrels_adapted = {}\n",
        "for id in queries_dict_train.keys():\n",
        "    qrels_adapted[id] = qrels_dict[id]"
      ],
      "metadata": {
        "id": "hKi_JoF6NI0n"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "print(len(qrels_adapted))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hQbC2WZZNtXx",
        "outputId": "0349fd66-192f-4f5d-f7f4-1cd9272004cf"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "11246\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "create_triplets(corpus_dict, queries_dict_train, qrels_adapted)"
      ],
      "metadata": {
        "id": "u8_OXatcMqUS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# test lexical\n",
        "run_lr_bm25(queries_dict_test, qrels_adapted, corpus_dict)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 364
        },
        "id": "0jcObS9rK5XL",
        "outputId": "1bfe68da-87a3-4229-b139-5db0a69fc7c6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que:   0%|          | 0/791 [00:21<?, ?it/s]\n"
          ]
        },
        {
          "output_type": "error",
          "ename": "KeyError",
          "evalue": "ignored",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-23-2381631fa9f7>\u001b[0m in \u001b[0;36m<cell line: 2>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# test lexical\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrun_lr_bm25\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqueries_dict_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqrels_adapted\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcorpus_dict\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
            "\u001b[0;32m<ipython-input-4-e65fbcd7d5e0>\u001b[0m in \u001b[0;36mrun_lr_bm25\u001b[0;34m(queries, qrels, corpus)\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m   \u001b[0;31m#### Retrieve dense results (format of results is identical to qrels)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m   \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mretriever\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretrieve\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m   \u001b[0;31m#### Evaluate your retrieval using NDCG@k, MAP@K ...\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/evaluation.py\u001b[0m in \u001b[0;36mretrieve\u001b[0;34m(self, corpus, queries, **kwargs)\u001b[0m\n\u001b[1;32m     21\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretriever\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m             \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Model/Technique has not been provided!\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretriever\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtop_k\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscore_function\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     24\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     25\u001b[0m     def rerank(self, \n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/lexical/bm25_search.py\u001b[0m in \u001b[0;36msearch\u001b[0;34m(self, corpus, queries, top_k, *args, **kwargs)\u001b[0m\n\u001b[1;32m     49\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mstart_idx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqueries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdesc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'que'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     50\u001b[0m             \u001b[0mquery_ids_batch\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mquery_ids\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 51\u001b[0;31m             results = self.es.lexical_multisearch(\n\u001b[0m\u001b[1;32m     52\u001b[0m                 \u001b[0mtexts\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mqueries\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     53\u001b[0m                 top_hits=top_k + 1) # Add 1 extra if query is present with documents\n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/lexical/elastic_search.py\u001b[0m in \u001b[0;36mlexical_multisearch\u001b[0;34m(self, texts, top_hits, skip)\u001b[0m\n\u001b[1;32m    191\u001b[0m         \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    192\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"responses\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 193\u001b[0;31m             \u001b[0mresponses\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"hits\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"hits\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mskip\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    194\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    195\u001b[0m             \u001b[0mhits\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mKeyError\u001b[0m: 'hits'"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TMZFZ75AgbQU"
      },
      "source": [
        "# **Results**\n",
        "\n",
        "create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 394,
          "referenced_widgets": [
            "877b5c86fa5346718be88b40db57771b",
            "9007a8f6836244c8a7b875abbdb29d8f",
            "f0845b95953c4d649c3261ff58f0259c",
            "0bc607c41a6944d59900835edce97c4f",
            "b922b379281b4aaeb37a580410050f11",
            "4662daf6953c4c90b3275a0c0a9aeec6",
            "f5de52812a924c96bc4e3a40e46de0f6",
            "36af54b9ad784cc8a76245444ac98090",
            "16720ddbeebf4336b60427e2f661b1d5",
            "024128e36dd3466c802822faca9927a6",
            "8fde17264ccc4fa9b2d19c8f1803e3c3",
            "64b8b67d4f964f0f80858a73e8474cf4",
            "0cf8897876d1499683d598fa8d517b4e",
            "2335c7f5eec24bf2aa494374e8cc2c7c",
            "7449ad3c85e94bb5a6863f1fdc125e48",
            "ae06be8b67974593bd3ad105e6283313",
            "c998379ace074e3d81415d0ec147d93a",
            "23e9eb2a15ff4941930edf8a58c0a5e2",
            "b39e9e2525e24d66baeacd5bc55764a0",
            "300a990e24814347b4cf4649eb164dc9",
            "aff454941f7d42bda9ff2d54291258a4",
            "d4f24728c24345a7a17ffebcb7da5334",
            "aeb62ff7ab3547abb70fc96c7baaf730",
            "8aff1e77f342418fb2ba1b93bf6ac757",
            "949a51e0190d4a7ca5566dceaaee119d",
            "79610e0c711c45a880529a909baf3d92",
            "6b299948b30e40009b3fd38624f39a8c",
            "7f1d44013acc4b0caadce87bb4452171",
            "17d682b70b5340ba8494fa7a49e14e94",
            "5a84299de18d462f81e06fde5cc57e4b",
            "f2f0c20d8be74ad9866cf86d6e5609d9",
            "a3388734447c40478ed37d6813cfa770",
            "2266afad9442498d83fe48cb4b0cff4f"
          ]
        },
        "id": "22aue761-w0U",
        "outputId": "1d742de9-9c63-43bf-a7c8-3f0ab2dda52c"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:datasets.builder:Found cached dataset json (/root/.cache/huggingface/datasets/xyz___json/xyz--qrel-34cc5a6c2d2d10cf/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "877b5c86fa5346718be88b40db57771b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:datasets.builder:Found cached dataset json (/root/.cache/huggingface/datasets/xyz___json/xyz--corpus-165ac86cedbdd2cb/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "64b8b67d4f964f0f80858a73e8474cf4",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:datasets.builder:Found cached dataset json (/root/.cache/huggingface/datasets/xyz___json/xyz--querie-0beee30470914711/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "aeb62ff7ab3547abb70fc96c7baaf730",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 72344\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 674296\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 18177\n",
            "})\n"
          ]
        }
      ],
      "source": [
        "querie_dataset, qrel_dataset, corpus_dataset = load_from_hf()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 86,
          "referenced_widgets": [
            "5f2b5c1f6eb443a698d42f07879fa600",
            "fb782d4a872447a5a5c37c21f79513e0",
            "51a0394c3cc1473aaa9a2a836c37e1b3",
            "91283355d1f646c997fe8016a675919d",
            "b7ebc25caf924d9abd59d9ea3ff1cd93",
            "d55e45473b98414881590f51012330ca",
            "c1ea083cc6214778a566c750a6903005",
            "c253d69ed2ea4d4cbb4b915b48b70878",
            "e58d3f09cdf044aebacacebcf05a9dc4",
            "51f5ac865f374cfe9f03958addf3644e",
            "a62acc28875441afba71eaf58e9c469d",
            "1074c087f775426aab7f357ba3a3e99a",
            "17b6dc5d458a4a76bdbf902802acc3c8",
            "6a49050ccf7a4b17a0ce2d44f1da6c73",
            "02084fae14ea48888906282f4d9394da",
            "176b9564a619446f81b6de70b9cde277",
            "ea693b004a7c463db11144513da6dd1c",
            "fcbbf4cd015d44a29e4750d343d16d80",
            "425284aec7d245de9e3f4b7f94be3c70",
            "f20b9bcc185346bdb9f43b90106818c6",
            "56776c9679d9488585ab0707f8d2e59b",
            "9af32e24a9404f968dbf54f4904293b6"
          ]
        },
        "id": "cvXXE-b6R3s8",
        "outputId": "ca98f960-3133-4251-b29a-e324d184a7c8"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5f2b5c1f6eb443a698d42f07879fa600",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1074c087f775426aab7f357ba3a3e99a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 62557\n",
            "})\n"
          ]
        }
      ],
      "source": [
        "removed_querie_dataset = querie_dataset.filter(lambda row: len(row['text']) <= 3)\n",
        "querie_dataset = querie_dataset.filter(lambda row: len(row['text']) > 3)\n",
        "print(querie_dataset)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wuNvPmRzl5_E"
      },
      "outputs": [],
      "source": [
        "sll = False\n",
        "querie_language = 'mixed'\n",
        "stopwords_remove = False\n",
        "qrel_language = 'mixed'\n",
        "split = False\n",
        "\n",
        "process_data = ProcessData()\n",
        "querie_dataset, qrel_dataset, corpus_dataset = process_data.get_data()\n",
        "\n",
        "data = {}\n",
        "\n",
        "# filter for languages\n",
        "data['corpus'] = corpus_dataset\n",
        "data['query'] = {'de': querie_dataset.filter(lambda x: x['language'] == 'de')}\n",
        "data['query']['fr'] = querie_dataset.filter(lambda x: x['language'] == 'fr')\n",
        "data['query']['it'] = querie_dataset.filter(lambda x: x['language'] == 'it')\n",
        "data['query']['mixed'] = querie_dataset.filter(lambda x: x['language'] == 'mixed')\n",
        "data['qrel'] = {'de': qrel_dataset.filter(lambda x: x['language'] == 'de')}\n",
        "data['qrel']['fr'] = qrel_dataset.filter(lambda x: x['language'] == 'fr')\n",
        "data['qrel']['it'] = qrel_dataset.filter(lambda x: x['language'] == 'it')\n",
        "data['qrel']['mixed'] = qrel_dataset.filter(lambda x: x['language'] == 'mixed')\n",
        "print(\"After filtering for language\")\n",
        "print(data.keys())\n",
        "\n",
        "# create Train Test split (only for mixed dataset)\n",
        "queries, qrels, corpus, _, __, ___ = process_data.create_data_dicts(data['query']['mixed'], data['qrel']['mixed'], data['corpus'])\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4B95YlOvasCD"
      },
      "outputs": [],
      "source": [
        "qrels_test_mixed, queries_test_mixed = process_data.create_sll(data['qrel']['mixed'], qrels, queries, corpus)\n",
        "qrels_test_de, queries_test_de = process_data.create_sll(data['qrel']['de'], qrels, queries, corpus)\n",
        "qrels_test_fr, queries_test_fr = process_data.create_sll(data['qrel']['fr'], qrels, queries, corpus)\n",
        "qrels_test_it, queries_test_it = process_data.create_sll(data['qrel']['it'], qrels, queries, corpus)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZrYG8N41hy6Y"
      },
      "outputs": [],
      "source": [
        "from datasets import concatenate_datasets, load_dataset\n",
        "use_subset = True\n",
        "subset_length = 10000\n",
        "remove_stopwords = True\n",
        "language_long = 'german'\n",
        "\n",
        "querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')\n",
        "querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')\n",
        "querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')\n",
        "\n",
        "querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])\n",
        "qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])\n",
        "corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])\n",
        "\n",
        "querie_dataset_p = querie_dataset_p.shuffle(seed=42)\n",
        "qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)\n",
        "corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)\n",
        "\n",
        "queries, qrels, corpus = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "r5lYqUEpiwNq"
      },
      "outputs": [],
      "source": [
        "run_lr_bm25(queries, qrels, corpus)\n",
        "create_triplets(corpus, queries, qrels)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F02EG87WVONw"
      },
      "source": [
        "# Lexical Retrieval for monolingual datasets with stopwordsremoval"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 367
        },
        "id": "jdToBDNQgiAr",
        "outputId": "6394076f-25b4-4ec9-8f78-013594ff3921"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "cdc69aeb1fa544d890c20908deca1283",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0fa1eb3ecf804b05a0db60c8bf064c1c",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c5d9db8b9eb3402281c403afb2b3ffe5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 1873\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 4882\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 248\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "45d9793bef784cbea771cd4219382512",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/4882 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1873\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "94800b04f49e43b9a4ce67465674660d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/248 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "248\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "311cfc9164914c72814f4be68cf581a8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/1873 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1873\n",
            "finished creating dictionaries\n",
            "['ad', 'al', 'allo', 'ai', 'agli', 'all', 'agl', 'alla', 'alle', 'con', 'col', 'coi', 'da', 'dal', 'dallo', 'dai', 'dagli', 'dall', 'dagl', 'dalla', 'dalle', 'di', 'del', 'dello', 'dei', 'degli', 'dell', 'degl', 'della', 'delle', 'in', 'nel', 'nello', 'nei', 'negli', 'nell', 'negl', 'nella', 'nelle', 'su', 'sul', 'sullo', 'sui', 'sugli', 'sull', 'sugl', 'sulla', 'sulle', 'per', 'tra', 'contro', 'io', 'tu', 'lui', 'lei', 'noi', 'voi', 'loro', 'mio', 'mia', 'miei', 'mie', 'tuo', 'tua', 'tuoi', 'tue', 'suo', 'sua', 'suoi', 'sue', 'nostro', 'nostra', 'nostri', 'nostre', 'vostro', 'vostra', 'vostri', 'vostre', 'mi', 'ti', 'ci', 'vi', 'lo', 'la', 'li', 'le', 'gli', 'ne', 'il', 'un', 'uno', 'una', 'ma', 'ed', 'se', 'perché', 'anche', 'come', 'dov', 'dove', 'che', 'chi', 'cui', 'non', 'più', 'quale', 'quanto', 'quanti', 'quanta', 'quante', 'quello', 'quelli', 'quella', 'quelle', 'questo', 'questi', 'questa', 'queste', 'si', 'tutto', 'tutti', 'a', 'c', 'e', 'i', 'l', 'o', 'ho', 'hai', 'ha', 'abbiamo', 'avete', 'hanno', 'abbia', 'abbiate', 'abbiano', 'avrò', 'avrai', 'avrà', 'avremo', 'avrete', 'avranno', 'avrei', 'avresti', 'avrebbe', 'avremmo', 'avreste', 'avrebbero', 'avevo', 'avevi', 'aveva', 'avevamo', 'avevate', 'avevano', 'ebbi', 'avesti', 'ebbe', 'avemmo', 'aveste', 'ebbero', 'avessi', 'avesse', 'avessimo', 'avessero', 'avendo', 'avuto', 'avuta', 'avuti', 'avute', 'sono', 'sei', 'è', 'siamo', 'siete', 'sia', 'siate', 'siano', 'sarò', 'sarai', 'sarà', 'saremo', 'sarete', 'saranno', 'sarei', 'saresti', 'sarebbe', 'saremmo', 'sareste', 'sarebbero', 'ero', 'eri', 'era', 'eravamo', 'eravate', 'erano', 'fui', 'fosti', 'fu', 'fummo', 'foste', 'furono', 'fossi', 'fosse', 'fossimo', 'fossero', 'essendo', 'faccio', 'fai', 'facciamo', 'fanno', 'faccia', 'facciate', 'facciano', 'farò', 'farai', 'farà', 'faremo', 'farete', 'faranno', 'farei', 'faresti', 'farebbe', 'faremmo', 'fareste', 'farebbero', 'facevo', 'facevi', 'faceva', 'facevamo', 'facevate', 'facevano', 'feci', 'facesti', 'fece', 'facemmo', 'faceste', 'fecero', 'facessi', 'facesse', 'facessimo', 'facessero', 'facendo', 'sto', 'stai', 'sta', 'stiamo', 'stanno', 'stia', 'stiate', 'stiano', 'starò', 'starai', 'starà', 'staremo', 'starete', 'staranno', 'starei', 'staresti', 'starebbe', 'staremmo', 'stareste', 'starebbero', 'stavo', 'stavi', 'stava', 'stavamo', 'stavate', 'stavano', 'stetti', 'stesti', 'stette', 'stemmo', 'steste', 'stettero', 'stessi', 'stesse', 'stessimo', 'stessero', 'stando']\n",
            "Shortened queries and removed stopwords\n"
          ]
        }
      ],
      "source": [
        "### IT ###\n",
        "language = 'it'\n",
        "language_long = 'italian'\n",
        "remove_stopwords =False\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus_it = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pkFm6Ago4atO",
        "outputId": "3bfd1c74-4847-4d8a-94d6-6bf5bd2b4631"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_IT\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/248 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 15/15 [00:31<00:00,  2.09s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.062, 'NDCG@3': 0.0556, 'NDCG@5': 0.05578, 'NDCG@10': 0.06314, 'NDCG@100': 0.14398, 'NDCG@1000': 0.25371}\n",
            "{'MAP@1': 0.02395, 'MAP@3': 0.03475, 'MAP@5': 0.03742, 'MAP@10': 0.0409, 'MAP@100': 0.05172, 'MAP@1000': 0.06007}\n",
            "{'Recall@1': 0.02395, 'Recall@3': 0.04495, 'Recall@5': 0.05543, 'Recall@10': 0.07595, 'Recall@100': 0.40679, 'Recall@1000': 0.99767}\n",
            "{'P@1': 0.062, 'P@3': 0.04328, 'P@5': 0.03253, 'P@10': 0.02195, 'P@100': 0.01182, 'P@1000': 0.00288}\n",
            "#######################################################################################################\n",
            "MR-BM25_IT\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/248 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 15/15 [00:21<00:00,  1.44s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.05835, 'NDCG@3': 0.05583, 'NDCG@5': 0.05796, 'NDCG@10': 0.06482, 'NDCG@100': 0.15248, 'NDCG@1000': 0.25566}\n",
            "{'MAP@1': 0.02231, 'MAP@3': 0.03447, 'MAP@5': 0.03816, 'MAP@10': 0.04123, 'MAP@100': 0.05286, 'MAP@1000': 0.06082}\n",
            "{'Recall@1': 0.02231, 'Recall@3': 0.0466, 'Recall@5': 0.0601, 'Recall@10': 0.08027, 'Recall@100': 0.44213, 'Recall@1000': 0.99767}\n",
            "{'P@1': 0.05835, 'P@3': 0.04449, 'P@5': 0.03559, 'P@10': 0.02334, 'P@100': 0.01279, 'P@1000': 0.00288}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_IT\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"MR-BM25_IT\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 367
        },
        "id": "j3DCD68dTXsj",
        "outputId": "0cc07c0f-510e-4cdd-f96c-7cdbc8a3f5bc"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "24ffc2405a2b451b9038b45a4ac6f0a2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "9fb7a61d40694f06ad22f131ba2d3a3b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c394b857f35d41379aee28ba60d6cd20",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 12164\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 63203\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 2379\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "9376fde8bdb348529343616a4527a8e8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/63203 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "12164\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f76ebe5959784d568be37f6b9320bee1",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/2379 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "2379\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0d4581c83d894d34821d9a7a34995f58",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/12164 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "12164\n",
            "finished creating dictionaries\n",
            "['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']\n",
            "Shortened queries and removed stopwords\n"
          ]
        }
      ],
      "source": [
        "### FR ###\n",
        "language = 'fr'\n",
        "language_long = 'french'\n",
        "remove_stopwords =True\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "R5c9dgaDTkOd",
        "outputId": "c71bb5a3-5ec6-4507-c321-895d7a37a76f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_IT\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/2379 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 96/96 [14:54<00:00,  9.32s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.06465, 'NDCG@3': 0.0549, 'NDCG@5': 0.0522, 'NDCG@10': 0.05596, 'NDCG@100': 0.09514, 'NDCG@1000': 0.1701}\n",
            "{'MAP@1': 0.0136, 'MAP@3': 0.02177, 'MAP@5': 0.02477, 'MAP@10': 0.02818, 'MAP@100': 0.03448, 'MAP@1000': 0.03721}\n",
            "{'Recall@1': 0.0136, 'Recall@3': 0.03061, 'Recall@5': 0.04111, 'Recall@10': 0.06049, 'Recall@100': 0.18197, 'Recall@1000': 0.57429}\n",
            "{'P@1': 0.06465, 'P@3': 0.04893, 'P@5': 0.04026, 'P@10': 0.03054, 'P@100': 0.00995, 'P@1000': 0.00304}\n",
            "#######################################################################################################\n",
            "MR-BM25_IT\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/2379 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 96/96 [12:28<00:00,  7.80s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.06439, 'NDCG@3': 0.05416, 'NDCG@5': 0.05216, 'NDCG@10': 0.05614, 'NDCG@100': 0.09538, 'NDCG@1000': 0.1727}\n",
            "{'MAP@1': 0.01358, 'MAP@3': 0.02159, 'MAP@5': 0.02472, 'MAP@10': 0.02821, 'MAP@100': 0.03451, 'MAP@1000': 0.03732}\n",
            "{'Recall@1': 0.01358, 'Recall@3': 0.03026, 'Recall@5': 0.04129, 'Recall@10': 0.06103, 'Recall@100': 0.18304, 'Recall@1000': 0.58918}\n",
            "{'P@1': 0.06439, 'P@3': 0.0479, 'P@5': 0.04036, 'P@10': 0.03083, 'P@100': 0.00998, 'P@1000': 0.0031}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_IT\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"MR-BM25_IT\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 367
        },
        "id": "aFC73rs6UEAn",
        "outputId": "49266ad4-cb4b-4b25-c6f2-d23c705a92ab"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0c7554b816ed4d72bb2993eca9b30fbe",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d1bb422fd7824a2e90da8edb749e9ae3",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "87854952e858442592ac8fb2d205464d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 21739\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 147486\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 5653\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "3ee4b5e47ebe4e298919524b28a87752",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/147486 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "21739\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f983de6be44141279e96b6bd2de1ad0f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/5653 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "5653\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ad8e05e18f4944499cb4eafb72ad0e18",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/21739 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "21739\n",
            "finished creating dictionaries\n",
            "['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'dass', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unsere', 'unserem', 'unseren', 'unser', 'unseres', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen']\n",
            "Shortened queries and removed stopwords\n"
          ]
        }
      ],
      "source": [
        "### DE ###\n",
        "language = 'de'\n",
        "language_long = 'german'\n",
        "remove_stopwords =True\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GYqlQ6IbUGjU",
        "outputId": "c182d606-dfc2-426d-d71d-76619c259f6a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_DE\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/5653 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 170/170 [21:54<00:00,  7.73s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.06041, 'NDCG@3': 0.05317, 'NDCG@5': 0.04906, 'NDCG@10': 0.04843, 'NDCG@100': 0.08245, 'NDCG@1000': 0.13682}\n",
            "{'MAP@1': 0.00902, 'MAP@3': 0.01494, 'MAP@5': 0.01735, 'MAP@10': 0.02023, 'MAP@100': 0.02607, 'MAP@1000': 0.02816}\n",
            "{'Recall@1': 0.00902, 'Recall@3': 0.02101, 'Recall@5': 0.02917, 'Recall@10': 0.04425, 'Recall@100': 0.14389, 'Recall@1000': 0.38765}\n",
            "{'P@1': 0.06041, 'P@3': 0.0498, 'P@5': 0.04213, 'P@10': 0.03306, 'P@100': 0.01154, 'P@1000': 0.00303}\n",
            "#######################################################################################################\n",
            "MR-BM25_DE\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/5653 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 170/170 [18:59<00:00,  6.70s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.06353, 'NDCG@3': 0.05486, 'NDCG@5': 0.05059, 'NDCG@10': 0.04992, 'NDCG@100': 0.08444, 'NDCG@1000': 0.13893}\n",
            "{'MAP@1': 0.00948, 'MAP@3': 0.0154, 'MAP@5': 0.01792, 'MAP@10': 0.02091, 'MAP@100': 0.0269, 'MAP@1000': 0.02902}\n",
            "{'Recall@1': 0.00948, 'Recall@3': 0.02154, 'Recall@5': 0.03006, 'Recall@10': 0.04567, 'Recall@100': 0.14661, 'Recall@1000': 0.39049}\n",
            "{'P@1': 0.06353, 'P@3': 0.05121, 'P@5': 0.04336, 'P@10': 0.03389, 'P@100': 0.01175, 'P@1000': 0.00305}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_DE\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"MR-BM25_DE\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GWUDth1hVe9R"
      },
      "source": [
        "# **Lexical Retrieval using full corpus, queries (mixed, de, fr, it) and full qrels. Main language for ML Lexical varies**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 330
        },
        "id": "Lt2tLtw-4eBX",
        "outputId": "4cfff1f3-fad8-4679-ab0a-fab14667f8fd"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c1fe8da6886e4ab18f06c0a32b4517ae",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "20021e87969740f395f5ca97172982ca",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "cfd887f881ca4226a0679d9954ac5507",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 31566\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 458725\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 9897\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "a386f2c175e14a13948d5b590b96e184",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/9897 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "9897\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4a1530be688945ceab89da7dd6990682",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/31566 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "31566\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "33a9a9e935214428a587377ce9ebc3ad",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/458725 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "31566\n",
            "finished creating dictionaries\n",
            "Shortened queries\n"
          ]
        }
      ],
      "source": [
        "### Mixed ###\n",
        "language = 'mixed'\n",
        "language_long = 'german'\n",
        "remove_stopwords =False\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3QaOnyWpIyVV",
        "outputId": "b7000564-16f2-4a48-ab8a-6546db5319f8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\n",
            "MR-BM25_MXF\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:02:34<00:00, 15.20s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.1137, 'NDCG@3': 0.10314, 'NDCG@5': 0.09392, 'NDCG@10': 0.08335, 'NDCG@100': 0.11514, 'NDCG@1000': 0.16627}\n",
            "{'MAP@1': 0.0102, 'MAP@3': 0.01785, 'MAP@5': 0.02127, 'MAP@10': 0.02521, 'MAP@100': 0.03283, 'MAP@1000': 0.03518}\n",
            "{'Recall@1': 0.0102, 'Recall@3': 0.0254, 'Recall@5': 0.03616, 'Recall@10': 0.0555, 'Recall@100': 0.16537, 'Recall@1000': 0.34645}\n",
            "{'P@1': 0.1137, 'P@3': 0.09885, 'P@5': 0.08538, 'P@10': 0.06656, 'P@100': 0.02082, 'P@1000': 0.00456}\n",
            "{'R_cap@1': 0.1137, 'R_cap@3': 0.09998, 'R_cap@5': 0.08823, 'R_cap@10': 0.07742, 'R_cap@100': 0.16537, 'R_cap@1000': 0.34645}\n"
          ]
        }
      ],
      "source": [
        "language_long='french'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXF\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "c2grnIAU3gx6",
        "outputId": "c918b4f3-c642-4faa-9aa2-b70e3cf6f5d5"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\n",
            "MR-BM25_MXI\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:08:08<00:00, 16.55s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.1008, 'NDCG@3': 0.09112, 'NDCG@5': 0.08385, 'NDCG@10': 0.07582, 'NDCG@100': 0.11021, 'NDCG@1000': 0.16311}\n",
            "{'MAP@1': 0.00956, 'MAP@3': 0.01644, 'MAP@5': 0.01955, 'MAP@10': 0.02321, 'MAP@100': 0.03074, 'MAP@1000': 0.03316}\n",
            "{'Recall@1': 0.00956, 'Recall@3': 0.02328, 'Recall@5': 0.03315, 'Recall@10': 0.05121, 'Recall@100': 0.16294, 'Recall@1000': 0.35015}\n",
            "{'P@1': 0.1008, 'P@3': 0.08696, 'P@5': 0.0762, 'P@10': 0.06069, 'P@100': 0.02049, 'P@1000': 0.00462}\n",
            "{'R_cap@1': 0.1008, 'R_cap@3': 0.0882, 'R_cap@5': 0.0791, 'R_cap@10': 0.07118, 'R_cap@100': 0.16294, 'R_cap@1000': 0.35015}\n",
            "Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\n",
            "MR-BM25_MXE\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:11:32<00:00, 17.38s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.08376, 'NDCG@3': 0.07664, 'NDCG@5': 0.07207, 'NDCG@10': 0.06658, 'NDCG@100': 0.10226, 'NDCG@1000': 0.15642}\n",
            "{'MAP@1': 0.00763, 'MAP@3': 0.01335, 'MAP@5': 0.01621, 'MAP@10': 0.01966, 'MAP@100': 0.02694, 'MAP@1000': 0.02941}\n",
            "{'Recall@1': 0.00763, 'Recall@3': 0.01923, 'Recall@5': 0.02857, 'Recall@10': 0.04607, 'Recall@100': 0.15764, 'Recall@1000': 0.34937}\n",
            "{'P@1': 0.08376, 'P@3': 0.07335, 'P@5': 0.06634, 'P@10': 0.05457, 'P@100': 0.0199, 'P@1000': 0.00461}\n",
            "{'R_cap@1': 0.08376, 'R_cap@3': 0.07449, 'R_cap@5': 0.06905, 'R_cap@10': 0.06434, 'R_cap@100': 0.15764, 'R_cap@1000': 0.34937}\n"
          ]
        }
      ],
      "source": [
        "language_long='italian'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXI\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)\n",
        "language_long='english'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXE\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "K19JM_tW4qs3",
        "outputId": "dd0cce1c-47c4-42c7-da79-b022f4e0fb6e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:09:08<00:00, 16.79s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.08376, 'NDCG@3': 0.07664, 'NDCG@5': 0.07207, 'NDCG@10': 0.06658, 'NDCG@100': 0.10226, 'NDCG@1000': 0.15642}\n",
            "{'MAP@1': 0.00763, 'MAP@3': 0.01335, 'MAP@5': 0.01621, 'MAP@10': 0.01966, 'MAP@100': 0.02694, 'MAP@1000': 0.02941}\n",
            "{'Recall@1': 0.00763, 'Recall@3': 0.01923, 'Recall@5': 0.02857, 'Recall@10': 0.04607, 'Recall@100': 0.15764, 'Recall@1000': 0.34937}\n",
            "{'P@1': 0.08376, 'P@3': 0.07335, 'P@5': 0.06634, 'P@10': 0.05457, 'P@100': 0.0199, 'P@1000': 0.00461}\n",
            "{'R_cap@1': 0.08376, 'R_cap@3': 0.07449, 'R_cap@5': 0.06905, 'R_cap@10': 0.06434, 'R_cap@100': 0.15764, 'R_cap@1000': 0.34937}\n",
            "#######################################################################################################\n",
            "Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\n",
            "MR-BM25_MXD\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:03:41<00:00, 15.47s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.08687, 'NDCG@3': 0.07908, 'NDCG@5': 0.0745, 'NDCG@10': 0.06816, 'NDCG@100': 0.10426, 'NDCG@1000': 0.15816}\n",
            "{'MAP@1': 0.0079, 'MAP@3': 0.01384, 'MAP@5': 0.01671, 'MAP@10': 0.02018, 'MAP@100': 0.02767, 'MAP@1000': 0.03016}\n",
            "{'Recall@1': 0.0079, 'Recall@3': 0.0199, 'Recall@5': 0.02927, 'Recall@10': 0.04652, 'Recall@100': 0.15989, 'Recall@1000': 0.35064}\n",
            "{'P@1': 0.08687, 'P@3': 0.07558, 'P@5': 0.06859, 'P@10': 0.05571, 'P@100': 0.02034, 'P@1000': 0.00464}\n",
            "{'R_cap@1': 0.08687, 'R_cap@3': 0.07681, 'R_cap@5': 0.07146, 'R_cap@10': 0.06539, 'R_cap@100': 0.15989, 'R_cap@1000': 0.35064}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "language_long='german'\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xhD_PLh4Qq0W"
      },
      "source": [
        "# **Lexical Retrieval Stopword removal**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gQaSJzImR3Gp",
        "outputId": "a105d201-8ff5-4c0e-e8a1-836ea9260ce9"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'dass', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unsere', 'unserem', 'unseren', 'unser', 'unseres', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen']\n",
            "Shortened queries and removed stopwords\n"
          ]
        }
      ],
      "source": [
        "# add stopword removal:\n",
        "queries_s, corpus_s = shorten_and_reduce(queries, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "uvItHf4LMJ87",
        "outputId": "cba0b236-a212-4fdc-a910-2d83bdb408ed"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "['ad', 'al', 'allo', 'ai', 'agli', 'all', 'agl', 'alla', 'alle', 'con', 'col', 'coi', 'da', 'dal', 'dallo', 'dai', 'dagli', 'dall', 'dagl', 'dalla', 'dalle', 'di', 'del', 'dello', 'dei', 'degli', 'dell', 'degl', 'della', 'delle', 'in', 'nel', 'nello', 'nei', 'negli', 'nell', 'negl', 'nella', 'nelle', 'su', 'sul', 'sullo', 'sui', 'sugli', 'sull', 'sugl', 'sulla', 'sulle', 'per', 'tra', 'contro', 'io', 'tu', 'lui', 'lei', 'noi', 'voi', 'loro', 'mio', 'mia', 'miei', 'mie', 'tuo', 'tua', 'tuoi', 'tue', 'suo', 'sua', 'suoi', 'sue', 'nostro', 'nostra', 'nostri', 'nostre', 'vostro', 'vostra', 'vostri', 'vostre', 'mi', 'ti', 'ci', 'vi', 'lo', 'la', 'li', 'le', 'gli', 'ne', 'il', 'un', 'uno', 'una', 'ma', 'ed', 'se', 'perché', 'anche', 'come', 'dov', 'dove', 'che', 'chi', 'cui', 'non', 'più', 'quale', 'quanto', 'quanti', 'quanta', 'quante', 'quello', 'quelli', 'quella', 'quelle', 'questo', 'questi', 'questa', 'queste', 'si', 'tutto', 'tutti', 'a', 'c', 'e', 'i', 'l', 'o', 'ho', 'hai', 'ha', 'abbiamo', 'avete', 'hanno', 'abbia', 'abbiate', 'abbiano', 'avrò', 'avrai', 'avrà', 'avremo', 'avrete', 'avranno', 'avrei', 'avresti', 'avrebbe', 'avremmo', 'avreste', 'avrebbero', 'avevo', 'avevi', 'aveva', 'avevamo', 'avevate', 'avevano', 'ebbi', 'avesti', 'ebbe', 'avemmo', 'aveste', 'ebbero', 'avessi', 'avesse', 'avessimo', 'avessero', 'avendo', 'avuto', 'avuta', 'avuti', 'avute', 'sono', 'sei', 'è', 'siamo', 'siete', 'sia', 'siate', 'siano', 'sarò', 'sarai', 'sarà', 'saremo', 'sarete', 'saranno', 'sarei', 'saresti', 'sarebbe', 'saremmo', 'sareste', 'sarebbero', 'ero', 'eri', 'era', 'eravamo', 'eravate', 'erano', 'fui', 'fosti', 'fu', 'fummo', 'foste', 'furono', 'fossi', 'fosse', 'fossimo', 'fossero', 'essendo', 'faccio', 'fai', 'facciamo', 'fanno', 'faccia', 'facciate', 'facciano', 'farò', 'farai', 'farà', 'faremo', 'farete', 'faranno', 'farei', 'faresti', 'farebbe', 'faremmo', 'fareste', 'farebbero', 'facevo', 'facevi', 'faceva', 'facevamo', 'facevate', 'facevano', 'feci', 'facesti', 'fece', 'facemmo', 'faceste', 'fecero', 'facessi', 'facesse', 'facessimo', 'facessero', 'facendo', 'sto', 'stai', 'sta', 'stiamo', 'stanno', 'stia', 'stiate', 'stiano', 'starò', 'starai', 'starà', 'staremo', 'starete', 'staranno', 'starei', 'staresti', 'starebbe', 'staremmo', 'stareste', 'starebbero', 'stavo', 'stavi', 'stava', 'stavamo', 'stavate', 'stavano', 'stetti', 'stesti', 'stette', 'stemmo', 'steste', 'stettero', 'stessi', 'stesse', 'stessimo', 'stessero', 'stando']\n",
            "Shortened queries and removed stopwords\n",
            "['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']\n",
            "Shortened queries and removed stopwords\n"
          ]
        }
      ],
      "source": [
        "queries_s, corpus_s = shorten_and_reduce(queries_s, corpus_s, 'italian')\n",
        "queries_s, corpus_s = shorten_and_reduce(queries_s, corpus_s, 'french')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xJuftljcSVw5",
        "outputId": "13466e4c-6336-470a-9866-63d7d4605bc8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:08:44<00:00, 16.70s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.10644, 'NDCG@3': 0.09688, 'NDCG@5': 0.08906, 'NDCG@10': 0.08044, 'NDCG@100': 0.11329, 'NDCG@1000': 0.16542}\n",
            "{'MAP@1': 0.00986, 'MAP@3': 0.0172, 'MAP@5': 0.02047, 'MAP@10': 0.02444, 'MAP@100': 0.03208, 'MAP@1000': 0.03448}\n",
            "{'Recall@1': 0.00986, 'Recall@3': 0.02444, 'Recall@5': 0.03482, 'Recall@10': 0.0543, 'Recall@100': 0.16465, 'Recall@1000': 0.34962}\n",
            "{'P@1': 0.10644, 'P@3': 0.09277, 'P@5': 0.08119, 'P@10': 0.06478, 'P@100': 0.0208, 'P@1000': 0.00461}\n",
            "{'R_cap@1': 0.10644, 'R_cap@3': 0.09403, 'R_cap@5': 0.08417, 'R_cap@10': 0.07566, 'R_cap@100': 0.16465, 'R_cap@1000': 0.34962}\n",
            "#######################################################################################################\n",
            "Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\n",
            "MR-BM25_MXD\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:08:43<00:00, 16.69s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.10879, 'NDCG@3': 0.098, 'NDCG@5': 0.09019, 'NDCG@10': 0.0814, 'NDCG@100': 0.11527, 'NDCG@1000': 0.1671}\n",
            "{'MAP@1': 0.00995, 'MAP@3': 0.0173, 'MAP@5': 0.02062, 'MAP@10': 0.02465, 'MAP@100': 0.03251, 'MAP@1000': 0.03494}\n",
            "{'Recall@1': 0.00995, 'Recall@3': 0.02449, 'Recall@5': 0.0351, 'Recall@10': 0.05471, 'Recall@100': 0.16793, 'Recall@1000': 0.35134}\n",
            "{'P@1': 0.10879, 'P@3': 0.09376, 'P@5': 0.08232, 'P@10': 0.06564, 'P@100': 0.02128, 'P@1000': 0.00465}\n",
            "{'R_cap@1': 0.10879, 'R_cap@3': 0.09499, 'R_cap@5': 0.08519, 'R_cap@10': 0.07646, 'R_cap@100': 0.16793, 'R_cap@1000': 0.35134}\n",
            "Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\n",
            "MR-BM25_MXF\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 247/247 [1:07:40<00:00, 16.44s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.10971, 'NDCG@3': 0.09911, 'NDCG@5': 0.09113, 'NDCG@10': 0.08135, 'NDCG@100': 0.11405, 'NDCG@1000': 0.16614}\n",
            "{'MAP@1': 0.00993, 'MAP@3': 0.01732, 'MAP@5': 0.02072, 'MAP@10': 0.02461, 'MAP@100': 0.03229, 'MAP@1000': 0.0347}\n",
            "{'Recall@1': 0.00993, 'Recall@3': 0.02457, 'Recall@5': 0.03542, 'Recall@10': 0.05441, 'Recall@100': 0.16522, 'Recall@1000': 0.35001}\n",
            "{'P@1': 0.10971, 'P@3': 0.09484, 'P@5': 0.0831, 'P@10': 0.06525, 'P@100': 0.02088, 'P@1000': 0.00462}\n",
            "{'R_cap@1': 0.10971, 'R_cap@3': 0.09599, 'R_cap@5': 0.08606, 'R_cap@10': 0.07602, 'R_cap@100': 0.16522, 'R_cap@1000': 0.35001}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX\")\n",
        "run_lr_bm25(queries_s, qrels, corpus_s)\n",
        "print(\"#######################################################################################################\")\n",
        "language_long='german'\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD\")\n",
        "run_mr_bm25(queries_s, qrels, corpus_s, language_long)\n",
        "language_long='french'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXF\")\n",
        "run_mr_bm25(queries_s, qrels, corpus_s, language_long)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P2xfHrVDQ3Fl"
      },
      "source": [
        "# **Lexical Retrieval: SSL datasets**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 764
        },
        "id": "FpuY9pNjS0ih",
        "outputId": "5e364328-a6f1-4dc4-afd8-b215bf4f46dd"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2d4fccb5d86943748c66f71c2e1f925d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8d8a29bfa85d41bbb75ccf53b6841b7e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "9f3529cf4a3b4f649ae7275466ef928c",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 1371\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 4882\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 248\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ed921949207645259ed355afb5b1fa93",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "3946136d9a024f479d8805dbe968378a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "94ca0a69e8a94a84b36fac0ea329210a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 11942\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 63203\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 2379\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "9fcbb5cfe6c4464fa017f8461931f382",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "cadb35ed817543fe8e72b8f6b9a52ecd",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "70c1a1326754439a9e06a46d20d806e1",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 17678\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 147486\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 5653\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "dd15a0ff82c7406ebc1b5c2e4c20d2ae",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/8280 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "8280\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5333ab4ca36a4d02b07a96f68dbefcd2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/30991 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "30991\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "17be30d073cc4d1c8944181a7f2b1f2a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/215571 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "30991\n",
            "finished creating dictionaries\n"
          ]
        }
      ],
      "source": [
        "# SSL for lexical\n",
        "from datasets import concatenate_datasets, load_dataset\n",
        "use_subset = True\n",
        "subset_length = 10000\n",
        "remove_stopwords = False\n",
        "language_long = 'german'\n",
        "\n",
        "querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')\n",
        "querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')\n",
        "querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')\n",
        "\n",
        "querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])\n",
        "qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])\n",
        "corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])\n",
        "\n",
        "querie_dataset_p = querie_dataset_p.shuffle(seed=42)\n",
        "qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)\n",
        "corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)\n",
        "\n",
        "queries_ssl, qrels_ssl, corpus_ = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 451
        },
        "id": "t5aURI4BTNWY",
        "outputId": "c7ae5b27-32f4-4a32-e4ef-819265eadf4f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que:   0%|          | 0/243 [00:21<?, ?it/s]\n"
          ]
        },
        {
          "ename": "KeyError",
          "evalue": "ignored",
          "output_type": "error",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-24-ffc190285276>\u001b[0m in \u001b[0;36m<cell line: 3>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"#######################################################################################################\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"LR-BM25_MX\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mrun_lr_bm25\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqueries_ssl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqrels_ssl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcorpus\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"#######################################################################################################\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mlanguage_long\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'german'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;32m<ipython-input-10-e65fbcd7d5e0>\u001b[0m in \u001b[0;36mrun_lr_bm25\u001b[0;34m(queries, qrels, corpus)\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m   \u001b[0;31m#### Retrieve dense results (format of results is identical to qrels)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m   \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mretriever\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretrieve\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m   \u001b[0;31m#### Evaluate your retrieval using NDCG@k, MAP@K ...\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/evaluation.py\u001b[0m in \u001b[0;36mretrieve\u001b[0;34m(self, corpus, queries, **kwargs)\u001b[0m\n\u001b[1;32m     21\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretriever\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m             \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Model/Technique has not been provided!\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretriever\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtop_k\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscore_function\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     24\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     25\u001b[0m     def rerank(self, \n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/lexical/bm25_search.py\u001b[0m in \u001b[0;36msearch\u001b[0;34m(self, corpus, queries, top_k, *args, **kwargs)\u001b[0m\n\u001b[1;32m     49\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mstart_idx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqueries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdesc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'que'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     50\u001b[0m             \u001b[0mquery_ids_batch\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mquery_ids\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 51\u001b[0;31m             results = self.es.lexical_multisearch(\n\u001b[0m\u001b[1;32m     52\u001b[0m                 \u001b[0mtexts\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mqueries\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     53\u001b[0m                 top_hits=top_k + 1) # Add 1 extra if query is present with documents\n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/lexical/elastic_search.py\u001b[0m in \u001b[0;36mlexical_multisearch\u001b[0;34m(self, texts, top_hits, skip)\u001b[0m\n\u001b[1;32m    191\u001b[0m         \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    192\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"responses\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 193\u001b[0;31m             \u001b[0mresponses\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"hits\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"hits\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mskip\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    194\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    195\u001b[0m             \u001b[0mhits\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mKeyError\u001b[0m: 'hits'"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX\")\n",
        "run_lr_bm25(queries_ssl, qrels_ssl, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "language_long='german'\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD\")\n",
        "run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)\n",
        "print(\"#######################################################################################################\")\n",
        "language_long='french'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXF\")\n",
        "run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)\n",
        "print(\"#######################################################################################################\")\n",
        "language_long='italian'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXI\")\n",
        "run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)\n",
        "print(\"#######################################################################################################\")\n",
        "language_long='english'\n",
        "print(\"Attention: As one lanugage needs to be specified, French was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXE\")\n",
        "run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RZGobu2pTN--"
      },
      "source": [
        "# Lexical Retrieval with only one language per queries:\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 330
        },
        "id": "oomDnoBfVtxg",
        "outputId": "a0ff4457-b49f-4dc5-b0ca-e289771930f5"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "57c25f04c163414487b103bf1626e464",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ab863f0a05eb4bdbbd269977ea06ae07",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d89dd656083e44668e0fc6f957ecab69",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 1371\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 4882\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 248\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "74f5c6dc41cf48b0a9bbb1df443af7ca",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/248 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "248\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "48c8da0312f147e8a158b89d603b05d2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/1371 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1371\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ec292dcf3ae04c0f9ab45a2136cd0b73",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/4882 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1371\n",
            "finished creating dictionaries\n",
            "Shortened queries\n"
          ]
        }
      ],
      "source": [
        "### Italian ###\n",
        "language = 'it'\n",
        "language_long = 'italian'\n",
        "remove_stopwords = False\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus_it = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bBZikyv5WJ1R",
        "outputId": "f150690f-a891-4890-c670-eccec51501d5"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_IT_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 11/11 [03:18<00:00, 18.09s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.14223, 'NDCG@3': 0.13575, 'NDCG@5': 0.15606, 'NDCG@10': 0.19881, 'NDCG@100': 0.34109, 'NDCG@1000': 0.37782}\n",
            "{'MAP@1': 0.04739, 'MAP@3': 0.081, 'MAP@5': 0.09817, 'MAP@10': 0.118, 'MAP@100': 0.1531, 'MAP@1000': 0.15644}\n",
            "{'Recall@1': 0.04739, 'Recall@3': 0.11602, 'Recall@5': 0.17553, 'Recall@10': 0.27552, 'Recall@100': 0.80278, 'Recall@1000': 1.0}\n",
            "{'P@1': 0.14223, 'P@3': 0.11743, 'P@5': 0.10562, 'P@10': 0.08366, 'P@100': 0.02298, 'P@1000': 0.00289}\n",
            "{'R_cap@1': 0.14223, 'R_cap@3': 0.13713, 'R_cap@5': 0.1785, 'R_cap@10': 0.27568, 'R_cap@100': 0.80278, 'R_cap@1000': 1.0}\n",
            "#######################################################################################################\n",
            "MR-BM25_IT_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 11/11 [02:57<00:00, 16.10s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.24799, 'NDCG@3': 0.22079, 'NDCG@5': 0.23772, 'NDCG@10': 0.28604, 'NDCG@100': 0.41971, 'NDCG@1000': 0.44462}\n",
            "{'MAP@1': 0.08239, 'MAP@3': 0.13673, 'MAP@5': 0.15927, 'MAP@10': 0.18438, 'MAP@100': 0.21999, 'MAP@1000': 0.22222}\n",
            "{'Recall@1': 0.08239, 'Recall@3': 0.18622, 'Recall@5': 0.25373, 'Recall@10': 0.37164, 'Recall@100': 0.86336, 'Recall@1000': 1.0}\n",
            "{'P@1': 0.24799, 'P@3': 0.18867, 'P@5': 0.15244, 'P@10': 0.10985, 'P@100': 0.02484, 'P@1000': 0.00289}\n",
            "{'R_cap@1': 0.24799, 'R_cap@3': 0.21882, 'R_cap@5': 0.25773, 'R_cap@10': 0.37182, 'R_cap@100': 0.86336, 'R_cap@1000': 1.0}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_IT\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"MR-BM25_IT\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 330
        },
        "id": "UcMC9H2fVtND",
        "outputId": "952aebef-09d0-41da-b53f-d1aa2df68336"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "55f0e7dd48424f32953387983d0df3f6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e1227dba635d4bd6a75f958037947101",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ae94970e3b4941369be9501de57b496b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 11942\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 63203\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 2379\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1c8be2d35bff4be5ad04c295c4d20f40",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/2379 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "2379\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d6b58429443c469bbc9535fb22127d09",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/11942 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "11942\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c3a07fbbf92d487ab72d85f0cd661e61",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/63203 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "11942\n",
            "finished creating dictionaries\n",
            "Shortened queries\n"
          ]
        }
      ],
      "source": [
        "### French ###\n",
        "language = 'fr'\n",
        "language_long = 'french'\n",
        "remove_stopwords =False\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus_fr = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "eQyVNyI9WHD8",
        "outputId": "45e76194-174a-45bd-d670-f437ee51f9bf"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_FR_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 94/94 [34:58<00:00, 22.32s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.07888, 'NDCG@3': 0.07454, 'NDCG@5': 0.07614, 'NDCG@10': 0.08818, 'NDCG@100': 0.17314, 'NDCG@1000': 0.25976}\n",
            "{'MAP@1': 0.01655, 'MAP@3': 0.02964, 'MAP@5': 0.03603, 'MAP@10': 0.0436, 'MAP@100': 0.05942, 'MAP@1000': 0.0644}\n",
            "{'Recall@1': 0.01655, 'Recall@3': 0.04378, 'Recall@5': 0.06573, 'Recall@10': 0.1071, 'Recall@100': 0.38333, 'Recall@1000': 0.81509}\n",
            "{'P@1': 0.07888, 'P@3': 0.06861, 'P@5': 0.06197, 'P@10': 0.05052, 'P@100': 0.01844, 'P@1000': 0.00413}\n",
            "{'R_cap@1': 0.07888, 'R_cap@3': 0.07419, 'R_cap@5': 0.0799, 'R_cap@10': 0.10915, 'R_cap@100': 0.38333, 'R_cap@1000': 0.81509}\n",
            "#######################################################################################################\n",
            "MR-BM25_FR_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 94/94 [30:26<00:00, 19.43s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.15893, 'NDCG@3': 0.14974, 'NDCG@5': 0.147, 'NDCG@10': 0.16145, 'NDCG@100': 0.2463, 'NDCG@1000': 0.31715}\n",
            "{'MAP@1': 0.03556, 'MAP@3': 0.064, 'MAP@5': 0.07524, 'MAP@10': 0.0873, 'MAP@100': 0.10642, 'MAP@1000': 0.11082}\n",
            "{'Recall@1': 0.03556, 'Recall@3': 0.09148, 'Recall@5': 0.12714, 'Recall@10': 0.1866, 'Recall@100': 0.46047, 'Recall@1000': 0.80652}\n",
            "{'P@1': 0.15893, 'P@3': 0.13739, 'P@5': 0.11496, 'P@10': 0.08482, 'P@100': 0.02152, 'P@1000': 0.00407}\n",
            "{'R_cap@1': 0.15893, 'R_cap@3': 0.14896, 'R_cap@5': 0.15089, 'R_cap@10': 0.18957, 'R_cap@100': 0.46047, 'R_cap@1000': 0.80652}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_FR\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"MR-BM25_FR\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 330
        },
        "id": "RGKxo8AfVsef",
        "outputId": "f2bec8e4-3b23-4bae-ea0d-48001cddad9c"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d8ec4453a76a4878a9985622e936d359",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2322cee2806d443cbab2a045401890a0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4a38ffaa76e24da3a0695319d3b9d53a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 17678\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 147486\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 5653\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8ef0d859e1ae40078369208825cb7b71",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/5653 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "5653\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e3a481b3ff3446a7b00615d825f26777",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/17678 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "17678\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "dcefc16293d2425aaa0216dfc93aec6e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/147486 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "17678\n",
            "finished creating dictionaries\n",
            "Shortened queries\n"
          ]
        }
      ],
      "source": [
        "### German ###\n",
        "language = 'de'\n",
        "language_long = 'german'\n",
        "remove_stopwords = False\n",
        "use_subset = False\n",
        "subset_length = 0\n",
        "queries, qrels, corpus_de = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "dmHcsUB7WBmC",
        "outputId": "6e2291b9-8b5d-4f6f-935a-f3365f7a5126"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_DE_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 139/139 [54:42<00:00, 23.62s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.08525, 'NDCG@3': 0.0797, 'NDCG@5': 0.07827, 'NDCG@10': 0.08341, 'NDCG@100': 0.15674, 'NDCG@1000': 0.24341}\n",
            "{'MAP@1': 0.01374, 'MAP@3': 0.02411, 'MAP@5': 0.02963, 'MAP@10': 0.03636, 'MAP@100': 0.05085, 'MAP@1000': 0.05598}\n",
            "{'Recall@1': 0.01374, 'Recall@3': 0.03471, 'Recall@5': 0.05265, 'Recall@10': 0.08684, 'Recall@100': 0.30672, 'Recall@1000': 0.68687}\n",
            "{'P@1': 0.08525, 'P@3': 0.07548, 'P@5': 0.06834, 'P@10': 0.05675, 'P@100': 0.02128, 'P@1000': 0.00501}\n",
            "{'R_cap@1': 0.08525, 'R_cap@3': 0.07835, 'R_cap@5': 0.07865, 'R_cap@10': 0.0939, 'R_cap@100': 0.30672, 'R_cap@1000': 0.68687}\n",
            "#######################################################################################################\n",
            "MR-BM25_DE_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 139/139 [44:49<00:00, 19.35s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.09662, 'NDCG@3': 0.08977, 'NDCG@5': 0.0882, 'NDCG@10': 0.09295, 'NDCG@100': 0.1705, 'NDCG@1000': 0.25436}\n",
            "{'MAP@1': 0.01587, 'MAP@3': 0.02783, 'MAP@5': 0.03391, 'MAP@10': 0.04128, 'MAP@100': 0.05715, 'MAP@1000': 0.06233}\n",
            "{'Recall@1': 0.01587, 'Recall@3': 0.03987, 'Recall@5': 0.05944, 'Recall@10': 0.09569, 'Recall@100': 0.32771, 'Recall@1000': 0.69299}\n",
            "{'P@1': 0.09662, 'P@3': 0.08451, 'P@5': 0.07693, 'P@10': 0.06278, 'P@100': 0.02278, 'P@1000': 0.00507}\n",
            "{'R_cap@1': 0.09662, 'R_cap@3': 0.08833, 'R_cap@5': 0.08848, 'R_cap@10': 0.10343, 'R_cap@100': 0.32771, 'R_cap@1000': 0.69299}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_DE\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"MR-BM25_DE\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DUYAubkgWN7W"
      },
      "source": [
        "# Cross Encoder, Sbert Reranking and Lexical for subset of 100 queries\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Oc-dn8NwW-p0"
      },
      "outputs": [],
      "source": [
        "querie_dataset = querie_dataset.shuffle(seed=42)\n",
        "qrel_dataset = qrel_dataset.shuffle(seed=42)\n",
        "corpus_dataset = corpus_dataset.shuffle(seed=42)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Gz6UmoBtWvDR",
        "outputId": "972b7d7c-e92b-4792-8076-f73cc9ee3cb5"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 62557\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 674296\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 18177\n",
            "})\n"
          ]
        }
      ],
      "source": [
        "print(querie_dataset)\n",
        "print(qrel_dataset)\n",
        "print(corpus_dataset)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 471
        },
        "id": "5NxmRFlcWe50",
        "outputId": "136e722d-1f35-4db9-ef36-97b12c519ffd"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/xyz___json/xyz--querie-0beee30470914711/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-71dd81b95e148fbc.arrow\n",
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/xyz___json/xyz--qrel-6ed2d669b9013e6e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-419db32b04d39763.arrow\n",
            "WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/xyz___json/xyz--corpus-165ac86cedbdd2cb/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-2a6663825da27043.arrow\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 31566\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 458725\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 9897\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1daacf0465504b59a64ad04609947dc1",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/9897 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "9897\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1c7d79551d1c408e81ee92a401c479d3",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/31566 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "31566\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7d92e0ca77784cd8937c329dc3718bf0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/458725 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "31566\n",
            "finished creating dictionaries\n",
            "create subset of 10000 queries.\n",
            "10001\n",
            "10001\n",
            "9897\n",
            "Shortened queries\n"
          ]
        }
      ],
      "source": [
        "### Mixed ###\n",
        "language = 'mixed'\n",
        "language_long = 'german'\n",
        "remove_stopwords = False\n",
        "use_subset = True\n",
        "subset_length = 10000\n",
        "queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "uEXmDrvHayQW",
        "outputId": "17061af3-19f8-4e26-9482-643dd671c14e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "create subset of 100 queries.\n",
            "101\n",
            "101\n",
            "9897\n"
          ]
        }
      ],
      "source": [
        "### Subset 100 ###\n",
        "queries_100, qrels_100, corpus_100 = create_subset(100, queries, qrels, corpus)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "KG28JhZPawBI",
        "outputId": "b3a3df3f-ec7a-45ba-95f2-948bd6cbd65b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "RCE_BM25_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:20<00:00, 20.96s/it]\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f0e24eeecb2642dea4cbf045e7c99371",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)lve/main/config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4561a34b6cfc468795d523465e46c3e0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b7fef5e9a34d42989529f86587011cda",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)okenizer_config.json:   0%|          | 0.00/435 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "a8a1a1597274412a8c37dff815102c4d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b4a35927683f4a7686b8f313d5efd389",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "31e6ce9e343a4f7e8f2cbb45e523434e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c0f74cd947b745058adb402213db5f90",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/79 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.0297, 'NDCG@3': 0.02213, 'NDCG@5': 0.01989, 'NDCG@10': 0.01839, 'NDCG@100': 0.07353, 'NDCG@1000': 0.07353}\n",
            "{'MAP@1': 0.00182, 'MAP@3': 0.00255, 'MAP@5': 0.00278, 'MAP@10': 0.0032, 'MAP@100': 0.00893, 'MAP@1000': 0.00893}\n",
            "{'Recall@1': 0.00182, 'Recall@3': 0.00355, 'Recall@5': 0.00472, 'Recall@10': 0.0081, 'Recall@100': 0.14198, 'Recall@1000': 0.14198}\n",
            "{'P@1': 0.0297, 'P@3': 0.0198, 'P@5': 0.01782, 'P@10': 0.01683, 'P@100': 0.02426, 'P@1000': 0.00243}\n",
            "{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}\n",
            "#######################################################################################################\n",
            "SB_R_M1_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:15<00:00, 15.66s/it]\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d3799cc6216d4fc29052989d1f022a3b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)f333f/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0e4efd51315347a69a907ad4ba29a625",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4d6e1c477d99435a8f941bccb6f69384",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)4d423f333f/README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4d9615d8f7494935a6b5afbff12aa17c",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)423f333f/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "6c11cfe58994491eb762d346a095c89b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "14e55078b6344926b75642e4c6a6ff2f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "395c0377c6594df2a3fe2180cc2fbd64",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ae0bab4d87c84b42a78c29c083403567",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "494b4e0196cf45b0a29fbdffa96742b1",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8eff1b80db0043fa906f98188c6dd60b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)f333f/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b5efa3dd8a854ceca5ec1d66a12acb37",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ea382025082044ff84f0a0b115e9c47e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)23f333f/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "398e26f3122646f6b319952532a29019",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "428b9656368d455e95ccc311f27d3711",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/32 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.0396, 'NDCG@3': 0.02323, 'NDCG@5': 0.02113, 'NDCG@10': 0.01654, 'NDCG@100': 0.02023, 'NDCG@1000': 0.02023}\n",
            "{'MAP@1': 0.00182, 'MAP@3': 0.00229, 'MAP@5': 0.00279, 'MAP@10': 0.00305, 'MAP@100': 0.00388, 'MAP@1000': 0.00388}\n",
            "{'Recall@1': 0.00182, 'Recall@3': 0.00324, 'Recall@5': 0.005, 'Recall@10': 0.00708, 'Recall@100': 0.02562, 'Recall@1000': 0.02562}\n",
            "{'P@1': 0.0396, 'P@3': 0.0198, 'P@5': 0.01782, 'P@10': 0.01287, 'P@100': 0.00465, 'P@1000': 0.00047}\n",
            "{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}\n",
            "#######################################################################################################\n",
            "SB_R_M2_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:14<00:00, 14.41s/it]\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "62d5e24cdf1d4c328e0f13128d8875b6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)5f450/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "3df9057d2b414c72b0741c7952fa25ec",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "fc2d1ec18b314f4281509df238e3dce8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)/2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5c94c8228cd44a1ebd3152507317a578",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7f72218e68ed4028a229ec868315697e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)966465f450/README.md:   0%|          | 0.00/2.38k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ed86f785fa79418794f3f8403f3fa4ae",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)6465f450/config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "dea7106fc8024006983e677e3d2e36ea",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c1933d8a9a4e49e0b157dd29a76a8fbe",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "523331953b654941ac9f524e7c571cdc",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "bdadfe143f144faf9ed3b3fdce198d56",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "932d6504f42e427f85931c62705b542e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)5f450/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c02ed7b945ab43e6bca4eccebe3ebe43",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7319bb8cd63f476aacc99148323d3d66",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)966465f450/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1896cd93e2ed46c7a63aec97d7033499",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)465f450/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7d8a1d344f5b4ecfab0f5d008dc3640e",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "cfd3a987e28b487f8e62e2fca22a399b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/32 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.0198, 'NDCG@3': 0.03446, 'NDCG@5': 0.0291, 'NDCG@10': 0.02518, 'NDCG@100': 0.03546, 'NDCG@1000': 0.03584}\n",
            "{'MAP@1': 0.00096, 'MAP@3': 0.00329, 'MAP@5': 0.00384, 'MAP@10': 0.00444, 'MAP@100': 0.00614, 'MAP@1000': 0.00617}\n",
            "{'Recall@1': 0.00096, 'Recall@3': 0.00589, 'Recall@5': 0.00752, 'Recall@10': 0.01149, 'Recall@100': 0.05395, 'Recall@1000': 0.05477}\n",
            "{'P@1': 0.0198, 'P@3': 0.0363, 'P@5': 0.02772, 'P@10': 0.02277, 'P@100': 0.0095, 'P@1000': 0.00097}\n",
            "{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"RCE_BM25_MX_100\")\n",
        "model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'\n",
        "run_rce_bm25(queries_100, qrels_100, corpus_100, model_name)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M1_MX_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "# distiluse-base-multilingual-cased-v1 # model 2\n",
        "model_name = \"paraphrase-albert-small-v2\" # model 1\n",
        "run_sb_r(queries_100, qrels_100, corpus_100, model_name)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M2_MX_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "model_name = 'sentence-transformers/distiluse-base-multilingual-cased-v1'\n",
        "# model_name = \"paraphrase-albert-small-v2\"\n",
        "run_sb_r(queries_100, qrels_100, corpus_100, model_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 237
        },
        "id": "W_Uue59qMngk",
        "outputId": "1fd16580-a252-4c48-da2a-af45a52fde8a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "SB_R_M3_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:21<00:00, 21.54s/it]\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5813f9cd59254bfc8140ce4f04499b5b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8988012f4ab5417e8fc934e5e729c5d9",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/32 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.17822, 'NDCG@3': 0.15366, 'NDCG@5': 0.15579, 'NDCG@10': 0.14174, 'NDCG@100': 0.21998, 'NDCG@1000': 0.22114}\n",
            "{'MAP@1': 0.01264, 'MAP@3': 0.02295, 'MAP@5': 0.03383, 'MAP@10': 0.04498, 'MAP@100': 0.07343, 'MAP@1000': 0.07364}\n",
            "{'Recall@1': 0.01264, 'Recall@3': 0.03062, 'Recall@5': 0.05417, 'Recall@10': 0.08554, 'Recall@100': 0.3219, 'Recall@1000': 0.32485}\n",
            "{'P@1': 0.17822, 'P@3': 0.14851, 'P@5': 0.15248, 'P@10': 0.12772, 'P@100': 0.05267, 'P@1000': 0.00532}\n",
            "{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M3_MX_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "model_name = 'xyz/sBert-swiss-legal-base'\n",
        "# model_name = \"paraphrase-albert-small-v2\"\n",
        "run_sb_r(queries_100, qrels_100, corpus_100, model_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "t0ICJb8oFPuA",
        "outputId": "fc459049-2a3c-4660-d358-b68616119e0c"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:22<00:00, 22.23s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.05941, 'NDCG@3': 0.10974, 'NDCG@5': 0.09872, 'NDCG@10': 0.08517, 'NDCG@100': 0.10635, 'NDCG@1000': 0.16756}\n",
            "{'MAP@1': 0.00291, 'MAP@3': 0.01122, 'MAP@5': 0.01411, 'MAP@10': 0.01803, 'MAP@100': 0.02495, 'MAP@1000': 0.02798}\n",
            "{'Recall@1': 0.00291, 'Recall@3': 0.01884, 'Recall@5': 0.0282, 'Recall@10': 0.04683, 'Recall@100': 0.14198, 'Recall@1000': 0.3337}\n",
            "{'P@1': 0.05941, 'P@3': 0.11881, 'P@5': 0.09901, 'P@10': 0.07723, 'P@100': 0.02426, 'P@1000': 0.00585}\n",
            "{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}\n",
            "#######################################################################################################\n",
            "Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\n",
            "MR-BM25_MXD_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/9897 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:19<00:00, 19.58s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.09901, 'NDCG@3': 0.11477, 'NDCG@5': 0.11066, 'NDCG@10': 0.09139, 'NDCG@100': 0.1158, 'NDCG@1000': 0.17426}\n",
            "{'MAP@1': 0.00601, 'MAP@3': 0.01368, 'MAP@5': 0.01783, 'MAP@10': 0.0208, 'MAP@100': 0.02878, 'MAP@1000': 0.03167}\n",
            "{'Recall@1': 0.00601, 'Recall@3': 0.02098, 'Recall@5': 0.0336, 'Recall@10': 0.04694, 'Recall@100': 0.1519, 'Recall@1000': 0.33719}\n",
            "{'P@1': 0.09901, 'P@3': 0.11881, 'P@5': 0.11089, 'P@10': 0.08119, 'P@100': 0.02614, 'P@1000': 0.00588}\n",
            "{'R_cap@1': 0.09901, 'R_cap@3': 0.11881, 'R_cap@5': 0.11139, 'R_cap@10': 0.08412, 'R_cap@100': 0.1519, 'R_cap@1000': 0.33719}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX_100\")\n",
        "run_lr_bm25(queries_100, qrels_100, corpus_100)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD_100\")\n",
        "run_mr_bm25(queries_100, qrels_100, corpus_100, language_long)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JHMa0HJ_StoP"
      },
      "source": [
        "# SSL Experiments"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 833
        },
        "id": "tSypALXpPFfa",
        "outputId": "7efaaac6-8596-458f-e619-4b1bf3beb263"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "569ab439fb514a02be7713a8452982c7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7ca99638dbde42589c3777865c375819",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "fa923ad29965497db5c829c92e6aafac",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 1371\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 4882\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 248\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7fcc6865899d473cbc96c3651c2e31ae",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "debb07a4e58a486ca6c83692ac01a502",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "fac0b87fda51499a8a3b426cc1ec06e7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 11942\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 63203\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 2379\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "62a74e44f30140d4bf4c6fd600fea43f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2a1e61d2a3f74ffdac079e2a753821c4",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2deacc1256234cea83ac483e2a6d2833",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 17678\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'corp_id', 'match', 'language'],\n",
            "    num_rows: 147486\n",
            "})\n",
            "Dataset({\n",
            "    features: ['id', 'text', 'language'],\n",
            "    num_rows: 5653\n",
            "})\n",
            "Finished filtering for language\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "50774bc51e9b4a569b0f4c608694e890",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/8280 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "8280\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7a0596efda52419591ed18c52b1968cd",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/30991 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "30991\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "62a0725c143a46c28ccb75becf4191b1",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Map:   0%|          | 0/215571 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "30991\n",
            "finished creating dictionaries\n",
            "create subset of 100 queries.\n",
            "101\n",
            "101\n",
            "8280\n"
          ]
        }
      ],
      "source": [
        "    ### Mixed single language links###\n",
        "from datasets import concatenate_datasets, load_dataset\n",
        "use_subset = True\n",
        "subset_length = 10000\n",
        "remove_stopwords = False\n",
        "language_long = 'german'\n",
        "\n",
        "querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')\n",
        "querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')\n",
        "querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')\n",
        "\n",
        "querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])\n",
        "qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])\n",
        "corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])\n",
        "\n",
        "querie_dataset_p = querie_dataset_p.shuffle(seed=42)\n",
        "qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)\n",
        "corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)\n",
        "\n",
        "queries, qrels, corpus = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)\n",
        "### Subset 100 ###\n",
        "queries_100, qrels_100, corpus_100 = create_subset(100, queries, qrels, corpus)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 451
        },
        "id": "zX6opr_NPWy2",
        "outputId": "b2de687f-5eb0-48ea-ba79-1bea0f283751"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX_SSL_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/8280 [00:00<?, ?docs/s]\n",
            "que:   0%|          | 0/1 [00:15<?, ?it/s]\n"
          ]
        },
        {
          "ename": "KeyError",
          "evalue": "ignored",
          "output_type": "error",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-27-106e2dc7b9bd>\u001b[0m in \u001b[0;36m<cell line: 3>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"#######################################################################################################\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"LR-BM25_MX_SSL_100\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mrun_lr_bm25\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqueries_100\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqrels_100\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcorpus_100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"#######################################################################################################\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;32m<ipython-input-10-e65fbcd7d5e0>\u001b[0m in \u001b[0;36mrun_lr_bm25\u001b[0;34m(queries, qrels, corpus)\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m   \u001b[0;31m#### Retrieve dense results (format of results is identical to qrels)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m   \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mretriever\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretrieve\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m   \u001b[0;31m#### Evaluate your retrieval using NDCG@k, MAP@K ...\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/evaluation.py\u001b[0m in \u001b[0;36mretrieve\u001b[0;34m(self, corpus, queries, **kwargs)\u001b[0m\n\u001b[1;32m     21\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretriever\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m             \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Model/Technique has not been provided!\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mretriever\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtop_k\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscore_function\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     24\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     25\u001b[0m     def rerank(self, \n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/lexical/bm25_search.py\u001b[0m in \u001b[0;36msearch\u001b[0;34m(self, corpus, queries, top_k, *args, **kwargs)\u001b[0m\n\u001b[1;32m     49\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mstart_idx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqueries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdesc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'que'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     50\u001b[0m             \u001b[0mquery_ids_batch\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mquery_ids\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 51\u001b[0;31m             results = self.es.lexical_multisearch(\n\u001b[0m\u001b[1;32m     52\u001b[0m                 \u001b[0mtexts\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mqueries\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mstart_idx\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     53\u001b[0m                 top_hits=top_k + 1) # Add 1 extra if query is present with documents\n",
            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/lexical/elastic_search.py\u001b[0m in \u001b[0;36mlexical_multisearch\u001b[0;34m(self, texts, top_hits, skip)\u001b[0m\n\u001b[1;32m    191\u001b[0m         \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    192\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"responses\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 193\u001b[0;31m             \u001b[0mresponses\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"hits\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"hits\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mskip\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    194\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    195\u001b[0m             \u001b[0mhits\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mKeyError\u001b[0m: 'hits'"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX_SSL_100\")\n",
        "run_lr_bm25(queries_100, qrels_100, corpus_100)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD_SSL_100\")\n",
        "run_mr_bm25(queries_100, qrels_100, corpus_100, language_long)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"Attention: As one lanugage needs to be specified, french was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD_SSL_100\")\n",
        "run_mr_bm25(queries_100, qrels_100, corpus_100, \"french\")\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M3_MX_SSL_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "model_name = 'xyz/sBert-swiss-legal-base'\n",
        "# model_name = \"paraphrase-albert-small-v2\"\n",
        "run_sb_r(queries_100, qrels_100, corpus_100, model_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hjKcsd5QcYGK"
      },
      "source": [
        "# Results SSL and shortened\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dOYcdVD1cmUA"
      },
      "outputs": [],
      "source": [
        "                ### Mixed single language links###\n",
        "from datasets import concatenate_datasets, load_dataset\n",
        "use_subset = True\n",
        "subset_length = 10000\n",
        "remove_stopwords = True\n",
        "language_long = 'german'\n",
        "\n",
        "querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')\n",
        "querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')\n",
        "querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')\n",
        "\n",
        "querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])\n",
        "qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])\n",
        "corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])\n",
        "\n",
        "querie_dataset_p = querie_dataset_p.shuffle(seed=42)\n",
        "qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)\n",
        "corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)\n",
        "\n",
        "queries, qrels, corpus = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)\n",
        "queries = shorten(queries)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "IPlSV1xMFX9O",
        "outputId": "6a863917-aadc-44ad-cb16-0d7b1114fb8e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX-SLL_10000\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/8280 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 280/280 [1:26:13<00:00, 18.48s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.09128, 'NDCG@3': 0.08575, 'NDCG@5': 0.08629, 'NDCG@10': 0.09645, 'NDCG@100': 0.18027, 'NDCG@1000': 0.26529}\n",
            "{'MAP@1': 0.01785, 'MAP@3': 0.03149, 'MAP@5': 0.03831, 'MAP@10': 0.04651, 'MAP@100': 0.06339, 'MAP@1000': 0.06853}\n",
            "{'Recall@1': 0.01785, 'Recall@3': 0.04554, 'Recall@5': 0.06803, 'Recall@10': 0.11037, 'Recall@100': 0.37469, 'Recall@1000': 0.76708}\n",
            "{'P@1': 0.09128, 'P@3': 0.07969, 'P@5': 0.07143, 'P@10': 0.05846, 'P@100': 0.02106, 'P@1000': 0.00469}\n",
            "{'R_cap@1': 0.07908, 'R_cap@3': 0.07362, 'R_cap@5': 0.07727, 'R_cap@10': 0.09993, 'R_cap@100': 0.32457, 'R_cap@1000': 0.66449}\n",
            "#######################################################################################################\n",
            "Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\n",
            "MR-BM25_MXD_S\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/8280 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 280/280 [1:19:08<00:00, 16.96s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.09293, 'NDCG@3': 0.08733, 'NDCG@5': 0.08782, 'NDCG@10': 0.09699, 'NDCG@100': 0.18167, 'NDCG@1000': 0.26676}\n",
            "{'MAP@1': 0.01802, 'MAP@3': 0.03164, 'MAP@5': 0.03845, 'MAP@10': 0.04653, 'MAP@100': 0.06359, 'MAP@1000': 0.0688}\n",
            "{'Recall@1': 0.01802, 'Recall@3': 0.04568, 'Recall@5': 0.06785, 'Recall@10': 0.10937, 'Recall@100': 0.37672, 'Recall@1000': 0.76984}\n",
            "{'P@1': 0.09293, 'P@3': 0.08121, 'P@5': 0.07333, 'P@10': 0.05944, 'P@100': 0.02144, 'P@1000': 0.00473}\n",
            "{'R_cap@1': 0.0805, 'R_cap@3': 0.07498, 'R_cap@5': 0.07827, 'R_cap@10': 0.09936, 'R_cap@100': 0.32633, 'R_cap@1000': 0.66687}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX-SLL_10000\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD_SLL_10000\")\n",
        "language_long = 'german'\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cIu4eH2WeRbT",
        "outputId": "5b6982d5-d04a-4a4d-b48f-d084a28b67f7"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "create subset of 1000 queries.\n"
          ]
        }
      ],
      "source": [
        "### Subset 1000 ###\n",
        "queries, qrels, corpus = create_subset(1000, queries, qrels, corpus)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Ebrel-X-0w1a",
        "outputId": "66b7500c-945c-406c-c463-0bce32d4c519"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX-SLL_1000\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/2413 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 8/8 [01:56<00:00, 14.51s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.2266, 'NDCG@3': 0.1866, 'NDCG@5': 0.17495, 'NDCG@10': 0.17431, 'NDCG@100': 0.28217, 'NDCG@1000': 0.39647}\n",
            "{'MAP@1': 0.0302, 'MAP@3': 0.05143, 'MAP@5': 0.06379, 'MAP@10': 0.07935, 'MAP@100': 0.1103, 'MAP@1000': 0.12076}\n",
            "{'Recall@1': 0.0302, 'Recall@3': 0.06724, 'Recall@5': 0.09781, 'Recall@10': 0.15643, 'Recall@100': 0.46355, 'Recall@1000': 0.90347}\n",
            "{'P@1': 0.2266, 'P@3': 0.17376, 'P@5': 0.15426, 'P@10': 0.12394, 'P@100': 0.04066, 'P@1000': 0.00854}\n",
            "{'R_cap@1': 0.21279, 'R_cap@3': 0.16533, 'R_cap@5': 0.15468, 'R_cap@10': 0.16586, 'R_cap@100': 0.4353, 'R_cap@1000': 0.84842}\n",
            "#######################################################################################################\n",
            "Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\n",
            "MR-BM25_MXD_SLL_1000\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/2413 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 8/8 [01:49<00:00, 13.73s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.22021, 'NDCG@3': 0.19416, 'NDCG@5': 0.18155, 'NDCG@10': 0.18012, 'NDCG@100': 0.28781, 'NDCG@1000': 0.40035}\n",
            "{'MAP@1': 0.02877, 'MAP@3': 0.05277, 'MAP@5': 0.06582, 'MAP@10': 0.08192, 'MAP@100': 0.11312, 'MAP@1000': 0.12355}\n",
            "{'Recall@1': 0.02877, 'Recall@3': 0.06997, 'Recall@5': 0.10269, 'Recall@10': 0.16347, 'Recall@100': 0.47183, 'Recall@1000': 0.90584}\n",
            "{'P@1': 0.22021, 'P@3': 0.18404, 'P@5': 0.16149, 'P@10': 0.12904, 'P@100': 0.04167, 'P@1000': 0.00856}\n",
            "{'R_cap@1': 0.20679, 'R_cap@3': 0.17516, 'R_cap@5': 0.16214, 'R_cap@10': 0.17292, 'R_cap@100': 0.44308, 'R_cap@1000': 0.85064}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX-SLL_1000\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD_SLL_1000\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XZdu3OzmeXSG"
      },
      "outputs": [],
      "source": [
        "### Subset 100 ###\n",
        "queries, qrels, corpus = create_subset(100, queries, qrels, corpus)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "3xKFvivBeb9i",
        "outputId": "265ede41-38aa-4679-8fc0-1156a6092f97"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "RCE_BM25_MX_100\n",
            "#######################################################################################################\n",
            "SB_R_M1_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/525 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:05<00:00,  5.71s/it]\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ea35be3568834184bca1f8b8916ccd20",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)f333f/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "78bd0a84af1c4ac9adfcc469620340f3",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e0cc9ea2fea4415c8d15df7836fee331",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)4d423f333f/README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "a4fd85a4699a462197d4cb7f9ed81875",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)423f333f/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "eb29c7257bf74ee9972ebeecabdb97e5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f5d826f6d8e045719bd25dac0755c8a5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b63d687eff114cf4a69cf5c74ef18d3c",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "cd97c575298b4d8781597216a025a67f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b2c76f746eb34e27a3fcc7fc664cd161",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "68c8ad5dd82443a68eb662cc08e227c8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)f333f/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ed1b491acfd040d29c4bfe476d0c6be7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "603c07c2e6b04749b432482ec9fdc4ba",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)23f333f/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "70ca7ff0c0cd42558f099fcc7d84843b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f60a038d907a403885108d957bd8831d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/4 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.06931, 'NDCG@3': 0.06931, 'NDCG@5': 0.07174, 'NDCG@10': 0.07365, 'NDCG@100': 0.19027, 'NDCG@1000': 0.19177}\n",
            "{'MAP@1': 0.01071, 'MAP@3': 0.01854, 'MAP@5': 0.02231, 'MAP@10': 0.02803, 'MAP@100': 0.05022, 'MAP@1000': 0.05046}\n",
            "{'Recall@1': 0.01071, 'Recall@3': 0.02776, 'Recall@5': 0.04216, 'Recall@10': 0.07312, 'Recall@100': 0.40525, 'Recall@1000': 0.41032}\n",
            "{'P@1': 0.06931, 'P@3': 0.06601, 'P@5': 0.06535, 'P@10': 0.05347, 'P@100': 0.03604, 'P@1000': 0.00364}\n",
            "{'R_cap@1': 0.42574, 'R_cap@3': 0.40924, 'R_cap@5': 0.35842, 'R_cap@10': 0.3372, 'R_cap@100': 0.69103, 'R_cap@1000': 0.94839}\n",
            "#######################################################################################################\n",
            "SB_R_M2_MX_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/525 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:05<00:00,  5.84s/it]\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b8a528567a1f4642b4e615d5f52c9114",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)5f450/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c2b294fb646b44a189f29513c911bef9",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "984aead3c880408bb234f7a97b7324bd",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)/2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c04a42acd8ea4346a7faac24f00428d0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "69f031675c2a40809b6146126b2c5b45",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)966465f450/README.md:   0%|          | 0.00/2.38k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c91d0e6d258d49ee936a1be7cce60068",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)6465f450/config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "adad7253333742e7b3552118596ac415",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "74a64a4f99644ea3947b5402ac658c1b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e7e2756228b44f24b3fc502f91e7cfe6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b726333a34644e53905e8763bb318873",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "9455950f918741d2a524f75f521c0818",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)5f450/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d89cdf3aa1424883950dd5b9c0e10036",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2f10aacc50af4651bf62933b9f761de3",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)966465f450/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "eefd46e54f964a12a10adfd28158b6d8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (…)465f450/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b65c3ef58094416aa75ec98a41fc9715",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "6abba9c31ec747eca1063e5f4d9d953f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/4 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.11881, 'NDCG@3': 0.10952, 'NDCG@5': 0.11224, 'NDCG@10': 0.10551, 'NDCG@100': 0.19898, 'NDCG@1000': 0.19985}\n",
            "{'MAP@1': 0.02043, 'MAP@3': 0.03049, 'MAP@5': 0.03721, 'MAP@10': 0.04471, 'MAP@100': 0.06809, 'MAP@1000': 0.06824}\n",
            "{'Recall@1': 0.02043, 'Recall@3': 0.03537, 'Recall@5': 0.05329, 'Recall@10': 0.08054, 'Recall@100': 0.35261, 'Recall@1000': 0.35513}\n",
            "{'P@1': 0.11881, 'P@3': 0.09901, 'P@5': 0.10297, 'P@10': 0.08317, 'P@100': 0.03564, 'P@1000': 0.00359}\n",
            "{'R_cap@1': 0.42574, 'R_cap@3': 0.40924, 'R_cap@5': 0.35842, 'R_cap@10': 0.3372, 'R_cap@100': 0.69103, 'R_cap@1000': 0.94839}\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"RCE_BM25_MX_100\")\n",
        "model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'\n",
        "run_rce_bm25(queries, qrels, corpus, model_name)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M1_MX_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "# distiluse-base-multilingual-cased-v1 # model 2\n",
        "model_name = \"paraphrase-albert-small-v2\" # model 1\n",
        "run_sb_r(queries, qrels, corpus, model_name)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M2_MX_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "model_name = 'sentence-transformers/distiluse-base-multilingual-cased-v1'\n",
        "# model_name = \"paraphrase-albert-small-v2\"\n",
        "run_sb_r(queries, qrels, corpus, model_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xM9HGfY_YOd5"
      },
      "outputs": [],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"SB_R_M3_MX_SSL_100\")\n",
        "# distiluse-base-multilingual-cased\n",
        "model_name = 'xyz/sBert-swiss-legal-base'\n",
        "# model_name = \"paraphrase-albert-small-v2\"\n",
        "run_sb_r(queries, qrels, corpus, model_name)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"RCE_BM25_MX_100\")\n",
        "model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'\n",
        "run_rce_bm25(queries_short, qrels, corpus, model_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-jgr_IWP03yD",
        "outputId": "699cb0cc-eb39-4491-8c3a-ad19ff6632c3"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "#######################################################################################################\n",
            "LR-BM25_MX-SLL_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/525 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:05<00:00,  5.46s/it]\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.44792, 'NDCG@3': 0.43547, 'NDCG@5': 0.39547, 'NDCG@10': 0.3697, 'NDCG@100': 0.5127, 'NDCG@1000': 0.59221}\n",
            "{'MAP@1': 0.0557, 'MAP@3': 0.1331, 'MAP@5': 0.15931, 'MAP@10': 0.19631, 'MAP@100': 0.26777, 'MAP@1000': 0.28189}\n",
            "{'Recall@1': 0.0557, 'Recall@3': 0.16642, 'Recall@5': 0.21655, 'Recall@10': 0.31046, 'Recall@100': 0.72702, 'Recall@1000': 0.99779}\n",
            "{'P@1': 0.44792, 'P@3': 0.41667, 'P@5': 0.35417, 'P@10': 0.2625, 'P@100': 0.06917, 'P@1000': 0.00991}\n",
            "{'R_cap@1': 0.42574, 'R_cap@3': 0.40924, 'R_cap@5': 0.35842, 'R_cap@10': 0.3372, 'R_cap@100': 0.69103, 'R_cap@1000': 0.94839}\n",
            "#######################################################################################################\n",
            "Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\n",
            "MR-BM25_MXD_SLL_100\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  0%|          | 0/525 [00:00<?, ?docs/s]\n",
            "que: 100%|██████████| 1/1 [00:03<00:00,  3.83s/it]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'NDCG@1': 0.53125, 'NDCG@3': 0.44811, 'NDCG@5': 0.41081, 'NDCG@10': 0.37753, 'NDCG@100': 0.52324, 'NDCG@1000': 0.60228}\n",
            "{'MAP@1': 0.07362, 'MAP@3': 0.13567, 'MAP@5': 0.16665, 'MAP@10': 0.20199, 'MAP@100': 0.27637, 'MAP@1000': 0.29058}\n",
            "{'Recall@1': 0.07362, 'Recall@3': 0.15661, 'Recall@5': 0.21891, 'Recall@10': 0.30757, 'Recall@100': 0.72761, 'Recall@1000': 0.99779}\n",
            "{'P@1': 0.53125, 'P@3': 0.41319, 'P@5': 0.35833, 'P@10': 0.26146, 'P@100': 0.06958, 'P@1000': 0.00991}\n",
            "{'R_cap@1': 0.50495, 'R_cap@3': 0.40429, 'R_cap@5': 0.3637, 'R_cap@10': 0.3356, 'R_cap@100': 0.69159, 'R_cap@1000': 0.94839}\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "\n"
          ]
        }
      ],
      "source": [
        "print(\"#######################################################################################################\")\n",
        "print(\"LR-BM25_MX-SLL_100\")\n",
        "run_lr_bm25(queries, qrels, corpus)\n",
        "print(\"#######################################################################################################\")\n",
        "print(\"Attention: As one lanugage needs to be specified, german was chosen as it is the most common.\")\n",
        "print(\"MR-BM25_MXD_SLL_100\")\n",
        "run_mr_bm25(queries, qrels, corpus, language_long)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [
        "gB6I0UKeRH9-",
        "7iE5Sr7e6Jaq",
        "nqotyXuIBPt6",
        "F02EG87WVONw",
        "GWUDth1hVe9R",
        "xhD_PLh4Qq0W",
        "P2xfHrVDQ3Fl",
        "RZGobu2pTN--",
        "DUYAubkgWN7W",
        "JHMa0HJ_StoP",
        "hjKcsd5QcYGK"
      ],
      "provenance": [],
      "include_colab_link": true
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "02084fae14ea48888906282f4d9394da": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_56776c9679d9488585ab0707f8d2e59b",
            "placeholder": "​",
            "style": "IPY_MODEL_9af32e24a9404f968dbf54f4904293b6",
            "value": " 63000/72344 [00:00&lt;00:00, 120978.33 examples/s]"
          }
        },
        "024128e36dd3466c802822faca9927a6": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "0bc607c41a6944d59900835edce97c4f": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_024128e36dd3466c802822faca9927a6",
            "placeholder": "​",
            "style": "IPY_MODEL_8fde17264ccc4fa9b2d19c8f1803e3c3",
            "value": " 1/1 [00:00&lt;00:00, 23.55it/s]"
          }
        },
        "0cf8897876d1499683d598fa8d517b4e": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_c998379ace074e3d81415d0ec147d93a",
            "placeholder": "​",
            "style": "IPY_MODEL_23e9eb2a15ff4941930edf8a58c0a5e2",
            "value": "100%"
          }
        },
        "1074c087f775426aab7f357ba3a3e99a": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_17b6dc5d458a4a76bdbf902802acc3c8",
              "IPY_MODEL_6a49050ccf7a4b17a0ce2d44f1da6c73",
              "IPY_MODEL_02084fae14ea48888906282f4d9394da"
            ],
            "layout": "IPY_MODEL_176b9564a619446f81b6de70b9cde277"
          }
        },
        "16720ddbeebf4336b60427e2f661b1d5": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "176b9564a619446f81b6de70b9cde277": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": "hidden",
            "width": null
          }
        },
        "17b6dc5d458a4a76bdbf902802acc3c8": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_ea693b004a7c463db11144513da6dd1c",
            "placeholder": "​",
            "style": "IPY_MODEL_fcbbf4cd015d44a29e4750d343d16d80",
            "value": "Filter:  87%"
          }
        },
        "17d682b70b5340ba8494fa7a49e14e94": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "2266afad9442498d83fe48cb4b0cff4f": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "2335c7f5eec24bf2aa494374e8cc2c7c": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_b39e9e2525e24d66baeacd5bc55764a0",
            "max": 1,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_300a990e24814347b4cf4649eb164dc9",
            "value": 1
          }
        },
        "23e9eb2a15ff4941930edf8a58c0a5e2": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "300a990e24814347b4cf4649eb164dc9": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "36af54b9ad784cc8a76245444ac98090": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "425284aec7d245de9e3f4b7f94be3c70": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "4662daf6953c4c90b3275a0c0a9aeec6": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "51a0394c3cc1473aaa9a2a836c37e1b3": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_c253d69ed2ea4d4cbb4b915b48b70878",
            "max": 72344,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_e58d3f09cdf044aebacacebcf05a9dc4",
            "value": 72344
          }
        },
        "51f5ac865f374cfe9f03958addf3644e": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "56776c9679d9488585ab0707f8d2e59b": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "5a84299de18d462f81e06fde5cc57e4b": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "5f2b5c1f6eb443a698d42f07879fa600": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_fb782d4a872447a5a5c37c21f79513e0",
              "IPY_MODEL_51a0394c3cc1473aaa9a2a836c37e1b3",
              "IPY_MODEL_91283355d1f646c997fe8016a675919d"
            ],
            "layout": "IPY_MODEL_b7ebc25caf924d9abd59d9ea3ff1cd93"
          }
        },
        "64b8b67d4f964f0f80858a73e8474cf4": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_0cf8897876d1499683d598fa8d517b4e",
              "IPY_MODEL_2335c7f5eec24bf2aa494374e8cc2c7c",
              "IPY_MODEL_7449ad3c85e94bb5a6863f1fdc125e48"
            ],
            "layout": "IPY_MODEL_ae06be8b67974593bd3ad105e6283313"
          }
        },
        "6a49050ccf7a4b17a0ce2d44f1da6c73": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_425284aec7d245de9e3f4b7f94be3c70",
            "max": 72344,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_f20b9bcc185346bdb9f43b90106818c6",
            "value": 72344
          }
        },
        "6b299948b30e40009b3fd38624f39a8c": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "7449ad3c85e94bb5a6863f1fdc125e48": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_aff454941f7d42bda9ff2d54291258a4",
            "placeholder": "​",
            "style": "IPY_MODEL_d4f24728c24345a7a17ffebcb7da5334",
            "value": " 1/1 [00:00&lt;00:00, 52.99it/s]"
          }
        },
        "79610e0c711c45a880529a909baf3d92": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_a3388734447c40478ed37d6813cfa770",
            "placeholder": "​",
            "style": "IPY_MODEL_2266afad9442498d83fe48cb4b0cff4f",
            "value": " 1/1 [00:00&lt;00:00, 47.97it/s]"
          }
        },
        "7f1d44013acc4b0caadce87bb4452171": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "877b5c86fa5346718be88b40db57771b": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_9007a8f6836244c8a7b875abbdb29d8f",
              "IPY_MODEL_f0845b95953c4d649c3261ff58f0259c",
              "IPY_MODEL_0bc607c41a6944d59900835edce97c4f"
            ],
            "layout": "IPY_MODEL_b922b379281b4aaeb37a580410050f11"
          }
        },
        "8aff1e77f342418fb2ba1b93bf6ac757": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_7f1d44013acc4b0caadce87bb4452171",
            "placeholder": "​",
            "style": "IPY_MODEL_17d682b70b5340ba8494fa7a49e14e94",
            "value": "100%"
          }
        },
        "8fde17264ccc4fa9b2d19c8f1803e3c3": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "9007a8f6836244c8a7b875abbdb29d8f": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_4662daf6953c4c90b3275a0c0a9aeec6",
            "placeholder": "​",
            "style": "IPY_MODEL_f5de52812a924c96bc4e3a40e46de0f6",
            "value": "100%"
          }
        },
        "91283355d1f646c997fe8016a675919d": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_51f5ac865f374cfe9f03958addf3644e",
            "placeholder": "​",
            "style": "IPY_MODEL_a62acc28875441afba71eaf58e9c469d",
            "value": " 69000/72344 [00:00&lt;00:00, 112989.89 examples/s]"
          }
        },
        "949a51e0190d4a7ca5566dceaaee119d": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_5a84299de18d462f81e06fde5cc57e4b",
            "max": 1,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_f2f0c20d8be74ad9866cf86d6e5609d9",
            "value": 1
          }
        },
        "9af32e24a9404f968dbf54f4904293b6": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "a3388734447c40478ed37d6813cfa770": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "a62acc28875441afba71eaf58e9c469d": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "ae06be8b67974593bd3ad105e6283313": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "aeb62ff7ab3547abb70fc96c7baaf730": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_8aff1e77f342418fb2ba1b93bf6ac757",
              "IPY_MODEL_949a51e0190d4a7ca5566dceaaee119d",
              "IPY_MODEL_79610e0c711c45a880529a909baf3d92"
            ],
            "layout": "IPY_MODEL_6b299948b30e40009b3fd38624f39a8c"
          }
        },
        "aff454941f7d42bda9ff2d54291258a4": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "b39e9e2525e24d66baeacd5bc55764a0": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "b7ebc25caf924d9abd59d9ea3ff1cd93": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": "hidden",
            "width": null
          }
        },
        "b922b379281b4aaeb37a580410050f11": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "c1ea083cc6214778a566c750a6903005": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "c253d69ed2ea4d4cbb4b915b48b70878": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "c998379ace074e3d81415d0ec147d93a": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "d4f24728c24345a7a17ffebcb7da5334": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "d55e45473b98414881590f51012330ca": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "e58d3f09cdf044aebacacebcf05a9dc4": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "ea693b004a7c463db11144513da6dd1c": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "f0845b95953c4d649c3261ff58f0259c": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_36af54b9ad784cc8a76245444ac98090",
            "max": 1,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_16720ddbeebf4336b60427e2f661b1d5",
            "value": 1
          }
        },
        "f20b9bcc185346bdb9f43b90106818c6": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "f2f0c20d8be74ad9866cf86d6e5609d9": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "f5de52812a924c96bc4e3a40e46de0f6": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "fb782d4a872447a5a5c37c21f79513e0": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_d55e45473b98414881590f51012330ca",
            "placeholder": "​",
            "style": "IPY_MODEL_c1ea083cc6214778a566c750a6903005",
            "value": "Filter:  95%"
          }
        },
        "fcbbf4cd015d44a29e4750d343d16d80": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        }
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}