{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3502f69e",
   "metadata": {},
   "source": [
    "# Text data\n",
    "\n",
    "This analysis is done on an unpublished dataset, which was kindly shared with us by one of our colleagues. It can be recreated using the following:\n",
    "- Download the PDF of each NeurIPS/ICLR paper, the download links are provided in our dataset.\n",
    "- Convert each to a .txt file and place it in paper_data_txt/NeurIPS/2022 etc. respectively (see for example PyPDF2 for conversion)\n",
    "- Now you're ready to run the code!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf183b1",
   "metadata": {},
   "source": [
    "## Reproducibility Checklist / Statement presence analysis\n",
    "\n",
    "- In NeurIPS its stated that \"Do not remove the checklist: The papers not including the checklist will be desk rejected.\" and in the call for papers since 2022 that it must be included in the paper (\"checklist must appear in the submitted PDF\").\n",
    "- ICML has the same checklist but no statements on how it should be delivered.\n",
    "- ICLR has an optional reproduciblity statement that does not count towards the page limit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d26c18c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Possible phrases indicating NeurIPS checklist:\n",
    "neurips_phrases = [\"1. For all authors...\", \"Reproduciblity Checklist\", \"Paper Checklist\", \"Neur IPS Paper Checklist\", \"NeurIPS Paper Checklist\"]\n",
    "iclr_phrases = [\"Reproducibility Statement\", \"Reproducibility\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "70ab3d93",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# Directories with the papers\n",
    "neurips_path = Path(\"paper_data_txt/NeurIPS\")\n",
    "iclr_path = Path(\"paper_data_txt/ICLR\")\n",
    "\n",
    "# Counts from https://github.com/lixin4ever/Conference-Acceptance-Rate\n",
    "\n",
    "results = {\n",
    "    \"NeurIPS\": {\n",
    "        2022: {\"reproducibility_count\": 0, \"total_papers\": 2671},\n",
    "        2023: {\"reproducibility_count\": 0, \"total_papers\": 3218},\n",
    "        2024: {\"reproducibility_count\": 0, \"total_papers\": 4037},\n",
    "    },\n",
    "    \"ICLR\": {\n",
    "        2022: {\"reproducibility_count\": 0, \"total_papers\": 1574},\n",
    "        2023: {\"reproducibility_count\": 0, \"total_papers\": 2250},\n",
    "        2024: {\"reproducibility_count\": 0, \"total_papers\": 3706},\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "8d0d5647",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count NeurIPS:\n",
    "for year in range(2022, 2025):\n",
    "    for paper in (neurips_path / str(year)).iterdir():\n",
    "        if paper.suffix != \".txt\":\n",
    "            continue\n",
    "        with paper.open(\"r\") as f:\n",
    "            text = f.read()\n",
    "            for phrase in neurips_phrases:\n",
    "                if phrase.lower() in text.lower():\n",
    "                    results[\"NeurIPS\"][year][\"reproducibility_count\"] += 1\n",
    "                    break  # Only one hit per paper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "76629898",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count ICLR\n",
    "for year in range(2022, 2025):\n",
    "    for paper in (iclr_path / str(year)).iterdir():\n",
    "        if paper.suffix != \".txt\":\n",
    "            continue\n",
    "        with paper.open(\"r\") as f:\n",
    "            text = f.read()\n",
    "            for phrase in iclr_phrases:\n",
    "                if phrase.lower() in text.lower():\n",
    "                    results[\"ICLR\"][year][\"reproducibility_count\"] += 1\n",
    "                    break  # Only one hit per paper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "5038a5b6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NeurIPS 2022: 2105 / 2671 = 78.80943466866343%\n",
      "NeurIPS 2023: 2 / 3218 = 0.06215040397762585%\n",
      "NeurIPS 2024: 3990 / 4037 = 98.83576913549665%\n",
      "\n",
      "ICLR 2022: 455 / 1574 = 28.907242693773828%\n",
      "ICLR 2023: 474 / 2250 = 21.066666666666666%\n",
      "ICLR 2024: 585 / 3706 = 15.785213167835943%\n"
     ]
    }
   ],
   "source": [
    "for year in results[\"NeurIPS\"].keys():\n",
    "    print(f\"NeurIPS {year}: {results['NeurIPS'][year]['reproducibility_count']} / {results['NeurIPS'][year]['total_papers']} = {(results['NeurIPS'][year]['reproducibility_count'] / results['NeurIPS'][year]['total_papers']) * 100}%\")\n",
    "\n",
    "print()\n",
    "\n",
    "for year in results[\"ICLR\"].keys():\n",
    "    print(f\"ICLR {year}: {results['ICLR'][year]['reproducibility_count']} / {results['ICLR'][year]['total_papers']} = {(results['ICLR'][year]['reproducibility_count'] / results['ICLR'][year]['total_papers']) * 100}%\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
