{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3df1e888",
   "metadata": {},
   "source": [
    "### Working with CHRONOBERG \n",
    "\n",
    "In this notebook, we walk through the Chronoberg dataset explaining how to use it to perform lexical analysis on prefered temporal slice and lexical analysis. The dataset is publicly available at [ChronoBerg](https://huggingface.co/datasets/sdp56/ChronoBerg/tree/main).\n",
    "\n",
    "The dataset is made available in two variants: \n",
    "- The non-annotated raw ChronoBerg\n",
    "- The annotated ChronoBerg\n",
    "\n",
    "In each version, the text is grouped by years from 1750-2000s. The annotated version employs a further splitting of texts into sentences. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a72488a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json, re\n",
    "import os\n",
    "import nltk\n",
    "import pandas as pd \n",
    "import torch\n",
    "from nltk.tokenize import sent_tokenize\n",
    "from nltk.tokenize import sent_tokenize\n",
    "import itertools \n",
    "from tqdm import tqdm\n",
    "from gensim.models.word2vec import Word2Vec\n",
    "from sklearn.model_selection import train_test_split\n",
    "import gensim \n",
    "from IPython.display import display, Markdown\n",
    "from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces, strip_numeric\n",
    "from gensim.parsing.preprocessing import preprocess_string, preprocess_documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b24a2988",
   "metadata": {},
   "source": [
    "#### Loading the dataset\n",
    "\n",
    "We will first load the raw Chronoberg dataset and walk the steps of preprocessing and loading. \n",
    "One can download the raw Chornoberg dataset at [Chrono_raw](https://huggingface.co/datasets/chb19/ChronoBerg/blob/main/dataset/ChronoBerg_raw.jsonl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6490f1e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "249it [00:47,  5.26it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data loaded\n",
      "------------------------------------------\n",
      "The number of years in the dataset is:  249\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "data_dict = {'year': [], 'text': []}\n",
    "path_to_your_data_folder = ''\n",
    "with open(os.path.join(path_to_your_data_folder, 'ChronoBerg_raw.jsonl'), 'r', encoding='utf-8') as dataset_in:\n",
    "    for line in tqdm(dataset_in):\n",
    "        file = json.loads(line)\n",
    "        text = file['text']\n",
    "        text = text.replace('\\n', ' ')\n",
    "        # Remove all kinds of quotation marks\n",
    "        file['text'] = text\n",
    "        data_dict['text'].append(file['text'])\n",
    "        data_dict['year'].append(file['year'])\n",
    "\n",
    "print('data loaded')\n",
    "\n",
    "print(\"------------------------------------------\")\n",
    "print(\"The number of years in the dataset is: \", len(set(data_dict['year'])) )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64420a92",
   "metadata": {},
   "source": [
    "### Visualize sample textual data from a particular year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "bbd69088",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "** Time_Interval- 1900**:  [Illustration]     The Wonderful Wizard of Oz  by L. Frank Baum   This book is dedicated to my good friend & comrade My Wife L.F.B.   Contents   Introduction  Chapter I. The Cyclone  Chapter II. The Council with the Munchkins  Chapter III. How Dorothy Saved the Scarecrow  Chapter IV. The Road Through the Forest  Chapter V. The Rescue of the Tin Woodman  Chapter VI.  The Cowardly Lion  Chapter VII. The Journey to the Great Oz  Chapter VIII. The Deadly Poppy Field  Chapter IX. The Queen of the Field Mice  Chapter X. The Guardian of the Gates  Chapter XI. The Emerald City of Oz  Chapter XII. The Search for the Wicked Witch  Chapter XIII. The Rescue  Chapter XIV. The Winged Monkeys  Chapter XV. The Discovery of Oz, the Terrible  Chapter XVI. The Magic Art of the Great Humbug  Chapter XVII. How the Balloon Was Launched  Chapter XVIII. Away to the South  Chapter XIX. Attacked by the Fighting Trees  Chapter XX. The Dainty China Country  Chapter XXI. The Lion Becomes the King of Beasts  Chapter XXII. The Country of the Quadlings  Chapter XXIII. Glinda The Good Witch Grants Dorothyâs Wish  Chapter XXIV. Home Again     Introduction   Folklore, legends, myths and fairy tales have followed childhood through the ages, for every healthy youngster has a wholesome and instinctive love for stories fantastic, marvelous and manifestly unreal. The winged fairies of Grimm and Andersen have brought more happiness to childish hearts than all other human creations.  Yet the old time fairy tale, h"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#### Show a sample text from a particular year\n",
    "year = 1900\n",
    "for i in range(len(data_dict['year'])):\n",
    "    if data_dict['year'][i] == year:\n",
    "        display(Markdown(f\"** Time_Interval- {data_dict['year'][i]}**: {data_dict['text'][i][:1500]}\"))\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d6acd7af",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Short the text by year\n",
    "df = pd.DataFrame(data_dict)\n",
    "df.head()\n",
    "df = df.sort_values(by=['year'], ascending=True)\n",
    "\n",
    "### Recreate the data_dict based on the sorted dataframe\n",
    "data_dict = {'year': [], 'text': []}\n",
    "for i in range(len(df)):\n",
    "    data_dict['year'].append(df['year'].iloc[i])\n",
    "    data_dict['text'].append(df['text'].iloc[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7514e0df",
   "metadata": {},
   "source": [
    "#### Transform the block of text into a set of sentences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cadf18e9",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249/249 [35:31<00:00,  8.56s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total sentences: 88957134\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "text_total = []\n",
    "for text in tqdm(data_dict['text']):\n",
    "    sentence = sent_tokenize(text)\n",
    "    text_total.append(sentence)\n",
    "\n",
    "text_total = list(itertools.chain(*text_total))\n",
    "print(f'Total sentences: {len(text_total)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7a243ea",
   "metadata": {},
   "source": [
    "### Extract Textual data from a particular year "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "de4c43fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of sentences in year 1771: 10323\n",
      "------------------------------ Sample sentences ------------------------------\n",
      "Sentence 1:       THE EXPEDITION OF HUMPHRY CLINKER  by TOBIAS SMOLLETT     To Mr HENRY DAVIS, Bookseller, in London.\n",
      "Sentence 2: ABERGAVENNY, Aug. 4.\n",
      "Sentence 3: RESPECTED SIR,  I have received your esteemed favour of the 13th ultimo, whereby it appeareth, that you have perused those same Letters, the which were delivered unto you by my friend, the reverend Mr Hugo Behn; and I am pleased to find you think they may be printed with a good prospect of success; in as much as the objections you mention, I humbly conceive, are such as may be redargued, if not entirely removed--And, first, in the first place, as touching what prosecutions may arise from printing the private correspondence of persons still living, give me leave, with all due submission, to observe, that the Letters in question were not written and sent under the seal of secrecy; that they have no tendency to the mala fama, or prejudice of any person whatsoever; but rather to the information and edification of mankind: so that it becometh a sort of duty to promulgate them in usum publicum.\n",
      "Sentence 4: Besides, I have consulted Mr Davy Higgins, an eminent attorney of this place, who, after due inspection and consideration, declareth, That he doth not think the said Letters contain any matter which will be held actionable in the eye of the law.\n",
      "Sentence 5: Finally, if you and I should come to a right understanding, I do declare in verbo sacerdotis, that, in case of any such prosecution, I will take the whole upon my own shoulders, even quoad fine and imprisonment, though, I must confess, I should not care to undergo flagellation: Tam ad turpitudinem, quam ad amaritudinem poenoe spectans--Secondly, concerning the personal resentment of Mr Justice Lismahago, I may say, non flocci facio--I would not willingly vilipend any Christian, if, peradventure, he deserveth that epithet: albeit, I am much surprised that more care is not taken to exclude from the commission all such vagrant foreigners as may be justly suspected of disaffection to our happy constitution, in church and state--God forbid that I should be so uncharitable, as to affirm, positively, that the said Lismahago is no better than a Jesuit in disguise; but this I will assert and maintain, totis viribus, that, from the day he qualified, he has never been once seen intra templi parietes, that is to say, within the parish church.\n"
     ]
    }
   ],
   "source": [
    "def extract_text_by_year(year, data_dict):\n",
    "    sents = []\n",
    "    texts = data_dict['text'][year]\n",
    "    sents.extend(sent_tokenize(texts))\n",
    "    return sents, texts\n",
    "\n",
    "#### Transform the block of text into a set of sentences\n",
    "year = 20\n",
    "sents, texts = extract_text_by_year(year, data_dict)\n",
    "print(f'Number of sentences in year {data_dict[\"year\"][year]}: {len(sents)}')\n",
    "print(\"------------------------------ Sample sentences ------------------------------\")\n",
    "for i in range(len(sents[:5])):\n",
    "    print(f'Sentence {i+1}: {sents[i]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "626b907e",
   "metadata": {},
   "source": [
    "#### Polish the text if needed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c996191f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess_text(texts):\n",
    "    for i in range(len(texts)):\n",
    "        text = texts[i]\n",
    "        text = re.sub(r'[\\'\\\"]', '', text)\n",
    "        text = re.sub('\\.', ' ', text)\n",
    "        text = re.sub(r'[\\x80-\\xFF]', '', text)\n",
    "        text = re.sub(r'\\d+','', text) \n",
    "        # Reduce all consecutive whitespace to a single whitespace\n",
    "        text = re.sub(r'\\s+', ' ', text)\n",
    "        texts[i] = text\n",
    "    return texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "54477090",
   "metadata": {},
   "outputs": [],
   "source": [
    "texts_ = preprocess_text(sents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a34e9bc",
   "metadata": {},
   "source": [
    "#### Load preprocessed Chronoberg\n",
    "\n",
    "We have also made available a preprocessed version of Chronoberg.  One can download the dataset at [Chronoberg_preprocessed](https://huggingface.co/datasets/chb19/ChronoBerg/blob/main/dataset/Chronoberg_preprocessed.jsonl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70ef600b",
   "metadata": {},
   "outputs": [],
   "source": [
    "path_to_data_folder = ''\n",
    "yearly_sentence_dict = {}\n",
    "with open(os.path.join(path_to_data_folder, 'ChronoBerg_preprocessed.jsonl'), 'r', encoding='utf-8') as f:\n",
    "    for line in f:\n",
    "        year, sentence = json.loads(line)['year'], json.loads(line)['sentence']\n",
    "        if year not in yearly_sentence_dict:\n",
    "            yearly_sentence_dict[year] = {}\n",
    "        yearly_sentence_dict[year] = sentence\n",
    "\n",
    "print(f'Total years in the preprocessed dataset: {len(yearly_sentence_dict.keys())}')\n",
    "print(f'Total sentences in the preprocessed dataset: {sum([len(yearly_sentence_dict[year]) for year in yearly_sentence_dict.keys()])}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13ce0970",
   "metadata": {},
   "source": [
    "#### Statistics\n",
    "\n",
    "- Number of sentences and number of words \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2d485d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Calculate the number of words in the preprocessed dataset\n",
    "\n",
    "### load the preprocessed dataset\n",
    "\n",
    "text_total = []\n",
    "for text in tqdm(data_dict['text']):\n",
    "    sentence = sent_tokenize(text)\n",
    "    text_total.append(sentence)\n",
    "\n",
    "text_total = list(itertools.chain(*text_total))\n",
    "print(f'Total sentences: {len(text_total)}')\n",
    "\n",
    "for year, sentences in yearly_sentence_dict.items():\n",
    "    for text in sentences:\n",
    "        word_list = nltk.tokenize.word_tokenize(text)\n",
    "        count += len(word_list)\n",
    "\n",
    "print(f'Total words in the preprocessed dataset: {count}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f5fd1b8",
   "metadata": {},
   "source": [
    "### Create train and test splits\n",
    "\n",
    "We will now look into ways of creating train and test splits for the different time-intervals.\n",
    "In our paper, we have reported experimental results evaluating LLMS in a sequential setup based on valence-stable or valence-shifting words. Consequently, a strategic way to create train and test splits for each time interval is to split the dataset through the occurence of words in specific intervals. \n",
    "\n",
    "\n",
    "Load the pre-processed sentence splitted dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f091a872",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_5056/4142511867.py:1: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
      "  chrono_1750 = torch.load('/app/src/ChronoBerg/cade/new_lexicons/cl_score_min_1750.pth')\n"
     ]
    }
   ],
   "source": [
    "path_to_data_folder = \"\"\n",
    "\n",
    "file = open(os.path.join(path_to_data_folder, 'ChronoBerg_preprocessed.jsonl'), 'w', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b5d3ecc",
   "metadata": {},
   "source": [
    "#### Get block of sentences per time-interval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f84bc829",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_block_sentence_per_era(chrono_dict):\n",
    "    block_sent = []\n",
    "    for year, sent_value in chrono_dict.items():\n",
    "        block_sent.append(list(sent_value))\n",
    "    \n",
    "    block_sent = list(itertools.chain(*block_sent))\n",
    "    #print(f'Number of sentences in year {data_dict[\"year\"][year]} with at least one chrono-berg word: {len(block_sent)}')\n",
    "    return block_sent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8484da87",
   "metadata": {},
   "outputs": [],
   "source": [
    "sent_b = get_block_sentence_per_era(file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a50f3b54",
   "metadata": {},
   "source": [
    "- Specify some words and create train and test splits based on that word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57e128c2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of sentences with at least one chrono-berg word: 68626\n"
     ]
    }
   ],
   "source": [
    "### Specify your words here\n",
    "word_list = ['act', 'action', 'active', 'actor', 'activity', 'actual', 'actually', 'actress', 'acts']\n",
    "sent_app = []\n",
    "sent_non_app = []\n",
    "for word in word_list:\n",
    "    print(\"Preprocessing the word: \", word)\n",
    "    for snt in sent_b:\n",
    "        if word in snt:\n",
    "            sent_app.append(snt)\n",
    "        else:\n",
    "            sent_non_app.append(snt)\n",
    "print(f'Number of sentences with at least one chrono-berg word: {len(sent_app)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "077650cf",
   "metadata": {},
   "source": [
    "##### Create train and test splits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d45656f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_splits(sent_app, sent_non_app, train_ratio=0.8):\n",
    "\n",
    "    train_app, test_app = train_test_split(sent_app, train_size=train_ratio, random_state=42)\n",
    "    train_non_app, test_non_app = train_test_split(sent_non_app, train_size=train_ratio, random_state=42)\n",
    "\n",
    "    train_data = train_app + train_non_app\n",
    "    test_data = test_app + test_non_app\n",
    "\n",
    "    print(f'Number of training samples: {len(train_data)}')\n",
    "    print(f'Number of testing samples: {len(test_data)}')\n",
    "\n",
    "    return train_data, test_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0390e625",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
