{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "76364253",
   "metadata": {},
   "source": [
    "## Determining Affective Polarity of Sentences in ChronoBerg & Comparisons with hate-check models\n",
    "\n",
    "In this notebook, we will walk through the steps to determine the affective polarities of sentences (negative/positive) along with the connotations acquired using hate-check tools such as Perspective API, RoBERTa, and OpenAI moderation tool. \n",
    "\n",
    "For Perspective API and OpenAI moderation tool, you need to pass your own API keys\n",
    "\n",
    "The dataset and lexicons are available at the Huggingface: [CHRONOBERG](https://huggingface.co/datasets/sdp56/ChronoBerg/tree/main)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "826ac045",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n",
      "[nltk_data] Downloading package punkt_tab to /root/nltk_data...\n",
      "[nltk_data]   Package punkt_tab is already up-to-date!\n",
      "[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger_eng to\n",
      "[nltk_data]     /root/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-\n",
      "[nltk_data]       date!\n"
     ]
    }
   ],
   "source": [
    "### Import necessary libraries\n",
    "import torch\n",
    "import json\n",
    "import pandas as pd\n",
    "from collections import defaultdict\n",
    "import math\n",
    "import nltk\n",
    "import numpy as np\n",
    "from googleapiclient import discovery\n",
    "import time\n",
    "import itertools\n",
    "from tqdm import tqdm\n",
    "import re\n",
    "from openai import OpenAI\n",
    "import os\n",
    "from nltk.tokenize import sent_tokenize\n",
    "nltk.download('stopwords')\n",
    "from nltk.corpus import stopwords\n",
    "nltk.download('punkt_tab')\n",
    "nltk.download('wordnet')\n",
    "nltk.download('averaged_perceptron_tagger_eng')\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "from nltk.corpus import wordnet\n",
    "#from nltk.tokenize import word_tokenize\n",
    "import argparse\n",
    "#!pip install --upgrade transformers\n",
    "import transformers\n",
    "from transformers import pipeline\n",
    "#os.chdir('/app/src/Chronoberg/')\n",
    "## Load Helper functions to load dataset\n",
    "from Dataset_statistics.load_data import load_data, preprocess_text, extract_sentence_splits_by_year, extract_text_by_year\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b0100fb",
   "metadata": {},
   "source": [
    "### Load the dataset and Valence lexicons "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "7bdd529d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading lexicons...\n",
      "Loaded 87360 words from 1750 lexicon\n",
      "Loaded 133886 words from 1800 lexicon\n",
      "Loaded 181955 words from 1850 lexicon\n",
      "Loaded 199535 words from 1900 lexicon\n",
      "Loaded 85000 words from 1950 lexicon\n",
      "All lexicons loaded.\n"
     ]
    }
   ],
   "source": [
    "print(\"Loading lexicons...\")\n",
    "\n",
    "path_lexicons = ''\n",
    "\n",
    "valence_1750 = torch.load(path_lexicons + 'valence_dic_1750_new.pt', weights_only=False)\n",
    "valence_1800 = torch.load(path_lexicons + 'valence_dic_1800_new.pt', weights_only=False)\n",
    "valence_1850 = torch.load(path_lexicons + 'valence_dic_1850_new.pt', weights_only=False)\n",
    "valence_1900 = torch.load(path_lexicons + 'valence_dic_1900_new.pt', weights_only=False)\n",
    "valence_1950 = torch.load(path_lexicons + 'valence_dic_1950_new.pt', weights_only=False)\n",
    "print(f\"Loaded {len(valence_1750)} words from 1750 lexicon\")\n",
    "print(f\"Loaded {len(valence_1800)} words from 1800 lexicon\")\n",
    "print(f\"Loaded {len(valence_1850)} words from 1850 lexicon\")\n",
    "print(f\"Loaded {len(valence_1900)} words from 1900 lexicon\")\n",
    "print(f\"Loaded {len(valence_1950)} words from 1950 lexicon\")\n",
    "\n",
    "print(\"All lexicons loaded.\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "39a7c382",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading the dataset...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "249it [00:47,  5.22it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data loaded\n",
      "data sorted\n",
      "Extracted 1214867 sentences from the years [1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998]\n"
     ]
    }
   ],
   "source": [
    "print(\"Loading the dataset...\")\n",
    "\n",
    "\n",
    "data_dict = load_data(data_path= './ChronoBerg_raw.jsonl')\n",
    "\n",
    "### Extract sentences and preprocess them\n",
    "sents, years = extract_sentence_splits_by_year(year=[1750+i for i in range(200,249)], data_dict=data_dict)\n",
    "sents = preprocess_text(sents)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7082f16e",
   "metadata": {},
   "source": [
    "- Define Helper functions: \n",
    "    1. calculate_valence: Use to assign a single valence score for a group of tokens\n",
    "    2. split_into_clauses: Split sentences into different clauses\n",
    "    3. get_sentence_score: Assign a single valence score to a sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "f7ec899e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_valence(tokens, file, negation_words):\n",
    "    total_sc = []\n",
    "    #print(tokens)\n",
    "    for tok in tokens:\n",
    "        try: \n",
    "            scores = float(file[tok])\n",
    "        except:\n",
    "            scores = None\n",
    "        if scores is not None:\n",
    "            total_sc.append(scores)\n",
    "    if total_sc != []:\n",
    "        return np.mean(total_sc)\n",
    "    else:\n",
    "        return None\n",
    "\n",
    "def split_into_clauses(text):\n",
    "    # Split clauses using simple punctuation-based heuristic\n",
    "    clauses = re.split(r'[;,]|(?:\\sbut\\s)|(?:\\band\\b)', text, flags=re.IGNORECASE)\n",
    "    return [clause.strip() for clause in clauses if clause.strip()]\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "negation_words = [\"no\", \"not\", \"n't\", \"never\", \"none\", \"nobody\",\n",
    "    \"nothing\", \"neither\", \"nowhere\", \"hardly\",\n",
    "    \"scarcely\", \"barely\", \"without\"\n",
    "]\n",
    "stopwords_ = list(set(stopwords.words('english')) - set(negation_words))\n",
    "def get_sentence_score(sentence, lexicon, negation_words, stopwords=stopwords_):\n",
    "    for text in tqdm(sentence):\n",
    "\n",
    "        clauses =  split_into_clauses(text)\n",
    "        sc_ = []\n",
    "        for clause in clauses:\n",
    "            word_list= nltk.tokenize.word_tokenize(clause.lower())\n",
    "            pos_tags = nltk.pos_tag(word_list)\n",
    "            sentiment_words= []\n",
    "            for word, pos in pos_tags:\n",
    "                if pos.startswith(('JJ', 'VB')):\n",
    "                    sentiment_words.append(word)\n",
    "                elif word in list(lexicon.keys()):\n",
    "                    sentiment_words.append(word)\n",
    "                elif word in negation_words:\n",
    "                    sentiment_words.append(word)\n",
    "                elif pos.startswith('RB'):\n",
    "                    sentiment_words.append(lemmatizer.lemmatize(word, pos='a'))\n",
    "\n",
    "            if sentiment_words == []:\n",
    "                sentiment_words = word_list\n",
    "            tokens = [word for word in sentiment_words if word.isalpha()]\n",
    "\n",
    "            tokens = [token for token in tokens if token not in stopwords]\n",
    "\n",
    "            if tokens == []:\n",
    "                tokens = word_list\n",
    "                tokens = [token for token in tokens if token not in stopwords]\n",
    "\n",
    "            valence= calculate_valence(tokens, lexicon, negation_words) \n",
    "            if valence is not None:\n",
    "                sc_.append(valence)\n",
    "    return min(sc_) if sc_ else None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4af0e98e",
   "metadata": {},
   "source": [
    "#### Determining Affective connotation of a sentence using Valence scores\n",
    "\n",
    "Use the above defined helper functions to assign a affective polarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "cd43e9c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.19s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Valence score for the sentence during 1750s: 0.37\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "## pass your sentence here to analyze\n",
    "your_sentence  = ['The conversation at supper was very gay.']\n",
    "\n",
    "### Set the lexicon to the specified time period\n",
    "lexicon = valence_1750\n",
    "\n",
    "score = get_sentence_score(your_sentence, lexicon, negation_words)\n",
    "print(f\"The Valence score for the sentence during 1750s: {score}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f70277b",
   "metadata": {},
   "source": [
    "## Classify the connotation of a sentence using Hate-Check Tools\n",
    "\n",
    "We will use three different hate-check tools: RoBERTA, Perspective API and OpenAI Moderation Tool"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef2ba010",
   "metadata": {},
   "source": [
    "- RoBERTA + Perspective API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "4cbd999d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cuda:0\n"
     ]
    }
   ],
   "source": [
    "\n",
    "### Pass your Perspective  API key here \n",
    "### RoBERTa is free to use, doesn't need an API key\n",
    "API_KEY = 'AIzaSyB3SOPV2_Ft9DZOY2hOo7xVEirOWe88_1Q'\n",
    "\n",
    "pipe_fb_roberta = pipeline(\"text-classification\", model=\"facebook/roberta-hate-speech-dynabench-r4-target\")\n",
    "\n",
    "# Flag hateful sentences using RoBERTa\n",
    "# Convert the output into a number in between 0 and 1 (0 signaling nonhate, 1 signaling hate)\n",
    "def fb_roberta_predict_score(sent):\n",
    "  result = pipe_fb_roberta(sent)\n",
    "  print(result)\n",
    "  if result[0]['label'] == 'nothate':\n",
    "    return 1 - result[0]['score']\n",
    "  else:\n",
    "    return result[0]['score']\n",
    "\n",
    "def fb_roberta_predict_label(sent):\n",
    "  result = pipe_fb_roberta(sent)\n",
    "  return \"non-hateful\" if result[0]['label'] == \"nothate\" else \"hateful\"\n",
    "\n",
    "\n",
    "#### Set up Google Perspective API client\n",
    "# Flag hateful sentences using Google Perspective API\n",
    "# For more details, see https://developers.perspectiveapi.com/s/docs-get-started\n",
    "client = discovery.build(\n",
    "  \"commentanalyzer\",\n",
    "  \"v1alpha1\",\n",
    "  developerKey=API_KEY,\n",
    "  discoveryServiceUrl=\"https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1\",\n",
    "  static_discovery=False,\n",
    ")\n",
    "\n",
    "def google_perspective_predict(sent):\n",
    "  analyze_request = {\n",
    "      'comment': { 'text': sent },\n",
    "      'requestedAttributes': {'IDENTITY_ATTACK': {}},\n",
    "      'languages': [\"en\"],\n",
    "      }\n",
    "  response = client.comments().analyze(body=analyze_request).execute()\n",
    "  return response[\"attributeScores\"][\"IDENTITY_ATTACK\"][\"summaryScore\"][\"value\"]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44626428",
   "metadata": {},
   "source": [
    "#### Hate-Speech Classification using RoBERTa + Perspective API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "3911555e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Perspective API score: 0.2127345\n",
      "------------ Google Perspective API Result ------------\n",
      "The sentence is classified as: Non_hateful: ✅\n"
     ]
    }
   ],
   "source": [
    "def google_perspective_roberta_full_response(sent, threshold_perspective=0.5):\n",
    "    label = fb_roberta_predict_label(sent)\n",
    "    if label == \"hateful\":\n",
    "        perspective_score = google_perspective_predict(sent)\n",
    "        print(f\"Perspective API score: {perspective_score}\")\n",
    "    else:\n",
    "        return \"✅\"\n",
    "    if perspective_score >= threshold_perspective:\n",
    "        return \"Hateful: 🚩\"\n",
    "    else:\n",
    "        return \"Non_hateful: ✅\"\n",
    "\n",
    "your_sentence  = ['Have the Christians an exclusive right of setting up a blind faith?']\n",
    "result = google_perspective_roberta_full_response(your_sentence[0])\n",
    "print(\"------------ Google Perspective API Result ------------\")\n",
    "print(f\"The sentence is classified as: {result}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "487a3825",
   "metadata": {},
   "source": [
    "### OpenAI Moderation Tool"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "acd6ee14",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------ OpenAI Moderation Tool Result ------------\n",
      "The sentence is classified as: not flagged hate; text: The conversation at supper was very gay. and response: non_hate\n"
     ]
    }
   ],
   "source": [
    "\n",
    "## Pass your OpenAI  API key here\n",
    "key= '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'\n",
    "your_sentence  = ['The conversation at supper was very gay.']\n",
    "\n",
    "def openai_moderation_tool(sent, key):\n",
    "    client = OpenAI(api_key=key)\n",
    "    response = client.moderations.create(\n",
    "        input=sent,\n",
    "    )\n",
    "\n",
    "    if response.results[0].categories.hate or response.results[0].categories.harassment:\n",
    "        response = \"flagged text: 🚩 and response: hate/harassment\"\n",
    "    else:\n",
    "        response = f'not flagged hate; text: {sent} and response: non_hate'\n",
    "    return response\n",
    "result = openai_moderation_tool(your_sentence[0], key)\n",
    "\n",
    "print(\"------------ OpenAI Moderation Tool Result ------------\")\n",
    "print(f\"The sentence is classified as: {result}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa589d5c",
   "metadata": {},
   "source": [
    "### Visualizing and comparing the connotations aquired from different hate-check tools and Valence scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "21722907",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_f316c th {\n",
       "  text-align: center;\n",
       "}\n",
       "#T_f316c td {\n",
       "  text-align: center;\n",
       "}\n",
       "#T_f316c_row0_col1, #T_f316c_row1_col1, #T_f316c_row2_col1 {\n",
       "  text-align: left;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_f316c\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_f316c_level0_col0\" class=\"col_heading level0 col0\" >YEAR</th>\n",
       "      <th id=\"T_f316c_level0_col1\" class=\"col_heading level0 col1\" >Sentences</th>\n",
       "      <th id=\"T_f316c_level0_col2\" class=\"col_heading level0 col2\" >('Hate-Check Models', 'RoBERTa+Persp')</th>\n",
       "      <th id=\"T_f316c_level0_col3\" class=\"col_heading level0 col3\" >('Hate-Check Models', 'OpenAI')</th>\n",
       "      <th id=\"T_f316c_level0_col4\" class=\"col_heading level0 col4\" >Valence Score</th>\n",
       "      <th id=\"T_f316c_level0_col5\" class=\"col_heading level0 col5\" >Affective connotation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_f316c_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
       "      <td id=\"T_f316c_row0_col0\" class=\"data row0 col0\" >1750s</td>\n",
       "      <td id=\"T_f316c_row0_col1\" class=\"data row0 col1\" >but i loathe you, you apache indian!</td>\n",
       "      <td id=\"T_f316c_row0_col2\" class=\"data row0 col2\" >🚩</td>\n",
       "      <td id=\"T_f316c_row0_col3\" class=\"data row0 col3\" >🚩</td>\n",
       "      <td id=\"T_f316c_row0_col4\" class=\"data row0 col4\" >-0.48</td>\n",
       "      <td id=\"T_f316c_row0_col5\" class=\"data row0 col5\" >👎</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_f316c_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
       "      <td id=\"T_f316c_row1_col0\" class=\"data row1 col0\" >1800s</td>\n",
       "      <td id=\"T_f316c_row1_col1\" class=\"data row1 col1\" >its annals are black with treason, murder, and incest.</td>\n",
       "      <td id=\"T_f316c_row1_col2\" class=\"data row1 col2\" >🚩</td>\n",
       "      <td id=\"T_f316c_row1_col3\" class=\"data row1 col3\" >✅</td>\n",
       "      <td id=\"T_f316c_row1_col4\" class=\"data row1 col4\" >-0.49</td>\n",
       "      <td id=\"T_f316c_row1_col5\" class=\"data row1 col5\" >👎</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_f316c_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
       "      <td id=\"T_f316c_row2_col0\" class=\"data row2 col0\" >1850s</td>\n",
       "      <td id=\"T_f316c_row2_col1\" class=\"data row2 col1\" >the irish are mostly very filthy and diseased.</td>\n",
       "      <td id=\"T_f316c_row2_col2\" class=\"data row2 col2\" >🚩</td>\n",
       "      <td id=\"T_f316c_row2_col3\" class=\"data row2 col3\" >🚩</td>\n",
       "      <td id=\"T_f316c_row2_col4\" class=\"data row2 col4\" >-0.51</td>\n",
       "      <td id=\"T_f316c_row2_col5\" class=\"data row2 col5\" >👎</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x7f7a3fe28790>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Example data\n",
    "data = {\n",
    "    \"YEAR\": [\"1750s\", \"1800s\", \"1850s\"],\n",
    "    \"Sentences\": [\n",
    "        \"but i loathe you, you apache indian!\",\n",
    "        \"its annals are black with treason, murder, and incest.\",\n",
    "        \"the irish are mostly very filthy and diseased.\"\n",
    "    ],\n",
    "    (\"Hate-Check Models\", \"RoBERTa+Persp\"): [\"🚩\", \"🚩\", \"🚩\"],\n",
    "    (\"Hate-Check Models\", \"OpenAI\"): [\"🚩\", \"✅\", \"🚩\"],\n",
    "    \"Valence Score\": [-0.48, -0.49, -0.51],\n",
    "    \"Affective connotation\": [\"👎\", \"👎\", \"👎\"]\n",
    "}\n",
    "\n",
    "# Convert to DataFrame with MultiIndex columns for Hate-Check Models\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Style formatting\n",
    "styled = (\n",
    "    df.style\n",
    "    .format({\"Valence Score\": \"{:.2f}\"})  # 2 decimal places\n",
    "    .set_table_styles([\n",
    "        {\"selector\": \"th\", \"props\": [(\"text-align\", \"center\")]}, # center headers\n",
    "        {\"selector\": \"td\", \"props\": [(\"text-align\", \"center\")]}, # center data\n",
    "    ])\n",
    "    .set_properties(subset=[\"Sentences\"], **{\"text-align\": \"left\"})  # left align text column\n",
    ")\n",
    "\n",
    "styled\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b8e8499",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
