{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expense Claim Patterns and Fraud Analysis (Flag 88)\n",
    "\n",
    "### Dataset Description\n",
    "The dataset consists of 500 entries simulating the ServiceNow fm_expense_line table, which records various attributes of financial expenses. Key fields include 'number', 'opened_at', 'amount', 'state', 'short_description', 'ci', 'user', 'department', 'category', 'processed_date', 'source_id', and 'type'. This table documents the flow of financial transactions by detailing the amount, departmental allocation, and the nature of each expense. It provides a comprehensive view of organizational expenditures across different categories, highlighting both the timing and the approval state of each financial entry. Additionally, the dataset offers insights into the efficiency of expense processing based on different states, revealing potential areas for workflow optimization.\n",
    "\n",
    "### Your Task\n",
    "**Goal**: To detect and investigate instances of repeated identical expense claims by individual users, determining whether these repetitions are fraudulent or due to misunderstandings of the expense policy.\n",
    "\n",
    "**Role**: Compliance and Audit Analyst\n",
    "\n",
    "**Difficulty**: 3 out of 5.\n",
    "\n",
    "**Category**: Finance Management\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Necessary Libraries\n",
    "This cell imports all necessary libraries required for the analysis. This includes libraries for data manipulation, data visualization, and any specific utilities needed for the tasks. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import argparse\n",
    "import pandas as pd\n",
    "import json\n",
    "import requests\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "\n",
    "from openai import OpenAI\n",
    "from pandas import date_range"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Dataset\n",
    "This cell loads the expense dataset to be analyzed. The data is orginally saved in the from a CSV file, and is here imported into a DataFrame. The steps involve specifying the path to the dataset, using pandas to read the file, and confirming its successful load by inspecting the first few table entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>state</th>\n",
       "      <th>closed_at</th>\n",
       "      <th>opened_at</th>\n",
       "      <th>closed_by</th>\n",
       "      <th>number</th>\n",
       "      <th>sys_updated_by</th>\n",
       "      <th>location</th>\n",
       "      <th>assigned_to</th>\n",
       "      <th>caller_id</th>\n",
       "      <th>sys_updated_on</th>\n",
       "      <th>short_description</th>\n",
       "      <th>priority</th>\n",
       "      <th>assignement_group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Database</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2023-07-25 03:32:18.462401146</td>\n",
       "      <td>2023-01-02 11:04:00</td>\n",
       "      <td>Fred Luddy</td>\n",
       "      <td>INC0000000034</td>\n",
       "      <td>admin</td>\n",
       "      <td>Australia</td>\n",
       "      <td>Fred Luddy</td>\n",
       "      <td>ITIL User</td>\n",
       "      <td>2023-07-06 03:31:13.838619495</td>\n",
       "      <td>There was an issue</td>\n",
       "      <td>2 - High</td>\n",
       "      <td>Database</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Hardware</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2023-03-11 13:42:59.511508874</td>\n",
       "      <td>2023-01-03 10:19:00</td>\n",
       "      <td>Charlie Whitherspoon</td>\n",
       "      <td>INC0000000025</td>\n",
       "      <td>admin</td>\n",
       "      <td>India</td>\n",
       "      <td>Beth Anglin</td>\n",
       "      <td>Don Goodliffe</td>\n",
       "      <td>2023-05-19 04:22:50.443252112</td>\n",
       "      <td>There was an issue</td>\n",
       "      <td>1 - Critical</td>\n",
       "      <td>Hardware</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Database</td>\n",
       "      <td>Resolved</td>\n",
       "      <td>2023-01-20 14:37:18.361510788</td>\n",
       "      <td>2023-01-04 06:37:00</td>\n",
       "      <td>Charlie Whitherspoon</td>\n",
       "      <td>INC0000000354</td>\n",
       "      <td>system</td>\n",
       "      <td>India</td>\n",
       "      <td>Fred Luddy</td>\n",
       "      <td>ITIL User</td>\n",
       "      <td>2023-02-13 08:10:20.378839709</td>\n",
       "      <td>There was an issue</td>\n",
       "      <td>2 - High</td>\n",
       "      <td>Database</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hardware</td>\n",
       "      <td>Resolved</td>\n",
       "      <td>2023-01-25 20:46:13.679914432</td>\n",
       "      <td>2023-01-04 06:53:00</td>\n",
       "      <td>Fred Luddy</td>\n",
       "      <td>INC0000000023</td>\n",
       "      <td>admin</td>\n",
       "      <td>Canada</td>\n",
       "      <td>Luke Wilson</td>\n",
       "      <td>Don Goodliffe</td>\n",
       "      <td>2023-06-14 11:45:24.784548040</td>\n",
       "      <td>There was an issue</td>\n",
       "      <td>2 - High</td>\n",
       "      <td>Hardware</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hardware</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2023-05-10 22:35:58.881919516</td>\n",
       "      <td>2023-01-05 16:52:00</td>\n",
       "      <td>Luke Wilson</td>\n",
       "      <td>INC0000000459</td>\n",
       "      <td>employee</td>\n",
       "      <td>UK</td>\n",
       "      <td>Charlie Whitherspoon</td>\n",
       "      <td>David Loo</td>\n",
       "      <td>2023-06-11 20:25:35.094482408</td>\n",
       "      <td>There was an issue</td>\n",
       "      <td>2 - High</td>\n",
       "      <td>Hardware</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   category     state                      closed_at            opened_at  \\\n",
       "0  Database    Closed  2023-07-25 03:32:18.462401146  2023-01-02 11:04:00   \n",
       "1  Hardware    Closed  2023-03-11 13:42:59.511508874  2023-01-03 10:19:00   \n",
       "2  Database  Resolved  2023-01-20 14:37:18.361510788  2023-01-04 06:37:00   \n",
       "3  Hardware  Resolved  2023-01-25 20:46:13.679914432  2023-01-04 06:53:00   \n",
       "4  Hardware    Closed  2023-05-10 22:35:58.881919516  2023-01-05 16:52:00   \n",
       "\n",
       "              closed_by         number sys_updated_by   location  \\\n",
       "0            Fred Luddy  INC0000000034          admin  Australia   \n",
       "1  Charlie Whitherspoon  INC0000000025          admin      India   \n",
       "2  Charlie Whitherspoon  INC0000000354         system      India   \n",
       "3            Fred Luddy  INC0000000023          admin     Canada   \n",
       "4           Luke Wilson  INC0000000459       employee         UK   \n",
       "\n",
       "            assigned_to      caller_id                 sys_updated_on  \\\n",
       "0            Fred Luddy      ITIL User  2023-07-06 03:31:13.838619495   \n",
       "1           Beth Anglin  Don Goodliffe  2023-05-19 04:22:50.443252112   \n",
       "2            Fred Luddy      ITIL User  2023-02-13 08:10:20.378839709   \n",
       "3           Luke Wilson  Don Goodliffe  2023-06-14 11:45:24.784548040   \n",
       "4  Charlie Whitherspoon      David Loo  2023-06-11 20:25:35.094482408   \n",
       "\n",
       "    short_description      priority assignement_group  \n",
       "0  There was an issue      2 - High          Database  \n",
       "1  There was an issue  1 - Critical          Hardware  \n",
       "2  There was an issue      2 - High          Database  \n",
       "3  There was an issue      2 - High          Hardware  \n",
       "4  There was an issue      2 - High          Hardware  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset_path = \"csvs/flag-88.csv\"\n",
    "flag_data = pd.read_csv(dataset_path)\n",
    "df = pd.read_csv(dataset_path)\n",
    "flag_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Question 1:How many instances of repeated identical expense claims are there, and which users are involved?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plot expense distribution by department\n",
    "\n",
    "This bar visualization plots distribution of expenses across different departments within the organization, focusing on an average expenses per department.  This plot helps identify departments that might be overspending or under-utilizing resources etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N/A\n"
     ]
    }
   ],
   "source": [
    "# import matplotlib.pyplot as plt\n",
    "# import pandas as pd\n",
    "\n",
    "# # Assuming flag_data is your DataFrame containing expense data\n",
    "# # Group data by department and calculate total and average expenses\n",
    "# department_expenses = flag_data.groupby('department')['amount'].agg(['sum', 'mean']).reset_index()\n",
    "\n",
    "# # Sort data for better visualization (optional)\n",
    "# department_expenses.sort_values('sum', ascending=False, inplace=True)\n",
    "\n",
    "# # Creating the plot\n",
    "# fig, ax = plt.subplots(figsize=(14, 8))\n",
    "\n",
    "# # Bar plot for total expenses\n",
    "# # total_bars = ax.bar(department_expenses['department'], department_expenses['sum'], color='blue', label='Total Expenses')\n",
    "\n",
    "# # Bar plot for average expenses\n",
    "# average_bars = ax.bar(department_expenses['department'], department_expenses['mean'], color='green', label='Average Expenses', alpha=0.6, width=0.5)\n",
    "\n",
    "# # Add some labels, title and custom x-axis tick labels, etc.\n",
    "# ax.set_xlabel('Department')\n",
    "# ax.set_ylabel('Expenses ($)')\n",
    "# ax.set_title('Average Expenses by Department')\n",
    "# ax.set_xticks(department_expenses['department'])\n",
    "# ax.set_xticklabels(department_expenses['department'], rotation=45)\n",
    "# ax.legend()\n",
    "\n",
    "# # Adding a label above each bar\n",
    "# def add_labels(bars):\n",
    "#     for bar in bars:\n",
    "#         height = bar.get_height()\n",
    "#         ax.annotate(f'{height:.2f}',\n",
    "#                     xy=(bar.get_x() + bar.get_width() / 2, height),\n",
    "#                     xytext=(0, 3),  # 3 points vertical offset\n",
    "#                     textcoords=\"offset points\",\n",
    "#                     ha='center', va='bottom')\n",
    "\n",
    "# # add_labels(total_bars)\n",
    "# add_labels(average_bars)\n",
    "\n",
    "# plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)\n",
    "# plt.show()\n",
    "print(\"N/A\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate JSON Description for the Insight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'data_type': 'frequency',\n",
       " 'insight': \"The analysis could not be completed because the required 'department' column is missing from the dataset (flag_data). This is evidenced by the KeyError in the output indicating that 'department' is not a valid column name.\",\n",
       " 'insight_value': {},\n",
       " 'plot': {'description': 'The code attempts to create a bar chart showing average expenses by department, but the visualization failed due to missing data. The intended plot would have shown department-wise average expenses with green bars, including numerical labels and a rotated x-axis for better readability.'},\n",
       " 'question': 'How many instances of repeated identical expense claims are there, and which users are involved?',\n",
       " 'actionable_insight': \"Before analyzing expense claim patterns, the data structure needs to be verified and corrected. Specifically, ensure that the dataset contains the required 'department' and 'amount' columns. Additionally, the code should be modified to address the actual question about repeated expense claims, as the current code focuses on department-wise averages instead.\"}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{\n",
    "    \"data_type\": \"frequency\",\n",
    "    \"insight\": \"The analysis could not be completed because the required 'department' column is missing from the dataset (flag_data). This is evidenced by the KeyError in the output indicating that 'department' is not a valid column name.\",\n",
    "    \"insight_value\": {},\n",
    "    \"plot\": {\n",
    "        \"description\": \"The graph could not be generated due to missing data\"\n",
    "    },\n",
    "    \"question\": \"How many instances of repeated identical expense claims are there, and which users are involved?\",\n",
    "    \"actionable_insight\": \"No actionable insight could be generated due to missing data\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Question 2:** What are the differences in processing times for expenses in various states such as Processed, Declined, Submitted, and Pending?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analyzing the processing times for expenses in different states reveals notable differences. Processed expenses tend to have shorter processing times compared to Declined expenses. Understanding these differences helps identify areas for potential optimization and efficiency improvements in the expense processing workflow.\"\n",
    "\n",
    "These components are designed to prompt an analysis focused on the differences in processing times based on the states of the expenses, ultimately leading to the identified insight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N/A\n"
     ]
    }
   ],
   "source": [
    "# # Calculate average processing time for each state\n",
    "# avg_processing_time_by_state = df.groupby('state')['processing_time_hours'].mean().reset_index()\n",
    "\n",
    "# # Set the style of the visualization\n",
    "# sns.set(style=\"whitegrid\")\n",
    "\n",
    "# # Create a bar plot for average processing time by state\n",
    "# plt.figure(figsize=(12, 6))\n",
    "# sns.barplot(x='state', y='processing_time_hours', data=avg_processing_time_by_state)\n",
    "# plt.title('Average Processing Time by State')\n",
    "# plt.xlabel('State')\n",
    "# plt.ylabel('Average Processing Time (hours)')\n",
    "# plt.xticks(rotation=45)\n",
    "# plt.show()\n",
    "print(\"N/A\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'data_type': 'comparative',\n",
       " 'insight': \"The analysis could not be completed because the column 'processing_time_hours' was not found in the dataset, indicating either missing or incorrectly named data\",\n",
       " 'insight_value': {},\n",
       " 'plot': {'description': 'The graph could not be generated due to missing data'},\n",
       " 'question': 'What are the differences in processing times for expenses in various states such as Processed, Declined, Submitted, and Pending?',\n",
       " 'actionable_insight': 'No actionable insight could be generated due to missing data'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{\n",
    "    \"data_type\": \"comparative\",\n",
    "    \"insight\": \"The analysis could not be completed because the column 'processing_time_hours' was not found in the dataset, indicating either missing or incorrectly named data\",\n",
    "    \"insight_value\": {},\n",
    "    \"plot\": {\n",
    "        \"description\": \"The graph could not be generated due to missing data\"\n",
    "    },\n",
    "    \"question\": \"What are the differences in processing times for expenses in various states such as Processed, Declined, Submitted, and Pending?\",\n",
    "    \"actionable_insight\": \"No actionable insight could be generated due to missing data\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Question 3: How many instances of any repeated identical expense claims are there?**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Frequency Distribution of Repeated Expense Claims\n",
    "\n",
    "This chart analyzes frequency of repeated identical expense claims, highlighting potential anomalies. It focuses on claims submitted by the same user, within the same category, and for the same amount. The histogram displays the distribution of these frequencies, using red bars to highlight any unusual nature of repeated claims.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N/A\n"
     ]
    }
   ],
   "source": [
    "# import matplotlib.pyplot as plt\n",
    "# import pandas as pd\n",
    "\n",
    "# # Group by user, category, and amount to count occurrences\n",
    "# grouped_data = flag_data.groupby(['user', 'category', 'amount']).size().reset_index(name='frequency')\n",
    "\n",
    "# # Filter out normal entries to focus on potential anomalies\n",
    "# potential_fraud = grouped_data[grouped_data['frequency'] > 3]  # Arbitrary threshold, adjust based on your data\n",
    "\n",
    "# # Plot histogram of frequencies\n",
    "# plt.figure(figsize=(10, 6))\n",
    "# plt.hist(potential_fraud['frequency'], bins=30, color='red', alpha=0.7)\n",
    "# plt.title('Distribution of Repeated Claims Frequency')\n",
    "# plt.xlabel('Frequency of Same Amount Claims by Same User in Same Category')\n",
    "# plt.ylabel('Count of Such Incidents')\n",
    "# plt.grid(True)\n",
    "# plt.show()\n",
    "print(\"N/A\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate JSON Description for the Insight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'data_type': 'frequency',\n",
       " 'insight': \"The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the dataset (flag_data)\",\n",
       " 'insight_value': {},\n",
       " 'plot': {'description': 'The code attempted to create a histogram showing the distribution of repeated claims frequency, but failed due to missing data. The intended visualization would have shown the frequency of identical expense claims made by the same user in the same category'},\n",
       " 'question': 'How many instances of any repeated identical expense claims are there?',\n",
       " 'actionable_insight': 'No actionable insight could be generated due to missing data'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{\n",
    "    \"data_type\": \"frequency\",\n",
    "    \"insight\": \"The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the dataset (flag_data)\",\n",
    "    \"insight_value\": {},\n",
    "    \"plot\": {\n",
    "        \"description\": \"The code attempted to create a histogram showing the distribution of repeated claims frequency, but failed due to missing data. The intended visualization would have shown the frequency of identical expense claims made by the same user in the same category\"\n",
    "    },\n",
    "    \"question\": \"How many instances of any repeated identical expense claims are there?\",\n",
    "    \"actionable_insight\": \"No actionable insight could be generated due to missing data\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Question 4:  Which users are involved in the frequent cases?**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plot repeated expense claims by user and category\n",
    "\n",
    "This plot visualizes repeated expense claims across various categories, highlighting users involved in frequent submissions. Each dot represents a unique combination of user, category, and expense amount, with the size of the dot proportional to the frequency of claims.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N/A\n"
     ]
    }
   ],
   "source": [
    "# import matplotlib.pyplot as plt\n",
    "\n",
    "# # Assume flag_data includes 'user', 'amount', 'category' columns\n",
    "# # Group data by user, category, and amount to count frequencies\n",
    "# grouped_data = flag_data.groupby(['user', 'category', 'amount']).size().reset_index(name='count')\n",
    "\n",
    "# # Filter to only include cases with more than one claim (to highlight potential fraud)\n",
    "# repeated_claims = grouped_data[grouped_data['count'] > 1]\n",
    "\n",
    "# # Create a scatter plot with sizes proportional to the count of claims\n",
    "# plt.figure(figsize=(14, 8))\n",
    "# colors = {'Travel': 'blue', 'Meals': 'green', 'Accommodation': 'red', 'Miscellaneous': 'purple'}  # Add more categories as needed\n",
    "# for ct in repeated_claims['category'].unique():\n",
    "#     subset = repeated_claims[repeated_claims['category'] == ct]\n",
    "#     plt.scatter(subset['user'], subset['amount'], s=subset['count'] * 100,  # Increased size factor for better visibility\n",
    "#                 color=colors.get(ct, 'gray'), label=f'Category: {ct}', alpha=0.6)\n",
    "\n",
    "# # Customizing the plot\n",
    "# plt.title('Repeated Expense Claims by User and Category')\n",
    "# plt.xlabel('User')\n",
    "# plt.ylabel('Amount ($)')\n",
    "# plt.legend(title='Expense Categories')\n",
    "# plt.xticks(rotation=45)  # Rotate x-axis labels for better readability\n",
    "# plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)\n",
    "\n",
    "# # Highlighting significant cases\n",
    "# # Let's annotate the specific user found in your description\n",
    "# for i, row in repeated_claims.iterrows():\n",
    "#     if row['user'] == 'Mamie Mcintee' and row['amount'] == 8000:\n",
    "#         plt.annotate(f\"{row['user']} (${row['amount']})\", (row['user'], row['amount']),\n",
    "#                      textcoords=\"offset points\", xytext=(0,10), ha='center', fontsize=9, color='darkred')\n",
    "\n",
    "# # Show plot\n",
    "# plt.show()\n",
    "print(\"N/A\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate JSON Description for the Insight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'data_type': 'frequency',\n",
       " 'insight': \"The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame\",\n",
       " 'insight_value': {},\n",
       " 'plot': {'description': 'A scatter plot was attempted to visualize repeated expense claims by user and category, with point sizes representing frequency of claims, but failed due to missing data'},\n",
       " 'question': 'Which users are involved in the frequent cases?',\n",
       " 'actionable_insight': \"Before proceeding with the analysis, verify that the flag_data DataFrame contains the required 'user' column and ensure data integrity\"}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{\n",
    "    \"data_type\": \"frequency\",\n",
    "    \"insight\": \"The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame\",\n",
    "    \"insight_value\": {},\n",
    "    \"plot\": {\n",
    "        \"description\": \"A scatter plot was attempted to visualize repeated expense claims by user and category, with point sizes representing frequency of claims, but failed due to missing data\"\n",
    "    },\n",
    "    \"question\": \"Which users are involved in the frequent cases?\",\n",
    "    \"actionable_insight\": \"Before proceeding with the analysis, verify that the flag_data DataFrame contains the required 'user' column and ensure data integrity\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **Question 5:  What department and categories are most commonly involved in these repeated claims?**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plot distribution of expense claims by department and category for Mamie Mcintee\n",
    "\n",
    "This bar graph displays the distribution of Mamie Mcintee's expense claims across different departments and categories, illustrating the specific areas where repeated claims are most frequent. One color represents a different expense category, allowing for a clear view of which combinations are most problematic.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N/A\n"
     ]
    }
   ],
   "source": [
    "# import matplotlib.pyplot as plt\n",
    "# import pandas as pd\n",
    "\n",
    "# # Assuming 'flag_data' includes 'user', 'department', 'amount', 'category' columns\n",
    "# # and it's already loaded with the data\n",
    "\n",
    "# # Filter for the specific user\n",
    "# user_data = flag_data[flag_data['user'] == 'Mamie Mcintee']\n",
    "\n",
    "# # Group data by department and category to count frequencies\n",
    "# department_category_counts = user_data.groupby(['department', 'category']).size().unstack(fill_value=0)\n",
    "\n",
    "# # Plotting\n",
    "# plt.figure(figsize=(12, 7))\n",
    "# department_category_counts.plot(kind='bar', stacked=True, color=['blue', 'green', 'red', 'purple', 'orange'], alpha=0.7)\n",
    "# plt.title('Distribution of Expense Claims by Department and Category for Mamie Mcintee')\n",
    "# plt.xlabel('Department')\n",
    "# plt.ylabel('Number of Claims')\n",
    "# plt.xticks(rotation=0)  # Keep the department names horizontal for better readability\n",
    "# plt.legend(title='Expense Categories')\n",
    "# plt.grid(True, which='both', linestyle='--', linewidth=0.5)\n",
    "# plt.show()\n",
    "print(\"N/A\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate JSON Description for the Insight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'data_type': 'distribution',\n",
       " 'insight': \"The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame\",\n",
       " 'insight_value': {},\n",
       " 'plot': {'description': 'No plot was generated due to missing data'},\n",
       " 'question': 'What department and categories are most commonly involved in these repeated claims?',\n",
       " 'actionable_insight': 'No actionable insight could be generated due to missing data'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{\n",
    "    \"data_type\": \"distribution\",\n",
    "    \"insight\": \"The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame\",\n",
    "    \"insight_value\": {},\n",
    "    \"plot\": {\n",
    "        \"description\": \"No plot was generated due to missing data\"\n",
    "    },\n",
    "    \"question\": \"What department and categories are most commonly involved in these repeated claims?\",\n",
    "    \"actionable_insight\": \"No actionable insight could be generated due to missing data\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Updated Summary of Findings (Flag 88):\n",
    "\n",
    "1. **Pattern Recognition:** The dataset is focused on identifying patterns in expense submissions that may indicate potential fraud or policy abuse. However, the dataset is missing key columns such as 'department', 'user', and 'processing_time_hours', which are essential for conducting the analysis.\n",
    "\n",
    "2. **Insight into User Behavior:** No analysis could be performed due to the missing 'user' column. This column is crucial for identifying repeated identical expense claims by individual users.\n",
    "\n",
    "3. **State-Based Processing Time Analysis:** The analysis could not be completed because the 'processing_time_hours' column is missing from the dataset. This column is necessary to compare processing times for expenses in various states such as Processed, Declined, Submitted, and Pending.\n",
    "\n",
    "4. **Expense Distribution by Department:** The analysis could not be completed because the 'department' column is missing from the dataset. This column is needed to plot the distribution of expenses across different departments."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
