{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ceb4f80f",
   "metadata": {},
   "source": [
    "  # Experiment Notebook Analysis\n",
    "\n",
    "  The notebook serves as an **experimental testing and debugging environment** for the PVA-SAE project's\n",
    "   dataset building and model generation pipeline. Specifically, it:\n",
    "\n",
    "  ## Main Functions:\n",
    "\n",
    "  1. **Dataset Loading Testing** - Tests different approaches to load the MBPP dataset, debugging module\n",
    "   import issues and verifying that all 974 records are properly loaded\n",
    "\n",
    "  2. **Generated Code Analysis** - Examines the outputs from the dataset building phase, showing:\n",
    "     - Original MBPP problems and test cases\n",
    "     - Generated prompts sent to the model\n",
    "     - Generated Python code solutions\n",
    "     - Test execution results (pass/fail)\n",
    "\n",
    "  3. **Model Generation Testing** - Tests direct model inference with the Gemma-2-2B model to verify\n",
    "  prompt generation and code output quality\n",
    "\n",
    "  4. **Dataset Inspection** - Loads and analyzes generated parquet files containing:\n",
    "     - Task IDs\n",
    "     - Generated code solutions\n",
    "     - Test pass/fail results\n",
    "     - Complexity scores\n",
    "\n",
    "  ## Key Findings from the Notebook:\n",
    "\n",
    "  - Shows that the current dataset has only 4 records (not the full 974)\n",
    "  - All 4 generated solutions failed their tests\n",
    "  - Reveals issues with the generated code quality (e.g., incorrect algorithms, wrong return types)\n",
    "  - Demonstrates the prompt structure being used for code generation\n",
    "\n",
    "  The notebook appears to be used for **debugging the dataset building pipeline** and **analyzing why \n",
    "  generated code solutions are failing tests**, which is crucial for the thesis research on program\n",
    "  validity awareness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dcd314dd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No parquet files found\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Read the most recent dataset\n",
    "import glob\n",
    "import os\n",
    "\n",
    "# Find the most recent parquet file\n",
    "parquet_files = glob.glob('data/datasets/mbpp_dataset_*.parquet')\n",
    "if parquet_files:\n",
    "    latest_file = max(parquet_files, key=os.path.getmtime)\n",
    "    print(f\"Loading: {latest_file}\")\n",
    "    df = pd.read_parquet(latest_file)\n",
    "else:\n",
    "    print(\"No parquet files found\")\n",
    "    df = None\n",
    "\n",
    "if df is not None:\n",
    "    print(f\"Dataset shape: {df.shape}\")\n",
    "    print(f\"Columns: {df.columns.tolist()}\")\n",
    "    \n",
    "    # Show basic statistics\n",
    "    if 'complexity_score' in df.columns:\n",
    "        print(f\"\\nComplexity score statistics:\")\n",
    "        print(df['complexity_score'].describe())\n",
    "        \n",
    "        # Show sample data\n",
    "        print(f\"\\nSample data:\")\n",
    "        display_cols = ['task_id', 'test_passed', 'complexity_score']\n",
    "        print(df[display_cols].head())\n",
    "    else:\n",
    "        print(\"\\nComplexity score column not found!\")\n",
    "        print(\"Available columns:\", df.columns.tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf6fc8fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For handpicked files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1f6bdf26",
   "metadata": {},
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'phase1_dataset_building'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[2], line 12\u001b[0m\n\u001b[1;32m      9\u001b[0m     \u001b[38;5;28;01mdel\u001b[39;00m sys\u001b[38;5;241m.\u001b[39mmodules[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mphase1_dataset_building\u001b[39m\u001b[38;5;124m'\u001b[39m]\n\u001b[1;32m     11\u001b[0m \u001b[38;5;66;03m# Now import the updated DatasetManager\u001b[39;00m\n\u001b[0;32m---> 12\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mphase1_dataset_building\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mdataset_manager\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m DatasetManager\n\u001b[1;32m     14\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m     15\u001b[0m     \u001b[38;5;66;03m# Initialize the dataset manager\u001b[39;00m\n\u001b[1;32m     16\u001b[0m     dataset_manager \u001b[38;5;241m=\u001b[39m DatasetManager()\n",
      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'phase1_dataset_building'"
     ]
    }
   ],
   "source": [
    "# Force reload the updated DatasetManager module to clear cache\n",
    "import importlib\n",
    "import sys\n",
    "\n",
    "# Remove the module from cache if it exists\n",
    "if 'phase1_dataset_building.dataset_manager' in sys.modules:\n",
    "    del sys.modules['phase1_dataset_building.dataset_manager']\n",
    "if 'phase1_dataset_building' in sys.modules:\n",
    "    del sys.modules['phase1_dataset_building']\n",
    "\n",
    "# Now import the updated DatasetManager\n",
    "from phase1_dataset_building.dataset_manager import DatasetManager\n",
    "\n",
    "try:\n",
    "    # Initialize the dataset manager\n",
    "    dataset_manager = DatasetManager()\n",
    "    \n",
    "    # Load the full MBPP dataset\n",
    "    print(\"Loading MBPP dataset using updated DatasetManager (after module reload)...\")\n",
    "    dataset_manager.load_dataset()\n",
    "    \n",
    "    print(f\"✅ Successfully loaded MBPP dataset\")\n",
    "    print(f\"Total records: {dataset_manager.get_size()}\")\n",
    "    print(f\"Expected: 974 records\")\n",
    "    \n",
    "    if dataset_manager.get_size() == 974:\n",
    "        print(\"✅ All 974 records loaded successfully!\")\n",
    "    else:\n",
    "        print(f\"❌ Expected 974 records, but got {dataset_manager.get_size()}\")\n",
    "        print(\"The module may still be cached. Try restarting the kernel.\")\n",
    "    \n",
    "    # Show some basic info about the dataset\n",
    "    print(f\"\\nDataset structure:\")\n",
    "    if dataset_manager.is_loaded():\n",
    "        sample_record = dataset_manager.get_record(0)\n",
    "        print(f\"Keys in each record: {list(sample_record.keys())}\")\n",
    "        print(f\"First task_id: {sample_record.get('task_id', 'unknown')}\")\n",
    "        print(f\"Text: {sample_record.get('text', '')[:100]}...\")\n",
    "\n",
    "except Exception as e:\n",
    "    print(f\"❌ Error loading MBPP dataset: {e}\")\n",
    "    import traceback\n",
    "    traceback.print_exc()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "cc7a3d92",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Testing direct call to load_dataset('Muennighoff/mbpp', 'full')...\n",
      "Direct load result: 974 records\n",
      "DatasetManager result: 974 records\n"
     ]
    }
   ],
   "source": [
    "# Alternative: Test with direct dataset loading to verify our fix\n",
    "from datasets import load_dataset\n",
    "\n",
    "# Test the exact same call that DatasetManager should now make\n",
    "print(\"Testing direct call to load_dataset('Muennighoff/mbpp', 'full')...\")\n",
    "dataset = load_dataset(\"Muennighoff/mbpp\", \"full\")\n",
    "print(f\"Direct load result: {len(dataset['test'])} records\")\n",
    "\n",
    "# Now test DatasetManager\n",
    "from phase1_dataset_building.dataset_manager import DatasetManager\n",
    "dm = DatasetManager()\n",
    "dm.load_dataset()\n",
    "print(f\"DatasetManager result: {dm.get_size()} records\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "09dc0506",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== ANALYSIS OF GENERATED CODE ===\n",
      "\n",
      "Task ID: 1\n",
      "Test Passed: False\n",
      "\n",
      "Original Problem: Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][].\n",
      "Test Cases: ['assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8', 'assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12', 'assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16']\n",
      "\n",
      "Generated Prompt:\n",
      "Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][].\n",
      "\n",
      "assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8\n",
      "assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12\n",
      "assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16\n",
      "\n",
      "# Your code here\n",
      "\n",
      "Generated Code:\n",
      "\n",
      "def min_cost(cost, m, n):\n",
      "    # Write your code here\n",
      "    dp = [[0 for i in range(n+1)] for j in range(m+1)]\n",
      "    for i in range(1, m+1):\n",
      "        for j in range(1, n+1):\n",
      "            if i == 1 and j == 1:\n",
      "                dp[i][j] = cost[0][0]\n",
      "            elif i == 1:\n",
      "                dp[i][j] = dp[i][j-1] + cost[i][j]\n",
      "            elif j == 1:\n",
      "                dp[i][j] = dp[i-1][j] + cost[i][j]\n",
      "            else:\n",
      "                dp[i][j] = min(dp[i-1][j], dp[i][j-1]) + cost[i][j]\n",
      "    return dp[m][n]\n",
      "\n",
      "================================================================================\n",
      "\n",
      "Task ID: 2\n",
      "Test Passed: False\n",
      "\n",
      "Original Problem: Write a function to find the similar elements from the given two tuple lists.\n",
      "Test Cases: ['assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)', 'assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4)', 'assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14)']\n",
      "\n",
      "Generated Prompt:\n",
      "Write a function to find the similar elements from the given two tuple lists.\n",
      "\n",
      "assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n",
      "assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4)\n",
      "assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14)\n",
      "\n",
      "# Your code here\n",
      "\n",
      "Generated Code:\n",
      "\n",
      "def similar_elements(list1, list2):\n",
      "    # Write your code here\n",
      "    return list(set(list1) & set(list2))\n",
      "\n",
      "================================================================================\n",
      "\n",
      "Task ID: 3\n",
      "Test Passed: False\n",
      "\n",
      "Original Problem: Write a python function to identify non-prime numbers.\n",
      "Test Cases: ['assert is_not_prime(2) == False', 'assert is_not_prime(10) == True', 'assert is_not_prime(35) == True']\n",
      "\n",
      "Generated Prompt:\n",
      "Write a python function to identify non-prime numbers.\n",
      "\n",
      "assert is_not_prime(2) == False\n",
      "assert is_not_prime(10) == True\n",
      "assert is_not_prime(35) == True\n",
      "\n",
      "# Your code here\n",
      "\n",
      "Generated Code:\n",
      "\n",
      "def is_not_prime(n):\n",
      "    if n == 1:\n",
      "        return False\n",
      "    for i in range(2, n):\n",
      "        if n % i == 0:\n",
      "            return False\n",
      "    return True\n",
      "\n",
      "================================================================================\n",
      "\n",
      "Task ID: 4\n",
      "Test Passed: False\n",
      "\n",
      "Original Problem: Write a function to find the largest integers from a given list of numbers using heap queue algorithm.\n",
      "Test Cases: ['assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] ', 'assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] ', 'assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35]']\n",
      "\n",
      "Generated Prompt:\n",
      "Write a function to find the largest integers from a given list of numbers using heap queue algorithm.\n",
      "\n",
      "assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65]\n",
      "assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75]\n",
      "assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35]\n",
      "\n",
      "# Your code here\n",
      "\n",
      "Generated Code:\n",
      "\n",
      "def heap_queue_largest(arr,n):\n",
      "    # Write your code here\n",
      "    heap = []\n",
      "    for i in range(n):\n",
      "        heap.append(arr[i])\n",
      "    for i in range(n-1,0,-1):\n",
      "        heap[0],heap[i] = heap[i],heap[0]\n",
      "        heap.pop()\n",
      "        heapify(heap,0,i)\n",
      "    return heap[0:n]\n",
      "\n",
      "================================================================================\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Let's examine what code is being generated and the prompts\n",
    "import pandas as pd\n",
    "import glob\n",
    "import os\n",
    "from phase1_dataset_building.dataset_manager import DatasetManager, PromptTemplateBuilder\n",
    "\n",
    "# Find the most recent parquet file\n",
    "parquet_files = glob.glob('data/datasets/mbpp_dataset_*.parquet')\n",
    "if parquet_files:\n",
    "    latest_file = max(parquet_files, key=os.path.getmtime)\n",
    "    df = pd.read_parquet(latest_file)\n",
    "    \n",
    "    # Load dataset manager to see original tasks\n",
    "    dm = DatasetManager()\n",
    "    dm.load_dataset()\n",
    "    template_builder = PromptTemplateBuilder()\n",
    "    \n",
    "    print(\"=== ANALYSIS OF GENERATED CODE ===\\n\")\n",
    "    for idx, row in df.iterrows():\n",
    "        task_id = row['task_id']\n",
    "        print(f\"Task ID: {task_id}\")\n",
    "        print(f\"Test Passed: {row['test_passed']}\")\n",
    "        \n",
    "        # Get original record\n",
    "        record = dm.get_record(task_id - 1)  # task_id is 1-indexed\n",
    "        print(f\"\\nOriginal Problem: {record['text']}\")\n",
    "        print(f\"Test Cases: {record['test_list']}\")  # Show first 2\n",
    "        \n",
    "        # Show the prompt that would be generated\n",
    "        prompt = template_builder.build_prompt(record)\n",
    "        print(f\"\\nGenerated Prompt:\\n{prompt}\")\n",
    "        \n",
    "        print(f\"\\nGenerated Code:\\n{row['generated_code']}\")\n",
    "        print(\"=\" * 80)\n",
    "        print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "1048bf48",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Task ID: 11\n",
      "Problem: Write a python function to remove first and last occurrence of a given character from the string.\n",
      "Test cases: ['assert remove_Occ(\"hello\",\"l\") == \"heo\"', 'assert remove_Occ(\"abcda\",\"a\") == \"bcd\"', 'assert remove_Occ(\"PHP\",\"P\") == \"H\"']\n",
      "Reference solution: def remove_Occ(s,ch): \n",
      "    for i in range(len(s)): \n",
      "        if (s[i] == ch): \n",
      "            s = s[0 : i] + s[i + 1:] \n",
      "            break\n",
      "    for i in range(len(s) - 1,-1,-1):  \n",
      "        if (s[i] == ch): \n",
      "            s = s[0 : i] + s[i + 1:] \n",
      "            break\n",
      "    return s \n",
      "\n",
      "============================================================\n",
      "GENERATED PROMPT:\n",
      "============================================================\n",
      "Write a python function to remove first and last occurrence of a given character from the string.\n",
      "\n",
      "assert remove_Occ(\"hello\",\"l\") == \"heo\"\n",
      "assert remove_Occ(\"abcda\",\"a\") == \"bcd\"\n",
      "assert remove_Occ(\"PHP\",\"P\") == \"H\"\n",
      "\n",
      "# Your code here\n",
      "============================================================\n",
      "\n",
      "Prompt ends with '# Your code here': True\n",
      "Prompt length: 232 characters\n"
     ]
    }
   ],
   "source": [
    "# Let's test the prompt generation and model output directly\n",
    "from phase1_dataset_building.dataset_manager import DatasetManager, PromptTemplateBuilder\n",
    "from common.models import ModelManager\n",
    "from common.config import ModelConfiguration\n",
    "\n",
    "# Initialize components\n",
    "dm = DatasetManager()\n",
    "dm.load_dataset()\n",
    "template_builder = PromptTemplateBuilder()\n",
    "\n",
    "# Get a specific task that should be easy (task 11 from your dataset)\n",
    "record = dm.get_record(10)  # task_id 11 is at index 10\n",
    "print(f\"Task ID: {record['task_id']}\")\n",
    "print(f\"Problem: {record['text']}\")\n",
    "print(f\"Test cases: {record['test_list']}\")\n",
    "print(f\"Reference solution: {record['code']}\")\n",
    "\n",
    "# Build the prompt\n",
    "prompt = template_builder.build_prompt(record)\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"GENERATED PROMPT:\")\n",
    "print(\"=\"*60)\n",
    "print(prompt)\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Check if the prompt ends with the code initiator\n",
    "print(f\"\\nPrompt ends with '{template_builder.CODE_INITIATOR}': {prompt.endswith(template_builder.CODE_INITIATOR)}\")\n",
    "print(f\"Prompt length: {len(prompt)} characters\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a6884cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test actual model generation\n",
    "from common.config import ModelConfiguration\n",
    "\n",
    "# Create model config\n",
    "model_config = ModelConfiguration(\n",
    "    model_name=\"google/gemma-2-2b\",\n",
    "    temperature=0.0,  # Deterministic\n",
    "    max_new_tokens=200\n",
    ")\n",
    "\n",
    "# Initialize model manager\n",
    "model_manager = ModelManager(model_config)\n",
    "print(\"Loading model...\")\n",
    "model_manager.load_model()\n",
    "\n",
    "# Test generation with our prompt\n",
    "print(\"\\nGenerating code...\")\n",
    "generated_code = model_manager.generate(prompt, max_new_tokens=200, stream=True)\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"GENERATED CODE (raw):\")\n",
    "print(\"=\"*60)\n",
    "print(repr(generated_code))\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Check if the generated code is actually Python code\n",
    "print(f\"\\nGenerated code length: {len(generated_code)} characters\")\n",
    "print(f\"Contains 'def': {'def' in generated_code}\")\n",
    "print(f\"First 100 chars: {generated_code[:100]}\")\n",
    "\n",
    "# Clean up\n",
    "model_manager.cleanup()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "463a382c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Latest parquet file: data/datasets/mbpp_dataset_20250528_152327.parquet\n",
      "Modified: 2025-05-28 15:23:27.750401\n",
      "\n",
      "Dataset shape: (4, 4)\n",
      "Columns: ['task_id', 'generated_code', 'test_passed', 'complexity_score']\n",
      "\n",
      "First 5 records:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>task_id</th>\n",
       "      <th>generated_code</th>\n",
       "      <th>test_passed</th>\n",
       "      <th>complexity_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>\\ndef min_cost(cost, m, n):\\n    # Write your code here\\n    dp = [[0 for i in range(n+1)] for j...</td>\n",
       "      <td>False</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>\\ndef similar_elements(list1, list2):\\n    # Write your code here\\n    return list(set(list1) &amp; ...</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>\\ndef is_not_prime(n):\\n    if n == 1:\\n        return False\\n    for i in range(2, n):\\n       ...</td>\n",
       "      <td>False</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>\\ndef heap_queue_largest(arr,n):\\n    # Write your code here\\n    heap = []\\n    for i in range(...</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   task_id  \\\n",
       "0        1   \n",
       "1        2   \n",
       "2        3   \n",
       "3        4   \n",
       "\n",
       "                                                                                        generated_code  \\\n",
       "0  \\ndef min_cost(cost, m, n):\\n    # Write your code here\\n    dp = [[0 for i in range(n+1)] for j...   \n",
       "1  \\ndef similar_elements(list1, list2):\\n    # Write your code here\\n    return list(set(list1) & ...   \n",
       "2  \\ndef is_not_prime(n):\\n    if n == 1:\\n        return False\\n    for i in range(2, n):\\n       ...   \n",
       "3  \\ndef heap_queue_largest(arr,n):\\n    # Write your code here\\n    heap = []\\n    for i in range(...   \n",
       "\n",
       "   test_passed  complexity_score  \n",
       "0        False                 7  \n",
       "1        False                 1  \n",
       "2        False                 3  \n",
       "3        False                 1  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Test passed distribution:\n",
      "test_passed\n",
      "False    4\n",
      "Name: count, dtype: int64\n",
      "\n",
      "Complexity score distribution:\n",
      "complexity_score\n",
      "1    2\n",
      "3    1\n",
      "7    1\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Check the latest generated parquet file and display first 5 records\n",
    "import pandas as pd\n",
    "import glob\n",
    "import os\n",
    "\n",
    "# Find the most recent parquet file\n",
    "parquet_files = glob.glob('data/datasets/mbpp_dataset_*.parquet')\n",
    "if parquet_files:\n",
    "    latest_file = max(parquet_files, key=os.path.getmtime)\n",
    "    print(f\"Latest parquet file: {latest_file}\")\n",
    "    print(f\"Modified: {pd.Timestamp.fromtimestamp(os.path.getmtime(latest_file))}\")\n",
    "    \n",
    "    # Load the dataset\n",
    "    df = pd.read_parquet(latest_file)\n",
    "    print(f\"\\nDataset shape: {df.shape}\")\n",
    "    print(f\"Columns: {df.columns.tolist()}\")\n",
    "    \n",
    "    # Display the first 5 records as a table\n",
    "    print(\"\\nFirst 5 records:\")\n",
    "    display(df.head())\n",
    "    \n",
    "    # Show value counts for test_passed\n",
    "    print(\"\\nTest passed distribution:\")\n",
    "    print(df['test_passed'].value_counts())\n",
    "    \n",
    "    # Show complexity score distribution if available\n",
    "    if 'complexity_score' in df.columns:\n",
    "        print(\"\\nComplexity score distribution:\")\n",
    "        print(df['complexity_score'].value_counts().sort_index())\n",
    "else:\n",
    "    print(\"No parquet files found in data/datasets/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bb8246e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff128645",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pva_sae",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
