INTROSPECTIVE GROWTH: AUTOMATICALLY ADVANCING LLM EXPERTISE IN TECHNOLOGY JUDGMENT
=================================================================================

This is the code and data repository for the paper:
“Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment.”

The repository contains two Python scripts and a sample dataset that together
(1) generate comprehension questions for patent abstracts and
(2) answer those questions with the help of external scientific literature.

-----------------------------------------------------------------------
FILES
-----------------------------------------------------------------------
1. question_generation.py
2. question_answer.py
3. patent_pair_sample.csv (example dataset)

-----------------------------------------------------------------------
1. question_generation.py
-----------------------------------------------------------------------
Purpose:
    Produce six prerequisite-knowledge questions per patent abstract.

Key points:
    • Use large-language-model (LLM) APIs (OpenAI or OpenRouter).
    • Two embedded prompt templates:
        – Remembering (promptQ1) - low-level: recall key terms and components.
        – Understanding (promptQ2) - high-level: explain how components interact.
-----------------------------------------------------------------------
2. question_answer.py
-----------------------------------------------------------------------
Purpose:
    Answer the generated questions with retrieval-augmented generation.

Key points:
    • Load cleaned arXiv papers (full text) from local Parquet files.
    • Embed each question and text chunk with SPECTER2; similarity search via FAISS, and retrieval for relevant scientific knowledge.
    • Supply those chunks plus the question to an LLM to obtain an answer.
    • Generate a “placebo answer” of identical length by cutting a
      random snippet from the same paper (for robustness testing).
-----------------------------------------------------------------------
3. Dataset: patent_pair_sample.csv
-----------------------------------------------------------------------
Description:
    Paired patent records with semantic similarity scores and metadata.

Most useful columns:
    patent_id
    nearest_neighbor_patent_id
    similarity     (cosine similarity 0-1 using patent-specific pretrained embeddings)
    patent_abstract
    nearest_neighbor_patent_abstract
    patent_title
    nearest_neighbor_patent_title
    patent_date
    nearest_neighbor_patent_date

    cpc section, classes, subclasses, and other features are included for 
    both the focal patent and the nearest neighboring patent.

Uses:
    • Selecting representative patents for question generation/answering.
    • Nearest-neighbor prompting and comparative evaluation.

-----------------------------------------------------------------------
CITATION
-----------------------------------------------------------------------
Citation information will be added upon publication.

-----------------------------------------------------------------------
CONTACT
-----------------------------------------------------------------------
We commit to creating a GitHub repository and fully open-source our patent dataset. Please open an issue in the repository or contact the authors
after the paper is accepted.