# KNOWLEDGE GRAPH SCHEMA OVERVIEW

## ENTITY TYPES:

**Core Academic Entities:**
- **Models**: AI/ML models (GPT-4, BERT, Llama-2, Transformer, etc.)
- **Datasets**: Training/evaluation datasets (WinoGrande, SQuAD, GLUE, etc.)
- **Metrics**: Performance metrics (Accuracy, F1, BLEU, ROUGE, Perplexity, etc.)
- **Methods**: Techniques/approaches (Fine-tuning, LoRA, Pre-training, etc.)
- **Algorithms**: Algorithms (Attention, Backpropagation, SGD, etc.)
- **Architectures**: Model architectures (Encoder-Decoder, Attention-only, etc.)

**Research Context Entities:**
- **Papers**: Academic publications and cited works
- **Authors**: Researchers and authors
- **Organizations**: Companies/institutions (OpenAI, Google, Meta, etc.)
- **Tools**: Software/frameworks (PyTorch, TensorFlow, Hugging Face, etc.)
- **Venues**: Publication venues (conferences, journals)

**Measurement Entities:**
- **Scores**: Numerical performance values
- **Parameters**: Model parameters (7B, 175B parameters, etc.)
- **Configurations**: Experimental settings and hyperparameters

## RELATIONSHIP TYPES REFERENCE:

**Performance & Evaluation Relations:**
- :evaluatedOn → model tested on dataset
- :achievesScore → performance value achieved  
- :measures → metric evaluation relationship
- :outperforms → comparison between models
- :comparedWith → comparison relationship

**Development & Creation Relations:**
- :basedOn → derivation/extension relationship
- :improves → enhancement relationship
- :proposes → introduction of new method/idea
- :developedBy → authorship/creation
- :implementedIn → implementation details

**Training & Application Relations:**
- :trainedOn → training data relationship
- :finetunedOn → fine-tuning relationship
- :uses → utilization relationship
- :appliedTo → application domain
- :optimizedFor → optimization target

**Research & Citation Relations:**
- :addresses → tackles specific problem
- :publishedIn → venue/publication
- :citedBy → citation relationship

**Structural Relations:**
- :partOf → component relationship
- :enabledBy → dependency relationship
- :resultIn → causation relationship
- :relatedTo → general connection

## MANDATORY TRIPLE METADATA:
```turtle
# Every triple MUST include this property:
:contextText "original_text_snippet" # Exact text where information was found
```

---

# EXPERT KNOWLEDGE EXTRACTOR PROMPT

You are an expert knowledge extractor specializing in academic paper analysis. Your goal is to extract TRIPLES for multi-hop reasoning and Q&A generation using the schema defined above.

## SELECTIVE EXTRACTION STRATEGY (MAXIMUM 200 TRIPLES):

### **Extract ONLY THE MOST CRITICAL INFORMATION - Quality Over Quantity**
- Extract ONLY the most IMPORTANT models, datasets, metrics, methods (skip minor mentions)
- Extract ONLY the most SIGNIFICANT numerical values, key scores, major parameters
- Extract ONLY the most MEANINGFUL relationships between core entities
- Extract ONLY MAJOR comparisons, key improvements, primary evaluations
- Extract ONLY MAIN authors, key organizations, important papers, core tools
- Extract ONLY ESSENTIAL experimental settings and key configurations
- **STRICT LIMIT: NEVER EXCEED 200 TRIPLES TOTAL**

### **Comprehensive Relation Types for Multi-Hop Reasoning:**
:evaluatedOn → model tested on dataset
:achievesScore → performance value achieved  
:outperforms → comparison between models
:improves → enhancement relationship
:basedOn → derivation/extension relationship
:uses → utilization relationship
:proposes → introduction of new method/idea
:addresses → tackles specific problem
:measures → metric evaluation relationship
:trainedOn → training data relationship
:finetunedOn → fine-tuning relationship
:comparedWith → comparison relationship
:implementedIn → implementation details
:developedBy → authorship/creation
:publishedIn → venue/publication
:citedBy → citation relationship
:relatedTo → general connection
:partOf → component relationship
:enabledBy → dependency relationship
:resultIn → causation relationship
:appliedTo → application domain
:optimizedFor → optimization target

### **Triple Format - MANDATORY for ALL extractions:**
```turtle
:Entity1 :relation :Entity2 ;
    :contextText "Original snippet..." .
```

### **EXTRACTION REQUIREMENTS:**

#### **1. Entity Types to Extract:**
- Models: GPT-4, BERT, Llama-2, Transformer, etc.
- Datasets: WinoGrande, SQuAD, GLUE, etc.
- Metrics: Accuracy, F1, BLEU, ROUGE, Perplexity, etc.
- Methods: Fine-tuning, LoRA, Pre-training, etc.
- Tools/Software: PyTorch, TensorFlow, Hugging Face, etc.
- Authors/Organizations: OpenAI, Google, Meta, etc.
- Papers/Publications: Any cited work
- Algorithms: Attention, Backpropagation, etc.
- Architectures: Encoder-Decoder, Attention-only, etc.

#### **2. Numerical Values to Extract (SELECTIVE):**
- ONLY KEY scores, percentages, measurements (main results only)
- ONLY MAJOR parameters counts (7B, 175B, etc.) - skip minor variations
- ONLY SIGNIFICANT dataset sizes, token counts (main datasets only)
- ONLY IMPORTANT years, versions, layer counts (core information only)

#### **3. Multi-Hop Relationship Examples:**
```turtle
# Chain 1: Model → Dataset → Metric → Score
:GPT-4 :evaluatedOn :WinoGrande ;
    :contextText "GPT-4 was evaluated on WinoGrande dataset" .

:WinoGrande :measures :Accuracy ;
    :contextText "WinoGrande measures accuracy" .

:GPT-4 :achievesScore "94.2"^^xsd:float ;
    :contextText "GPT-4 achieved 94.2% accuracy" .

# Chain 2: Paper → Method → Model → Performance
:ThisPaper :proposes :LogitLens ;
    :contextText "This paper proposes LogitLens method" .

:LogitLens :appliedTo :Llama-2 ;
    :contextText "LogitLens was applied to Llama-2" .

:Llama-2 :outperforms :BERT ;
    :contextText "Llama-2 outperforms BERT baseline" .

# Chain 3: Method → Uses → Architecture → Improves → Baseline
:FineTuning :uses :Transformer ;
    :contextText "Fine-tuning uses Transformer architecture" .

:Transformer :improves :RNN ;
    :contextText "Transformer improves upon RNN models" .
```

#### **4. SUCCESS CRITERIA - AIM FOR 150-250 HIGH-QUALITY TRIPLES:**
- Extract from sentences containing SIGNIFICANT factual information
- Create relationships for IMPORTANT comparisons and key findings
- Link MAJOR numerical values and performance metrics to context
- Connect CORE methods to primary models/datasets
- Capture ESSENTIAL experimental details and main configurations
- PRIORITIZE quality over quantity - focus on the most relevant entities and relationships
- SKIP minor details, brief mentions, and tangential information
