{"url":"https://www.perplexity.ai/search/eb31d5a7-705b-49b6-b096-392ed76720d7","title":"Perplexity","description":"Perplexity is a free AI-powered answer engine that provides accurate, trusted, and real-time answers to any question.","authors":[],"published_date":null,"domain":"perplexity.ai","is_paywall":false,"is_cached":false,"content":"### Files Mentioned\n* `fever_pretrained_gpt2_experiment.py`\n* `fever50k_full_20260428.md`\n* `modal_fever_run.py`\n* `run_fever_pretrained_gpu.py`\n\n### Code Blocks\n\n**Gradient blocking for non-evidence tokens (Python)**\n```python\n# inside forward(), after hidden_states = outputs.last_hidden_state\nif pooling_mode == \"evidence_only_pooling\":\n # mask out claim + label (+ optionally everything after [SEP])\n masked_hs = hidden_states.clone()\n for i in range(B):\n # sep_pos[i] is index of [SEP]\n # zero out positions >= sep_pos[i] (claim, [LABELSEP], label, EOS, padding)\n start = sep_pos[i]\n masked_hs[i, start:, :] = 0.0\n pooled = self._pool(masked_hs, pooling_mode, sep_pos, labelsep_pos, attention_mask)\nelse:\n pooled = self._pool(hidden_states, pooling_mode, sep_pos, labelsep_pos, attention_mask)\ncons_logits = self.consistency_head(pooled)\n```\n\n**Recommended Modal Run Command (Bash)**\n```bash\npython modal_fever_run.py \\\n --model_name gpt2 \\\n --train_samples 50000 \\\n --eval_samples 5000 \\\n --max_seq_len 256 \\\n --epochs 5 \\\n --batch_size 16 \\\n --lr 5e-5 \\\n --consistency_loss_weight 0.5 \\\n --freeze_lower_layers_epochs 1 \\\n --seed 42 \\\n --output_stem fever50k_tightened_diag \\\n --variants no_consistency_loss,evidence_only_pooling,evidence_only_strict,claim_only_pooling,evidence_only_random_labels\n```\n\n### Metrics and Data Points\n\n**Evaluation Metrics**\n| Metric Name | Description |\n| :--- | :--- |\n| `cls_claim_acc` | Classification accuracy of the consistency head. |\n| `gen_claim_acc` | Accuracy of the language model generating the label token. |\n| `cfact_cls_follows_swap` | Frequency the classifier follows the label of swapped evidence. |\n| `cfact_cls_follows_orig` | Frequency the classifier follows the original label despite swapped evidence. |\n| `matched_cfact_cls_follows_swap` | Counterfactual swap metric using lexically similar claims with different labels. |\n| `matched_cfact_cls_follows_orig` | Frequency classifier follows original label in matched-claim swaps. |\n| `Δcls_claim_acc` | Difference in accuracy vs. no-consistency baseline. |\n| `Δcfact_cls_follows_swap` | Change in counterfactual swap behavior vs. baseline. |\n| `Δcfact_cls_follows_orig` | Change in original label following behavior vs. baseline. |\n\n**Labels**\n* SUPPORTS\n* REFUTES\n* NEI (Not Enough Info)\n\n**Experimental Parameters**\n| Parameter | Value |\n| :--- | :--- |\n| Model Name | gpt2 |\n| Training Samples | 50,000 |\n| Evaluation Samples | 5,000 |\n| Max Sequence Length | 256 |\n| Epochs | 5 |\n| Batch Size | 16 |\n| Learning Rate | 5e-5 |\n| Consistency Loss Weight | 0.5 |\n| Freeze Lower Layers Epochs | 1 |\n| Seed | 42 |\n| Output Stem | fever50k_tightened_diag |\n\n### Experimental Variants\n* `no_consistency_loss`\n* `evidence_only_pooling`\n* `evidence_only_strict` (zeroes non-evidence hidden states)\n* `claim_only_pooling`\n* `evidence_only_random_labels` (consistency head trained on permuted labels)\n* `last_four_tokens_pooling` (fixed tail span baseline)\n\n### Project Components and Steps\n\n**General Mechanism Pattern**\n1. LM emits prose plus inline claims.\n2. Claims checked by strong verifier/oracle.\n3. Prose quality improves via shared hidden states and consistency loss.\n4. Fixed claim ontology (optional).\n\n**Recommended Sequence**\n1. FEVER (NLP validation)\n2. Another fact/NLI benchmark\n3. Structured-domain oracle (KataGo or code execution)\n4. Ablations on claim format and evidence noise\n\n**Best Next Steps**\n1. Run an out-of-domain verifier task.\n2. Test a different claim format (varying expression/length).\n3. Stress the verifier boundary (adversarial swaps, noisy retrieval).\n4. Measure transfer (coherence, factual consistency, calibration).\n5. Scale to a second real domain (medical QA, code, math).\n\n### Data Tables and Snippets\n* **CSV/JSON/JSONL Snippets:** Not available in page.\n* **Downloadable Artifacts:** Not available in page.\n* **Metric Values:** No raw result values provided (only target parameters for future runs).","error":null,"saved_to":"/home/user/.pplx/search/fetch/0c312864.json"}
