This snapshot uses 10 stationary sessions and fixed seeds for speed; the full pipeline reports CI95



## TL;DR
We identify a task that is **super easy for humans** but where all LLMs—from early 0.1B to the most modern 600B+ (GPT-5, Grok-4, Gemini, DeepSeek, etc.)—consistently **fail in the Same Way**. This pinpoints the **core challenge of MRCR**



-Multi-round co-reference in Context Interference:

Classic long-context benchmarks often test retrieving a single "needle" from a massive "haystack." MRCR raises the bar by placing many similar needles in the same context, requiring models to select the correct item (up to 8 needles), and shows that all LLMs struggle with this task.

- OpenAI MRCR dataset
- DeepMind MRCR paper: Michelangelo: Long Context Evaluations
Beyond Haystacks via Latent Structure Queries

## Our test takes this one step further
If MRCR is "multiple needles in a haystack", we show the **haystack isn't necessary** to expose core retrieval failures. By isolating—and precisely controlling—the number of similar, co-referenced items (we repeatedly update the value of the same keys in key–value pairs), our paradigm directly measures how interference from up to 400 needles limits retrieval accuracy even without any "haystack" as background. LLMs cannot perform a simple task like "retrieving the last value" of each co-referenced item.

- We observe a clear log-linear decline in accuracy as the number of interfering updates grows (i.e., co-references increase).
- The effect holds across the transformer models we tested. 



## Key–value update paradigm (what the model sees)
We present a classical key–value experiment: the same key is updated multiple times. The model is then asked to return the current (last) value for each key. This isolates co-reference interference without requiring extremely long distractor contexts.

Minimal example (1 key, N updates each):
```

Key1: Value_1
Key1: Value_2
......
Key1: Value_N



Question: 

What is the current value (the last value) for Key1?
```

Expected: 
```
The current value of Key1 is Value_N. 
```


## Results:
ALL tested SOTA LLMs **cannot reliably retrieve** Value_N. Distribution spans value_1 to value_N, and **as N increases**, the **answers skew** increasingly toward **value_1**. 



## Note on dataset scale: 
(N from 1 to 400). We put up to 46 such groups (key1..key46) together and then ask the model to retrieve just the last value of each key. We make sure all values are different, so when the model replies, we know how far away the answer is from the correct answer.



## Why this is challenging for LLMs:
- Multiple co-references to the same key cause strong interference.

1. As the number of updates per key (N) increases, LLMs **confuse earlier values** with the most recent one and fail to retrieve the last value. (Dataset column: exp_updates)
2. We intentionally make the task to only retrieve the last value to keep searching difficulties low and to show all LLM are unable to keep track due to **context interference**. 


## On Randomization
We **RANDOMIZE**  update order after generation to mimic unpredictable changes by interleaving updates across different keys (i.e., different keys’ updates occur back-to-back rather than in contiguous blocks). Counterintuitively, this often helps LLMs, since the final update usually lands near the end of the context. In the sequential setting, most smaller (less than ~600B) models lose track after only a few updates—even with 5–8k-token inputs.
See the **Sequential /Original-Non-Random Mode** section at the end of this document, where many LLMs’ performance still **collapses** with only a **small amount of input (5–8k)**



## Cognitive science connection: Proactive Interference (PI)
Our test adopts the **classic proactive** interference paradigm from cognitive science, a **foundational method** for studying **human working memory**. PI shows how older, similar information disrupts encoding and retrieval of newer content. Bringing this approach to LLMs allows us to directly measure how interference—not just context length—limits memory and retrieval.

- Interestingly, humans are **also affected by these three dimensions**, but far less than LLMs. Humans consistently outperform even the latest and largest models on this task.”


## SAME Log-linear Decline of Accuracy for ALL SOTA LLMs tested(2019-2025)
- Humans: near-ceiling accuracy (99%+) on this controlled task across conditions (see paper for protocol and exact numbers).
- LLMs: accuracy declines approximately log-linearly with the number of updates per key and with the number of concurrent update blocks (details, plots, and model list in our paper).


## Full detail of 3 tests
This dataset consists of 2 additional dimensions of evaluation to show current LLMs' limits. Including SOTA models: GPT5, Grok4, DeepSeek, Gemini 2.5PRO, Mistral, Llama4...etc 

- Experiment2. (Dataset column: exp_keys).
LLMs's capacity to resist interference and their accuracy to retrieve the last value decrease log-linearly as the number of concurrent keys(n_keys) grows. 
This experiment fixes everything else and vary only n_keys. (Two sets of test are provided, one fixed updates to 350 and another fixed update to 125 as lower difficulty settings)


- Experiment3. (Dataset column: exp_valuelength). 
This causes rapid decline across LLMs (GPT-5 and Grok-4 decline similarly to GPT-2).”
Retrieval accuracy also decreases log-linearly as value length grows. 
This experiment fixes everything else, and vary only the value_length.
Two sets of tests are provided, one fixed updates to 20 and another fixed update per key to only 4 as low difficulty settings

(As this test is too hard, only 4 updates per key make all LLMs fail to retrieve the last value—which we intentionally designed to keep the searching difficulty low. Retrieve other order of value has even lower performance)

## One more thing: Sequential / Non-Randomized Mode (Last but interesting)
In a separate dataset files (Dataset column: extra_exp_updates_randomoff)
This mode takes the exact format shown in this document, without randomization. We fix everything but vary only the update times just like in the above experiment, but turn randomize_mode off .(column: randomize_mode)
- This separate dataset consists of 46 of following blocks in a non-randomized order:



Key1: Value_1
Key1: Value_2
......
Key1: Value_N


Key2: Value_1
Key2: Value_2
......
Key2: Value_N

....all the way to key46 block

Question: 

What is the current value (the last value) for key1 key2....key46?


**Result**  
- In this mode, **most Modern LLMs (all <600B) still confuse the last value with earlier value after only 50–100 updates** (fewer than 12–25k tokens, far less than any LLMs' context window).  
- Models quickly confuse earlier values with the most recent one.  
- This is the **original and most simple test**
- Performance for this mode is also **reported in our paper (Figure 4).**
- **Step-like failure pattern** in this sequential key–value update tests. Retrieval accuracy remains near-perfect as interfering information is added in strictly sequential order, until a model-specific threshold is reached—after which **performance drops rapidly to near-zero**.
-

# Dataset File List

Currently it includes two files:

- **core.parquet** → Main dataset (randomized updates). Recommended as the primary/SOTA comparison setting; All tested models fail to reliably retrieve the last value. 
- **sequential_additional.parquet** →  Sequential mode (non-randomized, strict per-key ordered update blocks). Trivial for humans yet still challenging for many LLMs; smaller (all <600B) models are especially affected, with proactive-interference effects clearly exposed (even in short contexts, ~5–8k tokens).


## Quick Start - Evaluate Your Model

```python
from huggingface_hub import hf_hub_download
import pandas as pd
from openai import OpenAI
import json
import tiktoken
from pathlib import Path

# Set accordingly  
MAX_CONTEXT_WINDOW = 1000000
MODEL = ""  # or your preferred model


# Prefer local, offline file
ROOT = Path(__file__).resolve().parent
DATASET_PATH = ROOT / "Static_Dataset_Version" / "core.parquet"

# Load dataset (DataFrame)
dataset = pd.read_parquet(DATASET_PATH)


client = OpenAI()
enc = tiktoken.get_encoding("o200k_base")

def extract_pieces_response_to_dict(model_output, probe_target="current"):
    """
    Extract the dictionary of key-value pairs from the model output.
    First extract using verbal language match, then using colon match.
    Merge the two dictionaries, prioritizing keys from the verbal match.
    """
    import re
    
    if len(model_output) == 0:
        return None
    
    if "error code" in model_output.lower():
        return None
    
    if model_output.startswith("error") or model_output.startswith("Error"):
        return None

    if (re.search(r'\berror\b', model_output, re.IGNORECASE)) and (len(model_output) < 680):
        return None

    # Remove backslashes and asterisks
    model_output = re.sub(r'\\(?!n)', '', model_output)
    model_output = re.sub(r'\*', '', model_output)

    dict_verbal_match = _extract_verbal_matches(model_output, probe_target)
    dict_colon_match = _extract_colon_matches(model_output)

    dict_merged = dict_colon_match.copy()
    dict_merged.update(dict_verbal_match)
    dict_merged.pop("key", None)

    return dict_merged

def _extract_verbal_matches(model_output, probe_target="current"):
    """Extract key-value pairs using verbal patterns like 'The current value of X is Y'"""
    import re
    
    patterns = [
        r"(?:the)?\s*(?:most recent|final|last|latest|current|up-to-date|asked|queried|specified)\s+(?:value|word|term)?(?:s)?(?:\s+\w+){0,1}\s+(?:with|for|of|to)?\s+(?:the )?(?:category|key)?\s*([\"'\[\<]?\w+(?:\s+\w+)?[\"'\]\>]?)\s+(?:is|was)(?:\s*:\s*)?\s+([\"'\[\<]?\w+(?:\s+\w+)?[\"'\]\>]?)(?=\n|[,.;:]|$)",
    ]
    
    dict_response = {}
    for pattern in patterns:
        matches = re.findall(pattern, model_output, re.IGNORECASE | re.DOTALL)
        for match in matches:
            if len(match) >= 2:
                key, value = match[0], match[1]
                key = re.sub(r'[\*\'"""''\[\]\{\}\(\)\<\>]', '', key).strip()
                value = re.sub(r'[\*\'"""''\[\]\{\}\(\)\<\>]', '', value).strip()
                if key and value:
                    dict_response[key] = value
    return dict_response

def _extract_colon_matches(model_output):
    """Extract key-value pairs using colon-separated patterns"""
    import re
    
    # Simple colon-based extraction
    dict_response = {}
    lines = model_output.split('\n')
    for line in lines:
        if ':' in line:
            parts = line.split(':', 1)
            if len(parts) == 2:
                key = re.sub(r'[\*\'"""''\[\]\{\}\(\)\<\>]', '', parts[0]).strip()
                value = re.sub(r'[\*\'"""''\[\]\{\}\(\)\<\>]', '', parts[1]).strip()
                if key and value:
                    dict_response[key] = value
    return dict_response

def grade_pi_response(response, answer_formatted):
    """
    Compute per-row accuracy: fraction of tracked keys answered with the last value.
    - Parses the ground truth JSON string (answer_formatted) into {key: last_value}.
    - Parses model output into {key: value} using robust extractors.
    - Returns (# of keys with exact value match) / (# of keys in ground truth).
    """
    try:
        # Parse ground truth JSON
        ground_truth = json.loads(answer_formatted)
        
        # Extract key-value pairs from model response using parsing functions
        response_dict = extract_pieces_response_to_dict(response, probe_target="current")
        if not isinstance(ground_truth, dict) or ground_truth is None:
            return 0.0
        if not isinstance(response_dict, dict) or response_dict is None:
            return 0.0
        
        keys = list(ground_truth.keys())
        if len(keys) == 0:
            return 0.0
        correct = sum(1 for k in keys if response_dict.get(k) == ground_truth.get(k))
        return correct / len(keys)
    except Exception as e:
        return 0.0

def n_tokens(messages):
    """Count tokens in messages."""
    return sum([len(enc.encode(m["content"])) for m in messages])

# Evaluate your model (Recommnd Using below AUC/weighted score )
results = []
for index, row in dataset.iterrows():
    messages = json.loads(row["prompt"])
    if n_tokens(messages) > MAX_CONTEXT_WINDOW:
        continue
        
    completion = client.chat.completions.create(
        model=MODEL,
        messages=messages,
    )
    response = completion.choices[0].message.content
    accuracy = grade_pi_response(response, row["answer_formatted"])
    parsed = extract_pieces_response_to_dict(response, probe_target="current")
    
    # Store result with experiment info and raw/parsed responses (useful for axes + error analysis)
    results.append({
        'experiment': row['experiment'],
        'session_id': row['session_id'],
        'run_id': row.get('run_id', None),
        'accuracy': accuracy,
        'index': index,
        'response_text': response,
        'parsed_response': parsed,
    })
    
    print(f"Row {index} ({row['experiment']}, session {row['session_id']}): {accuracy}")

# Calculate accuracy by experiment
import pandas as pd
results_df = pd.DataFrame(results)

# Group by experiment and calculate mean accuracy
experiment_accuracy = results_df.groupby('experiment')['accuracy'].agg(['mean', 'count']).reset_index()
experiment_accuracy['accuracy_percent'] = experiment_accuracy['mean'] * 100

print("\n=== Accuracy by Experiment ===")
for _, row in experiment_accuracy.iterrows():
    print(f"{row['experiment']}: {row['accuracy_percent']:.1f}% ({row['count']} samples)")

# Average across runs (e.g., 10 sessions via run_id)
if 'run_id' in results_df.columns:
    # Mean accuracy per experiment per run, then average across runs
    per_run = results_df.groupby(['experiment', 'run_id'])['accuracy'].mean().reset_index()
    exp_avg = per_run.groupby('experiment')['accuracy'].mean().reset_index()
    exp_avg['accuracy_percent'] = 100 * exp_avg['accuracy']
    print("\n=== Experiment accuracy averaged across runs (run_id) ===")
    for _, r in exp_avg.iterrows():
        print(f"{r['experiment']}: {r['accuracy_percent']:.1f}% (averaged over runs)")
```
