{
  "sims": {
    "google_maps": {
      "nnetnav_live_site=google_maps_num_tasks=75_portion=2": [
        "Both datasets require location-based searches with specific criteria (e.g., proximity to landmarks, hours of operation).",
        "Tasks involve filtering results by real-time or dynamic conditions (e.g., 'open now,' traffic status).",
        "Navigation includes multi-step actions like searching, selecting, and retrieving detailed information (e.g., directions, pricing).",
        "Queries demand integration of user reviews/ratings (e.g., 'highly-rated,' '4.8 stars or higher').",
        "Tasks require parsing contextual details (e.g., accessibility features, parking availability, transit schedules).",
        "Both involve route planning with transportation modes (e.g., walking, driving, public transit).",
        "Searches focus on specific business categories (e.g., restaurants, hotels, parks, museums).",
        "Tasks necessitate validation of operational details (e.g., closing times, reservation availability).",
        "Queries target granular geographic scopes (e.g., neighborhoods, intersections, landmarks).",
        "Both require interaction with embedded tools (e.g., booking systems, direction planners, filters)."
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=3": [
        "Tasks involve searching for specific locations or services (e.g., hotels, restaurants, transit stops) using filters like ratings, hours, or accessibility.",
        "Tasks require navigating to retrieve real-time or contextual information (e.g., traffic, weather, operating hours).",
        "Tasks include generating directions or routes between locations (e.g., walking, driving, public transit).",
        "Tasks involve verifying user reviews, ratings, or comments for places or services.",
        "Tasks require filtering results by geographic proximity (e.g., \"nearest,\" \"closest\").",
        "Tasks include requests for accessibility features (e.g., wheelchair-accessible routes, parking, amenities).",
        "Tasks involve booking, reserving, or interacting with dynamic elements (e.g., hotel stays, restaurant reservations).",
        "Tasks require parsing structured information (e.g., pricing tiers, service levels, amenity lists).",
        "Tasks focus on multi-step actions (e.g., search \u2192 filter \u2192 retrieve \u2192 share).",
        "Tasks frequently specify granular location constraints (e.g., neighborhoods, intersections, landmarks)."
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=1": [
        "Tasks require searching for specific locations or services with multiple criteria (e.g., hours, ratings, accessibility)",
        "Both datasets involve filtering results by real-time availability (e.g., 'open now', date-specific availability)",
        "Navigation tasks frequently combine spatial queries with informational needs (e.g., directions + reviews + amenities)",
        "Users need to compare/verify multiple attributes simultaneously (price, ratings, accessibility features)",
        "Tasks require understanding layered UI elements (search bars, filters, category carousels, map layers)",
        "Both involve parsing and synthesizing information from multiple interface components (place cards, reviews, maps)",
        "Tasks demand temporal awareness (opening hours, traffic conditions, transit schedules)",
        "Requires interaction with dynamic content controls (pagination, expandable menus, pop-up details)",
        "Both datasets emphasize accessibility requirements in navigation (wheelchair routes, EV charging locations)",
        "Tasks involve multi-step operations combining search, comparison, and action (find->verify->navigate->share)"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=0": [
        "Tasks require search functionality with specific filters (e.g., open hours, ratings, accessibility).",
        "Both involve location-based queries (e.g., 'near X,' 'closest to Y').",
        "Tasks demand parsing real-time or dynamic information (e.g., traffic, weather, operational hours).",
        "Navigation between two points (e.g., directions, routes) is a common objective.",
        "User-generated content (e.g., reviews, ratings, comments) is frequently referenced.",
        "Tasks require interaction with map UI elements (e.g., buttons for search, directions, layers).",
        "Queries often involve price or budget constraints (e.g., hotels under $160).",
        "Accessibility features (e.g., wheelchair access, parking specifics) are explicitly requested.",
        "Time-sensitive parameters (e.g., 'open now,' check-in/check-out dates) are critical.",
        "Tasks involve extracting hierarchical or nested information (e.g., hotel amenities, trail details)."
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=4": [
        "Both datasets involve tasks requiring search functionality for specific locations or points of interest (e.g., restaurants, hotels, landmarks).",
        "Tasks in both datasets frequently require filtering results by criteria like ratings (e.g., \"highly-rated,\" \"ratings greater than 4.8\"), accessibility, or operational status (e.g., \"open now\").",
        "Navigation tasks in both datasets emphasize route planning (e.g., walking, biking, public transit) between locations.",
        "Both datasets include tasks requiring extraction of detailed information about places, such as amenities, hours, pricing, or user reviews.",
        "Tasks in both datasets involve verifying real-time or dynamic data (e.g., traffic, availability, open/closed status).",
        "Both datasets require interaction with location-based services like directions, distance calculations, or travel time estimations.",
        "Tasks in both datasets often require identifying nearby amenities (e.g., parking, EV charging stations, transit stops) relative to a specific location.",
        "Both datasets include tasks that demand parsing structured information (e.g., menus, reservation systems, price comparisons).",
        "Tasks in both datasets involve validating or cross-referencing user-generated content (e.g., reviews, photos, ratings).",
        "Both datasets prioritize accessibility and usability features, such as wheelchair-accessible routes or facilities, in task requirements."
      ]
    },
    "github": {
      "nnetnav_live_site=github_num_tasks=71_portion=3": [
        "Tasks require searching/filtering repositories based on stars, recency, or language criteria",
        "Navigation involves locating GitHub product sections like Copilot, Security, or Pricing pages",
        "Tasks require comparing feature differences between GitHub plans (Free vs Pro vs Enterprise)",
        "Queries involve finding specific documentation elements (FAQ sections, how-to guides, or release notes)",
        "Tasks require identifying security-related information (vulnerabilities, security features, or compliance details)",
        "Navigation paths include educational resources like GitHub Skills courses or student developer packs",
        "Tasks involve checking version histories (latest releases, commit changes, or contributor activity)",
        "Actions require account management navigation (sign-up flows, privacy settings, or subscription management)",
        "Queries demand identification of popularity metrics (most starred repos, trending projects, or top contributors)",
        "Tasks require cross-referencing information between product features, pricing, and customer case studies"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=2": [
        "Tasks require searching/filtering repositories by criteria (stars, update date, language)",
        "Navigation involves comparing pricing plans (e.g. Free vs Pro, Copilot vs Enterprise)",
        "Tasks require accessing documentation pages (e.g. GitHub Skills, GraphQL API, Copilot setup)",
        "Interaction with security-related sections (advisories, Advanced Security features)",
        "Usage of GitHub's native search functionality to locate repositories/topics",
        "Tasks involve finding Copilot information (pricing, data usage, features)",
        "Requires navigation to customer success stories/use cases",
        "Filtering by programming language (Python, JavaScript, C++) is common",
        "Tasks involve checking repository metadata (stars, contributors, last commit details)",
        "Navigation through feature pages (Actions, Projects, Codespaces) is required"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=0": [
        "Tasks require filtering repositories by programming language, stars, and recency of updates.",
        "Navigation involves locating specific GitHub features like Copilot, Advanced Security, and Actions.",
        "Users must compare pricing tiers across Free/Pro plans and Copilot subscriptions.",
        "Tasks demand interaction with educational resources including GitHub Skills and Classroom.",
        "Search for customer success stories from partner organizations is required.",
        "Security-focused tasks involve analyzing vulnerability fixes, secret scanning, and compliance documentation.",
        "Account management actions like sign-up flows and privacy policy review are present.",
        "Tasks require accessing technical documentation for products like Copilot and CLI.",
        "Navigation through project management tools (Issues, Projects, Discussions) is essential.",
        "Users must interpret CI/CD metrics and automation features like workflow runs."
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=4": [
        "Tasks require searching/filtering repositories based on criteria like stars, update date, and programming language",
        "Navigation to GitHub's pricing section to compare plan features (Free vs Pro vs Enterprise)",
        "Tasks involve locating GitHub Copilot-related information (features, pricing, plans)",
        "Requires accessing GitHub Security features documentation (e.g. Advanced Security, Dependabot)",
        "Tasks demand finding customer success stories or case studies",
        "Requires checking repository details including recent commits and file changes",
        "Tasks involve account management actions (sign-up, trials, plan upgrades)",
        "Navigation through GitHub's educational resources (Skills, Classroom, documentation)",
        "Tasks require comparing storage limits and package allowances across plans",
        "Demands interaction with GitHub's API/integration documentation and developer tools"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=1": [
        "Tasks require searching/filtering repositories by criteria like stars, update dates, and programming languages",
        "Navigation involves accessing GitHub's pricing pages to compare plan features and storage limits",
        "Tasks require locating specific documentation sections (e.g., GitHub Skills, Security features, Copilot FAQs)",
        "Users must identify and extract metadata from repository pages (commit histories, file changes, contributors)",
        "Tasks involve comparing enterprise vs team plan features and limitations",
        "Navigation paths require using GitHub's search functionality with technical keywords (e.g., 'blockchain', 'machine learning')",
        "Tasks demand interpretation of versioning information and release dates from repository tags/releases",
        "Users must locate and parse specialized content areas (customer stories, security advisories, project wikis)",
        "Tasks require understanding GitHub's organizational structure for features (Copilot settings, project management tools)",
        "Navigation involves interacting with authentication-required features (Copilot trials, enterprise plan details)"
      ]
    },
    "espn": {
      "nnetnav_live_site=espn_num_tasks=62_portion=0": [
        "Tasks involve retrieving real-time or recent game scores across multiple sports leagues (NBA, NHL, NCAA, etc.).",
        "Navigation requires accessing team standings, conference/division rankings, and playoff trackers.",
        "Users frequently seek player-specific stats (e.g., points, assists, salaries, physical attributes like weight).",
        "Tasks involve filtering results by date (e.g., yesterday\u2019s games, specific future/past dates).",
        "Navigation includes locating schedules, matchups, and broadcast details (TV channels, streaming platforms like ESPN+).",
        "Users compare team records (e.g., win-loss ratios, standings) within leagues or conferences.",
        "Tasks require identifying league-specific structures (e.g., NBA Play-In Tournament, NCAA brackets).",
        "Navigation involves accessing injury reports, trade updates, and postseason predictions.",
        "Users interact with fantasy sports features (e.g., Tournament Challenge, Fantasy Football/Basketball).",
        "Tasks include cross-referencing multimedia content (articles, highlights) with game/player data."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=4": [
        "Tasks involve retrieving live/final game scores and results across multiple sports leagues (NBA, NHL, MLB, NCAA).",
        "Users navigate to check team standings within conferences/divisions for leagues like NBA, NFL, and NCAA.",
        "Tasks require accessing player statistics (points, rebounds, assists, salaries, weights) from team rosters/game summaries.",
        "Users frequently look up team schedules and specific game dates (e.g., Christmas Day NBA games, bowl game schedules).",
        "Both datasets involve finding ESPN articles/news updates about trades, injuries, or league developments within specified timeframes.",
        "Navigation to fantasy sports sections for player rankings, betting tips, or tournament challenge brackets is required.",
        "Tasks involve checking playoff/tournament qualification status (NBA postseason tracker, CFP bracket).",
        "Users compare team vs. team performance metrics (score differentials, win-loss records) across seasons or matchups.",
        "Tasks require multisport coverage navigation (switching between NFL, NBA, soccer, college sports).",
        "Users employ search functionality to locate specific games/players/teams across ESPN's hierarchical content structure."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=1": [
        "Tasks involve retrieving real-time or recent sports scores and schedules across multiple leagues (e.g., NBA, NFL, NCAA).",
        "Navigation requires accessing player-specific statistics (e.g., points, assists, weight, salary) from team rosters or game summaries.",
        "Users frequently seek standings/rankings (e.g., conference standings, division rankings, playoff brackets).",
        "Tasks require interaction with fantasy sports features (e.g., Tournament Challenge brackets, fantasy rankings).",
        "Queries demand identification of temporal game information (e.g., yesterday's matchups, December 25 games).",
        "Navigation involves comparing team/player performance metrics across multiple parameters (e.g., scores, records).",
        "Tasks require accessing box scores/gamecasts with detailed breakdowns (quarters, halves, OT results).",
        "Users frequently seek multimedia content links (e.g., ESPN+ broadcasts, watch locations).",
        "Queries involve filtering by league-specific terminology (e.g., 'Agg.', 'Final/OT', 'Top 6th').",
        "Tasks require cross-referencing multiple data points (e.g., standings with game schedules, player stats with team context)."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=3": [
        "Both datasets require navigation to league-specific sections (e.g., NBA, NHL, NCAA) for task completion.",
        "Tasks involve retrieving real-time or final game scores across multiple sports leagues.",
        "Users must locate team standings, conference rankings, and divisional positions in both datasets.",
        "Player statistics (e.g., points, rebounds, salary) extraction is a common requirement.",
        "Navigation to schedule pages for date-specific matchups is essential in both datasets.",
        "Both require accessing post-game summaries including top performers and key highlights.",
        "Tasks involve identifying broadcast/streaming details (e.g., ESPN+, TNT) for live events.",
        "Navigation through hierarchical structures (scores \u2192 team logos \u2192 detailed stats) is consistently required.",
        "Both datasets necessitate filtering by temporal constraints (e.g., 'yesterday's games', 'last 3 days').",
        "Cross-referencing between team rosters, player profiles, and game results is fundamental to tasks in both datasets."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=2": [
        "Tasks require retrieving real-time or recent game scores across multiple sports leagues (NBA, NFL, NHL, NCAA).",
        "Users navigate to access team standings, conference rankings, and divisional positions.",
        "Player-specific statistics (e.g., points, salaries, weights, performance metrics) are frequently queried.",
        "Tasks involve checking schedules for past and upcoming games, including dates and times.",
        "Fantasy sports features (e.g., bracket challenges, fantasy leagues) are common targets for interaction.",
        "Updates on trades, transfers, and player transactions are sought in both datasets.",
        "Navigation includes league-specific sections (NBA, NFL, NHL, NCAA) for granular data.",
        "Search functionality is utilized to locate teams, players, or specific articles/news.",
        "Game summaries with scores, highlights, and key player performances are frequently requested.",
        "Multi-sport coverage (e.g., switching between NBA, NFL, soccer) is central to tasks in both datasets."
      ]
    },
    "huggingface": {
      "nnetnav_live_site=huggingface_num_tasks=76_portion=1": [
        "Tasks require navigating to specific model, dataset, or application pages using search functionality or category links",
        "Tasks involve retrieving metadata attributes like update timestamps, download counts, likes, or model size",
        "Tasks focus on identifying most recent/trending resources based on recency or popularity metrics",
        "Tasks require understanding and filtering by technical specifications (architecture, framework, modality)",
        "Tasks involve cross-referencing multiple information sources (model cards, documentation, GitHub)",
        "Tasks require interpreting licensing information and usage restrictions for AI resources",
        "Tasks demand understanding of different ML modalities (text, image, audio) and their applications",
        "Tasks involve interaction with platform features like Inference API, Spaces, or enterprise solutions",
        "Tasks require comparison between similar resources using quantitative metrics and qualitative descriptions",
        "Tasks necessitate understanding of collaboration features like model versions, dataset splits, and community interactions"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=0": [
        "Tasks require locating models/datasets by name, functionality, or modality (e.g., NLP, text-to-image).",
        "Tasks involve extracting metadata: names, creators, update timestamps, licenses, and performance metrics.",
        "Navigation requires filtering/sorting by criteria like recency, popularity, or technical specifications (e.g., model size).",
        "Tasks demand identifying usage guidelines, including API integration, library dependencies, or deployment requirements.",
        "Queries focus on commercial viability: licensing restrictions, pricing tiers, or enterprise-grade features.",
        "Tasks require parsing technical documentation (e.g., adapter loading with PEFT) or model cards for implementation details.",
        "Users must validate temporal constraints (e.g., 'last updated in March 2023') and version compatibility.",
        "Tasks involve comparing multiple entries (e.g., 'most downloaded', 'best performance on MMMU benchmark').",
        "Queries target platform-specific features: Spaces apps, Inference Endpoints, or community resources like forums.",
        "Tasks require distinguishing between open-source vs. proprietary models and datasets with attribution requirements."
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=4": [
        "Tasks require identifying models/datasets by name, creator, and functional description",
        "Navigation involves filtering resources by recency (e.g. 'latest', 'updated X days ago')",
        "Tasks require extracting technical specifications including model size, framework, and tensor type",
        "Users must locate performance metrics like download counts, likes, and evaluation scores",
        "Navigation paths involve core sections: Models, Datasets, Spaces, and Documentation",
        "Tasks require understanding licensing information and commercial use permissions",
        "Users must interact with API-related features for model inference and deployment",
        "Tasks involve searching/filtering by technical attributes (modality, architecture, language)",
        "Navigation requires parsing structured metadata from resource cards/listings",
        "Tasks demand comparison of community engagement metrics (stars, followers, trending status)"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=2": [
        "Tasks require locating models/datasets by specific technical criteria (e.g., NLP, translation, sentiment analysis)",
        "Tasks involve extracting metadata like update dates, download counts, and model licenses",
        "Navigation includes accessing documentation for libraries like Transformers, Diffusers, or PEFT",
        "Tasks require performance metric retrieval (evaluation scores, usage benchmarks)",
        "Both involve searching for models by modality (text, image, audio) and task specialization",
        "Tasks require understanding model cards - including architecture details and training frameworks",
        "Navigation involves filtering models by popularity metrics (most downloaded/liked/recent)",
        "Tasks require handling technical implementation details (quantization, GPU deployment, API usage)",
        "Both involve cross-referencing research papers with model implementations",
        "Tasks require distinguishing between commercial vs. open-source licensing constraints"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=3": [
        "Tasks require identifying models by specific technical attributes (e.g., architecture, size, tensor type)",
        "Users need to locate temporal metadata like 'last updated' timestamps across resources",
        "Navigation involves filtering models/datasets by domain specialization (e.g., NLP, medical, translation)",
        "Tasks require cross-referencing between model documentation and associated research papers/blogs",
        "Users must interpret quantitative performance metrics (download counts, likes, evaluation scores)",
        "Navigation patterns involve switching between model repositories and documentation sections",
        "Tasks require understanding license types and commercial use restrictions",
        "Users need to compare multiple resource versions/releases within organizational accounts",
        "Navigation involves API integration requirements (Inference API, local deployment instructions)",
        "Tasks require identifying community engagement metrics (followers, trending status)"
      ]
    },
    "coursera": {
      "nnetnav_live_site=coursera_num_tasks=72_portion=3": [
        "Tasks require searching for courses using specific keywords or filters such as subject, duration, or difficulty level.",
        "Navigation involves identifying course details like instructor names, institution, completion time, and learning outcomes.",
        "Users frequently filter results by criteria like 'Beginner Level,' 'Credit Eligible,' or duration (e.g., 1-4 weeks).",
        "Tasks involve verifying course ratings (e.g., percentage of 5-star reviews) and comparing star-level distributions.",
        "Navigation includes locating professional certificates or specializations tied to career roles (e.g., Data Analyst, Cybersecurity).",
        "Users seek information about pricing, discounts (e.g., Coursera Plus), and partnerships with companies/universities.",
        "Tasks require identifying skill-based outcomes (e.g., Python programming, AI ethics) and credential providers (e.g., Google, IBM).",
        "Navigation involves exploring degree programs (e.g., Master\u2019s in Data Science) and their application deadlines or requirements.",
        "Users cross-reference course content with specific topics (e.g., Renewable Energy, Blockchain) and verify alignment with goals.",
        "Tasks often include comparing multiple courses or programs to determine relevance to career advancement or skill development."
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=2": [
        "Tasks require filtering courses by skill level (e.g., Beginner/Intermediate)",
        "Navigation involves searching for specific course titles or topics using keyword queries",
        "Users retrieve detailed course metadata including instructors, institutions, and duration",
        "Tasks require analyzing rating distributions (e.g., star percentages) from reviews",
        "Both datasets involve exploring professional certificates with career outcome data",
        "Filtering by course duration ranges (1-3 months, 1-4 weeks) is consistently required",
        "Tasks involve identifying free course options and subscription pricing models",
        "Degree program exploration includes admission requirements and credit transfer policies",
        "Navigation patterns show emphasis on partner institutions (Google, IBM, universities)",
        "Career path information includes job availability statistics and salary ranges"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=4": [
        "Tasks require using search functionality with specific keywords to locate courses/programs",
        "Users must apply multiple filters (e.g., skill level, duration, ratings) to refine results",
        "Tasks involve extracting detailed metadata including instructor names and institutional affiliations",
        "Both require identifying credential types (Professional Certificates/Specializations/Degrees)",
        "Navigation includes comparing course offerings from multiple partner institutions/companies",
        "Tasks demand analysis of course structure (modules, weekly commitments, learning outcomes)",
        "Users must interpret and compare numerical data (ratings percentages, salary figures)",
        "Tasks require identifying temporal constraints (start dates, deadlines, completion timelines)",
        "Both involve verifying course prerequisites and technical requirements (programming languages)",
        "Navigation includes cross-referencing career outcomes with program characteristics"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=0": [
        "Search functionality with filters for duration, level, and certification eligibility",
        "Course/specialization detail pages including instructor names and institutional affiliations",
        "Professional certificate programs offered by major tech companies (Google, IBM, Microsoft)",
        "Degree program information with application deadlines and credit requirements",
        "Course duration estimates with hourly/weekly time commitment breakdowns",
        "Rating system with percentage breakdowns by star category",
        "Multi-step filtering system for skill level (Beginner/Intermediate/Advanced)",
        "Structured learning paths for career roles (Data Analyst, Cybersecurity Analyst)",
        "Price comparison features for Coursera Plus subscriptions and discounts",
        "Partner institution listings with university/corporate logos and affiliations"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=1": [
        "Tasks require filtering courses by skill level (e.g., Beginner, Intermediate)",
        "Navigation involves extracting specific course details (instructor names, duration, learning outcomes)",
        "Users frequently search for Professional Certificates from tech companies (Google, IBM, Microsoft)",
        "Queries target course ratings and review distributions (e.g., 4.5+ stars, 5-star percentages)",
        "Tasks involve identifying university partnerships (e.g., Stanford, University of Michigan)",
        "Users compare program structures (certificate vs. degree requirements)",
        "Search functionality is used for domain-specific topics (Data Science, AI, Cybersecurity)",
        "Tasks require price/benefit analysis (Coursera Plus discounts, free vs paid courses)",
        "Navigation includes career-focused metadata (median salaries, job availability stats)",
        "Users explore AI/ML-related content across multiple disciplines (finance, cybersecurity, design)"
      ]
    },
    "arxiv": {
      "nnetnav_live_site=arxiv_num_tasks=80_portion=1": [
        "Both datasets require users to perform academic paper searches using specific keywords or phrases",
        "Tasks in both datasets involve filtering results by subject categories or sub-disciplines",
        "Users need to retrieve submission dates and version history information for papers",
        "Both require navigation through hierarchical subject classifications (e.g., Physics > Astrophysics > Solar)",
        "Tasks involve identifying and counting results within specific date ranges",
        "Users must locate and interpret author information including author counts",
        "Both datasets require understanding of paper metadata structure (titles, abstracts, comments)",
        "Tasks involve comparing search results across different subject archives",
        "Users need to access specialized search filters (date ranges, field selections)",
        "Both datasets require navigation between search interfaces and paper detail pages"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=4": [
        "Search functionality with field selection options (Title, Author, Abstract, etc.)",
        "Advanced search filters including date ranges and categories",
        "Subject category browsing with hierarchical subfield navigation",
        "Access to recent/new paper listings within specific subcategories",
        "Paper metadata retrieval including submission dates and versions",
        "Category-specific search capabilities (e.g. astro-ph.EP, cs.LG)",
        "Institutional affiliation references (Cornell University maintenance)",
        "Abstract viewing and paper summary accessibility",
        "Multi-format document download options (PDF, HTML)",
        "Submission guideline access including figure format requirements"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=0": [
        "Search functionality includes field filters (title, author, abstract, category)",
        "Requires navigation through hierarchical subject categories (e.g., Physics > Astrophysics)",
        "Supports date-range filtering for paper submissions/publications",
        "Tasks involve retrieving submission dates or version history of papers",
        "Requires accessing detailed paper metadata (author count, category assignments)",
        "Includes actions to download papers in PDF format",
        "Involves cross-referencing between category-specific archives (e.g., cs.LG vs stat.ML)",
        "Tasks require parsing abstracts or specific sections (results, methodology)",
        "Navigation to institutional/contributor information (e.g., Cornell University)",
        "Relies on advanced search parameters (arXiv ID, DOI, ORCID)"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=2": [
        "Both datasets require users to perform searches using specific keywords related to academic topics (e.g., 'quantum computing', 'machine learning').",
        "Tasks in both datasets involve navigating subject categories (e.g., Physics, Computer Science) to locate specialized research areas.",
        "Users must retrieve submission dates, version history, or publication timelines for papers in both datasets.",
        "Both require interaction with category-specific archives (e.g., astro-ph.EP, cond-mat) to filter results.",
        "Tasks demand extraction of metadata such as author counts, abstracts, and institutional affiliations from papers.",
        "Both datasets include queries that require comparing results across multiple categories or archive scopes (e.g., 'search in all archives').",
        "Users must utilize advanced search parameters like date ranges (e.g., 'submitted between Jan 1-3, 2024') in both datasets.",
        "Tasks involve navigating from high-level category listings to granular subcategory pages (e.g., Mathematics \u2192 Algebraic Topology).",
        "Both require verification of arXiv's operational policies (e.g., submission guidelines, privacy policy).",
        "Tasks in both datasets involve cross-referencing paper details with external entities (e.g., Cornell University's website)."
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=3": [
        "Both datasets require users to search for specific research papers using titles or keywords within arXiv's interface.",
        "Tasks in both datasets involve filtering search results by subject categories (e.g., Computer Science, Physics, Quantum Physics).",
        "Both require users to interpret date-related parameters (submission dates, publication recency) in search results.",
        "Tasks involve extracting quantitative information from search results (e.g., number of papers, author counts).",
        "Users must identify and access sub-categories within main subject areas (e.g., Astrophysics of Galaxies under Physics).",
        "Both datasets require navigation between search results and full paper details to locate abstracts/metadata.",
        "Tasks involve comparing search results across multiple categories (e.g., specific category vs. all archives).",
        "Both require understanding arXiv's organizational structure to find institutional information (e.g., maintaining university).",
        "Tasks utilize arXiv's advanced search features including date ranges and field-specific queries (title vs. abstract).",
        "Both datasets require cross-referencing between paper metadata and external university/author information sources."
      ]
    },
    "bbc": {
      "nnetnav_live_site=bbc_num_tasks=69_portion=2": [
        "Tasks involve navigating through categorized sections (e.g., News, Sport, Business, Culture) to locate specific content.",
        "Users frequently seek regional news coverage (e.g., Middle East, Asia, Europe, US & Canada).",
        "Tasks require identifying and summarizing articles based on timestamps (e.g., '3 hrs ago', '1 day ago').",
        "Navigation includes accessing multimedia content (e.g., videos, podcasts, live coverage).",
        "Users search for updates on ongoing global conflicts (e.g., Israel-Gaza War, Ukraine-Russia War).",
        "Tasks involve locating specialized sections (e.g., BBC InDepth, BBC Verify, Innovation, Travel).",
        "Users seek sports-related information (e.g., Premier League results, cricket highlights, tournament schedules).",
        "Tasks require filtering content by topic tags (e.g., 'Business', 'World', 'Asia', 'Culture').",
        "Navigation includes finding articles with geopolitical or economic implications (e.g., tariffs, climate change, trade wars).",
        "Users interact with lists (e.g., 'Most Watched', 'Most Read', 'Other Top Stories') to prioritize content."
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=3": [
        "Both datasets require navigation through categorized sections (e.g., World, Asia, Business) to locate region- or topic-specific content.",
        "Tasks in both datasets involve identifying time-sensitive articles, often indicated by timestamps (e.g., '3 hrs ago', '21 hrs ago').",
        "Users must extract summaries from articles with structured components (headline, image captions, timestamp, and category tags).",
        "Geopolitical conflict coverage (e.g., Israel-Gaza War, Syria crisis) is a recurring theme in navigation tasks for both datasets.",
        "Both datasets include tasks requiring differentiation between live/video content and text-based articles (e.g., 'Watch live coverage' vs. written reports).",
        "Navigation relies on hierarchical organization with primary sections (e.g., News, Sport) and subsections (e.g., US & Canada, Cricket).",
        "Tasks frequently involve cross-referencing category labels (e.g., 'Asia', 'Climate') with article metadata to verify relevance.",
        "Both datasets emphasize locating crisis/disaster-related updates (plane crashes, natural disasters) with specific casualty/impact figures.",
        "Users must distinguish between breaking news and analytical/feature content (e.g., crash reports vs. cultural trend analyses).",
        "Navigation patterns require identifying recurring template elements (e.g., 'MOST READ', 'MORE TO EXPLORE') for content prioritization."
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=1": [
        "Tasks require navigating hierarchical category structures (e.g., World > Middle East, Sport > Football)",
        "Users must identify time-sensitive content via timestamps (e.g., '3 hrs ago', '10 hrs ago')",
        "Tasks involve locating region-specific news sections (e.g., Asia, Europe, US & Canada)",
        "Navigation requires interaction with standardized metadata patterns (headline + summary + category tag)",
        "Multi-step exploration needed for thematic filtering (e.g., climate change across Business/World/Science)",
        "Content discovery depends on parsing article previews with image + text combinations",
        "Tasks require differentiation between live updates vs. archived reports",
        "Users must traverse mixed media formats (text articles, videos, podcasts) within category structures",
        "Section identification relies on consistent topical tagging (e.g., 'Business', 'Middle East', 'Science & Environment')",
        "Temporal prioritization needed for 'latest' content within category hierarchies"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=0": [
        "Tasks require navigating through hierarchical website sections (e.g., World News, Sports, Business)",
        "Users must locate and summarize key points from articles on current events",
        "Need to identify time-sensitive information like latest updates/recent match results",
        "Involve extracting specific details from multimedia content (videos/podcasts)",
        "Tasks include cross-referencing regional coverage (e.g., Asia/Europe/Middle East)",
        "Require interaction with categorized content hubs (e.g., Culture/Technology/Health)",
        "Users must parse structured data elements like league tables/event calendars",
        "Tasks involve following multi-step navigation paths through topic-specific filters",
        "Need to distinguish between live updates versus archived news content",
        "Require understanding of geographic categorization in news reporting"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=4": [
        "Tasks require locating articles by specific topics such as war, sports, or climate change.",
        "Navigation involves accessing dedicated sections (e.g., Sports, Business, World News, Culture).",
        "Users must identify time-sensitive information like latest updates or recent developments.",
        "Queries often focus on geographically specific regions (e.g., Asia, Europe, Middle East).",
        "Tasks demand summarization of key points from articles or reports.",
        "Instructions involve retrieving event-specific details (e.g., natural disasters, political incidents).",
        "Some tasks require accessing multimedia content like videos or podcasts.",
        "Users are asked to find exact articles using precise titles or descriptions.",
        "Tasks include extracting headlines or top stories from specific categories.",
        "Navigation paths involve cross-sectional movement (e.g., News \u2192 Sport \u2192 Football results)."
      ]
    },
    "amazon": {
      "nnetnav_live_site=amazon_num_tasks=63_portion=2": [
        "Tasks involve specifying price ranges or budget constraints for product searches",
        "Users are required to filter results by product attributes (e.g., size, color, material)",
        "Navigation includes sorting mechanisms (price low-high, newest arrivals, customer ratings)",
        "Queries require verification of specific product features/technical specifications",
        "Tasks involve checking availability of free shipping/return policies",
        "Actions include adding identified products to shopping cart",
        "Searches target specific categories/departments (e.g., Electronics, Home & Kitchen)",
        "Users must evaluate customer review metrics (star ratings, review counts)",
        "Tasks require comparison between multiple products/results",
        "Queries include both new and used/refurbished product conditions"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=3": [
        "Tasks require product searches with specific price ranges or budget constraints.",
        "Users must apply filters such as customer ratings, categories, or product attributes.",
        "Navigation involves adding items to the cart or verifying cart actions.",
        "Tasks demand locating products within defined categories or departments (e.g., Electronics, Home & Kitchen).",
        "Users need to verify product details like specifications, availability, or delivery options.",
        "Queries involve comparing prices, features, or customer reviews across products.",
        "Tasks include sorting results by criteria like price (low-high, high-low), newest arrivals, or relevance.",
        "Users target products with explicit attributes (e.g., waterproof, eco-friendly, hypoallergenic).",
        "Tasks require identifying discounted, on-sale, or used-condition items.",
        "Users interact with account-related actions (e.g., sign-in, returns, registrations) where applicable."
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=1": [
        "Tasks require searching for products with specific price ranges and filtering by customer ratings.",
        "Navigation involves using sorting options (e.g., price high-to-low, newest arrivals, popularity).",
        "Users must add items to the cart as a common action step.",
        "Tasks include verifying product availability and delivery conditions (e.g., free shipping).",
        "Both datasets involve filtering products by attributes like size, color, material, or usage context.",
        "Customer reviews and ratings (e.g., 4+ stars) are critical criteria for product selection.",
        "Users are required to compare prices across multiple search results.",
        "Navigation through hierarchical categories (e.g., Electronics > Accessories) is essential.",
        "Tasks involve handling product variants (e.g., used vs. new, seasonal collections).",
        "Users must access additional product details like return policies or compatibility information."
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=0": [
        "Tasks involve searching for products with specific price ranges or discounts",
        "Users are required to apply filters (e.g., ratings, material, category, features) during product searches",
        "Navigation includes adding items to cart as a primary action",
        "Tasks require sorting results by criteria like price, newest arrivals, or customer ratings",
        "Product availability checks (e.g., 'Used - Good' condition, stock status) are common requirements",
        "Users must verify detailed product attributes (e.g., dimensions, technical specs, color options)",
        "Tasks involve comparing multiple products or versions of the same product",
        "Navigation includes checking customer review thresholds (e.g., 'minimum 50 reviews')",
        "Users need to locate specific policies (returns, delivery) related to products",
        "Tasks require identifying items within specific departmental categories (e.g., Electronics, Home & Kitchen)"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=4": [
        "Tasks require product searches with specific attributes (e.g., price range, material, ratings).",
        "Users frequently filter results by price thresholds (e.g., under $50, $100-$200).",
        "Cart interactions (e.g., adding items, comparing quantities) are common objectives.",
        "Tasks involve verifying availability of free shipping or returns.",
        "Users prioritize customer reviews (e.g., 4+ stars, minimum review counts).",
        "Instructions often include sorting/filtering mechanisms (e.g., newest arrivals, price high-low).",
        "Category-specific navigation is required (e.g., electronics, home essentials, books).",
        "Tasks demand validation of product specifications (e.g., dimensions, compatibility, features).",
        "Promotions/sales are key search criteria (e.g., Winter Sale, on-sale items).",
        "Users seek exact product matches using brand/color/size filters (e.g., iPhone 12 Pro Blue)."
      ]
    },
    "wolframalpha": {
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=4": [
        "Tasks require computational problem-solving with mathematical equations or scientific formulas",
        "Queries involve unit conversions and calculations across physics, chemistry, and engineering domains",
        "Requests for step-by-step solutions to mathematical/physics problems are present in both datasets",
        "Tasks demand real-world data analysis (e.g. financial metrics, population statistics, material properties)",
        "Navigation requires understanding of STEM-focused categories like Calculus, Differential Equations, and Statistics",
        "Queries frequently involve comparisons between multiple entities or methodologies",
        "Tasks require interpretation of specialized terminology in mathematics and natural sciences",
        "User intents include data lookup for chemical elements, physical properties, and astronomical phenomena",
        "Both datasets contain requests for temporal/spatial calculations (e.g. time estimates, geographic data)",
        "Tasks demonstrate need for multi-domain integration (e.g. combining nutrition data with mathematical modeling)"
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=0": [
        "Tasks require computational or mathematical problem-solving (equations, derivatives, integrals)",
        "Tasks involve scientific data retrieval (chemical properties, material characteristics, physics calculations)",
        "Tasks utilize unit conversions (mass to moles, currency, measurement systems)",
        "Tasks request step-by-step solutions or explanatory processes",
        "Tasks focus on real-world applications (financial calculations, health metrics, engineering)",
        "Tasks involve comparative analysis (pricing comparisons, material properties, packing densities)",
        "Tasks require data interpretation from structured knowledge bases (population statistics, historical trends)",
        "Tasks utilize natural language input for complex queries",
        "Tasks involve exploration of domain-specific examples (mathematics, science, everyday life categories)",
        "Tasks demand cross-domain knowledge integration (combining math with physics, chemistry with biology)"
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=1": [
        "Tasks involve mathematical problem-solving (e.g., derivatives, integrals, equations).",
        "Queries require computational or algorithmic processing (e.g., unit conversions, chemical calculations).",
        "Navigation tasks demand structured data retrieval (e.g., temperature anomalies, population statistics).",
        "Tasks include step-by-step solutions for equations or proofs (e.g., differential equations, polynomial expansions).",
        "Focus on scientific and technical domains (e.g., physics, chemistry, engineering).",
        "Queries involve real-world data analysis (e.g., climate data, financial metrics).",
        "Tasks require domain-specific knowledge (e.g., beta distributions, Fibonacci sequences).",
        "Navigation includes exploration of educational or example-based content (e.g., paradoxes, tutorials).",
        "Tasks involve comparisons (e.g., material properties, packing densities).",
        "Queries target dynamic or interactive features (e.g., plotting, parameter adjustments)."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=3": [
        "Tasks require mathematical computations (e.g., derivatives, integrals, equations).",
        "Tasks involve unit conversions (e.g., weight, currency, temperature).",
        "Tasks query scientific properties (e.g., chemical compositions, physical constants).",
        "Tasks demand step-by-step solutions for complex problems (e.g., differential equations, integrals).",
        "Tasks involve real-world applications (e.g., finance, health, engineering).",
        "Tasks require data analysis (e.g., statistical measures, population trends).",
        "Tasks utilize Wolfram Alpha's specialized computational knowledge (e.g., algorithms, curated data).",
        "Tasks target STEM disciplines (e.g., physics, chemistry, engineering).",
        "Tasks include financial calculations (e.g., present value, interest rates).",
        "Tasks focus on educational or research-oriented objectives (e.g., historical events, definitions)."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=2": [
        "Tasks require computational problem-solving across mathematics, science, and engineering domains",
        "Queries involve unit conversions and chemical/weight-to-mole calculations",
        "Requests for physical properties of materials/elements (thermal conductivity, boiling points)",
        "Tasks demand data aggregation/comparison across multiple entities (cities, materials, time periods)",
        "Navigation includes financial calculations (mortgage costs, GDP, personal finance)",
        "Health-related computations present (calorie burn, BMI, weight loss projections)",
        "Requires handling differential equations and complex mathematical operations",
        "Tasks involve plotting/visualizing mathematical functions and physical phenomena",
        "Requests for temporal/spatial calculations (sunburn timing, planetary day lengths)",
        "Queries require accessing structured knowledge bases (element isotopes, historical climate data)"
      ]
    },
    "allrecipes": {
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=0": [
        "Tasks require filtering recipes by user ratings (e.g., 4+ stars) across both datasets",
        "Both involve searching for recipes with specific ingredient constraints (e.g., zucchini, vegan)",
        "Users frequently seek recipes with minimum review thresholds (50+ reviews in A, popularity metrics in B)",
        "Navigation requires interaction with recipe metadata: cooking time, servings, and skill level",
        "Both datasets emphasize finding seasonal/holiday-specific recipes (Easter, Christmas, etc.)",
        "Tasks require parsing nutritional information like calorie count and carbohydrate content",
        "Users need to locate and interpret user reviews/ratings for recipe quality assessment",
        "Both involve recipe category navigation (dinners, cuisines, kitchen tips)",
        "Tasks require identifying 'Save Recipe' functionality and saved recipe management",
        "Both datasets demand comparison of multiple recipe variants through filtering/sorting"
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=4": [
        "Recipes can be searched by specific dietary preferences (vegetarian, vegan, keto, etc.)",
        "User reviews and star ratings are prominently displayed for recipe evaluation",
        "Preparation/cooking time filters are available for time-sensitive searches",
        "Recipes can be filtered by required ingredients or ingredient exclusions",
        "Meal type categorization exists (dinners, desserts, breakfasts, etc.)",
        "Nutritional information (calories, carbs, etc.) is provided for health-focused queries",
        "Recipe saving/bookmarking functionality is present for meal planning",
        "User-generated content features (recipe reviews, modifications) are supported",
        "Seasonal/holiday-specific recipe collections are available (Easter, Christmas, etc.)",
        "Advanced filtering by cuisine type and cooking methods (grilled, baked) is possible"
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=1": [
        "Users search for recipes using specific filters like ratings, reviews count, and ingredients.",
        "Tasks require retrieving detailed recipe information including ingredients, steps, and cook time.",
        "Navigation involves filtering by dietary preferences (vegan, keto, gluten-free, etc.).",
        "Users interact with recipe reviews/ratings to assess quality.",
        "Tasks include saving/bookmarking recipes for future reference.",
        "Navigation paths involve category-based browsing (e.g., cuisines, meal types).",
        "Users compare multiple recipes based on criteria like nutrition or preparation time.",
        "Tasks emphasize time constraints (e.g., 'under 30 minutes' or 'quick and easy').",
        "Recipes are validated through community metrics (star ratings, review counts).",
        "Seasonal/event-driven recipe searches are common (e.g., holidays, themed meals)."
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=3": [
        "Tasks require filtering recipes by user ratings (e.g., 4 stars or higher).",
        "Tasks involve searching based on specific dietary requirements (e.g., vegetarian, vegan, keto).",
        "Tasks emphasize recipe popularity metrics (e.g., minimum number of reviews).",
        "Tasks include time constraints (e.g., preparation time under 30 minutes).",
        "Tasks require extracting detailed recipe metadata (e.g., ingredients, cooking steps).",
        "Tasks involve navigation through categorized recipe types (e.g., cuisines, occasions).",
        "Tasks focus on user-generated content interaction (e.g., reviews, ratings).",
        "Tasks target seasonal or event-specific recipes (e.g., Easter, Christmas).",
        "Tasks demand aggregation of supplementary recipe information (e.g., nutrition facts).",
        "Tasks prioritize accessibility of trending or editorially highlighted recipes."
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=2": [
        "Tasks require searching recipes by specific dietary restrictions (e.g., vegetarian, vegan, gluten-free)",
        "Users filter recipes by star ratings (e.g., 4 stars or higher) and review counts",
        "Tasks involve extracting detailed recipe metadata (ingredients, prep/cook time, nutrition facts)",
        "Navigation includes finding recipes for specific occasions/holidays (Easter, Thanksgiving, Christmas)",
        "Users seek recipes with ingredient constraints (e.g., 'uses zucchini', 'contains shrimp')",
        "Tasks require comparing multiple recipe versions through user reviews and ratings",
        "Navigation flows involve filtering by meal type (breakfast, dinner) and cuisine (Italian, Korean)",
        "Users look for time-efficient recipes (e.g., 'under 30 minutes', 'quick and easy')",
        "Tasks include community interaction elements like reading/writing recipe reviews",
        "Navigation patterns show interest in budget-friendly meals and leftovers repurposing"
      ]
    },
    "dictionary.cambridge": {
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=2": [
        "Tasks require searching for specific words to retrieve definitions.",
        "Users must locate pronunciation guides, including UK and US variants.",
        "Translation features are utilized to convert words into target languages.",
        "Grammar explanations and usage examples are accessed for linguistic concepts.",
        "Example sentences are provided to illustrate word usage in context.",
        "Multiple definitions or meanings for a single word are retrieved.",
        "Navigation through categorized sections (e.g., Grammar, Thesaurus) is necessary.",
        "Tasks involve comparing different linguistic elements (e.g., prepositions, adjectives).",
        "Phonetic transcriptions using IPA are required for pronunciation tasks.",
        "Exploration of additional features like word games or blog content is included."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=3": [
        "Tasks require information retrieval from dictionary entries, including definitions, pronunciations, and example sentences.",
        "Users must navigate to specific sections (e.g., Grammar, Thesaurus, Translations) via structured menus or links.",
        "Translation tasks involve switching between language pairs (e.g., English\u2013French, English\u2013Spanish) using directional controls.",
        "Pronunciation tasks require locating UK/US phonetic notations (IPA) and audio playback buttons.",
        "Grammar-related tasks demand exploration of sub-sections (e.g., modal verbs, passive voice, articles) with usage examples.",
        "Search functionality is central, with tasks requiring keyword input and filtering by dictionary type (e.g., Learner\u2019s Dictionary).",
        "Handling multiple definitions per word is common, requiring users to identify distinct meanings or contexts.",
        "Tasks involve locating synonyms, antonyms, or collocations through the Thesaurus or related word lists.",
        "Dynamic content (e.g., \"Word of the Day\", blog posts) must be accessed for specific queries or examples.",
        "Navigation includes cookie consent banners and language selection panels, requiring interaction to proceed."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=1": [
        "Both datasets require users to search for word definitions using a text input field.",
        "Tasks in both datasets involve navigating to specific sections (e.g., Grammar, Thesaurus, Pronunciation) via explicit links or menus.",
        "Users must locate and compare UK/US pronunciations, often including IPA notation, in both datasets.",
        "Translation tasks (e.g., English to Chinese/Spanish/French) are present in both datasets, requiring language-direction selection.",
        "Both include tasks to extract example sentences from word entries to demonstrate contextual usage.",
        "Grammar-related tasks (e.g., modal verbs, passive voice) require navigation to dedicated grammar explanation pages.",
        "Word entries in both datasets contain multiple numbered meanings/definitions that users must count or compare.",
        "Tasks in both datasets involve identifying synonyms or related terms through links to the Thesaurus section.",
        "Users must interact with expandable/collapsible content (e.g., 'See more' buttons) to access full translations or definitions.",
        "Both datasets require distinguishing between dictionary versions (e.g., Learner's Dictionary vs. Essential English) during navigation."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=4": [
        "Tasks require retrieving word definitions with multiple contextual meanings",
        "Users must locate pronunciation guides including UK and US variants",
        "Navigation involves accessing translation features between English and other languages",
        "Grammar explanations (e.g., modal verbs, passive voice) must be identified in dedicated sections",
        "Tasks demand extraction of example sentences demonstrating word usage",
        "Users need to navigate alphabetical word indexes (A-Z) or category-based browsing",
        "Interaction with phonetic notation (IPA) is required for pronunciation tasks",
        "Tasks involve comparing word senses across learner/essential/British/American dictionaries",
        "Users must identify and extract content from recurring features like 'Word of the Day'",
        "Navigation patterns require switching between dictionary/thesaurus/grammar/translate modes"
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=0": [
        "Tasks require searching for word definitions with detailed explanations and examples",
        "Users must navigate to pronunciation guides for both UK and US English variants",
        "Translation functionalities between English and other languages (e.g., Chinese, Spanish) are utilized",
        "Grammar sections are accessed to explore rules (e.g., modal verbs, adjectives, prepositions)",
        "Thesaurus features are used to identify synonyms and antonyms for target words",
        "Tasks involve extracting multiple meanings or contextual uses of a single word",
        "Example sentences are retrieved to demonstrate word usage in context",
        "Navigation includes accessing dedicated dictionary subsets (e.g., Learner\u2019s Dictionary, Essential British/American English)",
        "Users interact with phonetic notation systems (e.g., IPA) for pronunciation accuracy",
        "Tasks require comparing linguistic variations (e.g., regional pronunciations, grammar rules)"
      ]
    },
    "apple": {
      "nnetnav_live_site=apple_num_tasks=70_portion=1": [
        "Tasks require navigating through structured product categories (e.g., iPhone, Mac, iPad) to locate specifications",
        "Both involve price comparison tasks across multiple device models or configurations",
        "Navigation includes accessing technical details like camera specs, storage options, and processor information",
        "Tasks require interaction with product comparison features (e.g., iPhone Pro vs Pro Max, AirPods models)",
        "Both datasets include tasks that involve checking device compatibility (e.g., Apple Watch bands, accessory compatibility)",
        "Navigation flows require accessing specialized sections: support pages, environmental reports, and business solutions",
        "Tasks frequently involve cross-referencing between product pages and support/technical documentation",
        "Both require finding time-sensitive information like release dates, software updates, and seasonal promotions",
        "Tasks utilize hierarchical navigation through footer links for corporate information (e.g., privacy, sustainability)",
        "Both involve inventory checks including product availability, customization options, and regional variations"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=4": [
        "Tasks require locating product specifications (e.g., storage, camera specs, battery life)",
        "Tasks involve price comparisons between models/variants (e.g., iPhone Pro vs Pro Max)",
        "Tasks require identifying product availability (e.g., in-store pickup, release dates)",
        "Tasks involve configuration customization (e.g., color, storage, accessory selection)",
        "Tasks require navigation through multi-level product categories (e.g., iPhone > Pro models)",
        "Tasks involve accessing technical support/warranty information",
        "Tasks require comparison of features across product generations (e.g., iPhone 15 vs 16)",
        "Tasks involve locating purchase programs (e.g., Apple Trade In, education discounts)",
        "Tasks require identification of accessory compatibility (e.g., Smart Folio for specific iPad)",
        "Tasks involve extracting detailed product metadata (e.g., resolution specs, material composition)"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=0": [
        "Tasks require locating product prices across categories (e.g., iPhone, Mac, AirPods, Apple Watch).",
        "Users must compare features between product variants (e.g., Pro vs. Pro Max models, storage options).",
        "Navigation involves identifying technical specifications (e.g., chip types, camera specs, display resolutions).",
        "Tasks involve checking in-store pickup availability or locating nearby Apple Stores.",
        "Users must explore hierarchical product categories (e.g., iPhone > iPhone 16 Pro > storage configurations).",
        "Purchase-related actions include trade-in estimates, financing options, and customization workflows.",
        "Tasks target educational/business use cases (e.g., education discounts, enterprise applications of Apple products).",
        "Support documentation navigation required for compatibility checks or setup instructions (e.g., iOS versions).",
        "Accessory research tasks involve compatibility verification and purchase flows (e.g., cases, Apple Pencil).",
        "Service integration tasks appear (e.g., Apple Card, Apple Trade-In, subscription services like Fitness+)."
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=2": [
        "Tasks involve locating specific product specifications (e.g., storage options, camera features, display sizes)",
        "Users frequently compare prices across different product models or configurations",
        "Navigation includes checking product availability (e.g., in-store pickup, stock status)",
        "Tasks require identifying current product lineup details (e.g., color options, model variations)",
        "Multiple queries focus on purchasing workflows (e.g., customization, add-ons like AppleCare)",
        "Users seek official support documentation for device setup/troubleshooting",
        "Tasks involve exploring Apple service offerings (e.g., Trade-In, Business Solutions, Family Sharing)",
        "Navigation patterns include cross-referencing product pages with support articles",
        "Queries frequently target product comparison between successive generations (e.g., iPhone 15 vs 16)",
        "Tasks require understanding product hierarchy through category navigation (e.g., Store > iPhone > Accessories)"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=3": [
        "Tasks require navigating product-specific pages for detailed specifications (e.g., iPhone models, iPad storage options).",
        "Users frequently interact with product configurators to customize device features (e.g., storage, color, chipset).",
        "Price comparison between models or configurations is a common objective (e.g., iPhone Pro vs. Pro Max, AirPods variants).",
        "Tasks involve verifying accessory compatibility (e.g., cases, Apple Pencil) with specific devices.",
        "Users seek availability details like in-store pickup options or regional release dates (e.g., Apple Vision Pro).",
        "Support-related tasks focus on troubleshooting, warranty status, or repair guides (e.g., cracked iPhone screens).",
        "Tasks require extracting technical specifications (e.g., camera resolution, battery life, processor details).",
        "Users compare features across product generations (e.g., iPhone 13 vs. 14 vs. 15 camera settings).",
        "Business-oriented tasks include purchasing for organizations or exploring enterprise solutions (e.g., MacBook Air for businesses).",
        "Health-related features (e.g., Apple Watch sensors, Hearing Aid functionality) are frequently researched."
      ]
    },
    "google_search": {
      "nnetnav_live_site=google_search_num_tasks=72_portion=3": [
        "Tasks require retrieving specific factual information from search results",
        "Queries involve keyword-based search inputs for precise data extraction",
        "Navigation patterns focus on SERP interaction without deep website exploration",
        "Tasks demand parsing and synthesizing information from multiple search results",
        "Common need for current/recent information (news, stats, rankings, or trends)",
        "Frequent requirement to compare or list multiple entities/results",
        "Use of structured query operators (implied by request for specific numerical data or ordered lists)",
        "Reliance on vertical search features (images, news, shopping) without explicit vertical selection",
        "Tasks assume search engine capability to surface authoritative sources (academic papers, official stats, platform data)",
        "Common pattern of multi-intent queries requiring compound search strategies"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=2": [
        "Tasks require retrieving specific factual information from structured or semi-structured web content",
        "Tasks involve searching for current or real-time data (e.g., scores, stock prices, news)",
        "Tasks demand navigation through multiple information layers (e.g., specifications->requirements, movies->earnings)",
        "Tasks frequently require comparison of entities (products, stocks, movies)",
        "Tasks involve locating specialized content categories (recipes, technical specs, academic papers)",
        "Tasks require extraction of numerical data (dates, counts, measurements, rankings)",
        "Tasks often need identification of authoritative sources (journals, official platforms, corporate sites)",
        "Tasks involve following predefined information hierarchies (bio->achievements, movies->release details)",
        "Tasks require understanding of temporal sequences (latest/last/upcoming/current information)",
        "Tasks demand interaction with both textual and structured data elements (tables, lists, charts)"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=4": [
        "Tasks require retrieving specific factual information (e.g., dates, statistics, rankings, bios).",
        "Queries often involve structured data outputs (e.g., lists, rankings, step-by-step instructions).",
        "Tasks demand real-time or up-to-date information (e.g., latest news, live scores, current stock prices).",
        "User goals frequently involve multi-step navigation (e.g., search, filter, compare, validate).",
        "Tasks target domain-specific terminology (e.g., technical specs, scientific terms, industry jargon).",
        "Queries prioritize actionable outcomes (e.g., booking tickets, applying for jobs, purchasing items).",
        "Tasks require synthesizing information from multiple sources or pages (e.g., research papers, product specs, news articles).",
        "User intents often include comparative analysis (e.g., stock performance, movie ratings, academic programs).",
        "Queries focus on verifiable or authoritative sources (e.g., academic journals, official websites, reputable platforms).",
        "Tasks involve parsing dynamic or interactive content (e.g., search autocomplete, filters, forms)."
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=0": [
        "Tasks require retrieving specific factual information (e.g., names, dates, technical specs, numerical data).",
        "Queries demand navigation to external platforms (e.g., IMDb, GitHub, news outlets, academic journals).",
        "Tasks often involve multi-step actions (e.g., search, filter, compare, extract, or validate data).",
        "Focus on current/live data (e.g., latest news, real-time stock prices, up-to-date rankings).",
        "Requests for structured outputs (e.g., top-N lists, sorted results, comparative analyses).",
        "Emphasis on domain-specific technical details (e.g., software/hardware requirements, product specs).",
        "Use of precise keyword-based searches (e.g., names, titles, version numbers, timestamps).",
        "Tasks target entity-centric information (e.g., people, products, organizations, events).",
        "Requires parsing dynamic content (e.g., ratings, prices, research papers, event availability).",
        "Reliance on platform-specific features (e.g., search filters, voice/image search, account interactions)."
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=1": [
        "Tasks require precise information extraction from web sources.",
        "Queries involve specific entities (names, dates, technical terms, brands, or locations).",
        "Navigation focuses on retrieving current or real-time data (e.g., latest news, scores, prices).",
        "Tasks demand parsing structured data (rankings, lists, version numbers, statistics).",
        "User intent centers on factual answers with measurable outcomes (e.g., counts, dates, requirements).",
        "Queries target specialized domains (sports, tech, health, entertainment, finance).",
        "Tasks often require multi-step processes (e.g., search \u2192 filter \u2192 compare \u2192 verify).",
        "Emphasis on comparative analysis (e.g., prices, stock performance, program features).",
        "Frequent use of keyword-driven search strategies to locate niche or granular information.",
        "Interaction with dynamically updated content (e.g., live scores, trending topics, recent publications)."
      ]
    }
  },
  "diffs_synth_from_real": {
    "google_maps": {
      "nnetnav_live_site=google_maps_num_tasks=75_portion=2": [
        "Tasks in dataset A more frequently require generating or exporting map-related outputs (e.g., PDFs, sharing links, information summaries)",
        "Dataset B contains explicit requests for price comparisons and specific cost information (e.g., hotel rates, premium plan costs)",
        "Dataset A includes more tasks requiring validation of negative operational conditions (e.g., 'not open 24 hours', 'closes at night')",
        "Dataset B contains more explicit booking/reservation actions (e.g., hotel rooms, restaurant tables)",
        "Dataset A tasks more commonly require temporal coordination between multiple locations (e.g., arrival time to airport then walking time to supermarket)",
        "Dataset B includes more specific menu item requests (e.g., Crispy Chicken Sandwich, gluten-free meals)",
        "Dataset A contains more granular transportation mode combinations (e.g., walking time estimates between specific points)",
        "Dataset B features multi-stop route planning with conditional waypoints (e.g., bike ride with coffee shop stop)",
        "Dataset A tasks more frequently require identification of infrastructure-supported locations (e.g., EV charging, accessible parking)",
        "Dataset B includes more comparative analysis tasks (e.g., review comparisons, playground equipment comparisons)"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=3": [
        "Dataset A tasks emphasize time-sensitive operational constraints (e.g., 'open now but not 24 hours'), while B focuses on broader temporal or budgetary planning (e.g., '2-night stay with $400 budget').",
        "Dataset A includes explicit requests for output actions (e.g., 'print map as PDF', 'share link'), whereas B focuses on informational retrieval without output formatting requirements.",
        "Dataset A tasks frequently involve multi-location coordination (e.g., 'walking time to nearest supermarket from hotel'), while B emphasizes single-location queries with quality filters (e.g., 'highly rated hotels').",
        "Dataset A specifies concrete numerical thresholds for filtering (e.g., 'ratings greater than 4.8'), while B uses qualitative descriptors (e.g., 'moderately priced', 'highly rated').",
        "Dataset A contains explicit requests for accessibility-supported infrastructure (e.g., 'EV charging supported parking'), while B emphasizes accessibility-enabled services (e.g., 'wheelchair-accessible medical transportation').",
        "Dataset A tasks require temporal exception handling (e.g., 'not open 24 hours'), whereas B focuses on temporal availability matching (e.g., 'open Sunday and Monday').",
        "Dataset A includes granular route optimization parameters (e.g., 'least amount of walking'), while B prioritizes service quality in routing (e.g., 'walking directions to wheelchair-accessible restaurants').",
        "Dataset A tasks involve real-time contextual synthesis (e.g., 'walking time to nearest supermarket'), while B focuses on static attribute comparisons (e.g., 'price of hotel').",
        "Dataset A emphasizes geographic precision through coordinate-like references (e.g., 'corner of Elm Street and Oak Street'), while B uses landmark-centric navigation (e.g., 'near Empire State Building').",
        "Dataset A includes system configuration tasks (e.g., 'find search settings'), while B focuses exclusively on location/service discovery and reservation workflows."
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=1": [
        "Dataset B tasks frequently require date-specific future availability checks (e.g., 'January 11th'), while A focuses on real-time availability ('open now')",
        "B includes tasks requiring booking/purchasing actions (e.g., 'buy a ticket', 'make a reservation'), absent in A",
        "B contains more international location queries (Paris, Tokyo, Spain) compared to A's predominantly US-centric tasks",
        "B emphasizes multi-destination itinerary planning (e.g., 'route from X to Y to Z'), while A focuses on single-destination navigation",
        "B tasks frequently specify budget ranges and guest counts (e.g., '$350/night for 3 people'), unlike A's attribute-focused filters",
        "A includes output generation tasks (e.g., 'print as PDF', 'share link') not present in B",
        "B requires planning for tourist attractions/experiences (e.g., 'Eiffel Tower tickets', 'hike planning'), while A focuses on utilitarian services",
        "B tasks involve price verification for services (flights, admissions), absent in A's location/service-focused queries",
        "A specifies exact numerical targets (e.g., '5 beauty salons'), while B uses qualitative thresholds ('highly-rated')",
        "B includes transportation mode comparisons (e.g., 'stairs vs elevator access'), whereas A focuses on single-mode navigation"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=0": [
        "Dataset B tasks require booking/reservation parameters (e.g., check-in/check-out dates, guest counts)",
        "Dataset B emphasizes price comparisons between multiple options (e.g., hotel price comparisons)",
        "Dataset A focuses on immediate navigation needs (e.g., current walking time, real-time traffic)",
        "Dataset B includes explicit requirements for cancellation policies (e.g., free cancellation)",
        "Dataset A tasks prioritize operational hour exceptions (e.g., 'not open 24 hours')",
        "Dataset B contains tasks requiring multi-city coordination (e.g., Paris to Stanford)",
        "Dataset A emphasizes physical accessibility constraints (e.g., wheelchair access in parking)",
        "Dataset B tasks involve review composition/analysis (e.g., 'be prepared to write a review')",
        "Dataset A requires specific infrastructure queries (e.g., EV charging stations)",
        "Dataset B includes multimedia exploration tasks (e.g., 360\u00b0 views, photo galleries)"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=4": [
        "Tasks in dataset B require making reservations or bookings (e.g., hotels, restaurants) with specific dates/times, while dataset A does not.",
        "Dataset B includes tasks involving price comparisons (e.g., 'best price,' 'affordable') as explicit criteria, whereas dataset A focuses more on operational status (e.g., 'open now') without direct price competition analysis.",
        "Tasks in dataset B frequently involve multi-stop itinerary planning (e.g., 'day trip to wheelchair-accessible restaurants'), while dataset A focuses on single-point route planning between two locations.",
        "Dataset B contains tasks requiring evaluation of menu items (e.g., 'check out their menu') and food-specific amenities, which are absent in dataset A's requirements.",
        "Dataset A tasks often require interaction with map export/sharing features (e.g., 'print as PDF,' 'sharing link'), while dataset B does not include these functionalities.",
        "Dataset B includes tasks with explicit temporal event constraints (e.g., 'New Year's Eve,' 'specific dates'), whereas dataset A focuses on real-time status without date-bound events.",
        "Tasks in dataset B require analysis of elevation profiles (e.g., 'hiking trail elevation') and terrain-specific metrics, which are not present in dataset A.",
        "Dataset A emphasizes verification of facility-specific operational rules (e.g., 'not open 24 hours,' 'closes at night'), while dataset B focuses on current availability ('open now') without operational hour exceptions.",
        "Dataset B includes tasks requiring comparison of non-commercial locations (e.g., parks, trails), whereas dataset A focuses exclusively on commercial/service-oriented venues.",
        "Dataset B tasks demand evaluation of user review sentiment (e.g., 'what people think about...') as a primary objective, while dataset A uses reviews only for secondary validation of facilities."
      ]
    },
    "github": {
      "nnetnav_live_site=github_num_tasks=71_portion=3": [
        "Tasks require explicit comparisons between GitHub Copilot pricing tiers (Free vs Pro vs Enterprise)",
        "Queries involve detailed analysis of GitHub's security certifications/compliance standards (SOC 2, GDPR)",
        "Navigation paths require accessing GitHub Advisory Database for vulnerability research",
        "Tasks demand step-by-step account creation processes (email validation steps, username selection)",
        "Queries focus on legal/policy documentation (terms of service, privacy policy updates)",
        "Navigation involves direct integration with IDE-specific workflows (VS Code extensions, mobile app features)",
        "Tasks require identification of educational resource activation processes (student pack verification)",
        "Queries emphasize GitHub Project management features (task dependencies, spreadsheet-style tracking)",
        "Tasks involve security vulnerability reporting workflows (GHSA ID lookup, disclosure processes)",
        "Navigation paths require direct comparison of GitHub Codespaces compute/storage pricing metrics"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=2": [
        "Dataset B tasks require accessing security advisories by severity level (high/unreviewed)",
        "Dataset B requires finding compliance documentation (CSA STAR Certificate)",
        "Dataset B tasks involve obtaining quotes/pricing for enterprise-level services (Advanced Security, Enterprise plans)",
        "Dataset B requires investigating AI training data usage policies (Copilot model training data sources)",
        "Dataset B tasks focus on API technical specifications (GraphQL types, YAML configurations)",
        "Dataset B includes legal/license verification tasks (repository license information)",
        "Dataset B tasks require direct interaction with security feature documentation (Advanced Security features)",
        "Dataset B contains job search-related tasks within platform navigation",
        "Dataset B tasks involve package management instructions (copilot-backend installation)",
        "Dataset B requires comparing technical API implementations (REST vs GraphQL differences)"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=0": [
        "Tasks in B require understanding GitHub Copilot's security features and data usage policies, while A focuses on vulnerability fixes and compliance documentation.",
        "B includes tasks about GitHub Copilot plan comparisons and upgrade paths, whereas A compares broader Free/Pro plan tiers.",
        "B tasks demand analysis of technical documentation syntax (e.g., Markdown vs. rich text formatting), which A does not require.",
        "B requires identifying GitHub Copilot Extensions functionality, while A focuses on core Copilot features.",
        "Tasks in B involve explicit investigation of GitHub's data collection policies and confidentiality agreements, unlike A's account management focus.",
        "B includes troubleshooting specific GitHub Actions errors (e.g., Metrics embed failures), while A focuses on metric interpretation.",
        "B tasks require CLI tool configuration (e.g., GitHub CLI installation), whereas A focuses on CLI documentation access.",
        "B emphasizes educational resource configuration (e.g., autograding setup), while A focuses on resource discovery.",
        "Tasks in B require understanding feature eligibility requirements and trial processes (e.g., Enterprise Server trials), which A lacks.",
        "B includes testing error handling for non-existent URLs/navigation paths, while A focuses on valid feature discovery."
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=4": [
        "Tasks in B require retrieving pricing or cost information for specific use cases (e.g. NLP projects, educational trials)",
        "B includes tasks about legal/policy documentation (terms, data handling, intellectual property) not present in A",
        "B contains exploratory queries about platform capabilities without predefined filters (e.g. 'find out what GitHub can do')",
        "Tasks in B demand comparing plans based on hypothetical scenarios rather than concrete feature comparisons",
        "B requires finding security certifications/compliance information (e.g. CSA STAR Certificate)",
        "Tasks in B involve API implementation details (e.g. GraphQL integration workflows)",
        "B includes mobile-specific inquiries (app ratings, mobile feature availability) absent in A",
        "Tasks in B ask for vulnerability research with CVE/CWE identifiers rather than general security features",
        "B contains hypothetical system status checks (e.g. 'why is GitHub down?') not seen in A",
        "Tasks in B require interpreting upgrade paths between plans rather than direct feature comparisons"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=1": [
        "Tasks in dataset A focus on retrieving specific quantitative data (star counts, update dates, metadata statistics) while dataset B emphasizes understanding feature details and pricing structures",
        "Dataset A requires identifying exact content locations (specific documentation sections, repository tags) whereas dataset B tasks involve exploring broad product capabilities",
        "Tasks in dataset A demand temporal filtering (last week, past 2 days) while dataset B emphasizes plan comparisons without time constraints",
        "Dataset A contains technical implementation tasks (changing zsh themes, interpreting versioning) vs dataset B's focus on account management actions (subscription upgrades, trial access)",
        "Dataset A requires working with repository content analysis (commit histories, file changes) while dataset B focuses on service configuration (Copilot settings, project layouts)",
        "Tasks in dataset A specify programming language constraints (C++, Python, JavaScript) whereas dataset B uses broader technology categories (APIs, security vulnerabilities)",
        "Dataset A includes contributor analysis tasks (top 5 contributors) while dataset B emphasizes user-facing documentation comprehension (how-tos, troubleshooting)",
        "Tasks in dataset A require maintaining multiple search filters simultaneously (language + stars + date) while dataset B uses singular focus queries",
        "Dataset A contains wiki/Readme parsing requirements absent in dataset B tasks",
        "Dataset B includes explicit security vulnerability investigations (CVE tracking) not present in dataset A's security feature inquiries"
      ]
    },
    "espn": {
      "nnetnav_live_site=espn_num_tasks=62_portion=0": [
        "Dataset B includes tasks related to sports betting odds, spreads, and moneylines (e.g., Super Bowl futures, NFL Week 17 odds), while Dataset A lacks betting-oriented queries.",
        "Dataset B requires navigation for international soccer leagues (e.g., UEFA Conference League, Serie A, Spanish Supercopa), whereas Dataset A focuses on domestic leagues like NBA/NHL.",
        "Tasks in Dataset B involve locating broadcast details across non-ESPN platforms (TNT, truTV, Max, FS1), while Dataset A primarily references ESPN/ESPN+.",
        "Dataset B includes playoff/bowl game predictions (e.g., CFP bracket, Super Bowl favorites), while Dataset A focuses on real-time standings/rankings without projections.",
        "Dataset B tasks require aggregating injury reports across multiple NFL/NBA teams simultaneously, unlike Dataset A's single-team injury checks.",
        "Dataset B involves navigating college football bowl game schedules and CFB Playoff brackets, which are absent in Dataset A's NCAA-focused tasks.",
        "Dataset B includes real-time betting line updates (e.g., spread, over/under) during live game navigation, a feature not present in Dataset A.",
        "Dataset B requires accessing fantasy sports challenges like Pigskin Bracket Challenge, whereas Dataset A focuses on Tournament Challenge for basketball.",
        "Tasks in Dataset B involve cross-referencing game data with betting statistics (e.g., Gilgeous-Alexander MVP odds), while Dataset A links stats to multimedia content.",
        "Dataset B includes navigation for team-specific futures (e.g., Cavaliers' championship viability), while Dataset A focuses on historical/current team performance."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=4": [
        "Dataset B includes tasks related to NFL playoff standings and specific week analysis (e.g., Week 17, Week 18), which are absent in Dataset A.",
        "Dataset B requires navigation for soccer transfer rumors and international friendlies (e.g., US vs Canada match), while Dataset A focuses on league-specific soccer results.",
        "Dataset B involves hypothetical scenarios like simulating NBA trades involving soccer players (e.g., Manchester City to Atlanta Hawks), not present in Dataset A.",
        "Dataset B tasks include retrieving betting odds for NFL/NCAAF games, whereas Dataset A focuses on fantasy sports rankings without explicit betting tips.",
        "Dataset B emphasizes college football bowl game schedules and results (e.g., Toledo vs Arkansas), while Dataset A\u2019s NCAA tasks are limited to standings and conference data.",
        "Dataset B requests cross-sport player stat comparisons (e.g., Jahmyr Gibbs\u2019 fantasy football stats), unlike Dataset A\u2019s sport-specific player queries.",
        "Dataset B references newer seasons (e.g., 2024 NFL season, 2024-25 NHL season), while Dataset A uses 2023-24 seasons.",
        "Dataset B requires accessing podcasts (e.g., ESPN Radio NBA podcasts), a feature not mentioned in Dataset A.",
        "Dataset B tasks involve comparing team performance across multiple seasons (e.g., Cavaliers vs Thunder historical data), whereas Dataset A compares metrics within a single season.",
        "Dataset B includes navigation for lesser-covered leagues (e.g., Portuguese Primeira Liga), while Dataset A focuses on major leagues like NBA, NFL, and EPL."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=1": [
        "Dataset B tasks focus on NFL and college football leagues, while Dataset A emphasizes NBA and NHL.",
        "Dataset B includes navigation tasks for English Premier League soccer, which are absent in Dataset A.",
        "Tasks in Dataset B frequently involve College Football Playoff (CFP) schedules and brackets, not present in Dataset A.",
        "Dataset B requires interaction with NFL seasonal timelines (e.g., Week 18 scores), while Dataset A focuses on calendar dates like December 25 games.",
        "Dataset B tasks demand access to sports betting odds (e.g., AFC Champion odds), which are not highlighted in Dataset A.",
        "Dataset B involves retrieving player injury reports (e.g., Damar Hamlin\u2019s status), unlike Dataset A.",
        "Dataset B includes navigation for NFL Playoff Machine tools, absent in Dataset A\u2019s tasks.",
        "Dataset B emphasizes college football bowl game results and schedules, whereas Dataset A focuses on NCAA basketball tournaments.",
        "Tasks in Dataset B require accessing English Premier League tables and fixtures, which are not mentioned in Dataset A.",
        "Dataset B tasks involve tracking specific NFL team schedules (e.g., Saints 2024 schedule), while Dataset A focuses on real-time team/player performance comparisons."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=3": [
        "Dataset B includes tasks related to college football bowl game schedules and results, while Dataset A focuses on NBA/NHL regular season games",
        "Dataset B requires navigation to sports podcast sections and ESPN Radio content not present in Dataset A tasks",
        "Dataset B contains tasks involving NCAA Football Playoff (CFP) bracket analysis and tiebreaker scenarios absent in Dataset A",
        "Dataset B includes requests for coaching staff changes and NFL team management updates not found in Dataset A",
        "Dataset B requires comparison of complete team rosters between NFL teams, while Dataset A focuses on individual team rosters",
        "Dataset B contains tasks related to college football transfer portal news and player recruitment updates",
        "Dataset B includes requests for historical player statistics across different teams/eras (e.g. Michael Jordan with Wizards) not present in Dataset A",
        "Dataset B requires navigation to fantasy baseball rankings and projections absent in Dataset A's basketball-focused fantasy tasks",
        "Dataset B contains tasks involving specific ESPN+ streaming availability checks for NHL/NFL games",
        "Dataset B includes requests for NCAA Football Bowl Game historical results and analysis beyond basic score retrieval"
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=2": [
        "Tasks in B require accessing NFL Playoff Machine and playoff scenario simulations not present in A",
        "B includes navigation tasks for college football (NCAAF) bowl games and CFP brackets absent in A",
        "B involves retrieving injury reports for NFL teams and specific games, which A does not cover",
        "B requires interaction with 2024 NFL schedule and future game planning features",
        "B contains tasks related to National Lacrosse League scores and standings not found in A",
        "B includes esports navigation tasks (e.g., searching for 'esports' content) unlike A",
        "B features multi-team trade simulation requests (e.g., Hawks/Celtics/Bulls) not seen in A",
        "B requires accessing season-long team performance summaries (e.g., 2024 Bills season)",
        "B emphasizes NFL draft results and rookie player tracking more prominently than A",
        "B includes NCAA football Week 16-specific score queries and statistical breakdowns absent in A"
      ]
    },
    "huggingface": {
      "nnetnav_live_site=huggingface_num_tasks=76_portion=1": [
        "Tasks in B require locating academic research papers and tutorial materials alongside models/datasets",
        "B emphasizes finding implementation-specific details like GitHub repositories and code examples",
        "B includes tasks related to commercial product development and enterprise pricing information",
        "B contains requests for installation/configuration instructions for client libraries and SDKs",
        "B requires identifying beginner-friendly resources and learning materials for new users",
        "B tasks involve accessing raw dataset files and understanding dataset splits/structures",
        "B includes explicit requirements for troubleshooting and error reporting processes",
        "B contains tasks focused on generating specific creative outputs through model interactions",
        "B emphasizes workflow automation through CI/CD pipelines and GitHub Actions integration",
        "B requires understanding model deployment considerations for different hardware environments"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=0": [
        "Dataset B tasks require accessing research papers and academic references (e.g. arXiv papers, GlotLID paper) not present in Dataset A",
        "Dataset B includes explicit queries about commercial AI product development tools (e.g. GitHub Copilot pricing) beyond core platform features",
        "Dataset B tasks involve direct interaction with community contributions through comments, discussions, and pull request management",
        "Dataset B contains specific references to model architecture details (e.g. QVQ-72B-Preview image-text-to-text capabilities) rather than general functionality",
        "Dataset B requires comparison of computational resource pricing (CPU/GPU instances) not emphasized in Dataset A",
        "Dataset B tasks focus on model integration with external deployment tools (GitHub Actions) beyond basic API usage",
        "Dataset B includes explicit queries about dataset format conversions (e.g. Parquet conversion) not seen in Dataset A",
        "Dataset B tasks require accessing model-specific research papers (e.g. DialogGPT paper) alongside technical documentation",
        "Dataset B contains queries about ethical AI implementation and responsible usage guidelines as primary task objectives",
        "Dataset B tasks involve troubleshooting model access issues and license compliance scenarios for specific implementations"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=4": [
        "Dataset B tasks require identifying specific model versions (e.g. 'Llama-3.3-70B-Instruct') while Dataset A focuses on general model identification",
        "Dataset B includes tasks requiring academic paper retrieval (research papers/abstracts) while Dataset A focuses solely on model/dataset metadata",
        "Dataset B contains troubleshooting-oriented tasks (error resolution, implementation issues) absent in Dataset A",
        "Dataset B tasks involve multilingual resource handling (e.g. German documentation) not present in Dataset A",
        "Dataset B requires commercial integration analysis (business tool compatibility) while Dataset A focuses on basic commercial use permissions",
        "Dataset B includes performance optimization research tasks (e.g. 'Black-Box Prompt Optimization') absent in Dataset A",
        "Dataset B tasks demand exact model name matching (e.g. 'HuggingFaceTB/finemath') while Dataset A uses descriptive searches",
        "Dataset B contains CI/CD workflow integration tasks (GitHub Actions) not present in Dataset A",
        "Dataset B requires parsing numerical precision specifications (e.g. 'TensorRT-LLM backend precision') absent in Dataset A",
        "Dataset B includes low-resource language model searches (e.g. Indonesian sentiment analysis) not found in Dataset A"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=2": [
        "Dataset B tasks focus on general model discovery without recency requirements, while Dataset A emphasizes finding the most recently updated models",
        "Dataset B includes tasks requiring interaction with model commit histories/repository information, absent in Dataset A",
        "Dataset B contains tasks involving SDK installation/runtime environment setup, unlike Dataset A",
        "Dataset B tasks frequently require error resolution (e.g. 'Task not found' errors), not present in Dataset A",
        "Dataset B includes explicit commercial application verification tasks, while Dataset A focuses on license type identification",
        "Dataset B tasks involve direct paper retrieval from model pages, whereas Dataset A focuses on cross-referencing research",
        "Dataset B contains repository structure/organization tasks (e.g. dataset formatting), absent in Dataset A",
        "Dataset B includes user account actions (sign-ups/registrations), not required in Dataset A tasks",
        "Dataset B tasks target specific application domains (e.g. dog breeds, anime styles), while Dataset A uses broader technical categories",
        "Dataset B requires navigation through model README files for implementation details, whereas Dataset A uses centralized documentation"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=3": [
        "Tasks in B require direct interaction with specific model names (e.g., 'meta-llama/Llama-3.3-70B-Instruct') rather than generic attribute-based searches",
        "B includes tasks focused on dataset format conversions (e.g., Parquet format transformation)",
        "B contains explicit requirements for model optimization techniques (e.g., CPU inference optimization with bitsandbytes)",
        "B emphasizes practical implementation steps (e.g., library installation, local deployment instructions)",
        "Tasks in B require understanding of model fine-tuning processes (e.g., 'learn how to fine-tune a language model')",
        "B includes queries about model ethics policies and commercial suitability assessments",
        "B contains tasks requiring integration with specific Chinese platforms (WeChat, Zhihu)",
        "B emphasizes multilingual translation capabilities (e.g., sentence transformers for multilingual tasks)",
        "Tasks in B focus on medical domain applications (e.g., medical chatbot datasets, drug side effects)",
        "B requires navigation through specialized model families (Qwen, DeepSeek) rather than general categories"
      ]
    },
    "coursera": {
      "nnetnav_live_site=coursera_num_tasks=72_portion=3": [
        "Dataset A tasks require precise numerical outputs (e.g., course counts, percentage ratings, exact hours) while Dataset B focuses on qualitative exploration (e.g., curriculum content, career alignment).",
        "Dataset A tasks emphasize filtering by specific duration ranges (e.g., 1-4 weeks) whereas Dataset B tasks omit explicit duration constraints.",
        "Dataset A tasks frequently reference institutional partnerships (e.g., Google, IBM) while Dataset B tasks prioritize course providers over institutional affiliations.",
        "Dataset A tasks require cross-referencing course content with niche topics (e.g., Renewable Energy Futures) whereas Dataset B tasks target broader domains (e.g., Data Science).",
        "Dataset A tasks explicitly ask for instructor credentials and biographies; Dataset B tasks focus on skills/outcomes over instructor details.",
        "Dataset B tasks include guided project searches (e.g., Python guided projects) absent in Dataset A.",
        "Dataset A tasks involve verifying star-level distributions in reviews; Dataset B tasks omit granular rating analysis.",
        "Dataset B tasks emphasize career development goals (e.g., career skills, job roles) more prominently than Dataset A.",
        "Dataset A tasks specify time investment per week (e.g., 5 hours/week); Dataset B tasks omit time commitment parameters.",
        "Dataset B tasks prioritize course enrollment actions (e.g., 'enroll in IBM Data Science') while Dataset A focuses on informational queries."
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=2": [
        "Dataset B includes tasks focused on social impact, human rights, and finance for social good, which are absent in Dataset A.",
        "Tasks in Dataset B require identifying courses taught in specific languages (e.g., German), unlike Dataset A.",
        "Dataset B emphasizes detailed exploration of course modules and content (e.g., AI modules, Science of Well-Being topics), beyond metadata retrieval in Dataset A.",
        "Dataset B tasks prioritize professional certificates with explicit skill focuses (e.g., Google Cybersecurity), while Dataset A focuses on general certificate exploration.",
        "Dataset B includes queries about recommended experience or prerequisites for roles (e.g., Data Analyst), not present in Dataset A.",
        "Dataset B tasks involve career development alignment (e.g., data science career paths), whereas Dataset A focuses on career outcome statistics like salary ranges.",
        "Dataset B requires comparing courses (e.g., leadership courses), a task absent in Dataset A.",
        "Dataset B tasks target financial courses with ethical/social good angles, unlike Dataset A's general finance queries.",
        "Dataset B emphasizes Python programming integration in data science/business contexts, while Dataset A focuses on broader programming skills.",
        "Dataset B includes requests for course applicability to specific industries (e.g., healthcare, translation), whereas Dataset A focuses on general domain exploration."
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=4": [
        "Tasks in dataset B require identifying specific skills gained from courses, while A focuses on metadata extraction without skill outcomes.",
        "Dataset B includes tasks seeking definitions or conceptual explanations (e.g., 'What is data analytics?'), unlike A's practical focus.",
        "B contains explicit career path navigation tasks (e.g., 'How to become Data Analyst'), while A emphasizes credential comparison.",
        "Dataset B tasks request detailed curriculum breakdowns (e.g., 'First two modules of...'), whereas A focuses on structural overviews.",
        "B requires comparing courses for specific professional contexts (business professionals, teachers), while A compares institutional offerings.",
        "Tasks in B demand cross-disciplinary course combinations (e.g., 'English teaching + data analysis'), absent in A's single-domain focus.",
        "Dataset B emphasizes emerging tech domains (Generative AI, Prompt Engineering), while A focuses on established technical requirements.",
        "B includes tasks analyzing user reviews/testimonials ('What are people saying...'), unlike A's numerical rating analysis.",
        "Dataset B contains explicit enrollment/signup process tasks, while A focuses purely on discovery and comparison.",
        "B requires language-specific course filtering (e.g., 'taught in English'), whereas A's tasks assume universal language availability."
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=0": [
        "Dataset B includes tasks focused on course enrollment processes and configuration of user preferences (e.g., language settings)",
        "Dataset B emphasizes exploration of interdisciplinary course combinations (e.g., graphic design + art + culture)",
        "Dataset B contains tasks requiring analysis of course syllabi/module structures rather than just metadata",
        "Dataset B features more requests for social science/humanities content (e.g., feminism, human rights, social justice)",
        "Dataset B includes explicit requirements for subtitle availability and language localization features",
        "Dataset B tasks frequently reference specific university partnerships (e.g., Yale, Stanford) in queries",
        "Dataset B shows increased focus on free course offerings and audit options in task requirements",
        "Dataset B contains tasks requiring comparison between multiple course versions or providers",
        "Dataset B emphasizes career credential pathways over individual course outcomes in queries",
        "Dataset B includes tasks requiring navigation through multi-step enrollment/auth workflows"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=1": [
        "Dataset B tasks emphasize detailed program structure and curriculum content exploration, whereas Dataset A focuses on extracting specific course attributes like duration and ratings.",
        "Tasks in Dataset B require gathering information on enrollment processes and admission requirements, which are not present in Dataset A tasks.",
        "Dataset B includes queries about prerequisites for certifications/degrees, while Dataset A does not address eligibility criteria.",
        "Dataset B tasks explicitly target tool-specific skill development (e.g., Electric VLSI EDA Tool), unlike Dataset A's general domain focuses.",
        "Financial considerations in Dataset B extend to refund policies and payment structures, whereas Dataset A focuses on price comparisons and discounts.",
        "Dataset B contains requests to exclude specific topics/categories from search results, demonstrating more complex filtering needs than Dataset A.",
        "Career alignment verification (matching courses to specific jobs) is unique to Dataset B tasks.",
        "Dataset B requires identification of course components/modules within specializations, while Dataset A focuses on overall program characteristics.",
        "Language preference specification (e.g., English-only courses) appears exclusively in Dataset B requirements.",
        "Dataset B tasks emphasize practical skill application in professional contexts (e.g., fraud detection), whereas Dataset A focuses on theoretical learning outcomes."
      ]
    },
    "arxiv": {
      "nnetnav_live_site=arxiv_num_tasks=80_portion=1": [
        "Dataset B tasks require analyzing specific paper content sections (e.g., abstracts, related work, methodology)",
        "Dataset B includes tasks involving document format troubleshooting (e.g., HTML conversion errors)",
        "Dataset B requires interaction with supplementary materials (e.g., references, full-text downloads)",
        "Dataset B tasks involve technical format analysis (e.g., TeX/LaTeX/MathML usage in papers)",
        "Dataset B contains tasks requiring copyright/license information retrieval",
        "Dataset B includes cross-platform content verification (e.g., checking related papers on other academic platforms)",
        "Dataset B tasks focus on specific paper versions/updates (e.g., v3 submission details)",
        "Dataset B requires understanding submission endorsement processes/policies",
        "Dataset B contains multimedia content access tasks (e.g., finding experiment videos)",
        "Dataset B includes tasks requiring identification of paper structural components (e.g., separating main content from references)"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=4": [
        "Dataset B tasks require interacting with detailed paper sections (e.g., introductions, experimental setups) while Dataset A focuses on metadata retrieval like dates/counts",
        "Dataset B includes technical troubleshooting tasks (e.g., HTML error investigation) absent in Dataset A",
        "Dataset B contains author-specific search tasks (e.g., finding works by Kai Schmitz) not present in Dataset A",
        "Dataset B requires cross-referencing within papers (e.g., finding specific figures/references) while Dataset A focuses on surface attributes",
        "Dataset B tasks involve interdisciplinary research queries (e.g., AI+Quantum Computing) compared to Dataset A's single-domain focus",
        "Dataset B includes code/data retrieval tasks (e.g., source code download) absent in Dataset A",
        "Dataset B requires understanding paper structure (e.g., finding specific sections) while Dataset A focuses on categorical filtering",
        "Dataset B contains conceptual explanation tasks (e.g., galaxy rotation curves) beyond Dataset A's factual queries",
        "Dataset B tasks utilize arXiv identifiers directly (e.g., 2412.18585) more frequently than Dataset A",
        "Dataset B includes emerging technology exploration (e.g., GameFi/DeFi) not covered in Dataset A's traditional categories"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=0": [
        "Dataset B tasks involve retrieving papers by specific arXiv IDs (e.g., 'arXiv 2412.18601') while Dataset A focuses on metadata-based queries without explicit ID references",
        "Dataset B requires accessing full paper sections (e.g., methodology/results) for content extraction, whereas Dataset A focuses on surface metadata like submission dates/author counts",
        "Dataset B includes license verification tasks (e.g., 'Determine the license type') absent in Dataset A",
        "Dataset B contains troubleshooting tasks (e.g., layout issues, error messages) not present in Dataset A's navigation flows",
        "Dataset B tasks require direct PDF download actions ('Download the PDF version') rather than just metadata retrieval",
        "Dataset B includes citation tracking tasks ('Find all references in...') while Dataset A focuses on cross-category comparisons",
        "Dataset B tasks involve paper content validation (e.g., 'check information on arXiv licenses') unlike Dataset A's filtering tasks",
        "Dataset B requires anomaly detection in content (e.g., 'accessibility-friendly format') not addressed in Dataset A",
        "Dataset B includes author-specific searches (e.g., 'papers written by Ariel Shlosberg') beyond Dataset A's institutional information tasks",
        "Dataset B tasks involve temporal constraints spanning future dates (e.g., '2022-2025') while Dataset A uses historical date ranges"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=2": [
        "Dataset A tasks require counting results within specific date ranges and categories, while Dataset B focuses on retrieving specific content elements from individual papers (e.g., abstracts, sections, references)",
        "Dataset A requires comparing results across multiple subject categories, whereas Dataset B tasks emphasize locating papers through singular category navigation",
        "Dataset A includes queries requiring verification of institutional affiliations through external websites, while Dataset B tasks remain confined to arXiv content",
        "Dataset B contains tasks requiring navigation through paper components (methods, figures, references), unlike Dataset A which focuses on metadata extraction",
        "Dataset A tasks demand temporal precision (e.g., 'last week', 'Jan 1-3'), while Dataset B uses relative timeframes ('recent', 'latest') without specific date constraints",
        "Dataset B includes tasks requiring PDF/download management actions absent from Dataset A's metadata-focused queries",
        "Dataset A requires cross-archive comparisons ('search in all archives'), while Dataset B tasks focus on single-archive retrieval",
        "Dataset B contains ambiguous search queries ('find anything of interest'), whereas Dataset A uses precise search parameters",
        "Dataset A tasks require policy verification (submission guidelines), while Dataset B focuses on content comprehension",
        "Dataset B includes requests for technical implementation details (source code, figure designs) not present in Dataset A's metadata-oriented tasks"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=3": [
        "Dataset B tasks require downloading full paper texts, while Dataset A focuses on extracting metadata from search results.",
        "Dataset B includes tasks involving direct access to paper references/citations, absent in Dataset A.",
        "Dataset B tasks involve troubleshooting submission processes, not present in Dataset A.",
        "Dataset B requires identifying author affiliations beyond institutional maintainers mentioned in Dataset A.",
        "Dataset B tasks demand understanding paper content structures (e.g. methodology sections), while Dataset A focuses on surface-level attributes.",
        "Dataset B includes format-specific retrieval (PDF/TeX), not required in Dataset A's metadata-focused tasks.",
        "Dataset B contains queries about paper version histories and source code availability, absent in Dataset A.",
        "Dataset B requires cross-referencing between paper content and external citation networks, unlike Dataset A's category comparisons.",
        "Dataset B tasks involve arXiv ID lookup and direct paper retrieval, while Dataset A uses keyword/title searches.",
        "Dataset B includes requests for author publication histories, whereas Dataset A focuses on institutional relationships."
      ]
    },
    "bbc": {
      "nnetnav_live_site=bbc_num_tasks=69_portion=2": [
        "Dataset A tasks focus on real-time updates and immediate summaries of current geopolitical/economic events, while Dataset B includes future-oriented queries (e.g., release dates, upcoming fixtures).",
        "Dataset A requires precise extraction of numerical/metric-based answers (e.g., stroke counts, tariff rates), whereas Dataset B emphasizes exploratory browsing without strict quantitative targets.",
        "Dataset B includes hyper-localized content requests (e.g., Cornwall course closures, Devon wellness) absent in Dataset A's regionally broad scope.",
        "Dataset B tasks involve entertainment/cultural trend analysis (e.g., celebrity flops, Joker sequel), while Dataset A prioritizes hard news and policy impacts.",
        "Dataset A tasks demand timestamp-based recency filtering (e.g., 'latest', '3 hrs ago'), whereas Dataset B includes historical/retrospective queries (e.g., 2004 tsunami anniversary).",
        "Dataset B contains explicit requests for multimedia navigation (e.g., 'watch BBC News video'), while Dataset A assumes multimedia interaction within structured content lists.",
        "Dataset A emphasizes section-specific hierarchical navigation (e.g., 'Green Living section'), while Dataset B requires lateral topic exploration across categories (e.g., 'find causes of plane crash').",
        "Dataset B includes user-action tasks (e.g., bookmarking recipes), whereas Dataset A focuses purely on information retrieval.",
        "Dataset A sports queries target match results/analyses, while Dataset B seeks future schedules (e.g., 'January 2025 fixtures').",
        "Dataset B features niche cultural phenomena (e.g., waacking dance revival), contrasting with Dataset A's standardized category navigation."
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=3": [
        "Dataset B tasks require locating multimedia content (podcasts/videos) through dedicated sections not present in Dataset A",
        "Dataset B includes time-sensitive tasks involving real-time data lookup (weather forecasts, stock prices) absent in Dataset A's article-focused timestamps",
        "Dataset B tasks involve future event planning (sport schedules for 2025) while Dataset A focuses on current/recent articles",
        "Dataset B requires navigation through specialized science/environment categories not emphasized in Dataset A's geopolitical focus",
        "Dataset B tasks demand interaction with service-oriented content (course enrollment, club memberships) beyond Dataset A's information retrieval",
        "Dataset B contains cryptocurrency price tracking requirements not present in Dataset A's financial reporting scope",
        "Dataset B includes explicit map/data visualization interpretation tasks for climate change impacts absent in Dataset A",
        "Dataset B tasks require comparative analysis of regional perspectives (e.g., Chinese views on economy) beyond Dataset A's conflict reporting",
        "Dataset B involves navigation through multi-format content hubs (video libraries, podcast archives) rather than Dataset A's article lists",
        "Dataset B tasks require synthesizing information across temporal scales (historical trends vs current events) not emphasized in Dataset A"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=1": [
        "Dataset B tasks require navigation through multimedia content types (podcasts/videos) not emphasized in Dataset A",
        "Dataset B includes exploratory tasks without specific article targets compared to Dataset A's precise article-finding objectives",
        "Dataset B tasks involve historical event research (e.g., 2004 tsunami) while Dataset A focuses on current/recent news",
        "Dataset B contains technical/specialized content requirements (AI capabilities, climate tech) absent from Dataset A's general news focus",
        "Dataset B tasks require understanding of social media trends (TikTok underconsumption) not present in Dataset A",
        "Dataset B includes interactive elements (hotel booking attempts) while Dataset A remains informational",
        "Dataset B tasks span broader thematic categories (space exploration to urban sketching) versus Dataset A's standardized news sections",
        "Dataset B requires navigation through entertainment content (TV series releases, book lists) beyond Dataset A's news structure",
        "Dataset B contains weather forecast retrieval tasks absent from Dataset A's scope",
        "Dataset B tasks involve cross-content type synthesis (combining tech/sports/AI) unlike Dataset A's single-medium focus"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=0": [
        "Tasks in Dataset B require synthesizing information across multiple content categories for research purposes, unlike Dataset A's focus on retrieving specific details from predefined sections.",
        "Dataset B tasks involve exploratory navigation through undefined pathways (e.g., 'find products that promote longevity'), while Dataset A tasks follow established hierarchical structures (e.g., 'Check Sports section for match results').",
        "Dataset B includes objectives requiring interaction with dynamic website features (e.g., testing forecast functionality), whereas Dataset A focuses on static content consumption.",
        "Tasks in Dataset B demand cross-domain analysis (e.g., business decisions impacting Formula 1), while Dataset A maintains domain-specific queries within sections.",
        "Dataset B contains tasks requiring evaluation of real-world applications (e.g., environmental impact of flying), compared to Dataset A's emphasis on factual reporting extraction.",
        "Dataset B tasks involve speculative information gathering (e.g., 'potential alternatives to flying'), whereas Dataset A focuses on existing verified information retrieval.",
        "Dataset B requires navigation to external organizational resources (e.g., university courses, donation pages), while Dataset A remains within BBC's core news/content structure.",
        "Tasks in Dataset B necessitate temporal comparisons (e.g., 'latest graphene news' vs historical context), whereas Dataset A focuses on singular temporal snapshots ('most recent update').",
        "Dataset B includes meta-navigation tasks (e.g., 'browse through available podcasts'), while Dataset A specifies direct access to known content locations.",
        "Dataset B tasks require inference from multimedia context (e.g., video controls interaction), whereas Dataset A focuses on explicit content extraction from multimedia."
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=4": [
        "Tasks in dataset B require navigating through localized or hyper-specific regional sections (e.g., Tayside & Central Scotland, New Orleans) not present in dataset A's continent-level geographic queries.",
        "Dataset B includes tasks involving multi-disciplinary sections like 'Innovation' or 'Politics' that aren't explicitly targeted in dataset A's category-based navigation.",
        "Time-sensitive tasks in dataset B focus on multi-day/historical events (e.g., 20-year tsunami anniversary) rather than dataset A's emphasis on minute/hour-level recency.",
        "Dataset B tasks require identification of specialized content types (e.g., hotel features, academic courses) absent from dataset A's article/video-focused multimedia retrieval.",
        "Navigation in dataset B involves verifying conflicting claims (e.g., Trump's Panama Canal statements) rather than dataset A's factual summarization of predefined articles.",
        "Tasks in dataset B demand comparative analysis across content formats (e.g., books vs films vs podcasts) unlike dataset A's single-medium information extraction.",
        "Dataset B includes meta-navigation tasks testing website functionality (search/sign-up processes) absent from dataset A's content-focused requirements.",
        "Event retrieval in dataset B spans unconventional categories (e.g., cybersecurity human aspects, aurora phenomena) beyond dataset A's defined disaster/political taxonomies.",
        "Dataset B tasks require synthesis of cross-temporal information (e.g., 2024 tech trends evolution) versus dataset A's singular temporal focus on 'latest' updates.",
        "Regional specificity in dataset B extends to hyperlocal cultural elements (e.g., Indonesian coffee regions) rather than dataset A's country/continent-level geographic scope"
      ]
    },
    "amazon": {
      "nnetnav_live_site=amazon_num_tasks=63_portion=2": [
        "Dataset B tasks frequently involve purchasing intent (e.g., 'purchase', 'buy', 'add to cart') rather than pure search/filter actions",
        "Dataset B includes broader exploratory goals (e.g., 'find gift ideas', 'explore gourmet food') without specific attribute requirements",
        "Dataset B contains tasks focused on product popularity metrics (e.g., 'best-selling', 'top-selling') rather than user-defined filters",
        "Dataset B requires identification of luxury/pre-owned items (e.g., 'pre-loved Louis Vuitton', 'pre-owned devices') as explicit targets",
        "Dataset B tasks often lack numeric constraints (e.g., 'find cheapest shampoo' vs 'under $10')",
        "Dataset B includes protection plan considerations (e.g., 'including protection plan') not present in A",
        "Dataset B contains more generic category exploration (e.g., 'find laptop section') without attribute specifications",
        "Dataset B tasks reference seasonal/event-based shopping (e.g., 'Winter Sale toys', 'Friday night movie rental')",
        "Dataset B includes multi-product bundles (e.g., 'Peloton Bike and accessories') rather than single-item focus",
        "Dataset B tasks emphasize brand-specific searches (e.g., 'Homtiem Black Garlic', 'Makita Power Tools') over generic attributes"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=3": [
        "Dataset A tasks require explicit price range filters (e.g., $50-$100), while Dataset B tasks focus on finding cheapest/most expensive items without strict ranges",
        "Dataset A tasks emphasize customer rating thresholds (e.g., 4+ stars), while Dataset B rarely specifies rating requirements",
        "Dataset A includes specific condition filters (e.g., 'Used - Good'), while Dataset B tasks lack used/refurbished item requirements",
        "Dataset A tasks frequently require delivery option verification, while Dataset B tasks omit shipping/delivery details",
        "Dataset A tasks specify exact product attributes (e.g., 30\" length, 6mm thickness), while Dataset B uses broader attribute descriptions",
        "Dataset B contains generic purchase tasks (e.g., 'Buy office supplies') without specific filters, unlike Dataset A's detailed requirements",
        "Dataset A focuses on physical/digital products, while Dataset B includes service-related tasks (Prime Video rentals, Amazon Fresh)",
        "Dataset B contains more quantity-specific actions (e.g., 'Add 2 gifts'), while Dataset A focuses on single-item interactions",
        "Dataset A requires verification of technical specifications, while Dataset B emphasizes price comparison without detail checks",
        "Dataset B includes account creation tasks, while Dataset A focuses on existing account interactions (returns/sign-in)"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=1": [
        "Dataset B tasks include navigating niche categories like luxury brands (e.g., Oscar de la Renta, Aquazzura) not emphasized in Dataset A",
        "Dataset B requires interaction with Amazon services beyond core shopping (e.g., Prime Video, Kindle Unlimited, Grubhub)",
        "Dataset B contains tasks involving multi-step gift selection processes and gift card customization",
        "Dataset B includes navigation through Amazon Fresh grocery sections and fresh produce searches",
        "Dataset B tasks address site interaction challenges like CAPTCHA verification and anti-bot protections",
        "Dataset B emphasizes seasonal/holiday-specific shopping (e.g., Winter Sale, Easter gifts) more prominently",
        "Dataset B contains tasks requiring exploration of Amazon's organizational structures (brand pages, category hierarchies)",
        "Dataset B includes tasks focused on subscription-based products/services (e.g., Prime benefits, Subscribe & Save)",
        "Dataset B requires comparison of products across lifestyle categories rather than single product attributes",
        "Dataset B tasks involve navigating Amazon's premium/luxury store sections (Luxury Stores, Shopbop markdowns)"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=0": [
        "Dataset B tasks focus on finding cheapest/most expensive items without specific price ranges, while A requires precise price ranges",
        "Dataset B includes broader product category exploration (e.g. 'eco-friendly kitchen products'), while A specifies exact attributes",
        "Dataset B contains tasks for seasonal/event-specific sales (e.g. Winter Sale), which are absent in A",
        "Dataset B requires adding multiple unspecified items to cart, while A focuses on adding specific identified products",
        "Dataset B includes general shopping goals (e.g. 'buy gifts for woman'), while A requires specific feature combinations",
        "Dataset B tasks emphasize product exploration/discovery rather than strict filtering seen in A",
        "Dataset B contains more open-ended price checks ('find the price of X') without comparison parameters used in A",
        "Dataset B includes brand-focused searches (e.g. Louis Vuitton) without technical specifications required in A",
        "Dataset B tasks allow broader category browsing (e.g. office supplies), while A requires departmental filtering",
        "Dataset B contains more subjective requirements ('best selling', 'best deals') compared to A's quantifiable metrics"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=4": [
        "Dataset B tasks frequently require adding items to cart without specifying quantity or attributes (e.g. 'Add 5 items to cart')",
        "Dataset B includes gift card purchases (e.g. Christmas/graduation gifts) not seen in Dataset A",
        "Dataset B contains CAPTCHA verification tasks absent from Dataset A",
        "Dataset B tasks often lack specific numeric thresholds (e.g. 'Find shampoo prices' vs 'Under $50')",
        "Dataset B includes price checks without purchase requirements (e.g. 'Find price of Dole banana')",
        "Dataset B tasks use more generic category browsing (e.g. 'Browse fashion') vs specific subcategories in A",
        "Dataset B contains seasonal gift searches (e.g. 'Christmas gift') without detailed specifications",
        "Dataset B includes brand-specific searches without attributes (e.g. 'Find Belkin Wireless Chargers')",
        "Dataset B tasks frequently mention 'luxury' products without detailed specifications",
        "Dataset B contains open-ended discovery tasks (e.g. 'Find birthday gift ideas') without filtering criteria"
      ]
    },
    "wolframalpha": {
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=4": [
        "Dataset A tasks require applied real-world scenario modeling (e.g. nutritional calculations with weight loss timelines), while B focuses on theoretical concept exploration",
        "Dataset A contains complex multi-variable physics simulations (spring pendulums, celestial mechanics), whereas B emphasizes basic chemical equation balancing",
        "Dataset A requires comparative analysis of multiple packing methodologies, while B focuses on single-solution mathematical proofs",
        "Dataset A tasks integrate temporal-spatial parameters with scientific calculations (sunburn time estimates), whereas B handles isolated temporal calculations",
        "Dataset A demands interpretation of advanced mathematical constructs (complex number operations, pentagram inner regions), while B focuses on elementary arithmetic operations",
        "Dataset A contains tasks requiring material property comparisons across multiple elements, while B focuses on single-element property lookups",
        "Dataset A requires population growth rate calculations using real-world demographic data, whereas B handles basic financial present value computations",
        "Dataset A tasks involve geometric optimization problems (circle packing density), while B focuses on basic function plotting requests",
        "Dataset A contains advanced dynamic system modeling (spring pendulum kinematics), whereas B handles static chemical structure inquiries",
        "Dataset A requires cross-domain knowledge integration (combining calorie intake with metabolic rates), while B maintains single-domain question boundaries"
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=0": [
        "Dataset B tasks emphasize conceptual exploration (e.g., 'What is the Riemann Hypothesis?') over direct numerical computation prevalent in Dataset A.",
        "Dataset B includes explicit requests for definitions or foundational knowledge (e.g., 'What is a prime number?') absent in Dataset A's task-oriented queries.",
        "Dataset B tasks frequently involve humanities-oriented queries (e.g., 'Find the etymology of the word \"love\"') not observed in Dataset A's STEM-focused samples.",
        "Dataset B requires deeper contextual analysis (e.g., 'Find in-depth information about COVID-19') compared to Dataset A's targeted data retrieval (e.g., 'current temperature in Chicago').",
        "Dataset B includes explicit system/feature exploration (e.g., 'Investigate the features of Wolfram Language') absent in Dataset A's task execution focus.",
        "Dataset B tasks prioritize explanatory processes (e.g., 'explain step-by-step solution') over Dataset A's emphasis on result-oriented computation.",
        "Dataset B contains more historical/trend-based queries (e.g., 'historical trend of COVID-19 cases') compared to Dataset A's fixed temporal parameters (e.g., '2023 prices').",
        "Dataset B includes abstract mathematical concept exploration (e.g., 'properties of beta distribution') where Dataset A focuses on applied calculations (e.g., 'integral of 3e^(2x)').",
        "Dataset B features language/linguistics analysis tasks (e.g., 'meaning of molar mass value') beyond Dataset A's unit conversion requirements.",
        "Dataset B tasks demonstrate broader scope in information synthesis (e.g., 'climate change models with temperature anomalies') compared to Dataset A's discrete comparative analyses (e.g., 'thermal conductivity comparison')."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=1": [
        "Dataset A tasks involve real-time or current data retrieval (e.g., current temperature, 2023 prices), while B focuses on historical/long-term data analysis (e.g., 10-year temperature anomalies).",
        "Dataset A includes personalized scenarios (e.g., weight loss predictions, individual calorie intake), whereas B emphasizes general academic/theoretical queries without personalization.",
        "Dataset B tasks prioritize conceptual explanations (e.g., \"Explain chemical thermodynamics\"), while A emphasizes direct computational outputs (e.g., unit conversions).",
        "Dataset A requires immediate numerical comparisons (e.g., material conductivity comparisons), while B involves exploratory research (e.g., paradoxes, sequence relationships).",
        "Dataset B includes explicit requests for educational content navigation (e.g., \"Get familiar with examples\"), whereas A assumes prior familiarity with tool usage.",
        "Dataset A tasks often specify geographic/local constraints (e.g., city-specific data), while B uses universal scientific constants (e.g., speed of light) or abstract parameters.",
        "Dataset B contains more definitional/etymological queries (e.g., word origins), while A focuses exclusively on quantitative analysis.",
        "Dataset A features time-sensitive physical calculations (e.g., sunburn timing), while B includes celestial event timing (e.g., moon phases, eclipses).",
        "Dataset B tasks require cross-domain knowledge synthesis (e.g., linking Fibonacci sequences to Collatz conjecture), while A maintains domain-specific isolation.",
        "Dataset A uses concrete real-world measurement systems (e.g., SPF values, calorie counts), whereas B employs abstract mathematical constructs (e.g., beta distributions, polynomial roots)."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=3": [
        "Tasks in dataset B require explicit information retrieval (e.g., definitions, etymologies, historical facts) rather than direct computation.",
        "Dataset B includes tasks asking for platform-specific features or pricing (e.g., Wolfram Alpha Pro plans, product details).",
        "Tasks in dataset B involve exploring Wolfram Alpha's resources (e.g., math tools, problem generators) as part of the objective.",
        "Dataset B tasks explicitly request downloading or saving results (e.g., chemical properties, solutions in TeX format).",
        "Tasks in dataset B focus on theoretical concepts (e.g., paradoxes, hypotheses) beyond applied calculations.",
        "Dataset B includes tasks requiring research on economic or social factors (e.g., unemployment causes, mortgage options).",
        "Tasks in dataset B emphasize stock market data or financial planning (e.g., Merck & Co. stock, currency conversion for planning).",
        "Dataset B tasks often include meta-inquiries about Wolfram Alpha\u2019s capabilities (e.g., platform features, example problems).",
        "Tasks in dataset B involve linguistic or humanities-oriented queries (e.g., etymology, paradoxes, native names of deities).",
        "Dataset B includes tasks requiring verification of data dissemination permissions (e.g., checking if solutions can be shared)."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=2": [
        "Dataset B tasks more frequently involve direct lookup of single-point factual data (e.g., element properties, boiling points) rather than comparative analysis",
        "Dataset A contains more complex multi-variable physics/engineering simulations (e.g., spring pendulums) requiring dynamic system modeling",
        "Dataset B includes more basic arithmetic/algebraic equation solving without advanced mathematical operations",
        "Dataset A requires interpretation of geometric/spatial relationships (circle packing, polyomino combinations)",
        "Dataset B contains more explicit requests for data formatting/output (TeX, image downloads)",
        "Dataset A tasks frequently involve temporal projections/forecasts (weight loss timelines, population growth rates)",
        "Dataset B includes more definitional/encyclopedic queries (properties of mathematical concepts, paradox explanations)",
        "Dataset A contains more sophisticated visualization requirements (parametric curve plotting, function analysis)",
        "Dataset B features more exploratory/open-ended instructions without specific computational goals",
        "Dataset A requires integration of multiple scientific principles in single tasks (thermal+material properties, orbital mechanics)"
      ]
    },
    "allrecipes": {
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=0": [
        "Dataset A tasks require precise numerical thresholds (e.g., '50+ reviews', '4.5 stars'), while Dataset B tasks use qualitative popularity metrics without specific values.",
        "Dataset A tasks demand explicit nutritional data (calories, carbs) extraction; Dataset B rarely specifies nutritional requirements unless diet-related (e.g., keto).",
        "Dataset A tasks involve structured outputs (ingredient lists, step summaries); Dataset B tasks focus on general discovery without formatting constraints.",
        "Dataset B tasks emphasize dietary labels (keto, gluten-free) as primary filters; Dataset A prioritizes ingredient/rating constraints over named diets.",
        "Dataset A requires comparing multiple recipes via filtering/sorting; Dataset B tasks lack explicit comparison directives.",
        "Dataset A tasks reference metadata like 'prep time under 45 minutes' granularly; Dataset B uses broader timeframes (e.g., 'weeknight').",
        "Dataset B includes exploratory/incomplete task phrasing (e.g., 'might have been something like...'); Dataset A tasks are fully articulated.",
        "Dataset A tasks require parsing user reviews for quality assessment; Dataset B tasks mention reviews only peripherally (e.g., 'rate the best one').",
        "Dataset A tasks specify exact serving sizes (e.g., 'suitable for 6 people'); Dataset B omits portion requirements.",
        "Dataset B contains placeholder/error tasks (e.g., garbled text samples); Dataset A maintains task coherence."
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=4": [
        "Dataset B tasks emphasize saving/bookmarking for meal planning, while Dataset A focuses on retrieving specific recipe details.",
        "Dataset B includes tasks related to holiday/seasonal event menu planning beyond basic recipe search.",
        "Dataset B requires comparing multiple recipes, whereas Dataset A seeks single recipes meeting specific criteria.",
        "Dataset B involves active user contributions like submitting recipe modifications and detailed reviews.",
        "Dataset B features tasks requiring ingredient substitutions (e.g., finding alternatives for evaporated milk).",
        "Dataset B explicitly includes meal prep and structured healthy meal planning as primary objectives.",
        "Dataset B prioritizes kid-friendly recipes and family-oriented meal solutions.",
        "Dataset B tasks focus on nutritional comparisons between recipes, not just displaying nutritional information.",
        "Dataset B includes finding recipes for leftovers or repurposing specific ingredients (e.g., leftover ham).",
        "Dataset B emphasizes creating full event menus (e.g., holiday parties) rather than individual recipe discovery."
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=1": [
        "Dataset B tasks frequently involve saving/bookmarking multiple recipes for future use.",
        "Dataset B includes tasks focused on holiday/event-specific recipes (e.g., Christmas, Hanukkah, New Year's).",
        "Dataset B requires interaction with user-generated content through recipe reviews or ratings (e.g., leaving feedback).",
        "Dataset B tasks emphasize nutritional information comparison between recipes (e.g., calorie counts, carb content).",
        "Dataset B includes broader exploratory queries without strict filters (e.g., 'find some recipes' or 'explore ideas').",
        "Dataset B tasks involve meal planning for specific contexts like family meals, kid-friendly options, or leftovers.",
        "Dataset B contains requests for recipe modifications or ingredient substitutions (e.g., dairy alternatives).",
        "Dataset B includes unstructured navigation paths for discovery rather than targeted filtering (e.g., browsing 'holiday appetizers').",
        "Dataset B tasks focus on preparation techniques or kitchen tips (e.g., 'freezer-friendly casseroles', 'how-to guides').",
        "Dataset B emphasizes seasonal ingredient usage beyond general dietary preferences (e.g., winter spices, festive flavors)."
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=3": [
        "Tasks in B require user engagement actions like leaving reviews or interacting with recipe content",
        "Tasks in B involve broader exploration of recipe ideas without strict rating or review thresholds",
        "Tasks in B include saving recipes to collections or personal lists as a key requirement",
        "Tasks in B focus on utilizing leftover ingredients as a primary search criterion",
        "Tasks in B demand ingredient substitution suggestions within recipe interactions",
        "Tasks in B prioritize price checking for kitchen equipment alongside recipe searches",
        "Tasks in B emphasize meal planning strategies over single-recipe extraction",
        "Tasks in B require identification of seasonal ingredient pairings rather than strict event filters",
        "Tasks in B include requests for specific recipe format types (cookbooks/printables)",
        "Tasks in B emphasize dessert-specific preparation techniques over general recipe characteristics"
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=2": [
        "Dataset B tasks include explicit user engagement actions like writing reviews with personal tips or suggestions for recipe modifications.",
        "Dataset B tasks frequently involve repurposing leftovers or ingredients, focusing on creative reuse beyond initial meal preparation.",
        "Dataset B emphasizes seasonal/holiday-specific recipes for occasions like Halloween, New Year's Eve, and Christmas more granularly than Dataset A's general holiday focus.",
        "Dataset B tasks require community interaction beyond reading reviews (e.g., asking substitution questions, sharing leftover ideas) as part of the workflow.",
        "Dataset B includes budget-conscious constraints as a core task requirement across multiple samples, while Dataset A mentions it only as a navigation pattern.",
        "Dataset B tasks explicitly target beginner-friendly recipes and skill-level considerations in recipe selection.",
        "Dataset B requires users to navigate kid-friendly recipe attributes (e.g., Halloween snacks, school-safe granola bars) as a distinct filtering criterion.",
        "Dataset B tasks involve active recipe modification scenarios (e.g., substituting oils, adapting portion sizes) rather than just comparing existing versions.",
        "Dataset B includes specific requests for visual presentation elements (e.g., 'visually stunning garnish') not mentioned in Dataset A tasks.",
        "Dataset B tasks require cross-referencing multiple recipe components (e.g., matching main dishes with leftover ingredient reuse) as a consistent pattern"
      ]
    },
    "dictionary.cambridge": {
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=2": [
        "Tasks in dataset A require structured navigation through specific sections (e.g., Grammar, Thesaurus), while dataset B tasks involve broader exploration without explicit section guidance.",
        "Dataset A tasks often demand explicit interaction with pronunciation audio buttons (e.g., clicking speaker icons), whereas dataset B tasks focus on phonetic transcription extraction.",
        "Tasks in dataset A frequently specify retrieval of both UK and US pronunciations, while dataset B tasks may request only one variant or comparative analysis.",
        "Dataset A tasks require direct translation into named languages (e.g., Chinese, Spanish), while dataset B tasks involve general translation functionality without specified targets.",
        "Tasks in dataset A explicitly request numerical answers (e.g., count of definitions), whereas dataset B focuses on qualitative descriptions.",
        "Dataset B includes tasks requiring synonym/antonym exploration through Thesaurus relationships, which are absent in dataset A's sampled tasks.",
        "Tasks in dataset A specify creation of new example sentences using definitions, while dataset B focuses on finding existing usage examples.",
        "Dataset B contains tasks requiring comparison of related terms (e.g., business vocabulary clusters), whereas dataset A focuses on individual word analysis.",
        "Tasks in dataset A explicitly reference IPA notation requirements, while dataset B tasks imply phonetic understanding without technical notation demands.",
        "Dataset B includes exploratory tasks about word origins/etymology, while dataset A focuses on contemporary usage and definitions."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=3": [
        "Tasks in Dataset A require specifying exact numerical answers (e.g., count of definitions), while Dataset B focuses on conceptual understanding without numerical precision.",
        "Dataset A tasks explicitly mention language pairs for translations (e.g., English\u2013French), whereas Dataset B tasks omit directional language specifications.",
        "Dataset B includes tasks involving external content contribution (e.g., Wikipedia editing), which are absent in Dataset A.",
        "Dataset B tasks explicitly require error handling during navigation (e.g., 'handle errors'), unlike Dataset A.",
        "Dataset A tasks target granular grammar sub-sections (e.g., modal verbs), while Dataset B tasks use broader grammar categories (e.g., 'adjectives').",
        "Dataset B emphasizes collocation retrieval (e.g., 'common word combinations with \"accommodation\"'), while Dataset A focuses on synonym/antonym identification.",
        "Dataset A tasks specify UK/US pronunciation distinctions, whereas Dataset B tasks omit regional pronunciation requirements.",
        "Dataset A references future-dated dynamic content (e.g., 2025 blog posts), while Dataset B uses current or past-dated content.",
        "Dataset A tasks demand multi-context example sentences, while Dataset B tasks request general usage examples without contextual variety.",
        "Dataset B includes exploratory tasks (e.g., 'test the agent\u2019s ability to navigate'), whereas Dataset A prioritizes direct information retrieval."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=1": [
        "Tasks in dataset B require general exploration of dictionary features (e.g., 'Learn how to use the dictionary translation') rather than specific information retrieval",
        "Dataset B tasks frequently involve open-ended vocabulary improvement goals (e.g., 'Improve vocabulary by looking up meanings') without specified output formats",
        "Pronunciation tasks in B do not explicitly require IPA notation identification unlike A's explicit IPA requests",
        "B contains educational context tasks about teaching methods (e.g., 'Find information on teaching methods') absent in A",
        "Translation tasks in B lack requirements to identify translation service providers present in A's samples",
        "Grammar tasks in B focus on broad grammatical categories (e.g., 'Learn about Nouns') rather than specific usage patterns like modal verbs in A",
        "B includes website exploration tasks (e.g., 'Explore the features of the dictionary webpage') not found in A",
        "Synonym-related tasks in B require simple identification rather than navigation through Thesaurus sections as in A",
        "B contains test preparation context (e.g., 'Research vocabulary for TOEFL') absent in A's task formulations",
        "Definition tasks in B omit requirements to count/compare numbered meanings that are common in A's samples"
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=4": [
        "Tasks in dataset B require users to find synonyms or antonyms for words/phrases, while A focuses on definitions and translations.",
        "Dataset B includes tasks involving quizzes or interactive word games, whereas A does not mention such activities.",
        "Tasks in B often demand exploring multiple related words in a single query (e.g., 'jukebox, jujitsu, juice'), while A focuses on single-word analysis.",
        "Dataset B contains tasks explicitly requesting identification of parts of speech (e.g., adjectives, nouns), while A emphasizes grammatical usage examples.",
        "Tasks in B require sharing dictionary content via social media (e.g., Twitter), which is absent in A's requirements.",
        "Dataset B includes open-ended research tasks (e.g., 'explore features of the website'), whereas A tasks are narrowly scoped to specific content extraction.",
        "B tasks involve phrase-based queries (e.g., 'in a nutshell'), while A focuses on individual word analysis.",
        "Dataset B requires users to investigate word origins/etymology, while A focuses on contemporary usage and pronunciation.",
        "Tasks in B mention extracting 'codes' for words (e.g., 'find the code for \"solve\"'), a requirement not present in A.",
        "Dataset B includes tasks about business terminology and financial terms (e.g., 'reinvest'), while A emphasizes general vocabulary and grammar concepts."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=0": [
        "Tasks in dataset B focus on single-step information retrieval (e.g., basic definitions, pronunciation) whereas dataset A tasks often require multi-step synthesis (e.g., definitions + examples + comparisons)",
        "Dataset B includes ambiguous task phrasing (e.g., 'Find information about adverb phrases') while dataset A uses precise grammatical terminology (e.g., 'modal verbs in grammar section')",
        "Dataset B contains exploratory/open-ended tasks (e.g., 'Explore features and capabilities') whereas dataset A tasks are strictly answer-focused",
        "Dataset B shows emphasis on vocabulary expansion (e.g., 'related terms linked to accountability') while dataset A focuses on structured linguistic analysis",
        "Tasks in dataset B occasionally lack clear linguistic parameters (e.g., 'Find examples of phonetics') compared to dataset A's explicit scope (e.g., 'IPA notation requirements')",
        "Dataset B includes malformed/illogical queries (e.g., 'Find synonyms for \"thesaurus\"') that dataset A avoids through task validation",
        "Dataset B tasks more frequently require cross-referencing between dictionary sections (e.g., 'related terms...starting from there') compared to dataset A's linear navigation",
        "Translation tasks in dataset B use generic language pairs (e.g., 'English to Spanish') while dataset A specifies translation directions (e.g., 'English-Chinese Simplified')",
        "Dataset A consistently requires example sentence extraction across tasks whereas dataset B only occasionally mentions contextual usage",
        "Dataset B includes metadata-focused queries (e.g., '2024 word of the year') absent from dataset A's purely linguistic objectives"
      ]
    },
    "apple": {
      "nnetnav_live_site=apple_num_tasks=70_portion=1": [
        "Dataset B includes tasks requiring navigation through business/enterprise-specific sections (e.g., Apple Business Manager, enterprise solutions)",
        "Dataset B contains tasks focused on healthcare-specific product applications (e.g., Health Records, Mac in Healthcare)",
        "Dataset B requires interaction with environmental sustainability documentation beyond basic specs",
        "Dataset B includes tasks involving device troubleshooting/optimization (e.g., battery life improvement)",
        "Dataset B contains tasks requiring access to developer/enterprise account management features",
        "Dataset B includes queries about data privacy implementation details and user data handling",
        "Dataset B requires navigation through corporate responsibility sections (carbon neutrality, sustainability)",
        "Dataset B contains tasks involving accessory compatibility with specific professional use cases",
        "Dataset B includes app-specific technical requirements beyond Apple's first-party apps",
        "Dataset B requires interaction with family sharing configuration and management features"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=4": [
        "Dataset B tasks require accessing environmental impact reports and sustainability information for products",
        "Dataset B tasks involve managing Family Sharing groups and account configurations",
        "Dataset B requires checking warranty status and repair eligibility for devices",
        "Dataset B tasks include optimizing device performance/battery life through settings adjustments",
        "Dataset B requires accessing enterprise/business purchase programs and plans",
        "Dataset B tasks involve configuring parental controls and child device management",
        "Dataset B requires comparing trade-in values alongside product feature comparisons",
        "Dataset B tasks include accessing device repair services and self-repair instructions",
        "Dataset B requires researching data protection policies and privacy controls",
        "Dataset B tasks involve accessing educational institution purchasing processes and grants"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=0": [
        "Tasks in B require researching environmental sustainability reports and material recycling initiatives for products",
        "B includes tasks involving financial/business metrics like quarterly earnings reports and investor relations content",
        "B contains explicit requirements to compare product environmental impact across multiple device categories",
        "B tasks involve configuring hardware specifications during purchase workflows (e.g. chip selection, color customization)",
        "B requires navigation through enterprise/business success case studies and industry-specific solutions",
        "Tasks in B include accessing educational institution-specific pricing and academic verification processes",
        "B contains healthcare-specific navigation requirements (medical device certifications, health record systems)",
        "B tasks involve repair/service cost estimation workflows separate from trade-in programs",
        "B requires comparison of parental control features and family sharing configuration options",
        "Tasks in B include accessing developer-focused technical documentation for device integration/APIs"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=2": [
        "Dataset B tasks frequently involve enterprise/business solutions (e.g., device management, business plans)",
        "Dataset B contains specific queries about Apple's corporate operations (e.g., financial results, environmental reports)",
        "Dataset B requires navigation to business-specific support documentation (e.g., Business Conduct Policy)",
        "Dataset B tasks include bulk purchasing scenarios (e.g., multiple device configurations for organizations)",
        "Dataset B emphasizes corporate account management (e.g., Apple Business Essentials configurations)",
        "Dataset B contains explicit requests for technical specifications of enterprise-focused features",
        "Dataset B includes queries about Apple's organizational policies and compliance documentation",
        "Dataset B tasks involve deeper navigation through business-oriented support articles",
        "Dataset B requires understanding of enterprise service hierarchies (e.g., AppleCare Help Desk for businesses)",
        "Dataset B contains more complex multi-device management scenarios (e.g., enterprise device deployment)"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=3": [
        "Tasks in B require completing purchase flows or configuration processes (e.g., 'Order iPhone 16 Pro', 'Configure iPad Pro for business')",
        "B includes explicit app store-related tasks (e.g., finding app reviews, version histories like 'Kino - Pro Video Camera app')",
        "B contains specific business solution exploration (e.g., 'small business success stories', 'Mac products for businesses')",
        "B features detailed accessory compatibility checks including color variants (e.g., 'iPhone 16 Silicone Case colors')",
        "B requires navigating warranty status checks and repair workflows (e.g., 'Check iPad warranty', 'Find iPhone repair options')",
        "B includes Apple service integrations (e.g., 'Apple Family Sharing parental controls', 'Health Records patient enrollment')",
        "B tasks involve detailed technical specifications for troubleshooting (e.g., 'iPhone charging specs', 'MacBook battery optimization')",
        "B contains explicit price configuration tasks with trade-in combinations (e.g., 'iPhone pricing with trade-in options')",
        "B focuses on enterprise-specific purchasing (e.g., 'Configure iPad Pro for business use', 'Business MacBook Air purchases')",
        "B requires comparing health feature implementations across device generations (e.g., 'Apple Watch health comparisons')"
      ]
    },
    "google_search": {
      "nnetnav_live_site=google_search_num_tasks=72_portion=3": [
        "Dataset B tasks include explicit requests for local/geographic information (e.g. 'near me', zip codes) while A focuses on universal factual queries",
        "B contains tasks requiring multi-platform integration (e.g. YouTube tutorials, Allrecipes.com comparisons) where A focuses on single-platform data extraction",
        "B includes career/job search related queries (software engineer positions) absent in A",
        "B features health/medical information needs (symptoms, vaccination info) not present in A's technical/sports focus",
        "B contains explicit tutorial/guide searches (woodworking projects) while A focuses on factual record retrieval",
        "B includes tasks requiring comparison of user-generated content (recipe comparisons) vs A's structured data comparisons",
        "B features language learning/translation tasks (Spanish, Duolingo) absent in A",
        "B contains event planning/search tasks (venues, social events) not found in A",
        "B includes tasks requiring interaction with business services (schedule meetings, Google Ads) unlike A's passive data retrieval",
        "B features Wikipedia editing/content modification tasks absent in A's read-only requirements"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=2": [
        "Dataset B tasks more frequently involve user-initiated content creation or modification (e.g., recipe databases, course enrollment)",
        "Dataset B includes tasks requiring physical action planning (event venues, restaurant suggestions, travel arrangements)",
        "Dataset B contains more health/wellness focused information retrieval (symptoms, medical conditions, dietary needs)",
        "Dataset B emphasizes practical skill acquisition (language learning, woodworking tutorials, SEO strategies)",
        "Dataset B tasks more often involve commercial transactions (product purchases, ticket bookings, service enrollment)",
        "Dataset B includes more community/interactive elements (recipe ratings, social sharing, user reviews)",
        "Dataset B tasks frequently require multi-step process following (project instructions, implementation guides)",
        "Dataset B emphasizes contemporary social/environmental issues (climate action, responsible AI, sustainability)",
        "Dataset B contains more location-specific/local service searches (venue amenities, regional trends, local businesses)",
        "Dataset B tasks more often involve personalization/adaptation of information (dietary restrictions, learning preferences, risk factors)"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=4": [
        "Dataset A tasks primarily involve retrieving precise, singular data points (e.g., dates, counts, SHA hashes), while B focuses on understanding broader concepts/processes (e.g., AI innovations, climate effects)",
        "A requires direct extraction of existing structured data (e.g., rankings, bios), whereas B often involves exploratory analysis of unstructured information (e.g., research trends, program comparisons)",
        "A tasks emphasize temporal urgency (e.g., 'latest game score', 'current top artist'), while B includes strategic planning tasks (e.g., event bookings, future movie trailers)",
        "A outputs require verbatim replication of technical details (e.g., commit hashes, hardware specs), while B outputs allow synthesized summaries (e.g., research paper findings, recipe suggestions)",
        "A targets discrete platform-specific data (GitHub, IMDb, Reddit), whereas B spans cross-platform research (academic journals, news sources, corporate sites)",
        "B contains transactional objectives (job applications, ticket purchases) absent in A's purely informational goals",
        "A tasks demand precision in quantitative outputs (first 7 bits, top 5 movies), while B accepts qualitative comparisons (stock performance analysis, trend evaluations)",
        "B includes creative synthesis tasks (event activity ideas, recipe customization) not present in A's fact-retrieval paradigm",
        "A focuses on verification of existing records (biographies, release dates), while B emphasizes discovery of emerging information (new research, 2025 trends)",
        "B tasks require understanding relational context (how ML techniques apply to NLP), whereas A prioritizes isolated fact extraction (player stats, software requirements)"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=0": [
        "Tasks in B require booking/reservation actions (e.g., hotels, venues) while A focuses solely on information retrieval",
        "B includes localized search requirements (e.g., 'near me', specific cities) whereas A focuses on global/non-location-bound data",
        "B contains health/medical information requests (e.g., symptoms, vaccines) absent in A's technical/sports focus",
        "B features commercial transactions (e.g., product purchases, price checks) while A focuses on factual data extraction",
        "B includes future-oriented planning queries (e.g., travel destinations 2025) vs A's historical/current data focus",
        "B requires finding instructional content (e.g., tutorials, guides) whereas A focuses on objective facts/statistics",
        "B contains product configuration tasks (e.g., phone customization) not present in A's specification-focused queries",
        "B includes linguistic comparison tasks (e.g., translation analysis) absent in A's single-language requirements",
        "B features event planning/organization needs (e.g., venues, activities) while A focuses on entity information retrieval",
        "B requires open-ended resource gathering (e.g., 'find inspiration', 'discover ideas') whereas A demands specific structured outputs"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=1": [
        "Dataset B tasks frequently involve exploratory actions (e.g., 'browse news', 'explore programs') rather than direct fact extraction",
        "Dataset B includes tasks requiring user interaction with UI components (e.g., 'test donation prompt', 'manage translation settings')",
        "Dataset B contains more lifestyle/health-focused queries (e.g., recipes, health advice, wellness events) compared to A's technical/sports dominance",
        "Dataset B tasks often target specific organizational content (e.g., university program pages, CDC guidelines, company principles)",
        "Dataset B includes transactional objectives (e.g., 'buy tickets', 'contact experts') absent in A",
        "Dataset B queries more frequently require understanding processes/explanations (e.g., 'how autocomplete works', 'what is machine learning')",
        "Dataset B contains troubleshooting tasks (e.g., 'fix Pixel buttons', 'get support') not present in A",
        "Dataset B tasks emphasize practical implementation (e.g., 'install Python', 'set up ad campaigns') rather than pure information retrieval",
        "Dataset B includes comparative shopping tasks (e.g., hotel prices, stock comparisons) with user decision-making requirements",
        "Dataset B queries more frequently target official organizational resources (e.g., Google's AI principles, CDC website content)"
      ]
    }
  },
  "diffs_real_from_synth": {
    "google_maps": {
      "nnetnav_live_site=google_maps_num_tasks=75_portion=2": [
        "Dataset B tasks require identifying specific operational constraints beyond current time (e.g., 'not open 24 hours', 'closes at night')",
        "Dataset B includes map interaction tasks (e.g., printing maps as PDF, generating sharing links) absent in A",
        "Dataset B tasks target EV charging stations or Tesla-specific infrastructure, unlike A",
        "Dataset B tasks involve analyzing review content for granular attributes (e.g., comment themes, review proportions per facility level)",
        "Dataset B requires explicit use of location-sharing features (e.g., generating map links) not mentioned in A",
        "Dataset B contains multi-part queries requiring sequential validation (e.g., find hotel \u2192 find nearby supermarket \u2192 calculate walking time)",
        "Dataset B specifies parking facilities with exact closure times rather than general availability checks in A",
        "Dataset B tasks demand precise public transit stop identification at street intersections rather than route planning in A",
        "Dataset B uses exact numerical rating thresholds (e.g., '>4.8 stars') where A uses relative terms like 'highly-rated'",
        "Dataset B includes information-gathering tasks about institutional features (e.g., national park details, airport level statistics)"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=3": [
        "Dataset B tasks require explicit exclusion criteria in filters (e.g., 'not open 24 hours') whereas Dataset A focuses on inclusion criteria",
        "Dataset B includes tasks involving direct interaction with app functionality (e.g., printing maps, sharing links) absent in Dataset A",
        "Dataset B emphasizes precise operational hour constraints (e.g., 'closes at night') while Dataset A uses general availability checks",
        "Dataset B contains tasks requiring identification of specific infrastructure types (e.g., EV charging stations, Tesla Destination Chargers) not specified in Dataset A",
        "Dataset B tasks frequently involve spatial relationships between multiple points of interest (e.g., 'closest to X and nearest to Y') beyond Dataset A's single-location focus",
        "Dataset B includes explicit numerical thresholds in requirements (e.g., 'ratings greater than 4.8') where Dataset A uses qualitative descriptors like 'highly rated'",
        "Dataset B tasks require analysis of user-generated content characteristics (e.g., 'which level has least proportion in reviews') unlike Dataset A's basic review retrieval",
        "Dataset B contains wayfinding tasks with specific movement constraints (e.g., 'least amount of walking') absent in Dataset A's general route generation",
        "Dataset B includes map manipulation tasks (e.g., printing as PDF) not present in Dataset A's information retrieval focus",
        "Dataset B tasks frequently specify exact intersection-based locations (e.g., 'corner of Elm/Oak') while Dataset A uses neighborhood/landmark references"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=1": [
        "Dataset B tasks require direct interaction with map export/sharing features (e.g., print as PDF, generate sharing links)",
        "Dataset B contains explicit requirements to analyze review patterns/statistics (e.g., 'level with least proportion in reviews')",
        "Dataset B includes tasks requiring identification of commercial service providers with negative availability constraints (e.g., 'not open 24 hours')",
        "Dataset B tasks specifically target EV charging infrastructure and parking combinations",
        "Dataset B emphasizes precise identification of transportation nodes (e.g., 'nearest bus stop to street intersection')",
        "Dataset B contains requests for quantitative list generation (e.g., 'list three', 'find 5 beauty salons')",
        "Dataset B tasks require parsing operational hour patterns beyond simple 'open now' (e.g., 'closes at night')",
        "Dataset B includes explicit requirements to retrieve system-generated URLs/links (sharing features)",
        "Dataset B tasks focus more on infrastructure/services rather than leisure/tourism destinations",
        "Dataset B contains requests for temporal-spatial calculations between found locations (e.g., 'walking time between hotel and supermarket')"
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=0": [
        "Tasks in B require generating or exporting map data (e.g., PDF printing, sharing links), while A focuses on consumption of map data.",
        "B includes explicit requests for infrastructure-specific queries (e.g., parking garages, EV charging stations), whereas A emphasizes hospitality/services (hotels/restaurants).",
        "B contains tasks requiring analysis of hierarchical review components (e.g., 'which level has least proportion in reviews'), unlike A's general review parsing.",
        "B features precise numerical targets (e.g., '5 beauty salons', 'three bus stops'), while A uses qualitative thresholds ('highly-rated', 'moderately-priced').",
        "B requires identification of negative constraints (e.g., 'not open 24 hours'), whereas A primarily uses positive filters ('open now', 'wheelchair accessible').",
        "B includes map interaction mechanics (e.g., 'find search settings', 'share map') absent in A's navigation/booking-focused tasks.",
        "B emphasizes temporal specificity for closures (e.g., 'closes at night'), while A focuses on availability windows ('check-in dates', 'open hours').",
        "B contains multi-objective instructions with sequential dependencies (e.g., 'first search X then find Y'), whereas A tasks are single-objective focused.",
        "B requires proximity analysis between non-tourist landmarks (e.g., 'corner of Elm/Oak streets'), while A uses named destinations (cities, attractions).",
        "B includes requests for operational metadata (e.g., walking time calculations between arbitrary points) beyond A's route navigation between defined locations."
      ],
      "nnetnav_live_site=google_maps_num_tasks=75_portion=4": [
        "Dataset A tasks prioritize hospitality services (restaurants, hotels) with cuisine-specific and reservation requirements, while Dataset B focuses on logistics (parking, transit, charging stations) with operational hour constraints.",
        "Dataset A includes tasks requiring multi-stop route planning (e.g., bike trips with restaurant stops), whereas Dataset B emphasizes single-destination navigation with proximity constraints (e.g., 'closest to').",
        "Dataset B tasks frequently involve map interaction features (e.g., printing maps, generating sharing links), absent in Dataset A's sampled tasks.",
        "Dataset A contains explicit price comparison requirements (e.g., 'best price', 'affordable'), while Dataset B emphasizes operational status verification (e.g., 'not open 24 hours').",
        "Dataset B includes specific numerical quantifiers in task requirements (e.g., 'list three', '5 beauty salons'), unlike Dataset A's open-ended quantity requests.",
        "Dataset A tasks require temporal specificity for reservations (exact dates/times), while Dataset B focuses on spatial specificity (nearest to landmarks/intersections).",
        "Dataset B contains tasks requiring analysis of structured facility information (e.g., airport level statistics), unlike Dataset A's focus on amenity/service verification.",
        "Dataset A emphasizes accessibility validation within itinerary planning, while Dataset B prioritizes accessibility as a standalone filter (e.g., 'EV charging supported parking').",
        "Dataset B includes tasks requiring direct interaction with platform features (e.g., 'find search settings'), absent in Dataset A's sampled tasks.",
        "Dataset A tasks involve cross-referencing multiple qualitative factors (ratings + affordability + accessibility), while Dataset B focuses on singular quantitative thresholds (e.g., 'ratings greater than 4.8')."
      ]
    },
    "github": {
      "nnetnav_live_site=github_num_tasks=71_portion=3": [
        "Tasks in B require filtering repositories by specific time ranges (e.g., 'last week', 'past 15 days') rather than general recency",
        "B tasks demand extraction of precise repository metadata (e.g., 'files changed in last commit', 'total additions/deletions') rather than general version history checks",
        "B requires identification of educational resource structure details (e.g., number of courses in specific skill sections)",
        "Tasks in B specify technical tags/labels (e.g., 'web scraping', 'blockchain') rather than general language filters",
        "B tasks require summarization of project purposes/objectives from repository descriptions",
        "B includes explicit requirements to verify documentation completeness (e.g., presence of Readme files)",
        "Tasks in B demand comparison of numerical plan limits (e.g., 'maximum private repositories') rather than general feature differences",
        "B requires identification of repository creation dates rather than just update timestamps",
        "Tasks in B specify exact contributor metrics (e.g., 'list top five contributors') rather than general popularity indicators",
        "B includes requirements to locate specific documentation elements within wikis/FAQs (e.g., theme configuration steps)"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=2": [
        "Tasks in B require extracting specific file-level details from repository commits (e.g. changed filenames, additions/deletions)",
        "B includes tasks requiring identification of educational course structure/actions in GitHub Skills resources",
        "B tasks demand quantitative comparisons between plan features (e.g. storage numbers, repository limits)",
        "B requires locating and summarizing repository purposes/main objectives from descriptions",
        "Tasks in B specify tighter time constraints for recency (e.g. 'last 10 days' vs general 'recent')",
        "B includes explicit requirements to list/enumerate items (top contributors, course counts, story listings)",
        "Tasks in B require identification of specific configuration steps from documentation (e.g. theme changes)",
        "B contains tasks needing interpretation of tag-based organization (e.g. 'First day on GitHub' heading)",
        "B requires cross-referencing multiple specific criteria simultaneously (language + stars + topic + date)",
        "Tasks in B demand mobile-specific feature verification (e.g. Copilot chat mobile availability)"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=0": [
        "Dataset B tasks require locating repositories with specific real-time criteria (e.g., 'last week', 'past 2 days') while Dataset A focuses on general repository discovery.",
        "Dataset B includes tasks requiring direct interaction with repository content (e.g., commit history, file changes) whereas Dataset A focuses on feature documentation.",
        "Dataset B tasks demand identification of quantitative metrics (e.g., 'number of courses', 'total additions/deletions') absent in Dataset A's qualitative inquiries.",
        "Dataset B requires navigation through GitHub Skills learning paths with course-specific actions while Dataset A focuses on general educational resource discovery.",
        "Dataset B tasks involve explicit version checking (e.g., 'latest release version') unlike Dataset A's feature exploration.",
        "Dataset B requires comparison of numerical plan limits (e.g., 'package storage', 'private repositories') while Dataset A compares plan tiers qualitatively.",
        "Dataset B tasks mandate identification of temporal project metadata (e.g., creation date, update recency) not emphasized in Dataset A.",
        "Dataset B includes contributor analysis tasks (e.g., 'top five contributors') absent from Dataset A's individual-focused tasks.",
        "Dataset B requires extraction of specific configuration details (e.g., theme setup in wiki) while Dataset A focuses on general policy understanding.",
        "Dataset B tasks demand summarization of technical specifications (e.g., 'main objective', 'project purpose') whereas Dataset A requires comparative analysis of features"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=4": [
        "Tasks in B require filtering repositories by exact star counts (e.g., '500 stars') rather than general popularity criteria",
        "B demands locating specific educational courses by exact name (e.g., 'Resolve merge conflicts course') rather than general educational resources",
        "Tasks in B require extraction of detailed commit metrics (file names, additions/deletions count) rather than general commit history checks",
        "B tasks demand counting/numerical verification (e.g., 'how many courses', 'top five contributors') not required in A",
        "Dataset B requires identifying repository creation dates as explicit search criteria rather than general update date ranges",
        "Tasks in B specify exact technology stacks/tags (e.g., 'web scraping', 'blockchain technology') rather than general language filters",
        "B requires locating and interpreting wiki documentation (e.g., theme configuration instructions) rather than general feature documentation",
        "Dataset B tasks demand comparison of exact numerical values between plans (e.g., 'package storage difference') rather than general plan comparisons",
        "B requires identification of specific UI elements/categories in resources (e.g., 'First day on GitHub' heading) rather than general navigation",
        "Tasks in B demand listing ranked results (e.g., 'most stars', 'top contributors') rather than general repository discovery"
      ],
      "nnetnav_live_site=github_num_tasks=71_portion=1": [
        "Dataset B tasks emphasize precise technical repository searches with multiple filters (stars, update dates, language)",
        "Dataset B requires extracting granular repository metadata (e.g., commit file changes, contributor rankings)",
        "Dataset B includes navigation to educational resources (GitHub Skills courses) with specific learning objectives",
        "Dataset B tasks demand interaction with repository wikis for configuration instructions",
        "Dataset B focuses on quantitative plan comparisons (storage limits, private repo counts)",
        "Dataset B requires analysis of repository content evolution through commit histories",
        "Dataset B tasks involve identifying top contributors in technical domain-specific projects",
        "Dataset B emphasizes temporal constraints (recent updates within X days) for repository discovery",
        "Dataset B includes explicit requirements to locate and parse technical documentation (theme configurations, release notes)",
        "Dataset B tasks require validation of repository popularity through multiple metrics (stars, forks, creation dates)"
      ]
    },
    "espn": {
      "nnetnav_live_site=espn_num_tasks=62_portion=0": [
        "Dataset A tasks emphasize real-time game updates and live scores, while Dataset B focuses on post-game statistics and historical standings",
        "Dataset A includes tasks related to accessing multimedia content (highlights/articles) tied to games, while Dataset B prioritizes direct statistical comparisons (e.g., heaviest player, highest salary)",
        "Dataset B requires navigation of power rankings/indices (e.g., NBA Basketball Power Index) not present in Dataset A tasks",
        "Dataset B tasks involve retrieving salary data and physical attributes of players, whereas Dataset A focuses on in-game performance metrics (points/assists)",
        "Dataset A tasks frequently require date-specific filtering (yesterday's games), while Dataset B emphasizes conference/division ranking comparisons",
        "Dataset B includes tasks about season-long standings (top/bottom teams in conferences), while Dataset A focuses on single-game outcomes",
        "Dataset A navigation involves playoff trackers/bracket structures, while Dataset B requires accessing league-specific indices/metrics",
        "Dataset B tasks require identification of team geographical affiliations (e.g., 'Los Angeles' teams), not present in Dataset A",
        "Dataset A includes fantasy sports interactions tied to live tournaments, while Dataset B references betting odds analysis",
        "Dataset B tasks involve cross-referencing player positions with statistical categories (e.g., 'infielders' weight'), absent from Dataset A"
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=4": [
        "Tasks in B require identifying players with specific physical attributes (e.g., heaviest infielder) not present in A",
        "B tasks demand direct comparisons of numerical values between losing/winning teams (e.g., score highs)",
        "B requires salary information retrieval for individual players, absent in A's general statistical queries",
        "B tasks specify exact calendar dates for schedule checks (e.g., Dec 25, 2023) rather than general timeframes",
        "B includes pattern matching requirements for team names (e.g., 'Los Angeles', 'Golden') across leagues",
        "B tasks require identification of positional leaders (e.g., top rebounder's position) within conferences",
        "B involves cross-sport statistical comparisons (e.g., NCAAF vs NFL odds) not seen in A",
        "B tasks reference advanced analytical metrics (e.g., Basketball Power Index) absent in A's basic standings checks",
        "B requires multi-game performance aggregation (e.g., last 5 games for a player) rather than single-game data",
        "B tasks demand explicit conference/division hierarchy analysis in standings (e.g., NHL division breakdown)"
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=1": [
        "Tasks in B require analytical comparisons beyond direct metrics (e.g., comparing loser's high vs. winner's high in historical games).",
        "B includes queries for league-wide analytical rankings (e.g., NBA Basketball Power Index) not present in A.",
        "B tasks demand identification of player positions alongside statistical metrics (e.g., position of top scorer).",
        "B involves retrieving salary data for players (e.g., highest-paid Celtics player), whereas A focuses on performance stats.",
        "Tasks in B require division-specific standings (e.g., NHL divisions) in addition to conference rankings.",
        "B includes historical player/team performance queries (e.g., last 5 games of a player) beyond recent games.",
        "B tasks involve team naming conventions (e.g., counting teams with 'Los Angeles' or 'Golden' in their names).",
        "B requires identifying statistical leaders per conference (e.g., rebounds/assists leaders in Western Conference).",
        "B tasks specify verifying broadcast details (e.g., games aired on ESPN) for temporal information.",
        "B includes queries about league structure (e.g., number of NBA/NHL teams with specific naming patterns)."
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=3": [
        "Dataset B requires identification of specific player physical attributes (e.g., weight) not present in Dataset A tasks",
        "Dataset B includes analysis of team naming patterns (e.g., 'teams with Los Angeles/Golden in name') not required in Dataset A",
        "Dataset B tasks involve direct comparison of conference/division positions (top vs bottom teams) rather than simple retrieval",
        "Dataset B requires navigation to specialized ranking systems (e.g., Basketball Power Index) absent from Dataset A",
        "Dataset B contains queries about organizational structure (number of sports leagues on homepage) not seen in Dataset A",
        "Dataset B tasks demand positional filtering of player statistics (e.g., 'infielders') not required in Dataset A",
        "Dataset B includes multi-game historical analysis of individual players (last 5 games) beyond Dataset A's single-game focus",
        "Dataset B requires identification of broadcast network-specific content (games on ESPN) rather than general broadcast info",
        "Dataset B tasks involve salary cap information extraction not present in Dataset A requirements",
        "Dataset B contains explicit temporal constraints for trade news updates (past 2 days) with stricter recency requirements"
      ],
      "nnetnav_live_site=espn_num_tasks=62_portion=2": [
        "Dataset B tasks require comparing statistical metrics across teams (e.g., Basketball Power Index rankings) while Dataset A focuses on standalone data retrieval.",
        "Dataset B includes queries about player salaries and financial metrics (e.g., highest-paid Celtics player), which are absent in Dataset A tasks.",
        "Dataset B tasks demand identification of top performers in specific categories (e.g., points, assists) from recent games, whereas Dataset A prioritizes general performance stats.",
        "Dataset B involves conditional game outcomes (e.g., 'loser high > winner high'), while Dataset A focuses on absolute scores/results.",
        "Dataset B explicitly requests player physical attributes (e.g., 'heaviest weight'), which Dataset A does not reference.",
        "Dataset B includes searches for team names based on keywords (e.g., 'Los Angeles' or 'Golden'), absent in Dataset A.",
        "Dataset B requires historical data aggregation (e.g., 'last 5 games' for a player), whereas Dataset A emphasizes real-time/recent data.",
        "Dataset B tasks involve divisional/conference positional context (e.g., standings within divisions), while Dataset A references standings more broadly.",
        "Dataset B tasks specify granular game summaries (e.g., 'top rebounder'), whereas Dataset A requests general highlights.",
        "Dataset B includes analytical queries about league composition (e.g., team counts by name/league), absent in Dataset A."
      ]
    },
    "huggingface": {
      "nnetnav_live_site=huggingface_num_tasks=76_portion=1": [
        "Dataset B tasks require filtering resources by exact date ranges (e.g. 'within March 2023') rather than relative timestamps",
        "Dataset B tasks explicitly require using Inference API for content generation (e.g. generating stories)",
        "Dataset B focuses on identifying top-ranked resources by absolute download counts rather than trending status",
        "Dataset B tasks demand retrieval of specific numerical performance metrics (e.g. BLEU scores) rather than general capabilities",
        "Dataset B requires verification of specific license types (e.g. cc-by-sa-4.0) rather than general licensing awareness",
        "Dataset B tasks emphasize identifying 'first' or 'oldest' resources within temporal constraints",
        "Dataset B requires extraction of technical implementation details from documentation (e.g. bit precision settings)",
        "Dataset A contains tasks requiring interaction with community features (forums, GitHub issues) absent in B",
        "Dataset A includes tasks requiring account creation/authentication flows for API access",
        "Dataset A contains format conversion tasks (e.g. paper to HTML) not present in B"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=0": [
        "Dataset B tasks emphasize finding the most recent/last updated entries (e.g. 'latest', 'released in past month') while A focuses on temporal constraints without recency prioritization",
        "B requires content generation through API interactions (e.g. 'generate short story') whereas A focuses solely on information retrieval",
        "B includes explicit requirements to summarize features/descriptions (e.g. 'briefly describe', 'summarize features') not present in A's tasks",
        "B contains tasks requiring identification of tutorial-specific implementation details (e.g. 'how to load in 8bit/4bit') rather than general documentation parsing in A",
        "B emphasizes license specificity (e.g. 'cc-by-sa-4.0') more prominently than A's general license identification requirements",
        "B requires tracking GitHub stars/repository metrics while A focuses solely on model/dataset metrics",
        "B tasks demand confirmation of exact temporal constraints (e.g. 'last updated in 2022') rather than relative timeframes in A",
        "B includes explicit requirements to identify 'most downloaded' entries across multiple categories (audio, translation) unlike A's general popularity filters",
        "B contains tasks requiring analysis of blog content/paper summaries rather than pure metadata extraction in A",
        "B features multi-criteria combination requirements (e.g. 'most downloaded + en-ja focus') more frequently than A's single-filter tasks"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=4": [
        "Dataset B tasks emphasize identifying 'latest' or 'most recent' resources with explicit time sensitivity (e.g. 'as of today's date', 'past month')",
        "Dataset B requires verifying open-source status/commercial permissions as a primary filter in task requirements",
        "Dataset B tasks focus on domain-specific applications (travel chats, recipe generation, fake news detection) rather than general technical capabilities",
        "Dataset B emphasizes ranking metrics (most downloaded, most likes, highest stars) as core selection criteria",
        "Dataset B contains tasks requiring summarization of features/functionality rather than pure information extraction",
        "Dataset B tasks specify concrete numerical thresholds for recency (e.g. 'within March 2023', '1M+ downloads')",
        "Dataset B includes explicit requirements to compare multiple models/datasets (e.g. 'list three', 'identify three')",
        "Dataset B tasks frequently require identifying default configurations/implementation details (e.g. 'default model in pipeline')",
        "Dataset B emphasizes language pair specificity in multilingual tasks (e.g. English-Japanese translation models)",
        "Dataset B contains tasks requiring analysis of auxiliary content like blog posts alongside technical specifications"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=2": [
        "Dataset B tasks emphasize identifying 'most recent' updates/releases while Dataset A focuses on static version/architecture details",
        "Dataset B requires real-time/current date awareness ('as of today') not present in Dataset A tasks",
        "Dataset B includes explicit API interaction tasks (e.g., generating stories) absent from Dataset A",
        "Dataset B tasks prioritize popularity metrics (most downloaded/liked) as primary filters more than Dataset A",
        "Dataset B requires summarization of features/functionality while Dataset A focuses on exact metadata extraction",
        "Dataset B contains specific temporal constraints (e.g., 'released in past month') not seen in Dataset A",
        "Dataset B tasks require license version specificity (e.g., cc-by-sa-4.0) while Dataset A uses broader open-source/commercial distinction",
        "Dataset B includes explicit model application/use case description requirements absent from Dataset A",
        "Dataset A contains technical troubleshooting tasks (error resolution) not present in Dataset B",
        "Dataset B requires identification of tutorial workflows (e.g., pipeline usage steps) while Dataset A focuses on documentation lookup"
      ],
      "nnetnav_live_site=huggingface_num_tasks=76_portion=3": [
        "Dataset B tasks emphasize identifying models by their recency (e.g., 'latest', 'most recently updated') rather than fixed version identifiers",
        "Dataset B requires users to retrieve summary-level descriptions of model functionality beyond technical specifications",
        "Dataset B tasks focus on creative application scenarios (e.g., story generation) rather than purely technical implementations",
        "Dataset B includes explicit requirements to verify temporal constraints (e.g., 'updated within March 2023')",
        "Dataset B tasks demand comparison of models based on popularity metrics (e.g., 'most downloaded', 'most likes')",
        "Dataset B requires identification of models/datasets by specific open-source licenses rather than general license awareness",
        "Dataset B tasks involve extracting information from tutorial content rather than pure documentation references",
        "Dataset B emphasizes multilingual capabilities (e.g., English-Japanese) as core filtering criteria",
        "Dataset B requires temporal validation of model updates against current date/time constraints",
        "Dataset B tasks involve creative text generation through API integration rather than technical deployment scenarios"
      ]
    },
    "coursera": {
      "nnetnav_live_site=coursera_num_tasks=72_portion=3": [
        "Dataset B tasks frequently require analyzing numerical rating distributions (e.g., lowest percentage star ratings) while Dataset A focuses more on general rating verification.",
        "Dataset B includes explicit prompts to identify time-limited promotions (e.g., 'New year. Bigger savings' banner) not present in Dataset A tasks.",
        "Dataset B tasks require filtering by specific duration ranges (1-4 weeks, 1-3 months) more granularly than Dataset A's duration criteria.",
        "Dataset B emphasizes price comparison (e.g., Coursera Plus annual cost vs. discount) while Dataset A focuses more on pricing awareness.",
        "Dataset B tasks involve identifying partner companies/institutions as a distinct requirement, whereas Dataset A only references partnerships contextually.",
        "Dataset B requires sorting functionality (e.g., 'newest first') for course discovery, unlike Dataset A's unsorted searches.",
        "Dataset B tasks demand precise star rating thresholds (4.5+ stars) compared to Dataset A's general star-level comparisons.",
        "Dataset B includes analysis of instructor bios/career backgrounds, while Dataset A only requires identifying instructor names.",
        "Dataset B tasks specify credit eligibility filters, absent from Dataset A's filtering criteria.",
        "Dataset B requires deadline tracking for degree programs (e.g., 'latest application deadline'), whereas Dataset A focuses on general program requirements."
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=2": [
        "Dataset B tasks require precise numerical outputs (e.g., percentage calculations, result counts) while Dataset A focuses on descriptive information retrieval",
        "Dataset B involves multi-criteria filtering combinations (e.g., credit eligibility + duration) compared to Dataset A's single-filter tasks",
        "Dataset B tasks specifically request review distribution breakdowns by star levels (e.g., 3-star percentages) while Dataset A focuses on general rating analysis",
        "Dataset B includes queries about pricing models with exact discount amounts compared to Dataset A's general subscription model exploration",
        "Dataset B requires time commitment calculations based on specified weekly hours while Dataset A uses predefined duration ranges",
        "Dataset B contains time-sensitive queries about application deadlines that don't appear in Dataset A tasks",
        "Dataset B tasks require cross-referencing instructor bios with their other course offerings unlike Dataset A's single-course instructor requests",
        "Dataset B includes explicit sorting requirements (e.g., newest first) for search results while Dataset A focuses on basic keyword searches",
        "Dataset B specifies exact rating thresholds (e.g., 4.5+ stars) for course selection compared to Dataset A's general rating analysis",
        "Dataset B tasks require identification of specific partner companies as part of answers while Dataset A only mentions partner institutions in general"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=4": [
        "Dataset B tasks require calculating percentage distributions of course ratings (e.g., 5-star percentages) while Dataset A does not",
        "Dataset B includes filtering requirements based on credit eligibility status, which is absent in Dataset A tasks",
        "Dataset B tasks demand sorting functionality (e.g., newest first) that isn't present in Dataset A requirements",
        "Dataset B requires analysis of subscription pricing models (Coursera Plus) and discounts, unlike Dataset A",
        "Dataset B tasks involve identifying lowest-rated aspects of courses (star level percentages) not required in Dataset A",
        "Dataset B contains time-sensitive queries about application deadlines that don't appear in Dataset A tasks",
        "Dataset B requires biographical analysis of instructors (summarizing bios) while Dataset A only needs name extraction",
        "Dataset B specifies precise duration ranges (1-4 weeks) rather than general timelines in Dataset A",
        "Dataset B tasks require comparative analysis of multiple rating categories within reviews, unlike Dataset A's single metric checks",
        "Dataset B includes verification of credit-bearing status for courses, a dimension absent in Dataset A requirements"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=0": [
        "Tasks in dataset B require quantitative filtering results (e.g., exact course counts, percentage breakdowns of ratings)",
        "Dataset B tasks specify precise duration ranges (e.g., '1-4 years' rather than general duration estimates)",
        "Dataset B includes explicit requirements to calculate time commitments using weekly/hourly breakdowns (e.g., '5 hours/week for X weeks')",
        "Tasks in B demand identification of specific rating distributions (e.g., 'lowest percentage star rating') rather than general rating systems",
        "Dataset B contains tasks requiring price comparison mathematics (e.g., calculating discount amounts from subscription plans)",
        "B tasks require multi-filter combinations (e.g., 'Credit Eligible + 1-4 Years + Beginner Level') rather than single-filter use",
        "Dataset B tasks explicitly request institutional partnership enumeration (e.g., 'list 3 companies working with Coursera')",
        "B tasks require sorting/prioritization of results by specific criteria (e.g., 'newest first') not mentioned in A",
        "Dataset B includes biographical analysis requirements (e.g., instructor bios and cross-course teaching history)",
        "Tasks in B mandate exact numerical thresholds (e.g., 'rated 4.5 stars or higher', 'less than 20 hours completion time')"
      ],
      "nnetnav_live_site=coursera_num_tasks=72_portion=1": [
        "Tasks in B require filtering courses by credit eligibility (e.g., Credit Eligible), which is absent in A",
        "B tasks involve precise quantitative analysis of review distributions (e.g., rounded percentages of 3-star ratings)",
        "B requires identifying instructor biographies and their other taught courses beyond basic instructor names",
        "Tasks in B specify exact duration ranges (e.g., 1-4 weeks) rather than general duration extraction",
        "B includes sorting functionality requirements (e.g., 'sort by newest') for course discovery",
        "Tasks in B request degree program logistics like application deadlines, not present in A",
        "B requires counting total matching results after multi-filter application (e.g., 1-3 month duration + beginner level)",
        "Tasks in B explicitly ask for Coursera Plus annual pricing and discount percentages",
        "B contains requirements to identify star rating distributions at granular levels (e.g., lowest percentage star category)",
        "Tasks in B demonstrate multi-criteria filtering combinations (credit eligibility + duration + level) unlike A's single filters"
      ]
    },
    "arxiv": {
      "nnetnav_live_site=arxiv_num_tasks=80_portion=1": [
        "Dataset B tasks require temporal precision constraints (e.g. 'last week', 'last two days') while A uses relative timeframes",
        "B contains explicit result counting requirements across filtered categories that must be numerically quantified",
        "B tasks involve cross-category comparison operations (e.g. 'search in all archives' vs specific categories)",
        "B requires extraction of institutional metadata from external linked resources (e.g. university student counts)",
        "B includes operational queries about platform governance (submission guidelines, leadership team information)",
        "B tasks demand content summarization of technical findings beyond metadata retrieval",
        "B emphasizes version history tracking with specific revision number references (e.g. 'v3 submitted')",
        "B contains format specification requirements for non-text elements (e.g. figure format guidelines)",
        "B includes merchandise inventory queries that extend beyond academic content retrieval",
        "B tasks require author count thresholds as filtering criteria (e.g. 'more than five authors')"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=4": [
        "Dataset B tasks require temporal constraints (e.g., 'last week', 'Jan 1-3, 2024') for filtering results, while Dataset A does not.",
        "Dataset B tasks frequently involve quantitative analysis (e.g., counting papers, authors, or submissions), absent in Dataset A.",
        "Dataset B tasks demand extraction of version-specific metadata (e.g., submission dates, version history), unlike Dataset A.",
        "Dataset B tasks include cross-category comparisons (e.g., specific vs. all archives), which are not present in Dataset A.",
        "Dataset B tasks require summarization of paper content (e.g., objectives, findings), whereas Dataset A focuses on retrieval without synthesis.",
        "Dataset B tasks involve multi-step operations (e.g., search + count/analyze), while Dataset A tasks are single-step retrievals.",
        "Dataset B tasks request external data (e.g., university statistics, merchandise counts) unrelated to arXiv's core research content, unlike Dataset A.",
        "Dataset B tasks explicitly ask for granular metadata (e.g., author counts, submission timestamps), while Dataset A focuses on basic metadata access.",
        "Dataset B tasks target non-research content (e.g., arXiv store, leadership team), absent in Dataset A's scholarly focus.",
        "Dataset B tasks require verifying submission guidelines details (e.g., figure formats) in conjunction with other actions, unlike Dataset A's standalone guideline queries."
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=0": [
        "Tasks in B require checking specific version submission dates (e.g., 'when was v3 submitted?')",
        "B involves quantitative analysis of paper counts within timeframes (e.g., 'how many published in the last week?')",
        "B includes cross-site navigation (e.g., accessing Cornell University's website for student statistics)",
        "B tasks demand summarization of paper findings (e.g., 'provide a brief summary of one article's main findings')",
        "B requires accessing submission guidelines (e.g., 'formats for figures') and policy documents",
        "B involves queries about arXiv's organizational structure (e.g., 'names of people in Leadership Team')",
        "B tasks include merchandise-related actions (e.g., 'how many types of merchandise are available')",
        "B requires comparative category analysis (e.g., 'results in Quantum Physics vs. all archives')",
        "B tasks involve author count thresholds (e.g., 'papers with more than five authors')",
        "B includes temporal granularity constraints (e.g., 'submitted within the last two days')"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=2": [
        "Tasks in dataset B require determining specific version submission dates (e.g., 'when was v3 submitted?') while A focuses on general version history retrieval",
        "Dataset B includes quantitative result counting tasks (e.g., 'how many have been published...') absent in A's information retrieval focus",
        "B contains tasks requiring comparison of search scope impacts (e.g., 'search in all archives' vs category-specific results)",
        "B requires summarization of paper findings (e.g., 'provide a brief summary') rather than just metadata extraction",
        "Dataset B tasks involve external quantitative verification (e.g., university enrollment numbers) beyond simple policy checks",
        "B includes merchandise inventory queries ('how many types of merchandise') unrelated to research content",
        "Dataset B demands precise temporal filtering (e.g., 'submitted within last two days') with numerical verification",
        "B contains leadership team identification tasks absent from A's operational policy verification",
        "Dataset B requires category-specific author count analysis (e.g., 'more than five authors')",
        "B tasks involve format specification retrieval from guidelines (e.g., 'formats for figures') rather than general policy verification"
      ],
      "nnetnav_live_site=arxiv_num_tasks=80_portion=3": [
        "Tasks in dataset B require determining submission dates of specific paper versions (e.g., 'when was v3 submitted?')",
        "Dataset B includes tasks involving temporal recency validation (e.g., 'uploaded this week', 'last two days')",
        "Tasks in B require quantitative comparisons between category-specific results and all archives",
        "Dataset B tasks involve counting authors per paper with numerical thresholds (e.g., 'more than five authors')",
        "B requires extracting institutional statistics from external university websites (e.g., undergraduate student counts)",
        "Tasks in B demand analysis of arXiv's operational documentation (e.g., submission guidelines, figure formats)",
        "Dataset B includes merchandise inventory queries from arXiv store navigation",
        "B tasks require identification of arXiv's organizational leadership team members",
        "Tasks in B involve time-bound category-specific publication counts (e.g., 'in the last week' constraints)",
        "Dataset B requires cross-referencing between paper metadata and arXiv's operational policies/status"
      ]
    },
    "bbc": {
      "nnetnav_live_site=bbc_num_tasks=69_portion=2": [
        "Tasks in dataset B require summarizing key points from articles, while dataset A focuses on locating information without summarization",
        "Dataset B includes tasks involving economic impact analysis and geopolitical implications, whereas dataset A focuses on factual retrieval of ongoing conflicts",
        "Dataset B tasks reference structured educational content (e.g., \"A really simple guide\"), while dataset A lacks explicit instructional content navigation",
        "Sports tasks in dataset B require identifying tournament statistics (e.g., stroke counts), unlike dataset A's general match result queries",
        "Dataset B contains tasks targeting specialized subsections like \"Green Living\" and \"The SpeciaList\", indicating deeper content categorization",
        "Regional news tasks in dataset B demand geographic impact analysis, while dataset A focuses on basic regional article retrieval",
        "Dataset B includes explicit requests for podcast metadata (e.g., \"New Releases\"), whereas dataset A's multimedia tasks are more generic",
        "Tasks in dataset B require identifying current leadership standings (e.g., Premier League table), unlike dataset A's general fixture requests",
        "Dataset B contains tasks requiring identification of corporate stakeholders in news, while dataset A focuses on general company mentions",
        "Cultural tasks in dataset B specify review analysis of media releases, whereas dataset A focuses on general cultural trends"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=3": [
        "Dataset B tasks require explicit summarization of key points from articles (e.g. 'summarize central points') while A focuses on information extraction without synthesis",
        "Dataset B contains tasks referencing specific named guides/articles (e.g. 'What is climate change? A really simple guide') not present in A's tasks",
        "Dataset B includes queries about quantitative sports metrics (e.g. '-10 strokes') requiring numerical data interpretation absent in A",
        "Dataset B tasks reference specialized sections like 'Green Living' and 'The SpeciaList' not mentioned in A's navigation requirements",
        "Dataset B requires identification of corporate entities involved in news topics (e.g. 'which companies are involved') unlike A's general topic focus",
        "Dataset B contains explicit audio/podcast navigation tasks (e.g. 'BBC News Audio') while A only differentiates video/text",
        "Dataset B tasks demand geographical categorization of headlines (e.g. 'describe the region') beyond A's region-specific content location",
        "Dataset B includes technical section navigation (e.g. 'artificial intelligence section') requiring specialized category recognition not in A",
        "Dataset B tasks require calendar-based event tracking (e.g. 'athletics calendar dates') unlike A's timestamp-focused articles",
        "Dataset B contains music industry-specific queries (e.g. 'musician headlines') indicating dedicated entertainment subsections not emphasized in A"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=1": [
        "Dataset B tasks require explicit summarization of content (e.g. 'summarize key points') while A focuses on information location",
        "Dataset B includes tasks requiring identification of implied relationships (e.g. 'economic implications of climate change') not explicitly stated in metadata",
        "Dataset B contains tasks targeting specific content formats like explanatory guides ('A really simple guide') not mentioned in A",
        "Dataset B requires cross-referencing multiple data points (e.g. tournament names + stroke counts) within single tasks",
        "Dataset B tasks demand identification of unstated content relationships (e.g. connecting musicians to specific news categories)",
        "Dataset B includes explicit requests for quantitative data extraction (e.g. '-10 strokes' counts) from content",
        "Dataset B tasks require monitoring of recurring structured content updates (e.g. 'New Releases' podcast sections)",
        "Dataset B contains tasks targeting specialized verticals like 'Green Living' not present in A's task samples",
        "Dataset B tasks require temporal comparisons within content hierarchies (e.g. 'most recent development' tracking)",
        "Dataset B includes explicit geographic granularity requirements (e.g. 'specific cities in Travel section') beyond regional categories"
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=0": [
        "Tasks in dataset B require explicit summarization of key points from identified articles, while dataset A focuses on locating information without mandatory summarization.",
        "Dataset B tasks emphasize retrieving the most recent updates and top headlines, whereas dataset A includes both current and historical information retrieval.",
        "Dataset B involves precise structured data extraction (e.g., tournament stroke counts), while dataset A handles broader structured elements like league tables.",
        "Tasks in dataset B specify exact subsections (e.g., Green Living, Athletics calendar), requiring granular navigation, unlike dataset A's broader exploration.",
        "Dataset A tasks involve interactive multimedia actions (e.g., pausing videos), while dataset B focuses on locating multimedia content without interaction.",
        "Dataset B tasks frequently target regional impact analysis (e.g., natural disasters in Asia), whereas dataset A\u2019s geographic categorization is more general.",
        "Dataset B requires identifying featured/highlighted content (e.g., 'New Releases' podcasts), unlike dataset A\u2019s open-ended content hub exploration.",
        "Dataset B tasks demand exact article retrieval (e.g., specific guide titles), while dataset A involves topic-based browsing without precise titles.",
        "Tasks in dataset B prioritize event-specific timeframes (e.g., next Athletics game), whereas dataset A includes less time-bound event queries.",
        "Dataset B tasks focus on analytical summaries (e.g., economic implications), while dataset A emphasizes factual extraction from diverse content types."
      ],
      "nnetnav_live_site=bbc_num_tasks=69_portion=4": [
        "Tasks in B require explicit identification of subsection counts (e.g., 'how many War-related sections') while A focuses on general topic navigation without numerical verification.",
        "B includes tasks demanding structured data extraction from tables/leaderboards (e.g., 'Golf's DP World Tour Total strokes'), whereas A lacks explicit tabular data retrieval requirements.",
        "B tasks specify content hierarchy verification (e.g., 'top headline in World News') while A focuses on content discovery without hierarchical prioritization.",
        "B requires identification of newly featured/released content (e.g., 'New Releases' podcasts) whereas A focuses on existing multimedia access without recency filters.",
        "B tasks explicitly mandate section-specific summarization (e.g., 'summarize economic implications in Europe') while A allows broader summarization without regional constraints.",
        "B contains instructions for metadata verification (e.g., 'which companies are involved') alongside content retrieval, unlike A's pure content-finding focus.",
        "B includes direct reference to permanent guides/static content (e.g., 'What is climate change' article), while A tasks target transient news updates.",
        "B tasks require identification of categorical sub-sections (e.g., 'Green Living', 'The SpeciaList') not present in A's broader category navigation.",
        "B emphasizes thematic clustering analysis (e.g., 'what topics most Africa news are about') where A focuses on singular article retrieval.",
        "B includes calendar-based event verification (e.g., 'Athletics calendar date') while A's time-sensitive tasks focus on recency without date matching."
      ]
    },
    "amazon": {
      "nnetnav_live_site=amazon_num_tasks=63_portion=2": [
        "Tasks in dataset B require specifying multiple concurrent attribute filters (e.g. price + material + size)",
        "Dataset B tasks explicitly require filtering by upcoming/recent product releases with time constraints",
        "Navigation in B requires maintaining filtered/sorted state across multiple interaction steps more frequently",
        "Tasks in B mandate verification of specific technical specifications (e.g. battery life measurements)",
        "Dataset B includes explicit requirements to compare prices across different product conditions (new vs used)",
        "Tasks in B require preservation of filtered results for later reference/action (e.g. 'save the lowest priced')",
        "Dataset B contains explicit requirements for numeric measurement validations (e.g. 'minimum 30 inches length')",
        "Tasks in B frequently combine price range filtering with specific customer rating thresholds (e.g. 4+ stars)",
        "Dataset B includes explicit requirements to verify return policy details for specific product variants",
        "Tasks in B require identification of products meeting multiple simultaneous material constraints (e.g. stainless steel + programmable)"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=3": [
        "Tasks in dataset B require multi-attribute specifications (e.g., water-resistant design, memory foam material)",
        "Dataset B includes explicit condition-based filters (e.g., 'Used - Good' quality tier)",
        "Tasks in B involve time-sensitive constraints (e.g., upcoming book releases within 1 month)",
        "Dataset B requires verification of return policies and free return eligibility",
        "B tasks demand sorting by specific criteria (e.g., price high-low) and processing top results",
        "Dataset B specifies minimum customer review thresholds (e.g., 50+ reviews)",
        "B tasks explicitly check delivery options/free shipping availability",
        "Dataset B includes energy efficiency ratings as search criteria",
        "B requires exact dimensional specifications (e.g., 30-inch length, 6mm thickness)",
        "Tasks in B involve saving/search persistence actions (e.g., 'save the lowest priced item')"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=1": [
        "Tasks in B require precise product specifications with multiple attributes (e.g., material, dimensions, energy efficiency ratings) while A focuses on general categories",
        "B includes explicit requirements for product conditions (e.g., 'Used - Good') whereas A focuses primarily on new items",
        "B contains tasks requiring verification of specific delivery options/free return policies while A focuses on general availability checks",
        "B emphasizes time-sensitive filters (e.g., 'released within a month') not present in A's tasks",
        "B requires comparing/analyzing multiple filtered results (e.g., 'compare prices of top three results') while A focuses on single-item actions",
        "B includes explicit requirements for customer review quantities (e.g., 'minimum of 50 reviews') beyond just star ratings",
        "B tasks specify exact price ranges with both upper and lower bounds more frequently than A's price-focused tasks",
        "B contains tasks requiring saving/search history management (e.g., 'save the lowest priced') not present in A",
        "B includes technical specification requirements (e.g., battery life, waterproof ratings) absent in A's product searches",
        "B tasks require navigation through specialized sub-stores (e.g., Kindle Store, Luxury Stores) more frequently than A's general browsing"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=0": [
        "Tasks in B require handling multi-condition filters with exact numerical constraints (e.g., '10x zoom', '10-hour battery')",
        "B includes tasks requiring future-dated product searches (e.g., 'released within a month')",
        "B tasks explicitly demand saving/bookmarking specific search results (e.g., 'save the lowest priced')",
        "B requires verifying delivery options as a core task component (e.g., 'check FREE delivery availability')",
        "Tasks in B specify exact material compositions (e.g., 'stainless steel', 'memory foam') as mandatory filters",
        "B contains tasks requiring comparison of ranked results after sorting (e.g., 'compare top three search results')",
        "Tasks in B frequently require reporting specific metric thresholds (e.g., '500+ customer reviews') in outputs",
        "B includes explicit size/space requirements (e.g., 'room size 300 sq ft', '30 inches length') as filters",
        "Tasks in B mandate energy efficiency/technical certifications as search criteria",
        "B requires identification of specific product conditions within filtered results (e.g., 'cheapest Used - Good')"
      ],
      "nnetnav_live_site=amazon_num_tasks=63_portion=4": [
        "Tasks in B require multi-attribute filtering (e.g., water-resistant + battery life + price)",
        "B tasks demand specific quantity outputs (e.g., 'provide at least 2 products')",
        "B includes explicit comparisons between results (e.g., 'compare the prices of the top three')",
        "B tasks require validating return/delivery policies (e.g., 'check if FREE return is available')",
        "B tasks specify time-bound constraints (e.g., 'released within a month')",
        "B requires conditional filtering (e.g., 'Used - Good' condition)",
        "B tasks enforce numeric thresholds in product specs (e.g., 'minimum battery life of 10 hours')",
        "B includes explicit sorting instructions with follow-up actions (e.g., 'sort by newest arrivals, then check upcoming releases')",
        "B tasks demand granular material/design attributes (e.g., 'stainless steel', 'hypoallergenic')",
        "B requires post-search actions like saving/results reporting (e.g., 'save the lowest priced')"
      ]
    },
    "wolframalpha": {
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=4": [
        "Dataset B tasks require real-time data acquisition (e.g. current temperature) while A focuses on static/historical data lookup",
        "Dataset B contains queries involving personal health metrics (e.g. calorie intake, age, height) not present in A",
        "Dataset B tasks frequently require optimization analysis (e.g. circle packing density) absent in A's requests",
        "Dataset B includes specific geographic constraints (e.g. cities, countries) in calculations more systematically than A",
        "Dataset B demonstrates stronger emphasis on material properties comparison under specific environmental conditions (e.g. thermal conductivity at 25\u00b0C)",
        "Dataset B contains more complex multi-variable physics simulations (e.g. spring pendulum with multiple parameters)",
        "Dataset B tasks often involve consumer product specifications (e.g. SPF ratings) not featured in A",
        "Dataset B requires interpretation of biological/environmental factors (e.g. skin types, UV exposure) absent in A",
        "Dataset B includes geometric construction problems (e.g. polyomino combinations) with combinatorial analysis not seen in A",
        "Dataset B tasks demonstrate greater emphasis on practical engineering applications (e.g. material comparisons, mechanical systems) compared to A's theoretical focus"
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=0": [
        "Dataset B tasks involve real-time or current data retrieval (e.g., temperature, wind speed) unlike Dataset A, which focuses on static historical or general data.",
        "Dataset B tasks require dynamic parameterization of physical systems (e.g., spring pendulum with specific mass/spring constants), while Dataset A emphasizes theoretical or formulaic computations.",
        "Dataset B includes personalized health/fitness calculations with granular biometric inputs (age, weight, height), whereas Dataset A uses generic health metrics.",
        "Dataset B tasks demand multi-constraint geometric/mechanical solutions (e.g., circle packing comparisons), while Dataset A focuses on single-method comparisons.",
        "Dataset B contains combinatorial enumeration tasks (e.g., polyomino combinations) absent in Dataset A's statistical/mathematical problems.",
        "Dataset B requires solving nonlinear differential equations with specific boundary conditions, while Dataset A focuses on standard equation solving.",
        "Dataset B tasks involve environmental condition modeling (e.g., sunburn time with SPF/skin type/location), unlike Dataset A's static unit conversions.",
        "Dataset B includes complex number operations with multiple terms, whereas Dataset A focuses on real-number mathematics.",
        "Dataset B tasks require parametric visualization (e.g., plotting curves from equations), while Dataset A emphasizes result interpretation over graphical generation.",
        "Dataset B features multi-variable optimization problems (e.g., circle packing density comparisons) absent in Dataset A's comparative analyses."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=1": [
        "Dataset B tasks frequently combine multiple real-world parameters (age, weight, SPF) in single queries",
        "Dataset B emphasizes real-time or current data retrieval (e.g., live weather conditions)",
        "Dataset B contains more health/lifestyle-oriented calculations (weight loss, sun exposure)",
        "Dataset B requires multi-constraint problem solving (geometric packing conditions)",
        "Dataset B includes dynamic physical system simulations (spring pendulum mechanics)",
        "Dataset B features precise material property comparisons with environmental specifications",
        "Dataset B utilizes parametric equations for specialized visualizations",
        "Dataset B emphasizes comparative analysis of alternative solutions/methods",
        "Dataset B incorporates personalized biological/physical characteristics in computations",
        "Dataset B contains combinatorial enumeration challenges with shape analysis"
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=3": [
        "Dataset B tasks involve real-time or current data queries (e.g., temperature, stock prices, population growth rates) while Dataset A focuses on static or historical data.",
        "Tasks in Dataset B explicitly require multi-variable or multi-constraint problem solving (e.g., combining age/weight/height for health calculations, material properties with environmental factors).",
        "Dataset B contains tasks demanding comparative analysis between multiple entities (e.g., material properties of different metals, packing densities).",
        "Tasks in Dataset B frequently involve spatial/geometric computations (e.g., circle packing, curve lengths, polyomino combinations) absent in Dataset A examples.",
        "Dataset B includes specific parametric scenario modeling (e.g., sunburn time based on location/SPF/skin type, pendulum physics with multiple initial conditions).",
        "Tasks in Dataset B require interpretation of physical system behaviors (e.g., spring pendulum dynamics, thermal properties) rather than pure mathematical formalism.",
        "Dataset B emphasizes practical engineering applications (e.g., material conductivity, mechanical systems) over Dataset A's theoretical mathematics focus.",
        "Tasks in Dataset B often combine multiple computation types in single queries (e.g., unit conversion + composition analysis + percentage calculation).",
        "Dataset B contains explicit requests for visualizations/plotting (e.g., parametric curves) as part of solutions, unlike Dataset A's text-based focus.",
        "Tasks in Dataset B require handling time-dependent physical phenomena (e.g., population growth rates, planetary day lengths) with dynamic parameters."
      ],
      "nnetnav_live_site=wolframalpha_num_tasks=66_portion=2": [
        "Dataset B tasks require multi-step problem-solving combining calculations across domains (e.g. conversion + composition analysis) while Dataset A focuses on single-operation queries",
        "Dataset B contains explicit requests for real-time/current data (e.g. 'current temperature') whereas Dataset A focuses on static historical/properties data",
        "Dataset B includes applied scenario modeling with multiple variables (SPF + location + skin type for sunburn) while Dataset A handles isolated health metrics",
        "Dataset B tasks involve constraint-based mathematical constructs ('inner region of pentagram', 'packing density constraints') absent in Dataset A's equation solving",
        "Dataset B requires comparative analysis between different methodologies/systems (densest vs square packing) while Dataset A focuses on singular data aggregation",
        "Dataset B contains dynamic system simulations (spring pendulum motion with initial conditions) where Dataset A focuses on static physical properties",
        "Dataset B emphasizes percentage/composition breakdowns (element weight percentages) while Dataset A focuses on absolute unit conversions",
        "Dataset B tasks show geographic specificity (Brazil, Australia locations) while Dataset A uses generic spatial references",
        "Dataset B includes parametric constraint definitions (initial spring length/angle) where Dataset A uses standard equation parameters",
        "Dataset B features combinatorial mathematics problems (polyomino combinations) absent from Dataset A's sequence/statistics tasks"
      ]
    },
    "allrecipes": {
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=0": [
        "Dataset A tasks often involve vague or open-ended recipe searches (e.g., 'Find some Christmas dessert recipes'), while Dataset B tasks require explicit criteria (e.g., '4.5 stars, 50+ reviews').",
        "Dataset B tasks frequently demand multi-step outputs (e.g., 'list ingredients, cooking time, and steps'), whereas Dataset A focuses on single actions like locating/saving recipes.",
        "Dataset B emphasizes structured nutritional data extraction (e.g., 'carbohydrate content per serving'), while Dataset A mentions nutrition more generally.",
        "Dataset B tasks specify exact review-count thresholds (e.g., '200+ reviews'), while Dataset A uses broader terms like 'popularity metrics' or '50+ reviews'.",
        "Dataset B includes explicit dietary constraints (e.g., 'vegan, 10 ingredients or less'), while Dataset A uses broader terms like 'vegetarian' without quantified limits.",
        "Dataset B requires validation of recipe metadata alignment (e.g., 'serves 6 people'), whereas Dataset A tasks rarely mention serving-size verification.",
        "Dataset B tasks often require parsing user reviews for quality signals (e.g., 'latest review says'), while Dataset A focuses on review existence rather than analysis.",
        "Dataset B prioritizes time-bound constraints (e.g., 'under 30 minutes prep'), while Dataset A references cooking time more generally without strict limits.",
        "Dataset B tasks involve comparative popularity metrics (e.g., 'most popular recipe with 1000+ reviews'), whereas Dataset A focuses on basic filtering.",
        "Dataset B requires synthesis of multiple data points (e.g., 'create shopping list'), while Dataset A tasks target discrete information retrieval."
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=4": [
        "Dataset B tasks require specifying minimum review counts and star ratings in search criteria (e.g., 'at least 50 reviews', '4 stars or higher')",
        "Dataset B tasks explicitly demand structured output formats (e.g., ingredient lists, cooking steps summaries, nutrition facts)",
        "Dataset B tasks require comparison of multiple recipe attributes simultaneously (ratings + review counts + specific ingredients)",
        "Dataset B tasks frequently involve quantitative filtering constraints (e.g., 'under 30 minutes', 'more than 200 reviews')",
        "Dataset B tasks require extraction and synthesis of information from multiple recipe sections (reviews + ingredients + nutrition)",
        "Dataset B tasks emphasize specific nutritional requirements in search queries (e.g., low-carb, protein content tracking)",
        "Dataset B tasks require verification of recipe scalability/serving sizes (e.g., 'suitable for 6 people')",
        "Dataset B tasks involve explicit requests for user review analysis (e.g., 'what the latest review says')",
        "Dataset B tasks demand ingredient quantity validation (e.g., '10 ingredients or less' requirements)",
        "Dataset B tasks require specific preparation phase time breakdowns (prep vs cook time differentiation)"
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=1": [
        "Dataset B tasks require specific numerical thresholds for reviews (e.g., '50 reviews') while A focuses on general popularity.",
        "Dataset B emphasizes structured output formats (e.g., ingredient lists, step summaries) whereas A prioritizes discovery/saving.",
        "Dataset B explicitly requests nutritional information (e.g., carb counts) while A does not require nutritional analysis.",
        "Dataset B specifies exact rating requirements (e.g., '4.5 stars') whereas A uses qualitative terms like 'highly-rated'.",
        "Dataset B tasks demand precise dietary constraints (e.g., '10 ingredients or less') while A uses broader filters like 'vegan/keto'.",
        "Dataset B requires time-bound preparation details (e.g., 'under 30 minutes') as mandatory outputs unlike A's general time awareness.",
        "Dataset B tasks frequently require serving size specifications (e.g., 'suitable for 6 people') absent in A's tasks.",
        "Dataset B emphasizes cuisine-style specificity (e.g., 'Italian-style meatballs') versus A's general category browsing.",
        "Dataset B mandates latest review analysis (e.g., 'what the latest review says') while A focuses on general review interaction.",
        "Dataset B includes output generation like shopping lists/calorie counts that A never requires"
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=3": [
        "Tasks in B require explicit mention of serving sizes or number of people the recipe serves",
        "Tasks in B demand structured output formatting for recipe components (e.g., separate ingredient lists and step summaries)",
        "Tasks in B specify exact numeric thresholds for review counts (e.g., 'more than 100 reviews')",
        "Tasks in B require extraction of specific nutritional metrics beyond general nutrition facts (e.g., carb content per serving)",
        "Tasks in B mandate comparison/verification of recipe metadata against multiple constraints simultaneously",
        "Tasks in B involve synthesizing information from reviews into summary insights",
        "Tasks in B require identification of recipe modifications/substitutions mentioned in user reviews",
        "Tasks in B specify exact time constraints in minutes rather than general time ranges",
        "Tasks in B demand explicit citation of recipe authorship/contributor information",
        "Tasks in B require creation of shopping lists from recipe ingredients"
      ],
      "nnetnav_live_site=allrecipes_num_tasks=79_portion=2": [
        "Tasks in dataset B require recipes to meet specific quantitative thresholds for reviews (e.g., 'at least 50 reviews')",
        "Dataset B tasks explicitly demand structured output formats (e.g., ingredient lists, step summaries, nutrition facts)",
        "Dataset B emphasizes precise nutritional requirements (e.g., low-carb, calorie counts) in task specifications",
        "Tasks in dataset B require identification of exact review counts rather than general popularity indicators",
        "Dataset B tasks specify exact serving sizes (e.g., 'suitable for 6 people') as a requirement",
        "Dataset B shows stronger focus on macronutrient requirements (e.g., high-protein, low-carb) in recipe searches",
        "Tasks in dataset B require explicit confirmation of specific ingredient presence/absence (e.g., 'must include bananas')",
        "Dataset B tasks demand time constraints for both preparation and cooking phases separately",
        "Dataset B emphasizes recipe scaling requirements (e.g., ingredient quantities for specific servings)",
        "Tasks in dataset B require direct extraction and reporting of numerical data from recipes (e.g., exact calorie counts)"
      ]
    },
    "dictionary.cambridge": {
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=2": [
        "Tasks in dataset B require users to provide numerical counts (e.g., number of word meanings) directly, while dataset A does not explicitly demand quantitative answers.",
        "Dataset B tasks involve identifying third-party translation service providers (e.g., asking which company provided the translation), whereas dataset A focuses solely on retrieving translated content.",
        "Tasks in dataset B explicitly require generating new example sentences in user-specified contexts, while dataset A only involves accessing pre-existing example sentences.",
        "Dataset B tasks demand detailed grammatical structure exploration (e.g., passive voice, sentence types like affirmative/negative/interrogative), whereas dataset A focuses on general grammar explanations.",
        "Tasks in dataset B emphasize synthesizing information across multiple linguistic dimensions (e.g., pronunciation + definition + contextual application), while dataset A typically isolates individual elements.",
        "Dataset B includes tasks requiring direct extraction of metadata (e.g., IPA notation specifications), whereas dataset A focuses on pronunciation lookup without explicit notation formatting requirements.",
        "Tasks in dataset B require explicit comparison of linguistic features across predefined categories (e.g., countable/uncountable nouns), while dataset A comparisons are more general.",
        "Dataset B tasks mandate explicit confirmation of regional language variants (e.g., verifying company-specific translations), unlike dataset A's translation tasks which lack attribution requirements.",
        "Tasks in dataset B often require multi-step application of retrieved information (e.g., using definitions to create new sentences), whereas dataset A focuses on information retrieval alone.",
        "Dataset B tasks include explicit instructions for structural analysis of content (e.g., identifying grouped prepositions), while dataset A tasks involve simpler categorization."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=3": [
        "Tasks in B require explicit reporting of numerical counts (e.g., number of definitions, examples).",
        "Tasks in B mandate inclusion of both UK and US phonetic notations (IPA) in responses.",
        "Tasks in B specify exact quantities of examples or contexts to retrieve (e.g., two example sentences).",
        "Tasks in B require structured answers with distinct components (definition, pronunciation, example).",
        "Tasks in B involve direct extraction of translations into explicitly named languages (e.g., Chinese, Spanish).",
        "Tasks in B focus on technical grammar constructs with explicit sub-section navigation (e.g., passive voice, articles).",
        "Tasks in B demand identification of specific linguistic elements (e.g., modal verbs for possibility).",
        "Tasks in B include corporate/service attribution requirements (e.g., translation provider).",
        "Tasks in B require verbatim extraction of example sentences from dictionary entries.",
        "Tasks in B necessitate confirmation of dynamic content details (e.g., Word of the Day attributes)."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=1": [
        "Dataset B tasks require users to report numerical counts (e.g., number of meanings) directly from definitions, while Dataset A focuses on general definition retrieval",
        "Dataset B explicitly demands translations into single target languages (e.g., Chinese/Spanish), while Dataset A emphasizes comparing translations across multiple languages",
        "Dataset B requires creation of original example sentences using defined terms, whereas Dataset A only asks for extraction of existing examples from entries",
        "Dataset B tasks specify identifying corporate attribution for translations (e.g., 'which company provided'), not present in Dataset A",
        "Dataset B includes direct interrogation of grammatical structures (e.g., 'most common prepositions'), while Dataset A focuses on general grammar section navigation",
        "Dataset B tasks require explicit differentiation between countable/uncountable noun usage in grammar explanations, unlike Dataset A",
        "Dataset B mandates comparison of grammatical forms (affirmative/negative/interrogative) within single tasks, while Dataset A addresses these separately",
        "Dataset B tasks specify multi-part responses combining pronunciation, definition, and examples in single queries, whereas Dataset A separates these components",
        "Dataset B requires identification of word groups/phrasal prepositions rather than individual terms, unlike Dataset A's focus on single-word grammar",
        "Dataset B tasks demand explicit documentation of IPA notation variants (UK/US) simultaneously, while Dataset A allows singular pronunciation checks"
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=4": [
        "Dataset B tasks require explicit reporting of the exact number of word meanings, while Dataset A focuses on retrieving synonyms or related terms without numerical quantification.",
        "Dataset B tasks specify target languages for translations (e.g., Chinese, Spanish), whereas Dataset A tasks mention translations generically without language specificity.",
        "Dataset B tasks demand identification of detailed grammatical structures (e.g., present perfect, passive voice), while Dataset A tasks involve broader grammar sections (e.g., modal verbs).",
        "Dataset B tasks explicitly require International Phonetic Alphabet (IPA) notation for pronunciations, whereas Dataset A tasks focus on regional variants (UK/US) without IPA emphasis.",
        "Dataset B tasks include identifying the translation service provider or company, which is absent in Dataset A tasks.",
        "Dataset B tasks require multiple example sentences per word to illustrate contextual variations, while Dataset A tasks seek general usage examples.",
        "Dataset B tasks frequently require both UK and US pronunciations within a single task, while Dataset A tasks may focus on one variant.",
        "Dataset B tasks involve structured grammatical requirements (e.g., affirmative/negative/interrogative sentences), while Dataset A tasks explore grammar concepts more broadly.",
        "Dataset B tasks ask for specific grammatical categories (e.g., prepositions, articles), whereas Dataset A tasks cover general parts of speech (e.g., adjectives, nouns).",
        "Dataset B tasks mandate attribution of content sources (e.g., translation providers), while Dataset A tasks lack such requirements."
      ],
      "nnetnav_live_site=dictionary.cambridge_num_tasks=54_portion=0": [
        "Tasks in dataset B require users to count or enumerate multiple entries (e.g., number of word meanings, grammatical rules)",
        "Dataset B tasks explicitly demand cross-referencing between dictionary entries and grammar sections within a single task",
        "Tasks in dataset B require users to synthesize new example sentences based on definitions, not just retrieve existing ones",
        "Dataset B includes tasks that require identifying specific corporate attribution for translation services",
        "Tasks in dataset B require explicit comparison of grammatical structures across sentence types (affirmative/negative/interrogative)",
        "Dataset B tasks specify analysis of linguistic elements in particular grammatical categories (e.g., groups of prepositions)",
        "Tasks in dataset B frequently require simultaneous extraction of pronunciation, definition, and example sentence as a combined output",
        "Dataset B includes tasks that demand identification of phonetic patterns across multiple regional variants within a single query",
        "Tasks in dataset B require explicit differentiation between countable/uncountable noun usage in grammatical explanations",
        "Dataset B tasks involve meta-analysis of dictionary structure (e.g., recognizing entry hierarchy for multiple meanings)"
      ]
    },
    "apple": {
      "nnetnav_live_site=apple_num_tasks=70_portion=1": [
        "Dataset B tasks require checking real-time inventory/pickup availability (e.g. in-store pickup scheduling) while A focuses on general compatibility checks",
        "B emphasizes precise technical specification retrieval (e.g. video recording resolution, processor details) where A focuses more on feature comparisons",
        "B contains tasks requiring identification of product variants/skus (e.g. color-specific pricing, storage configurations) while A focuses on model-level comparisons",
        "Dataset B includes explicit geographic/location-based tasks (zip code lookup) not present in A's region-agnostic tasks",
        "B requires direct price difference calculations between specific configurations where A focuses on model-level price ranges",
        "Dataset A contains tasks requiring navigation through corporate responsibility content (environmental reports) absent in B",
        "B emphasizes new product release timelines (specific dates) while A focuses on general time-sensitive promotions",
        "A includes tasks requiring technical support documentation navigation (repair manuals) not present in B",
        "Dataset B contains explicit count-based verification tasks (number of pencil types) absent in A's qualitative comparisons",
        "A features business/healthcare solution research tasks while B focuses strictly on consumer product specifications"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=4": [
        "Tasks in B require checking product availability at specific locations (e.g., zip code 90038) while A focuses on general availability checks",
        "B tasks involve retrieving exact technical specifications (e.g., 'maximum video recording resolution') rather than general feature descriptions",
        "B requires identifying marketing slogans for products (e.g., 'slogan for Mac') which A tasks don't address",
        "B tasks demand precise numerical comparisons of product variants (e.g., 'price difference between 3rd gen AirPods types') rather than qualitative comparisons",
        "B requires checking time-bound availability (e.g., 'schedule pickup for January 10, 2024') while A focuses on general release dates",
        "B tasks specify exact configuration parameters (e.g., 'M3 Max chip with 16-core CPU') where A focuses on configuration customization processes",
        "B involves counting distinct product variants (e.g., 'how many types of Apple Pencil') rather than assessing compatibility",
        "B requires identifying specific hardware components (e.g., 'processor for Apple TV') not typically addressed in A's tasks",
        "B tasks focus on cross-generational color comparisons (e.g., 'color options across iPhone 13-15 Pro') while A focuses on current model customization",
        "B demands verification of specific technical capabilities (e.g., 'Wireless pairing support') rather than general accessory compatibility checks"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=0": [
        "Tasks in B require precise identification of product specifications (e.g., storage options, color variants) directly from product pages, while A emphasizes multi-step comparisons across categories or use cases.",
        "Dataset B includes tasks targeting explicit temporal elements (e.g., release dates, pickup scheduling by date), absent in A\u2019s samples.",
        "B tasks involve direct price comparisons between specific variants (e.g., Pro vs. Pro Max, SE vs. standard models), whereas A focuses on price discovery within broader educational/business contexts.",
        "Tasks in B require verification of localized availability (e.g., in-store pickup by zip code), while A prioritizes general in-store availability checks without geographic granularity.",
        "Dataset B tasks demand identification of marketing slogans or product taglines (e.g., MacBook Pro\u2019s slogan), absent in A\u2019s task requirements.",
        "B tasks explicitly query accessory compatibility details (e.g., Apple Pencil wireless pairing), while A focuses on accessory purchase workflows without technical validation.",
        "Dataset B includes explicit model-year differentiation (e.g., iPhone 13 Pro vs. 14 Pro vs. 15 Pro), whereas A tasks refer to generational models without version-specific comparisons.",
        "Tasks in B require exact technical attribute extraction (e.g., video recording resolution, processor types), while A emphasizes feature summaries (e.g., camera specs, sustainability).",
        "Dataset B tasks involve direct navigation to configuration pages for storage/color selections, while A tasks center on hierarchical category exploration (e.g., iPhone > iPhone 16 Pro > storage).",
        "B tasks focus on immediate purchase logistics (e.g., pickup scheduling, price for exact configurations), whereas A integrates trade-in/financing workflows into purchase actions."
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=2": [
        "Dataset B tasks emphasize precise technical specifications retrieval (e.g., video recording resolution, processor types)",
        "Dataset B requires checking in-store pickup availability at specific locations/zip codes, while A focuses on general stock status",
        "Dataset B includes explicit queries about product release dates and regional availability timelines",
        "Dataset B tasks involve comparing prices across specific color/storage configurations rather than general model comparisons",
        "Dataset B contains queries about quantifying product variations (e.g., 'how many types of Apple Pencil exist')",
        "Dataset B requires identifying marketing slogans and promotional terminology used for products",
        "Dataset B focuses more on accessory compatibility details (e.g., wireless charging support for specific Apple Pencil models)",
        "Dataset B tasks demand comparisons across 3+ product generations rather than successive generation comparisons",
        "Dataset B includes explicit technical compatibility checks (e.g., iOS version requirements for features)",
        "Dataset B requires identifying exact geographical release schedules rather than general service offerings"
      ],
      "nnetnav_live_site=apple_num_tasks=70_portion=3": [
        "Dataset B tasks require checking real-time inventory/pickup availability (e.g., zip code-based availability checks)",
        "Dataset B focuses on identifying exact technical measurement specifications (e.g., video recording resolution, screen size comparisons)",
        "Dataset B tasks explicitly request marketing content analysis (e.g., identifying product slogans)",
        "Dataset B requires direct price comparisons between different product categories (e.g., Watch vs. Watch SE)",
        "Dataset B tasks emphasize identification of exact storage configurations across current product lines",
        "Dataset B includes historical generation comparisons for aesthetic features (e.g., color options across 3 iPhone generations)",
        "Dataset B tasks specify geographical release details (e.g., regional availability timelines)",
        "Dataset B requires differentiation between accessory versions with technical requirements (e.g., wireless charging support)",
        "Dataset B tasks involve time-bound scheduling actions (e.g., reserving specific pickup dates)",
        "Dataset B focuses on processor identification in non-computing devices (e.g., Apple TV chip details)"
      ]
    },
    "google_search": {
      "nnetnav_live_site=google_search_num_tasks=72_portion=3": [
        "Dataset B tasks require exact numerical answers (e.g., player counts, SHA bits) while Dataset A focuses on comparative numerical analysis (e.g., stock comparisons, NBA team performance)",
        "Dataset B emphasizes retrieval of unique identifiers/technical specifications (e.g., commit SHAs, hardware requirements) whereas Dataset A prioritizes practical application information (e.g., job requirements, health recommendations)",
        "Dataset B contains more time-sensitive real-time data requests (e.g., current player counts, live charts) compared to Dataset A's recent-but-not-live information needs",
        "Dataset B tasks frequently require parsing structured technical documentation (e.g., software requirements, research papers) while Dataset A more often involves synthesizing consumer-oriented content",
        "Dataset B shows higher prevalence of platform-specific metric extraction (e.g., GitHub commits, Steam player counts) versus Dataset A's general platform usage (e.g., YouTube tutorials, Google Scholar papers)",
        "Dataset B includes explicit requests for ranked list positions (e.g., 'top 5', 'number one') where Dataset A focuses on comparative analysis without positional requirements",
        "Dataset B tasks demonstrate stronger need for exact temporal precision (e.g., specific release dates, season records) compared to Dataset A's general recency requirements",
        "Dataset B contains more atomic fact retrieval (e.g., biography details, movie ratings) while Dataset A features compound exploratory tasks (e.g., symptom research, venue comparisons)",
        "Dataset B shows increased demand for technical system interoperability details (e.g., hardware compatibility, feature requirements) absent in Dataset A's tasks",
        "Dataset B tasks require precise answer formatting (e.g., specific bit sequences, ordered lists) whereas Dataset A allows more flexible information presentation"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=2": [
        "Dataset B tasks more frequently require retrieval of exact numerical identifiers (e.g., SHAs, player counts)",
        "Dataset B emphasizes precise technical specifications (e.g., hardware requirements, software compatibility)",
        "Dataset B tasks focus more on current/real-time quantitative metrics (player counts, chart positions)",
        "Dataset B requires direct extraction of specific data points from developer platforms (GitHub, Steam)",
        "Dataset B tasks involve structured sorting/ranking operations (top 5 lists, highest-grossing comparisons)",
        "Dataset B emphasizes entertainment industry metrics (box office numbers, music chart positions)",
        "Dataset B tasks require identification of version-specific technical constraints",
        "Dataset B focuses more on celebrity/public figure biographical details",
        "Dataset B tasks demand extraction of metadata from version control systems",
        "Dataset B emphasizes temporal precision for media release dates"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=4": [
        "Dataset B tasks prioritize exact numerical or alphanumeric retrieval (e.g., SHA hashes, player counts, box office figures), while Dataset A focuses on broader conceptual or procedural information retrieval.",
        "Dataset B tasks frequently require verbatim extraction of platform-specific metadata (e.g., GitHub commit details, Steam player statistics), whereas Dataset A emphasizes cross-domain knowledge integration.",
        "Dataset B includes explicit requests for ordered rankings (e.g., 'top 5 highest-grossing', 'number one artist'), while Dataset A comparisons focus on analytical contrasts without predefined ranking structures.",
        "Dataset B tasks often specify technical parameters (e.g., hardware requirements, software compatibility details) absent in Dataset A's more general domain inquiries.",
        "Dataset B queries emphasize real-time crowd-sourced metrics (e.g., current Spotify charts, live player counts) compared to Dataset A's focus on stable factual updates (e.g., research papers, event bookings).",
        "Dataset B contains explicit instructions for data manipulation (e.g., 'copy and paste SHA', 'list top 10 songs') not present in Dataset A's observational tasks.",
        "Dataset B tasks frequently target entertainment industry metrics (e.g., movie earnings, athlete statistics) while Dataset A spans academic, professional, and lifestyle domains.",
        "Dataset B requires precise temporal specificity (e.g., 'latest commit', 'current number one') whereas Dataset A uses relative temporal framing (e.g., 'recent', 'latest').",
        "Dataset B queries often demand platform-specific API-like interactions (e.g., GitHub commit inspection, Steam metrics) absent from Dataset A's general web navigation patterns.",
        "Dataset B tasks explicitly request structured output formats (e.g., 'list of...', 'first 7 bits of...') more frequently than Dataset A's open-ended information synthesis requirements."
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=0": [
        "Tasks in dataset B more frequently require exact numerical or alphanumerical identifiers (e.g., SHA hashes, version numbers, specific bit sequences)",
        "Dataset B tasks emphasize explicit ranking positions (e.g., 'top 3', 'number one') more consistently than A",
        "Tasks in B more commonly require combining information from multiple distinct platforms/sources within a single query",
        "Dataset B shows stronger focus on version-specific technical requirements (software/hardware compatibility details)",
        "B's tasks more frequently demand real-time player counts/active user statistics compared to A",
        "Dataset B contains more requests for cryptographic/commit-specific developer workflow information",
        "Tasks in B more often require extraction of metadata from version control systems (e.g., GitHub commit messages)",
        "Dataset A includes more transactional tasks (e.g., booking, purchasing) while B focuses purely on information retrieval",
        "B's queries more consistently demand temporal precision (specific season statistics, exact scoring timelines)",
        "Dataset A contains more location-based service queries (venue rentals, local activities) compared to B's platform-agnostic requests"
      ],
      "nnetnav_live_site=google_search_num_tasks=72_portion=1": [
        "Dataset B tasks more frequently require exact numerical outputs (e.g., 'first 7 bits of SHA', 'number of players') compared to Dataset A's general quantitative requests",
        "Dataset B includes explicit requests for ordered/sorted lists (e.g., 'top 5 highest-grossing', 'sorted by box office earnings') as core task requirements",
        "Dataset B contains more technical system specification queries (e.g., 'hardware requirements', 'software requirements for iPhones') than Dataset A",
        "Dataset B tasks frequently involve developer tools/platforms (GitHub commits, SHA hashes) absent in Dataset A samples",
        "Dataset B shows stronger emphasis on version-specific technical data (e.g., 'latest version of Adobe Photoshop', 'latest commit') compared to Dataset A's general version references",
        "Dataset B includes more direct requests for metadata about information sources (e.g., journal names like 'Nature Astronomy') compared to Dataset A's general source browsing",
        "Dataset B tasks require structured data extraction from ranked listings (e.g., 'top 3 super-earth planets') more explicitly than Dataset A's comparative analysis",
        "Dataset B contains more time-sensitive real-time player/audience metrics (e.g., 'number of players in-game at this time') than Dataset A's general 'latest news' requests",
        "Dataset B shows increased focus on exact technical specifications for compatibility (e.g., 'AirDrop's web transmission requirements') absent in Dataset A",
        "Dataset B tasks frequently combine multiple structured output requirements (e.g., 'list + number + current status') in single queries more than Dataset A's singular focus tasks"
      ]
    }
  }
}