{
  "sims": {
    "computer_student": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets query courses based on specific levels (e.g., basic/medium/high-level in A, Level_500 in B).",
        "Both involve filtering professors by faculty/department affiliation or employment status.",
        "Both require counting instances (e.g., courses taught, students advised, professors in roles).",
        "Both use numerical IDs (course, professor, student) as primary identifiers.",
        "Both include questions about professors teaching multiple courses or specific course ranges.",
        "Both reference academic phases or stages (e.g., pre-phase in A, Pre_Quals in B).",
        "Both involve intersections between entities (e.g., courses taught by multiple professors).",
        "Both query student statuses (year in program, phase) tied to advisor relationships.",
        "Both use aggregation operations (totals, percentages, min/max values).",
        "Both ask about professors' roles/positions (e.g., faculty employees, departmental positions)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets focus on querying relationships between professors and the courses they teach.",
        "Questions in both datasets frequently use numerical identifiers (e.g., course IDs, professor IDs) for specificity.",
        "Both include queries about counting entities (e.g., courses taught, students advised).",
        "Questions in both datasets involve filtering results based on course levels (e.g., basic, medium, high-level, Level_500).",
        "Both datasets ask about advising relationships between professors/advisors and students.",
        "Queries in both require aggregating data (e.g., totals, percentages, counts).",
        "Both include requests to list or identify professors teaching specific courses.",
        "Questions in both datasets reference academic roles (e.g., faculty members, advisors, students).",
        "Both datasets ask about course levels and their associations with professors or courses.",
        "Queries in both involve filtering or joining data across multiple entities (e.g., courses, professors, students)."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Both datasets query course IDs and their associated levels (e.g., basic, medium, high-level, undergraduate, master/graduate).",
        "Both focus on professors' affiliations with faculty positions or roles (e.g., faculty member, faculty employee).",
        "Questions in both datasets filter courses based on criteria like course level (e.g., high-level, medium) or type (e.g., professional, master).",
        "Both include queries about professors teaching multiple courses or specific course ranges (e.g., IDs 121\u2013130).",
        "Advisor-student relationships are analyzed in both, including student years in programs and advisor assignments.",
        "Both datasets use numerical identifiers (e.g., professor IDs, student IDs, course IDs) for granular filtering.",
        "Questions in both involve counting or aggregating results (e.g., total courses, percentage calculations, averages).",
        "Both include queries about professors' positions in the faculty (e.g., faculty_eme, full-time status, experience years).",
        "Course-property associations (e.g., course level tied to professor experience or faculty status) are a shared focus.",
        "Both datasets request filtering based on combined criteria (e.g., course level + faculty affiliation, ID ranges + course type)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Both datasets include queries about the number of courses taught by professors, often filtering by course level or professor attributes.",
        "Both focus on course levels (e.g., basic, medium, high-level, graduate) as a key filtering criterion in questions.",
        "Both use numerical identifiers (e.g., course IDs, professor IDs, student IDs) to reference specific entities.",
        "Both involve counting or aggregating entities (e.g., courses, students, professors) based on specific conditions like program phase or course difficulty.",
        "Both include questions about professors' affiliations or roles (e.g., faculty membership, years in the program).",
        "Both ask for lists of IDs (e.g., course IDs, professor IDs) that meet criteria such as course level or teaching responsibilities.",
        "Both reference program phases or student years (e.g., '5th year,' 'phase 2') as part of filtering conditions.",
        "Both contain queries about the relationship between advisors/advisees, though expressed differently (e.g., advisor assignments in A, student-advised relationships in B).",
        "Both require filtering courses by categorical or numerical thresholds (e.g., 'level 500 or higher,' 'high-level undergraduate').",
        "Both emphasize professional or graduate-level courses as distinct categories in questions."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Use of specific numeric identifiers (e.g., course_id, person_id, professor_id) in queries",
        "Focus on course level as a key attribute (e.g., 'basic', 'medium', 'high-level', 'master/graduate')",
        "Queries about relationships between professors and courses they teach",
        "Interest in advisor-advisee relationships (e.g., 'advised by', 'advisors for students')",
        "Requests for counts or percentages of entities (e.g., 'how many courses', 'percentage of high-level courses')",
        "Direct filtering by academic year or program phase (e.g., '5th year', 'pre-phase of qualification')",
        "Explicit use of Boolean conditions (e.g., 'basic or medium', 'high-level or harder', 'no more than two')",
        "Queries targeting faculty membership or employment status (e.g., 'faculty member', 'faculty employee')",
        "Requests to list IDs paired with attributes (e.g., 'course ID and level', 'professor ID and position')",
        "Use of comparative or superlative criteria (e.g., 'most number of professors', 'highest number of courses')"
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets use specific course IDs as primary query parameters.",
        "Questions in both datasets reference professors by their unique numerical IDs.",
        "Course levels (e.g., 'high-level undergraduate', 'Level_400') are used to filter or describe courses.",
        "Queries frequently link professors to the courses they teach using exact ID matches.",
        "Both datasets ask for the level of a specific course by its ID (e.g., 'What level is course 165?').",
        "Questions require retrieving lists of entities (courses, professors) based on ID or categorical attributes.",
        "Exact matching on numerical identifiers (course, professor, person) is a consistent criterion.",
        "Academic roles (professors, advisors) and their relationships to courses or students are central themes.",
        "Queries focus on structured relationships (e.g., 'taught by', 'advised by') between entities.",
        "Entity associations (e.g., professor-course, advisor-student) are explicitly modeled in the questions."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Queries focus on courses taught by professors based on specific attributes (e.g., faculty status, years in program).",
        "Questions filter results using course levels (e.g., basic, medium, high-level, master/graduate).",
        "Both datasets reference numerical identifiers for professors (e.g., professor ID, p_id).",
        "Queries involve counting or aggregating results (e.g., 'how many,' 'most number of').",
        "Questions use precise conditions on professor experience (e.g., years in program, faculty membership).",
        "Both datasets ask for lists of course IDs tied to professor attributes.",
        "Queries include filtering by course difficulty (e.g., 'high-level or harder undergraduate courses').",
        "Questions link professors to specific phases or years in academic programs (e.g., '12th years of program').",
        "Both datasets use exact numerical thresholds (e.g., 'at least 5 years,' 'more than 4').",
        "Queries map professors to courses they teach using relational logic (e.g., 'taught by advisors who advised student X')."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Both datasets focus on querying course levels (e.g., undergraduate, graduate, senior, advanced).",
        "Both include questions about professors teaching specific courses or course levels.",
        "Both require counting instances (e.g., number of courses, professors, or students).",
        "Both reference unique identifiers like course_id, professor_id, or person_id.",
        "Both involve filtering by faculty/program affiliation (e.g., faculty members, years in the program).",
        "Both ask about advisor-student relationships (e.g., advisors for specific student years).",
        "Both include granular queries about academic phases (e.g., pre-phase, Phase 3, program years).",
        "Both emphasize course difficulty tiers (e.g., basic, medium, high-level, professional).",
        "Both require joining entities (e.g., professors to courses, advisors to students).",
        "Both reference structured database attributes (e.g., courseLevel, inPhase, faculty positions)."
      ]
    },
    "movie_platform": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets include queries about user subscription or payment status (e.g., trialists, paying subscribers, users with payment methods).",
        "Questions in both datasets frequently reference movie ratings (e.g., specific scores like 5-star ratings or average ratings).",
        "Directors of movies are a common entity queried in both datasets.",
        "User-created lists (e.g., titles, follower counts, creation dates) are frequently referenced.",
        "Aggregate functions (e.g., counts, averages, percentages) are used to analyze data in both datasets.",
        "Time-based filters (e.g., dates, years, timestamps) are applied to narrow results.",
        "Popularity metrics (e.g., highest popularity, most followers) are used to rank movies or lists.",
        "User followers and list follower counts are explicitly tracked and queried.",
        "Specific user IDs are referenced to filter or retrieve data (e.g., user 4208563, user_id 48298880).",
        "Queries often combine multiple criteria (e.g., user status + rating score + release year)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Queries involve filtering movies by release year or date ranges",
        "Both datasets require counting specific user actions (e.g., ratings, list creations)",
        "Questions target movie directors and their associated works",
        "Aggregation of rating scores (e.g., highest average, most '5' ratings) is present in both",
        "User IDs are used as unique identifiers for profile-related queries",
        "Both include references to user-generated lists and their metadata (titles, update timestamps, followers)",
        "Queries involve popularity metrics for movies (numeric thresholds or rankings)",
        "URLs for movie/posters/user profiles are explicitly requested",
        "Questions combine multiple conditions (e.g., subscription status + rating score + release year)",
        "Time-based analysis of user/list activity (e.g., '10 years after creation') appears in both"
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Both datasets query user subscription status (e.g., trial, paying) in relation to actions like rating or creating lists.",
        "Both focus on aggregating numeric metrics (e.g., counts, averages, highest/lowest values) for ratings, followers, or popularity.",
        "Both involve filtering movies by specific attributes like release year, director, title, or popularity score ranges.",
        "Both reference user-generated lists (e.g., list titles, creation/update dates, followers) and their associated metadata.",
        "Both require joining data across entities (e.g., users, movies, ratings, lists) to answer questions.",
        "Both include questions about directors (e.g., identifying directors, counting their movies, linking directors to movie attributes).",
        "Both ask for URLs related to movies (e.g., cover images, user profile images, rating URLs).",
        "Both use temporal constraints (e.g., date ranges, years, time since creation/update) in queries.",
        "Both explicitly reference rating scores (e.g., counts of ratings with specific scores, average scores, highest/lowest scores).",
        "Both include questions about popularity metrics (e.g., popularity thresholds, rankings, comparisons)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Both datasets include questions about movie titles and their release years.",
        "Both datasets query aggregate metrics (e.g., counts, averages) for ratings, popularity, or user activity.",
        "Both datasets require filtering results by date ranges or specific years (e.g., \"after 2014\", \"released in 2021\").",
        "Both datasets ask for the identification of directors associated with specific movies or popularity metrics.",
        "Both datasets involve analyzing user-generated lists, including list titles, followers, and creation/update timestamps.",
        "Both datasets include questions about numerical thresholds (e.g., ratings > 4.0, popularity > 400).",
        "Both datasets require joining user data (e.g., trial/paying status, followers) with movie ratings or lists.",
        "Both datasets focus on ranking or identifying top results (e.g., \"most popular\", \"highest rating\", \"top 5\").",
        "Both datasets query unique counts (e.g., unique users, unique directors, unique movies).",
        "Both datasets involve extracting metadata (e.g., URLs, cover images) linked to movies or user profiles."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets query for movie titles based on specific criteria (e.g., highest rating, popularity, or ID).",
        "Both include questions about user-generated lists, such as list titles, followers, or creation/update dates.",
        "Both involve aggregations (e.g., counts, averages, maximums) for ratings, followers, or movies.",
        "Both reference numerical identifiers like user IDs, movie IDs, or list IDs in queries.",
        "Both ask for comparisons or rankings (e.g., 'highest rating,' 'most followers,' 'most popular').",
        "Both use exact numerical thresholds (e.g., ratings greater than 3, popularity between 400-500).",
        "Both include questions about user metadata (e.g., subscriber status, follower counts, profile URLs).",
        "Both require filtering results by temporal constraints (e.g., release year, list creation/update dates).",
        "Both focus on movie-director relationships (e.g., identifying directors of specific movies).",
        "Both involve explicit references to rating scores (e.g., counting ratings of 4 or 5)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets query user subscription status (e.g., 'paying subscriber', 'trialist', 'subscriber', 'payment method').",
        "Both filter results by specific rating scores (e.g., '4', '5', 'highest rating').",
        "Both use exact identifiers like movie titles, user IDs, or list IDs to retrieve data.",
        "Both include requests for URLs linked to movies, ratings, or user profiles.",
        "Both require aggregation operations (e.g., counts, averages, max/min values).",
        "Both reference list titles and their properties (e.g., followers, update dates, creation years).",
        "Both filter movies by release years or date ranges (e.g., '2014', '1924', '1988').",
        "Both ask for director names associated with movies or specific release years.",
        "Both use popularity metrics as a filter or ranking criterion.",
        "Both query relationships between users and lists (e.g., list creators, followers, list contents)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Queries often ask for movie titles based on specific criteria (e.g., highest rating, popularity, or user subscriptions).",
        "Questions frequently involve filtering results by user subscription status (e.g., trialist, paying subscriber, or users with payment methods).",
        "Both datasets include queries about movie ratings (e.g., counts of ratings, highest-rated movies, or scores like 4/5 or 5/5).",
        "Lists and their attributes (e.g., titles, update timestamps, number of movies, followers) are common query targets.",
        "User metadata (e.g., profile images, IDs, subscription eligibility) is frequently referenced in filtering or output criteria.",
        "Date/time filters (e.g., release years, list update dates, rating timestamps) are used to constrain results.",
        "Aggregations like counts, averages, and percentages are regularly requested (e.g., 'how many users', 'average rating').",
        "Questions target movie popularity metrics as a key comparative or filtering parameter.",
        "Directors are referenced in queries about movie attributes or performance (e.g., identifying directors of top movies).",
        "URLs for movie images, user profiles, or ratings are explicitly requested in output criteria."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Both datasets query movie titles based on rating scores (e.g., 'highest rating score' in B and 'most \"5\" ratings' in A).",
        "Both include questions about user counts under specific conditions (e.g., 'paying subscribers' in A and 'users with a payment method' in B).",
        "Both datasets filter results using numerical thresholds (e.g., 'popularity > 400' in A and 'rating score > 8' in B).",
        "Both involve aggregations like averages (e.g., 'average rating score' in A and B) and totals (e.g., 'total number of movies' in B).",
        "Both reference user-created lists and their properties (e.g., 'list titles created by user' in A and 'lists with >100 followers' in B).",
        "Both ask for rankings or extremes (e.g., 'most popular movie' in A and 'movie with the highest rating' in B).",
        "Both include temporal filters (e.g., 'released in 1924' in A and 'released after 2010' in B).",
        "Both query metadata like directors (e.g., 'director of Tokyo Eyes' in A) and language (e.g., 'movies in English' in B).",
        "Both use user-specific identifiers (e.g., 'user 4208563' in A and 'specific user_id' in B).",
        "Both involve counting interactions (e.g., 'number of ratings' in B and 'percentage of ratings with highest score' in A)."
      ]
    },
    "app_store": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets include questions filtering apps by numerical thresholds (e.g., ratings \u22654.2, installs >10M).",
        "Both require identification of apps by specific categories/genres (e.g., 'Games,' 'Finance,' 'Sports').",
        "Both query apps based on pricing model (free vs. paid).",
        "Both involve aggregation of metrics like average ratings, average sizes, or percentage calculations.",
        "Both reference app popularity metrics (installs, reviews, and rankings like 'top 5').",
        "Both include conditional filters combining multiple attributes (e.g., rating + installs + category).",
        "Both use explicit content rating criteria (e.g., 'Everyone 10+,' 'Everyone').",
        "Both analyze app metadata such as size, version, and update status.",
        "Both require sentiment analysis metrics (polarity scores in A, sentiment polarity thresholds in B).",
        "Both focus on granular quantitative comparisons (e.g., 'percentage ratio between positive/negative sentiments')."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets focus on app ratings as a key metric (e.g., 'rating 4.5 and above' in A, 'rated higher than 4.0' in B).",
        "Both include queries about app categories/genres (e.g., 'arcade genre' in A, 'GAME category' in B).",
        "Both involve aggregations like averages (e.g., 'average sentiment polarity score' in A, 'average rating' in B).",
        "Both use numerical thresholds for filtering (e.g., 'size no more than 1.0 M' in A, 'more than 10,000 reviews' in B).",
        "Both reference app attributes like installs (e.g., '100,000,000+ installs' in A, '1,000,000,000+ installs' in B).",
        "Both analyze free vs. paid apps (e.g., 'percentage for free application' in A, 'apps listed as Free' in B).",
        "Both include sentiment-related queries (e.g., 'neutral attitude' in A, 'positive sentiment' in B).",
        "Both ask for counts under specific conditions (e.g., 'how many apps have rating of 5' in A, 'how many games have rating above 4.5' in B).",
        "Both reference Google Play Store context explicitly (e.g., 'rating in the Google Play Store' in A, 'rated higher than 4.0 on the Google Play Store' in B).",
        "Both use structured queries combining filters and metrics (e.g., 'genre + content rating' in A, 'category + review count' in B)."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Both datasets query about apps' sentiment polarity scores (average, highest, lowest, etc.).",
        "Both datasets include questions about app ratings, often with thresholds (e.g., 4.5+).",
        "Both datasets ask for install/download counts, frequently using ranges (e.g., 1,000,000+).",
        "Both datasets reference app genres/categories (e.g., action, puzzle, simulation).",
        "Both datasets require filtering apps based on free/paid status.",
        "Both datasets involve ranking or listing top apps by metrics like reviews, installs, or ratings.",
        "Both datasets use aggregations (average, total, percentage) for metrics like ratings, installs, or sentiment scores.",
        "Both datasets include questions about app content ratings (e.g., Everyone 10+).",
        "Both datasets query apps updated within specific timeframes (e.g., 'since 2018').",
        "Both datasets ask for correlations between app attributes (e.g., sentiment scores vs. installs, ratings vs. genre)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Both datasets focus on app ratings, either through specific thresholds (e.g., 'rating 4.5+') or aggregated averages.",
        "Queries frequently filter results by app categories (B) or genres (A) like 'Tools,' 'Family,' 'Sports,' or 'Games.'",
        "Numerical thresholds (e.g., '100,000+ installs,' 'more than 10,000 reviews') are used to constrain results in both datasets.",
        "Aggregation functions (e.g., average, count, total) are applied to metrics like ratings, reviews, or sentiment scores.",
        "Top-ranked entries (e.g., 'top 5 apps by reviews') are a recurring focus in queries from both datasets.",
        "User feedback is analyzed quantitatively, via sentiment polarity/subjectivity (A) or review counts/ratings (B).",
        "Queries often combine multiple criteria (e.g., 'rating > 4.5 AND category = Tools' or 'sentiment score AND installs').",
        "Results are filtered by app attributes such as content rating (A) or update status (A) and category-specific metrics (B).",
        "Both datasets include queries to count apps meeting specific conditions (e.g., 'number of apps with rating > 4.5').",
        "Data is segmented by categorical attributes (e.g., genre, category) to analyze trends or comparisons."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets include questions about app categories/genres (e.g., 'arcade genre' in A, 'BUSINESS' category in B).",
        "Both require numerical aggregation (e.g., 'average rating' in B, 'average sentiment polarity score' in A).",
        "Both reference app-specific attributes like size (e.g., 'size of Browser 4G' in A, 'size of free android games' in B).",
        "Both involve filtering by content ratings (e.g., 'Everyone 10+' in A, age suitability in B).",
        "Both ask for counts of reviews or sentiments (e.g., 'negative comments' in A, 'positive reviews' in B).",
        "Both use explicit thresholds (e.g., 'rating 4.5 and above' in A, 'rating more than 4.0' in B).",
        "Both include queries about app installs/downloads (e.g., '100,000,000+ installs' in A, 'minimum number of downloads' in B).",
        "Both reference named apps (e.g., 'Cooking Fever' in A, 'Helix Jump' in B).",
        "Both require comparisons across apps (e.g., 'top 5 shopping apps' in A, 'highest rating' in B).",
        "Both use structured filters (e.g., 'free application' in A, 'free android games' in B)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets involve queries about app ratings (e.g., average rating, specific rating thresholds).",
        "Both include questions filtering apps by category/genre (e.g., Family, Games, Photography).",
        "Both use numerical thresholds for metrics like installs (e.g., 1,000,000+, 100,000,000+).",
        "Both reference free vs. paid app distinctions in queries.",
        "Both ask for statistical aggregates (e.g., averages, counts, percentages).",
        "Both include content rating filters (e.g., Teen, Everyone 10+).",
        "Both require identification of top-ranked apps (e.g., 'top 5', 'top-rated').",
        "Both use exact numerical comparisons (e.g., 'rating of 4.5', 'sentiment objectivity of 0.3').",
        "Both involve cross-referencing multiple attributes (e.g., category + rating + installs).",
        "Both include queries about app metadata like version numbers and update years."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Both datasets include questions about app ratings (e.g., 'rating 4.5 and above' in A and 'average rating' in B).",
        "Both reference app categories/genres (e.g., 'sports apps' in A and 'Action category' in B).",
        "Both involve queries about the number of installs/downloads (e.g., '100,000,000+ installs' in A and '10 million times' in B).",
        "Both use filtering criteria like price ranges (e.g., 'average price of games' in A and 'price less than 5 dollars' in B).",
        "Both include references to user sentiment metrics (e.g., 'sentiment polarity score' in A and 'sentiment polarity' in B).",
        "Both ask for rankings or comparisons (e.g., 'top 5 shopping apps' in A and 'top 5 most popular apps' in B).",
        "Both mention content ratings or audience targeting (e.g., 'content rating of Everyone 10+' in A and 'suitable for teenagers' in B).",
        "Both reference specific apps by name (e.g., 'Cooking Fever' in A and 'Twitter' in B).",
        "Both use quantitative thresholds (e.g., 'more than 75,000,000 times' in A and '1 million installs' in B).",
        "Both include statistical aggregations like averages (e.g., 'average sentiment polarity score' in A and 'average rating' in B)."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Both datasets include questions about app ratings, either specific scores or average ratings across categories.",
        "Both datasets query the number of reviews, installs, or user counts for specific apps or categories.",
        "Both datasets involve filtering or aggregating data based on app categories (e.g., 'Tools,' 'Games,' 'Sports').",
        "Both datasets reference sentiment analysis, with Dataset A using granular metrics like polarity/subjectivity scores and Dataset B using broader terms like 'positive sentiment.'",
        "Both datasets ask for comparisons or rankings (e.g., 'top 5 apps,' 'highest rating').",
        "Both datasets require aggregations such as averages, totals, or percentages (e.g., 'average rating,' 'percentage of positive sentiments').",
        "Both datasets include questions targeting specific named apps (e.g., 'Instagram,' 'Honkai Impact 3rd').",
        "Both datasets use numerical thresholds (e.g., 'rating greater than 4.0,' '100,000,000+ installs').",
        "Both datasets focus on user-generated content metrics, such as reviews and sentiment analysis.",
        "Both datasets query metadata like app genres, categories, or content ratings (e.g., 'arcade genre,' 'Everyone 10+')."
      ]
    }
  },
  "diffs_synth_from_real": {
    "computer_student": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B uses standardized course level codes (e.g., Level_500) rather than descriptive terms (e.g., 'high-level') for categorization",
        "Dataset B includes queries about professors' experience duration (e.g., 'more than 2 years in program') while A does not reference temporal experience metrics",
        "Dataset B contains explicit references to named faculties (e.g., 'Faculty of Mathematics') rather than generic departmental affiliations",
        "Dataset B asks for course/student names in some queries while A exclusively references numerical IDs",
        "Dataset B includes negation filters (e.g., 'courses not taught by professor 240') while A only uses positive inclusion criteria",
        "Dataset B queries combine teaching and advising relationships in single questions (e.g., students advised by teachers of specific courses)",
        "Dataset B explicitly tracks program completion status ('students who have not completed their program') while A focuses only on progression phases",
        "Dataset B uses ordinal phase numbering ('Phase 0') alongside named phases ('Pre_Quals') while A uses only descriptive phase labels",
        "Dataset B includes questions about database-wide aggregates ('total professors and students') while A focuses on context-specific counts",
        "Dataset B contains time-based comparisons between professor experience and student years in program that A doesn't address"
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B queries more frequently ask about specific named entities (e.g., 'student with the highest yearsInProgram') rather than numerical ranges used in A",
        "Dataset B contains queries requesting names of individuals ('What is the name of the person...') while A exclusively uses numerical identifiers",
        "Dataset B includes queries about course levels as standalone attributes ('What is the highest course level') without combining with other filters common in A",
        "Dataset A questions frequently require combining faculty status with course level filters while B doesn't reference faculty membership",
        "Dataset A contains queries about academic phases (e.g., 'pre-phase of qualification') not present in B",
        "Dataset B includes direct professor-student advising relationships ('Which student has been advised...') without year/program context common in A",
        "Dataset A questions use complex range specifications ('person IDs from 40 to 50') while B uses single IDs",
        "Dataset A includes percentage calculations ('Calculate the percentage...') not found in B's count-focused queries",
        "Dataset A queries frequently request position/faculty status information absent in B",
        "Dataset B contains comparative questions about maxima ('student with the highest yearsInProgram') while A focuses on quantitative thresholds ('more than 4')"
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B includes references to specific course names (e.g., 'Data Structures and Algorithms'), while Dataset A uses only course IDs or generic descriptors.",
        "Dataset B uses numerical course-level designations (e.g., Level 300, Level_400), whereas Dataset A relies on qualitative terms like 'high-level' or 'medium'.",
        "Dataset B explicitly queries professors based on exact numerical thresholds for experience (e.g., '\u22655 years'), while Dataset A does not specify numerical experience criteria.",
        "Dataset B requests calculations of averages (e.g., 'average number of students per professor'), while Dataset A focuses on counts, percentages, or totals.",
        "Dataset B includes exclusion filters (e.g., 'position other than Faculty_eme' or 'not in first phase'), whereas Dataset A uses only inclusive filters.",
        "Dataset B references student names (e.g., 'names of students'), while Dataset A exclusively uses numerical student IDs.",
        "Dataset B queries professors\u2019 teaching phases (e.g., 'Phase 1'), whereas Dataset A associates phases only with student program progression.",
        "Dataset B asks about students with multiple advisors, a scenario absent in Dataset A.",
        "Dataset B filters professors by professor ID ranges (e.g., '>10 and \u226415'), while Dataset A applies ID ranges to courses or person IDs.",
        "Dataset B includes questions about students entirely lacking advisors (negation), while Dataset A focuses only on students with advisors."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B queries focus on total counts (e.g., 'total number of courses') without requesting specific ID lists in most cases, unlike A which frequently requires paired outputs like 'course ID + level' or 'student ID + professor ID'.",
        "Dataset B lacks questions about faculty membership/positions (e.g., no equivalent to A's 'Is the teacher [...] a faculty member?' or 'position in faculty' requirements).",
        "Dataset B questions use simpler aggregation phrases like 'each professor' or 'the number of' rather than A's comparative thresholds (e.g., 'more than 4', 'less than 10') and percentage calculations.",
        "Dataset B repeats identical phrasings for multiple questions (e.g., 5 instances of 'How many students are currently in the program?'), while A maintains unique question structures throughout.",
        "Dataset B omits multi-condition intersections present in A (e.g., no equivalent to A's 'professors who teach in both harder undergraduate AND graduate courses' requirements).",
        "Dataset B includes student enrollment statistics unrelated to advising relationships (e.g., 'How many students are enrolled?'), whereas A exclusively ties student counts to advisor relationships or program phases.",
        "Dataset B uses numeric course level thresholds (e.g., 'level 500') rather than A's categorical descriptors like 'high-level undergraduate' or 'professional/master' classifications.",
        "Dataset B contains basic existence checks without positional context (e.g., 'courses taught by professor ID 123') where A specifies ranges (e.g., 'course IDs from 121 to 130').",
        "Dataset B includes meta-queries about database contents (e.g., 'total number of students in the database') absent in A's context-specific operational questions.",
        "Dataset B lacks A's emphasis on hierarchical relationships (e.g., no questions about 'students with most advisors' or 'professors who advised advisors') despite both referencing advisee relationships."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries in dataset B frequently request specific attributes (e.g., course level) for singular numeric identifiers (e.g., course_id 27) without aggregation or counts, unlike A's focus on multi-entity comparisons or summaries.",
        "Dataset B includes direct references to professor/staff names (e.g., 'Professor Jane', 'Professor Smith'), whereas A exclusively uses numeric identifiers (e.g., professor_id) without named entities.",
        "Queries in B often omit explicit Boolean operators (e.g., 'and', 'or') in favor of simple attribute retrieval, while A consistently combines conditions (e.g., 'basic or medium', 'high-level or harder').",
        "Dataset B contains queries about personal names (e.g., 'What is the name of the professor...'), whereas A never references names and focuses solely on IDs and attributes.",
        "B includes questions about hybrid roles (e.g., 'professor who is also a student'), which are absent in A's strictly separated entity roles (e.g., professors vs. students).",
        "Queries in B lack comparative/superlative phrasing (e.g., 'most', 'highest') present in A (e.g., 'professor who teaches the highest number of courses').",
        "Dataset B uses generic terms like 'person' (e.g., 'person with id 394') where A distinguishes roles explicitly (e.g., 'professor', 'student', 'advisor').",
        "B includes questions about existence or simple relationships (e.g., 'Which professor has advised a student?'), while A quantifies these relationships (e.g., 'how many students... advised by Advisor 5').",
        "Dataset B omits references to academic phases (e.g., 'pre-phase of qualification') and program years as filtering criteria, which are common in A.",
        "Queries in B do not request paired ID-attribute lists (e.g., 'course ID and level') and instead focus on singular attribute retrieval for specific IDs."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B questions never reference student entities or student-advisor relationships (present in A).",
        "Dataset B lacks queries involving statistical calculations (e.g., percentages, ratios) present in A.",
        "Dataset B never uses faculty employment status as a filter criterion (common in A).",
        "Dataset B questions never combine multiple categorical filters (e.g., 'basic or medium undergraduate') in single queries like A.",
        "Dataset B exclusively uses 'courseLevel' values as direct filters (e.g., Level_500), while A uses descriptive phrases (e.g., 'high-level undergraduate').",
        "Dataset B contains queries that use professor attributes (e.g., years in program) as indirect course filters, unlike A.",
        "Dataset B questions never request entity counts with conditional thresholds (e.g., 'more than 4') like A.",
        "Dataset B includes professor identifiers with varied formatting (p_id, professor_id) absent in A's consistent 'professor ID' usage.",
        "Dataset B lacks temporal/phase status filters (e.g., 'pre-phase of qualification') present in A's student-related queries.",
        "Dataset B questions never require multi-component responses (e.g., 'course ID AND level') that A frequently demands."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B questions often omit explicit requests for multiple attributes in answers (e.g., only asking for course IDs rather than 'course ID and level')",
        "Dataset B contains direct professor name references (e.g., 'professor John') while A exclusively uses numerical identifiers",
        "Dataset B queries frequently repeat identical phrasings for similar questions (e.g., multiple instances of 'more than 5 years in program')",
        "Dataset B lacks questions involving percentage calculations or ratio-based aggregations seen in A",
        "Dataset B includes simpler filtering conditions (single thresholds) compared to A's compound conditions (e.g., 'both harder undergrad and master courses')",
        "Dataset B questions never reference student-advisee relationships or student-specific attributes",
        "Dataset B shows inconsistent use of identifier terminology (mixing 'professor ID', 'p_id', and unnamed 'person') unlike A's consistent numerical ID references",
        "Dataset B contains ambiguous parameter references (e.g., 'a certain courseLevel') not found in A's explicitly specified criteria",
        "Dataset B lacks questions about employment positions/faculty roles beyond basic membership that appear in A",
        "Dataset B omits requests for ID range filtering (e.g., 'from 40 to 50') present in multiple A samples"
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B includes explicit requests for SQLite queries (e.g., 'Write a SQLite query...'), while Dataset A uses only natural language questions.",
        "Dataset B uses generic identifiers like 'p_id' and references a 'person' table, whereas Dataset A distinguishes between professor_id, person_id, and student_id explicitly.",
        "Dataset B contains repetitive phrasing for similar questions (e.g., multiple 'Which courses are taught by professors?' variations), while Dataset A uses more diverse wording for distinct query intents.",
        "Dataset A frequently references faculty positions (e.g., 'member of faculty') and employment status, whereas Dataset B omits faculty affiliation filters in most queries.",
        "Dataset A includes granular ranges for IDs (e.g., 'person IDs from 40 to 50'), while Dataset B uses singular IDs (e.g., 'course_id 1') without ranges.",
        "Dataset A explicitly asks for combined difficulty tiers (e.g., 'high-level or harder'), while Dataset B queries single-tier categories (e.g., 'advanced level') exclusively.",
        "Dataset A requires percentage calculations (e.g., 'Calculate the percentage of...'), whereas Dataset B focuses solely on absolute counts.",
        "Dataset B uses the column name 'inPhase' directly in queries, while Dataset A refers to phases descriptively (e.g., 'pre-phase of qualification') without column references.",
        "Dataset A includes multi-condition joins (e.g., professors teaching in both undergraduate and graduate courses), while Dataset B focuses on simpler single-table or single-join queries.",
        "Dataset A emphasizes advisor-student relationships with specific program years (e.g., '12th years of program'), while Dataset B uses vague references like 'student X' or 'another person in the program'."
      ]
    },
    "movie_platform": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B queries include explicit references to list IDs (e.g., list ID 157292) while Dataset A does not mention list IDs in any samples.",
        "Dataset B queries utilize exact UTC timestamps (e.g., '2012-11-13 00:00:00 UTC') for time filters, whereas Dataset A uses date ranges without timestamps.",
        "Dataset B includes queries about list titles containing specific text patterns (e.g., 'title contains the word 's''), a feature absent in Dataset A.",
        "Dataset B explicitly references the 'lists' table in queries (e.g., 'from ''lists'' table'), while Dataset A does not mention table names.",
        "Dataset B queries focus on users' payment method status rather than trial eligibility when filtering ratings, unlike Dataset A which uses both trialist and subscriber statuses.",
        "Dataset B includes percentage calculations focused on list attributes (e.g., '% of lists created by trialists'), whereas Dataset A calculates percentages for rating scores (e.g., '% of ratings with highest score').",
        "Dataset B queries filter lists by numeric movie count thresholds (e.g., 'more than 25 movies in them'), while Dataset A filters lists by temporal criteria (e.g., 'updated 10 years after creation').",
        "Dataset B explicitly sorts results in descending order of followers (e.g., 'descending order of followers'), a sorting specificity not present in Dataset A.",
        "Dataset B queries reference the 'DVD list' as a distinct list category, a granular list type distinction absent in Dataset A.",
        "Dataset B includes queries about users who are both trialists and subscribers simultaneously (e.g., 'both a trialist and a subscriber'), a combined status not referenced in Dataset A."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Queries in B focus on single-condition filtering (e.g., release year alone) rather than combined status checks (subscription + rating + year)",
        "B lacks references to user subscription/trial status in filtering conditions present in all A samples",
        "Dataset B queries use simpler aggregations (count/max) without combined metric calculations (e.g., percentage of highest scores)",
        "B requests complete list metadata (titles + creation dates) without update timestamp comparisons found in A",
        "User ID usage in B is limited to list creation counts rather than complex profile/rating history analysis",
        "B contains explicit table references ('lists table', 'ratings_users') absent in A's abstracted queries",
        "Time analysis in B focuses on absolute dates rather than relative time spans (e.g., '10 years after creation')",
        "B queries frequently request total follower counts without follower thresholds or temporal constraints",
        "Dataset B lacks URL requests tied to specific user actions (ratings/list creations) present in A",
        "B contains multiple redundant phrasing variations of the same question type (highest rating/popularity) unlike A's diverse structures"
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B questions focus more on list popularity metrics (e.g., 'most followers', 'highest number of followers') rather than combining list metrics with user subscription status like Dataset A",
        "Dataset B includes explicit filtering of list titles by string patterns (e.g., 'contains the word 'Avengers'') while Dataset A does not",
        "Dataset B questions frequently request top-N rankings (e.g., 'top 3', 'top 5') as explicit output requirements, unlike Dataset A",
        "Dataset B contains questions about movies being present in lists as a binary condition (e.g., 'have been added to the lists') rather than counting list contents as in Dataset A",
        "Dataset B uses generalized subscription status references ('subscribers', 'payment method') instead of Dataset A's specific distinction between trial/paying status in most questions",
        "Dataset B includes questions about directors' productivity thresholds (e.g., 'directed at least 10 movies') that don't appear in Dataset A",
        "Dataset B asks for average ratings in list contexts (e.g., 'average rating of movies in the list') while Dataset A focuses on averages tied to user subscription status",
        "Dataset B questions reference predefined list categories (e.g., 'Mubi's Top Lists') not mentioned in Dataset A",
        "Dataset B contains compound subscription status conditions (e.g., 'both trial and paid subscription') not found in Dataset A's binary status distinctions",
        "Dataset B uses numeric rating thresholds (e.g., 'rating score greater than 8') rather than Dataset A's explicit rating scores (e.g., 'score of 4')"
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B includes questions about movie genres (e.g., \"How many different genres of movies are there?\"), which are absent in dataset A.",
        "Dataset B omits references to user subscription status (e.g., trial/paying) during specific actions like rating or list creation, unlike dataset A.",
        "Dataset B does not require extracting metadata such as URLs, cover images, or profile images (e.g., user profile image URLs), which are common in dataset A.",
        "Dataset B focuses on simpler aggregations (e.g., \"average rating score for all movies\") without combining multiple numerical thresholds (e.g., \"popularity > 400 but < 500\") in a single query, unlike dataset A.",
        "Dataset B includes questions about list viewership metrics (e.g., \"most viewed lists\"), while dataset A emphasizes list update timestamps, user eligibility during creation, or age-based list activity (e.g., \"updated 10 years after creation\").",
        "Dataset B lacks questions involving percentage calculations (e.g., \"percentage of ratings with the highest score\") present in dataset A.",
        "Dataset B does not reference critic interactions (e.g., likes on critic ratings) beyond basic rating scores, unlike dataset A, which explicitly ties ratings to critic likes.",
        "Dataset B uses simpler date filters (e.g., \"after 2000\", \"released in 2020\") without granular date ranges (e.g., \"1/1/2017 to 12/31/2017\") or relative timeframes (e.g., \"10 years after creation\") seen in dataset A.",
        "Dataset B includes queries about movie categories (e.g., \"movies in each category\"), absent in dataset A, which focuses on user-generated lists instead.",
        "Dataset B omits questions requiring multi-part answers (e.g., combining director names, release years, and average ratings in one response), focusing on single-metric outputs or lists."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B queries are structurally simpler, often requesting single attributes (e.g., movie title) without multi-part combinations seen in A's questions like 'Indicate...' clauses",
        "Dataset B never requests URLs (profile images, movie covers, or rating URLs) unlike A which frequently includes URL retrieval requirements",
        "Dataset B uses broader temporal constraints (e.g., 'before 2014') rather than A's precise date ranges (e.g., '1/1/2017 to 12/31/2017')",
        "Dataset B never combines aggregations with metadata requirements (e.g., 'how many... and what is the image URL') as seen in A's multi-output questions",
        "Dataset B excludes user eligibility status filters (trial/paying subscriber status) present in 40% of A's samples",
        "Dataset B never requests percentage calculations ('how many percentage of ratings') seen in A",
        "Dataset B lacks references to external entities like critics/likes ('likes did the critic receive') present in A",
        "Dataset B shows repetitive question patterns (8/30 samples repeat 'movie with highest rating score') unlike A's varied phrasing",
        "Dataset B never uses non-ASCII characters or complex movie titles with colons/subtitles seen in A's samples",
        "Dataset B never requests user IDs in answers (e.g., 'indicate user ids') unlike A's explicit ID requirements"
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B queries often request specific list or movie titles by ID (e.g., 'list ID 134607', 'movie id 152761'), while A uses descriptive identifiers like user IDs or movie titles.",
        "Dataset B includes explicit references to database tables in queries (e.g., 'in the `lists` table', 'ratings table'), whereas A does not mention tables directly.",
        "Dataset B asks for thematic/genre-based filters (e.g., 'horror movies', 'Christianity as their theme'), while A lacks genre-specific queries.",
        "Dataset B focuses on singular metrics (e.g., 'most popular movie', 'highest rating') without combining multiple criteria in a single question, unlike A's multi-part queries.",
        "Dataset B uses simpler temporal filters (e.g., 'in the last year') without date-range calculations (e.g., '10 years after creation' in A).",
        "Dataset B does not reference trial eligibility statuses like 'trialist' or 'eligible for trial', whereas A explicitly includes these terms.",
        "Dataset B queries list followers or popularity without additional user-status conditions (e.g., 'list with the most followers'), while A often ties followers to user subscription statuses.",
        "Dataset B lacks percentage-based aggregations (e.g., 'how many percentage of the ratings') present in A.",
        "Dataset B does not request URLs for user profile images or cover images, unlike A's frequent URL inquiries.",
        "Dataset B uses phrases like 'top 5 movies' or 'top 5 rated movies' for rankings, whereas A uses comparative terms like 'highest movie popularity of all time' with contextual filters."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B queries explicitly reference database table names (e.g., 'lists', 'Ratings', 'lists_users') in their criteria, while Dataset A never mentions tables",
        "Dataset B queries focus on basic popularity/rating metrics without layered conditions (e.g., 'highest rating' standalone vs A's 'highest popularity movie of all time with trialist ratings')",
        "Dataset B lacks queries requiring percentage calculations or score distributions (e.g., no equivalent to A's 'percentage of ratings with highest score')",
        "Dataset B queries use simpler date filters (e.g., '> 2021-01-01') compared to A's complex temporal logic (e.g., 'still updated 10 years after creation')",
        "Dataset B queries never request URLs for movie images/ratings/profiles as output criteria, unlike Dataset A's frequent URL requirements",
        "Dataset B includes genre-based filtering (e.g., 'Horror' genre) which never appears in Dataset A queries",
        "Dataset B queries lack multi-part output requirements (e.g., A's 'director + release date + average rating' vs B's singular movie titles)",
        "Dataset B uses generic user identifiers (e.g., 'user1') while Dataset A employs specific numerical user IDs (e.g., 4208563)",
        "Dataset B queries never involve critic-related metrics (e.g., 'likes on critic reviews') that appear in Dataset A",
        "Dataset B lacks queries about list metadata dynamics (e.g., no equivalent to A's 'followers of lists created in 2009' or update frequency analysis)"
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B does not include requests for URLs (e.g., image URLs, user profile URLs, or rating URLs) in its queries, unlike Dataset A.",
        "Dataset B does not reference trialist status or eligibility (e.g., 'eligible for trial' or 'on a trialist') in user-related conditions.",
        "Dataset B does not ask for percentage-based metrics (e.g., 'percentage of ratings with highest score'), focusing instead on counts or averages.",
        "Dataset B does not query user-specific follower counts (e.g., 'users with >100 followers') but focuses on list follower counts.",
        "Dataset B does not include multi-part questions requiring combined outputs (e.g., director names + release dates + average ratings in a single query).",
        "Dataset B does not reference interactions with critics (e.g., 'likes related to the critic' or 'critic received the highest amount of likes').",
        "Dataset B does not use movie popularity metrics (e.g., 'popularity > 400') as a filter, relying instead on rating scores and release years.",
        "Dataset B does not include temporal filters tied to user actions (e.g., 'users who rated after 2014') but only to movie release dates.",
        "Dataset B does not use movie IDs (e.g., 'movie id 1103') as identifiers, relying solely on titles or user_ids.",
        "Dataset B does not request metadata about user-created lists beyond follower counts (e.g., no cover images or update timelines like 'updated 10 years after creation')."
      ]
    },
    "app_store": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B questions exclusively use uppercase for category names (e.g., 'PERSONALIZATION', 'FAMILY'), while A uses mixed case formatting",
        "Dataset B focuses more on category-level aggregations (e.g., 'average rating of apps in EDUCATION category'), whereas A emphasizes app-specific sentiment metrics",
        "Dataset B lacks references to sentiment subjectivity/objectivity scores present in A's questions (e.g., 'sentiment objectivity of 0.3')",
        "Dataset B contains no questions about app update status/version information that appears in A's samples",
        "Dataset B questions never mention translated user reviews/comments analysis that's prominent in A's queries",
        "Dataset B emphasizes category comparison questions (e.g., 'top 3 categories with highest rating') absent in A",
        "Dataset B uses simpler percentage calculations (e.g., '% of games with X installs') compared to A's complex sentiment ratios",
        "Dataset B questions lack multi-genre analysis present in A (e.g., 'apps with multiple genres')",
        "Dataset B contains no questions about app metadata relationships to sentiment (e.g., 'size + favorability' queries in A)",
        "Dataset B focuses more on install count thresholds as primary filters, while A combines installs with complex sentiment criteria"
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B queries do not mention sentiment polarity or subjectivity scores, which are present in all Dataset A samples.",
        "Dataset B does not reference translated reviews, unlike Dataset A which includes requests for translated reviews in multiple queries.",
        "Dataset B does not filter or analyze apps based on specific keywords in user comments (e.g., \"gr8\"), a feature present in Dataset A.",
        "Dataset B does not query about app versions or age group targeting, both of which appear in Dataset A questions.",
        "Dataset B includes explicit references to database structures (e.g., \"playstore table\"), while Dataset A does not mention technical schema elements.",
        "Dataset B contains queries about apps with 'None' ratings (e.g., \"rating of 'None'\"), a scenario absent in Dataset A samples.",
        "Dataset B focuses exclusively on single-category filters (e.g., \"GAME category\") without combining multiple genres, unlike Dataset A which asks about multi-genre apps.",
        "Dataset B includes requests to list positive reviews (\"list one positive review\"), while Dataset A focuses on sentiment percentages rather than specific review content.",
        "Dataset B queries combine average ratings with maximum review counts (e.g., \"average rating ... and maximum number of reviews\"), a dual-metric pattern not seen in Dataset A.",
        "Dataset B does not reference app price comparisons or cost-related filters (e.g., \"most expensive app\"), which appear in Dataset A queries."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B questions focus on aggregating data across categories (e.g., 'average rating of apps from the Tools category') while A includes direct app-specific metric requests (e.g., 'size of Browser 4G').",
        "Dataset B frequently combines genre filters with popularity metrics (e.g., 'action genre with a rating of 4.5 or higher and 1,000+ installs') without mentioning granular sentiment attributes, unlike A's focus on sentiment subjectivity/objectivity scores.",
        "Dataset B structures results hierarchically (e.g., 'top 5 categories... list apps under each category'), while A lists apps or metrics without categorical grouping.",
        "Dataset B uses phrases like 'most popular' as a standalone criterion, whereas A explicitly defines popularity via install ranges or review counts.",
        "Dataset B emphasizes category-wide comparisons (e.g., 'top 3 apps in the Simulation category by installs'), while A rarely groups results by genre unless explicitly stated.",
        "Dataset B avoids references to app technical details (e.g., size, version numbers, translated reviews) present in A (e.g., 'current version', 'translated review').",
        "Dataset B prioritizes install-count thresholds (e.g., '10 million installs') as standalone filters, while A often pairs install ranges with sentiment or version constraints.",
        "Dataset B lacks questions about sentiment polarity ratios (e.g., 'percentage ratio between positive and negative sentiments') seen in A.",
        "Dataset B focuses on uniform rating thresholds (e.g., '4.5 or higher') without combining them with non-rating attributes like A's 'rating 3.9 + translated review' queries.",
        "Dataset B omits references to user sentiment granularity (e.g., 'neutral attitude', 'sentiment objectivity scores') present in A."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B focuses on average ratings and counts without incorporating sentiment polarity or subjectivity scores, which are central to Dataset A.",
        "Queries in Dataset B use explicit category labels (e.g., 'Tools,' 'Family') for filtering, while Dataset A uses genres (e.g., 'Games,' 'Arcade').",
        "Dataset A includes granular textual analysis (e.g., 'translated reviews,' 'comments containing \"gr8\"'), absent in Dataset B.",
        "Dataset B queries are structured around SQL-like aggregation syntax (e.g., 'Write a SQL query...'), whereas Dataset A uses natural language without explicit SQL references.",
        "Dataset A incorporates app metadata like content rating (e.g., 'Everyone 10+'), update status, and version numbers, which are absent in Dataset B.",
        "Dataset A combines numerical thresholds (e.g., installs, reviews) with sentiment metrics (e.g., polarity, subjectivity), while Dataset B focuses purely on numerical thresholds and ratings.",
        "Dataset A explicitly requests percentages (e.g., 'percentage ratio between positive and negative sentiments'), whereas Dataset B emphasizes averages and totals.",
        "Dataset B queries are simpler, often filtering by a single category and numerical threshold, while Dataset A frequently combines multiple complex criteria (e.g., sentiment scores + installs + update year).",
        "Dataset A includes app-specific attributes like size, price, and age groups (e.g., 'targeted at teenagers'), which are absent in Dataset B.",
        "Dataset B does not reference user feedback text (e.g., 'translated reviews,' 'neutral comments') or sentiment objectivity/subjectivity, unlike Dataset A."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B questions focus primarily on average ratings and review counts without mentioning sentiment analysis metrics like polarity or subjectivity scores, which are present in all Dataset A questions.",
        "Dataset B queries do not reference app versions or update timelines, whereas all Dataset A samples include version-specific or temporal update criteria (e.g., 'not been updated since 2018').",
        "Dataset B questions use standardized uppercase category labels (e.g., 'BUSINESS') while Dataset A uses mixed-case genre labels (e.g., 'arcade genre') and explicit genre definitions.",
        "Dataset B lacks queries about sentiment objectivity/neutrality thresholds (e.g., 'sentiment objectivity of 0.3') that appear in all Dataset A samples.",
        "Dataset B questions never request percentage calculations or ratio comparisons between metrics (e.g., 'percentage ratio between positive sentiments and negative sentiments'), unlike Dataset A.",
        "Dataset B exclusively uses simple numerical thresholds ('rating more than 4.0') while Dataset A employs complex combined filters (e.g., 'rating 4.5 and above + size constraints + update year') across all samples.",
        "Dataset B questions do not mention translated reviews or language-specific analysis present in multiple Dataset A queries (e.g., 'translated review of each app').",
        "Dataset B lacks queries about apps with multiple genres/categories, while Dataset A explicitly asks about multi-genre apps in multiple samples.",
        "Dataset B questions show repetitive structural patterns (e.g., repeated 'average rating of apps in [CATEGORY]' format) not seen in Dataset A's varied query structures.",
        "Dataset B never references specific comment content analysis (e.g., 'comments that have \"gr8\"') that appears in Dataset A queries."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B queries focus primarily on average ratings across categories without combining with other metrics (e.g., sentiment, installs, or updates).",
        "Dataset A includes explicit references to sentiment analysis metrics (polarity, subjectivity, objectivity) while B does not.",
        "Dataset A contains queries about translated reviews or specific comment content (e.g., 'translated review', 'gr8 in comments'), which are absent in B.",
        "Dataset B uses simpler attribute combinations (e.g., category + rating) compared to A's multi-attribute cross-references (e.g., category + rating + installs + sentiment).",
        "Dataset A includes queries about app size and price thresholds (e.g., 'size no more than 1.0 M'), while B does not.",
        "Dataset A requires percentage/ratio calculations (e.g., 'percentage ratio between positive/negative sentiments'), whereas B focuses on basic averages and counts.",
        "Dataset A queries explicitly reference app version numbers and update years, while B lacks version/update metadata requirements.",
        "Dataset B questions are formulaic/repetitive in structure (e.g., repeated 'average rating of X category' patterns), unlike A's varied phrasing.",
        "Dataset A includes comment sentiment categorization (negative/neutral/positive counts), while B only references install counts.",
        "Dataset A uses granular numerical thresholds for non-install metrics (e.g., 'sentiment objectivity \u22640.5'), whereas B only applies numerical thresholds to installs."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Queries in Dataset B focus exclusively on average ratings, while Dataset A includes other sentiment metrics (subjectivity, objectivity) and user comments analysis.",
        "Dataset B queries frequently reference the 'PlayStore' as a specific platform, whereas Dataset A does not mention a platform explicitly.",
        "Dataset A includes questions about app versioning and update timelines (e.g., 'not been updated since 2018'), while B contains no temporal version-related criteria.",
        "Dataset A queries often combine multiple complex conditions (e.g., sentiment thresholds + downloads + category) in single questions, while B uses simpler single-dimension filters.",
        "Dataset B consistently uses the term 'category' for app classification, while Dataset A interchangeably uses 'genre' and 'category'.",
        "Dataset A includes questions about percentage calculations and ratio comparisons (e.g., 'percentage ratio between positive and negative sentiments'), which are absent in B.",
        "Dataset B queries show repetitive patterns focusing on ranking phrases like 'top 5 most popular apps', while A demonstrates more varied question structures.",
        "Dataset A contains questions about app size specifications (e.g., 'size no more than 1.0 M'), which never appear in Dataset B.",
        "Dataset B uses simplified numerical formats (e.g., '10 million installs') while A maintains exact numerical formats (e.g., '100,000,000+ installs').",
        "Dataset A includes explicit requests for textual content analysis (e.g., 'translated review', 'comments containing \"gr8\"'), while B focuses purely on numerical metrics."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B queries often focus on singular metrics (e.g., 'average rating') without combining multiple attributes (e.g., polarity + installs) as in Dataset A.",
        "Dataset B uses simpler numerical thresholds (e.g., 'rating greater than 4.0') compared to Dataset A's multi-part thresholds (e.g., 'size no more than 1.0 M AND rating 4.5+').",
        "Dataset B lacks references to granular sentiment metrics (e.g., polarity, subjectivity scores) and relies on binary terms like 'positive/negative sentiment.'",
        "Dataset B does not include queries about temporal conditions (e.g., 'not updated since 2018') or app versions, which are common in Dataset A.",
        "Dataset B questions frequently repeat identical or near-identical phrasing (e.g., multiple iterations of 'average rating of apps in the Games category'), unlike Dataset A's diverse phrasing.",
        "Dataset B omits queries about translated reviews, multilingual content, or review text analysis (e.g., 'translated review,' 'comments containing \"gr8\"'), which Dataset A includes.",
        "Dataset B does not ask for combined metadata comparisons (e.g., 'apps with multiple genres and total sentiment subjectivity') seen in Dataset A.",
        "Dataset B lacks questions about sentiment objectivity (e.g., 'sentiment objectivity of 0.3') or neutral sentiment quantification, which Dataset A explicitly references.",
        "Dataset B avoids queries involving percentage ratios (e.g., 'percentage ratio between positive and negative sentiments') common in Dataset A.",
        "Dataset B does not request app-specific content ratings (e.g., 'Everyone 10+') alongside other metrics, unlike Dataset A's combined queries (e.g., 'rating + age group')."
      ]
    }
  },
  "diffs_real_from_synth": {
    "computer_student": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries requesting percentage-based calculations (e.g., 'Calculate the percentage of high-level undergraduate course') while A focuses only on absolute counts.",
        "Dataset B explicitly references numerical ranges for IDs (e.g., 'person IDs from 40 to 50') in filtering conditions, whereas A uses single numerical IDs.",
        "Dataset B asks for combined criteria in course classifications (e.g., 'professional or master/graduate courses'), while A uses simpler level-based categorizations like 'basic/medium/high-level'.",
        "Dataset B includes queries about professors teaching in multiple distinct course categories (e.g., 'harder undergraduate and master/graduate courses'), whereas A focuses on single category overlaps.",
        "Dataset B uses the term 'person ID' as a generic identifier for both professors and students, while A maintains separate 'professor ID' and 'student ID' fields.",
        "Dataset B specifies output formatting requirements (e.g., 'State the course ID and level', 'List down...') more explicitly than A.",
        "Dataset B includes queries about subsets using terms like 'among' (e.g., 'Among the students...') for nested filtering, while A uses direct filters without subset emphasis.",
        "Dataset B requires listing professors' unique identifiers alongside aggregated results (e.g., 'Indicate each...professors unique identifying number'), which A does not specify.",
        "Dataset B contains requests for partial results (e.g., 'List any five of course IDs'), whereas A queries always return full result sets.",
        "Dataset B uses hybrid academic classifications (e.g., 'master/graduate courses') and explicit phase/year combinations not seen in A's simpler phase/year references."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B questions frequently request combining multiple criteria in a single query (e.g., 'basic or medium undergraduate courses taught by faculty members') while Dataset A uses singular filters.",
        "Dataset B includes queries about faculty employment status (e.g., 'faculty member,' 'faculty employee') as a filtering condition, which Dataset A does not reference.",
        "Dataset B explicitly asks for percentage-based calculations (e.g., 'percentage of high-level undergraduate course'), whereas Dataset A focuses only on absolute counts or totals.",
        "Dataset B contains queries requiring output formatting specifications (e.g., 'State the course ID and the level,' 'List out all the course id'), while Dataset A requests unstructured answers.",
        "Dataset B uses range-based filters (e.g., 'person IDs from 40 to 50,' 'course ID from 121 to 130'), which are absent in Dataset A.",
        "Dataset B references academic phases or qualification stages (e.g., 'pre-phase of qualification,' 'in phase status'), whereas Dataset A focuses only on years in the program.",
        "Dataset B includes compound course classifications (e.g., 'professional or master/undergraduate courses,' 'harder undergraduate course') instead of simple level labels like Dataset A.",
        "Dataset B explicitly asks for positional roles within the faculty (e.g., 'position in the faculty of the professor'), which Dataset A does not mention.",
        "Dataset B requires comparisons between faculty/non-faculty professors (e.g., 'faculty affiliated professor'), while Dataset A treats all professors uniformly.",
        "Dataset B includes queries with explicit output limitations (e.g., 'List any five of course IDs'), whereas Dataset A requests complete results without constraints."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B explicitly requests both course IDs and their levels in the same query (e.g., 'State course ID and level'), while A often asks for IDs alone.",
        "Dataset B uses 'teacher' interchangeably with 'professor' in questions (e.g., 'Is the teacher... a faculty member?'), whereas A exclusively uses 'professor.'",
        "Dataset B includes explicit numerical output limits (e.g., 'List any five course IDs'), while A queries for all results without such constraints.",
        "Dataset B specifies 'professional courses' as a distinct category for counting/listing (e.g., 'total of professional courses'), whereas A focuses on course levels/types without this explicit label.",
        "Dataset B uses phrases like 'harder undergraduate course' to modify course levels, a granularity not present in A's level descriptions (e.g., 'high-level').",
        "Dataset B includes queries about advisors by direct ID reference (e.g., 'Advisor 5'), while A refers to advisors indirectly via their teaching assignments or student relationships.",
        "Dataset B explicitly asks for percentages (e.g., 'percentage of high-level undergraduate courses'), whereas A focuses on counts, averages, or totals without percentage calculations.",
        "Dataset B introduces phases like 'pre-phase of qualification' in student criteria, which are absent in A's phase-related queries (e.g., 'master/graduate phase').",
        "Dataset B combines 'advised student IDs' with 'employing professor IDs' in single outputs (e.g., 'advised student IDs and IDs of employing professor'), a pairing not seen in A.",
        "Dataset B uses compound numerical constraints like 'no more than two' or 'more than 4' in filtering thresholds, whereas A typically uses open-ended ranges (e.g., 'greater than 10')."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B includes queries asking for percentage calculations (e.g., 'Calculate the percentage of high-level undergraduate course'), which are absent in Dataset A.",
        "Dataset B explicitly references numerical ID ranges (e.g., 'person IDs from 40 to 50') in filtering conditions, while Dataset A uses single IDs or categorical thresholds.",
        "Dataset B uses the term 'person IDs' for entity identification, whereas Dataset A specifies role-based identifiers like 'professor IDs' or 'student IDs'.",
        "Dataset B contains questions about faculty positions (e.g., 'position in the faculty'), while Dataset A focuses on general faculty affiliations or roles without hierarchical details.",
        "Dataset B includes requests to 'describe' attributes (e.g., 'Describe the year in program...'), whereas Dataset A focuses solely on numerical/list outputs.",
        "Dataset B introduces the concept of 'pre-phase of qualification' for students, a program phase not referenced in Dataset A.",
        "Dataset B frequently combines course level filters with ID ranges (e.g., 'course ID from 121 to 130 of basic undergraduate courses'), while Dataset A separates these criteria.",
        "Dataset B uses the term 'teachers' interchangeably with 'professors,' while Dataset A exclusively uses 'professors.'",
        "Dataset B includes queries about advisors responsible for entire student cohorts (e.g., 'advising all the students in 1st year'), whereas Dataset A focuses on individual advisor-advisee relationships.",
        "Dataset B explicitly requests partial results (e.g., 'List any five of course IDs...'), while Dataset A requires complete lists or counts."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries in B explicitly request aggregation of results with terms like 'most number of', 'highest number of', or 'total', while A focuses on direct counts without comparative aggregation.",
        "B includes requests for calculations involving percentages (e.g., 'Calculate the percentage of high-level undergraduate course'), whereas A only asks for raw counts.",
        "B uses explicit ID ranges (e.g., 'person IDs from 40 to 50', 'course ID from 121 to 130') in filters, while A queries target single IDs or unspecified ranges.",
        "B contains queries requiring multi-attribute output pairs beyond simple ID-attribute pairs (e.g., 'course ID and the level of the course' combined with professor counts), whereas A lists only basic ID-attribute pairs.",
        "B specifies granular course types (e.g., 'professional courses', 'master/undergraduate courses') alongside levels, while A uses only generic course levels like 'basic' or 'high-level'.",
        "B includes explicit requests for faculty-affiliated professors (e.g., 'professor who is currently the member of faculty'), whereas A refers to faculty status more generally (e.g., 'faculty member').",
        "B uses compound Boolean conditions (e.g., 'teaches in both harder undergraduate course and master/graduate courses'), while A employs simpler Boolean logic like 'basic or medium'.",
        "B explicitly targets professors who are advisors (e.g., 'advisors who gave advice to student with ID 376'), whereas A treats advisors and professors as separate entities.",
        "B includes queries with result limits (e.g., 'List any five of course IDs'), while A does not restrict output quantities.",
        "B requires dynamic phase-based filtering (e.g., 'undergoing the pre-phase of qualification'), whereas A uses static academic year filters like '5th year'."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes aggregation functions (e.g., 'how many', 'calculate the percentage', 'total') not present in A",
        "Dataset B requires explicit listing of multiple attributes in responses (e.g., 'State course ID and level') while A focuses on single attributes",
        "Dataset B uses range-based queries (e.g., 'person IDs from 40 to 50', 'course ID from 121 to 130') unlike A's exact single-ID references",
        "Dataset B introduces faculty membership status as a query parameter (e.g., 'faculty member', 'faculty employee') absent in A",
        "Dataset B includes threshold comparisons for quantities (e.g., 'more than 4', 'no more than two') while A only uses simple existence checks",
        "Dataset B contains questions about student academic progression phases (e.g., 'pre-phase of qualification', '5th year') not referenced in A",
        "Dataset B queries hierarchical course categorizations (e.g., 'basic or medium', 'professional or master/undergraduate') beyond A's single-level filters",
        "Dataset B requires positional faculty status responses (e.g., 'position in the faculty') not present in A's role-based questions",
        "Dataset B employs composite logical conditions (e.g., 'both harder undergraduate and master courses') where A uses singular criteria",
        "Dataset B includes explicit result limitations (e.g., 'List any five') while A always requests complete lists"
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B queries include requests for percentage calculations (e.g., 'Calculate the percentage of high-level undergraduate course'), absent in A",
        "Dataset B explicitly asks for yes/no verification questions (e.g., 'Is the teacher... a faculty member?'), which A lacks",
        "Dataset B includes queries about student-advisor relationships (e.g., 'students being advised by Advisor 5'), while A focuses solely on professors and courses",
        "Dataset B references positional faculty roles (e.g., 'position in the faculty'), whereas A only mentions general faculty membership",
        "Dataset B specifies ID ranges (e.g., 'person IDs from 40 to 50') in filtering, while A uses only exact numerical identifiers",
        "Dataset B explicitly requests paired attribute listings (e.g., 'course ID and the level'), whereas A typically requests single attributes",
        "Dataset B includes queries about employment status of professors (e.g., 'faculty employees'), while A only references general faculty status",
        "Dataset B uses comparative thresholds in counting (e.g., 'no more than two'), whereas A uses only minimum thresholds ('at least X')",
        "Dataset B contains queries about professional course categorization (e.g., 'professional courses'), which A never references",
        "Dataset B explicitly limits result quantities (e.g., 'List any five'), while A always requests complete lists"
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B includes queries requiring percentage calculations (e.g., 'Calculate the percentage of high-level undergraduate course'), whereas A focuses solely on direct counts.",
        "Queries in B frequently combine multiple conditional tiers (e.g., 'basic or medium undergraduate courses taught by faculty'), while A uses simpler single-tier filters like 'undergraduate' or 'graduate'.",
        "Dataset B explicitly references ranges of identifiers (e.g., 'person IDs from 40 to 50'), while A queries target specific individual IDs (e.g., 'p_id 1').",
        "B includes requests to list both identifiers and attributes simultaneously (e.g., 'course ID and the level'), whereas A often isolates single attributes (e.g., 'courseLevel').",
        "Queries in B involve comparative thresholds (e.g., 'advised more than 4 others'), while A uses fixed thresholds like 'more than 5 years in the program' without comparisons.",
        "Dataset B uses compound difficulty classifications (e.g., 'high-level or harder undergraduate courses'), while A references standalone tiers like 'basic' or 'advanced'.",
        "B includes explicit references to faculty positions (e.g., 'position in the faculty') as distinct attributes, whereas A mentions faculty affiliation more generally.",
        "Queries in B require aggregations across hybrid categories (e.g., 'professional or master/undergraduate courses'), unlike A\u2019s focus on single categories like 'graduate level'.",
        "Dataset B asks for cross-tier validation (e.g., 'teach in both harder undergraduate and master/graduate courses'), which A does not require.",
        "B includes queries about advisor-student relationships tied to specific program years (e.g., '12th years of program'), whereas A references broader phases like 'Phase 3'."
      ]
    },
    "movie_platform": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes references to external URLs (e.g., user profile images, movie cover images) in queries, while Dataset A does not.",
        "Dataset B explicitly tracks and queries movie critic interactions (e.g., likes on critics' ratings), absent in Dataset A.",
        "Dataset B uses MM/DD/YYYY date formats for time-based filters, whereas Dataset A uses ISO timestamps (YYYY-MM-DD).",
        "Dataset B includes popularity thresholds (e.g., popularity > 400 and < 500) as query criteria, not seen in Dataset A.",
        "Dataset B references list update timestamps (e.g., 'most recently updated'), while Dataset A only references creation timestamps.",
        "Dataset B combines user eligibility status (e.g., trialist) with specific actions (e.g., rating a movie) in queries, whereas Dataset A treats statuses as static attributes.",
        "Dataset B requests percentage calculations for specific rating scores (e.g., 'percentage of ratings with the highest score'), while Dataset A uses percentages for user/list metrics.",
        "Dataset B includes movie release years as standalone query criteria (e.g., 'released in 1924'), whereas Dataset A ties years to user/list events.",
        "Dataset B explicitly asks for director-movie popularity associations (e.g., 'director of the most popular movie'), while Dataset A focuses on directors in isolation.",
        "Dataset B queries user follower counts tied to temporal constraints (e.g., 'users with >100 followers in lists created in 2009'), unlike Dataset A's general follower queries."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Queries in B explicitly reference user subscription statuses (e.g., 'paying subscribers', 'eligible for trial') as filtering conditions",
        "Dataset B includes requests for critic-related metrics (e.g., likes on critic reviews) not present in A",
        "B requires percentage calculations (e.g., 'percentage of ratings were rated with highest score') while A uses absolute counts",
        "B specifies exact date ranges (e.g., '1/1/2017 to 12/31/2017') rather than year-only filters like A",
        "B requests URLs to user profile images/cover images (e.g., 'cover image of the user') whereas A only references movie/user profile URLs",
        "B includes ordinal position requirements (e.g., 'third movie directed by') while A focuses on maximums/totals",
        "B combines multiple distinct metrics in single responses (e.g., director name + release year + average rating) more frequently than A",
        "B contains explicit references to platform-specific entities like 'Mubi' in URLs and context",
        "B uses thresholds on list contents (e.g., 'lists with at least 200 movies') as conditions, unlike A's follower-based thresholds",
        "B requires temporal validation of statuses (e.g., 'still updated 10 years after creation') rather than simple timestamp retrieval in A"
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B includes explicit references to 'likes' related to critic reviews or user ratings, which are absent in Dataset A.",
        "Dataset B asks for percentage-based calculations (e.g., 'percentage of ratings with highest score'), while Dataset A focuses on absolute counts or averages.",
        "Dataset B frequently combines multiple distinct data points in a single query (e.g., director name, release year, and trialist average score), whereas Dataset A typically isolates metrics per question.",
        "Dataset B specifies exact date ranges (e.g., '1/1/2017 to 12/31/2017') more granularly, while Dataset A uses relative timeframes like 'last month' or years.",
        "Dataset B includes questions about user eligibility status (e.g., 'eligible for trial') during specific actions, whereas Dataset A refers to subscription states (trial/paying) without explicit eligibility checks.",
        "Dataset B references 'critic' interactions (e.g., likes on critic reviews), which are not mentioned in Dataset A.",
        "Dataset B explicitly requests boolean confirmation (e.g., 'Was the user... eligible for trial?') paired with additional metrics, a format absent in Dataset A.",
        "Dataset B asks for URLs tied to specific user actions (e.g., 'URL to the user profile image of the user who gave a 5 rating'), whereas Dataset A focuses on general movie or profile URLs.",
        "Dataset B includes hybrid popularity-rating thresholds (e.g., 'popularity of more than 400 but less than 500'), while Dataset A uses standalone popularity rankings or score ranges.",
        "Dataset B requires identifying users or directors based on combined conditions (e.g., 'directed at least 10 movies between 1960 to 1985'), whereas Dataset A often uses simpler attribute filters."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B includes specific references to platform-specific metadata (e.g., Mubi URLs, user profile images), while A does not mention platform-specific identifiers or URLs.",
        "Dataset B explicitly segments user statuses (e.g., \"eligible for trial,\" \"paying subscribers\") in queries, whereas A refers to generic user types (e.g., \"trial/paying status\").",
        "Dataset B combines multiple criteria (e.g., popularity ranges, rating thresholds, and user status) in single questions, while A often isolates these conditions.",
        "Dataset B frequently requests both quantitative results (e.g., counts) and qualitative data (e.g., movie titles, director names) in the same query, unlike A, which often separates these into distinct questions.",
        "Dataset B includes questions about time-based user activity (e.g., lists updated \"10 years after creation\"), while A focuses on timestamps (e.g., creation/update dates) without temporal relationships.",
        "Dataset B explicitly references percentage-based thresholds (e.g., \"percentage of ratings with highest score\"), whereas A uses absolute numerical thresholds (e.g., \"> 4.0\").",
        "Dataset B asks for interactions with critics (e.g., \"likes related to the critic\"), while A only mentions critic likes as a filter without further analysis.",
        "Dataset B includes platform-specific user IDs (e.g., \"user 4208563\") and list titles (e.g., \"250 Favourite Films\"), while A uses generic identifiers (e.g., \"list id 5\").",
        "Dataset B incorporates multi-part questions requiring compound results (e.g., \"indicate the average rating score... and the URL\"), whereas A typically asks for singular metrics.",
        "Dataset B uses precise date formats (e.g., \"4/19/2020\") and date ranges (e.g., \"1/1/2017 to 12/31/2017\"), while A uses relative or year-only date filters (e.g., \"after 2014\")."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes queries that reference specific external platforms or services (e.g., Mubi URLs for profiles, cover images, or ratings).",
        "Dataset B contains questions that explicitly combine temporal constraints with user eligibility status (e.g., users eligible for trial when rating a movie during a specific timeframe).",
        "Dataset B requires compound responses (e.g., 'Indicate the average rating score... and when was it released?'), whereas A focuses on single-value answers.",
        "Dataset B uses percentage-based aggregations (e.g., 'how many percentage of the ratings were rated with highest score'), while A uses absolute counts or averages.",
        "Dataset B includes queries about social engagement metrics (e.g., 'likes' on critic reviews or ratings), which are absent in A.",
        "Dataset B asks for hierarchical or conditional aggregations (e.g., 'average number of followers for lists with at least 200 movies'), whereas A uses simpler aggregations without nested conditions.",
        "Dataset B references user eligibility status tied to specific actions (e.g., 'paying subscribers when rating a movie'), while A only mentions static user metadata like subscriber status.",
        "Dataset B involves multi-step temporal logic (e.g., 'lists updated 10 years after creation'), while A uses basic date comparisons (e.g., 'before 2014').",
        "Dataset B includes questions about user-generated content beyond lists (e.g., critic reviews and their likes), which A does not address.",
        "Dataset B explicitly links user actions to platform-specific URLs (e.g., profile images, rating URLs), while A does not reference URLs."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes temporal conditions with specific date ranges (e.g., 'Between 1/1/2017 to 12/31/2017') in queries, whereas A uses only years or general timeframes.",
        "Dataset B combines aggregation results (e.g., counts) with non-aggregated data (e.g., movie names, URLs) in single queries, unlike A's simpler aggregation requests.",
        "Dataset B explicitly asks for percentage-based calculations (e.g., 'how many percentage of the ratings were rated with highest score'), which A does not.",
        "Dataset B requests URLs tied to user profile images (e.g., 'cover image of the user') in addition to movie/rating URLs, while A focuses on movie or rating URLs only.",
        "Dataset B includes compound questions that combine yes/no eligibility checks with quantitative outputs (e.g., 'Was the user... eligible for trial? Indicate followers'), unlike A's single-output queries.",
        "Dataset B references social engagement metrics like 'likes' related to critics or ratings (e.g., 'highest amount of likes'), which A does not mention.",
        "Dataset B queries the number of movies directed by a person based on another metric (e.g., 'director of the highest movie popularity'), whereas A only links directors to movies or years.",
        "Dataset B uses precise date formats (e.g., '4/19/2020') in filters, while A uses only years or date ranges without specific day/month granularity.",
        "Dataset B explicitly ties user subscription status to past actions (e.g., 'were eligible for trial when they rated'), while A queries current status without temporal linkage.",
        "Dataset B includes eligibility status checks (e.g., 'eligible for trial') as a distinct filter criterion, whereas A focuses on subscription types (e.g., 'paying subscriber') without eligibility nuances."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Queries in B explicitly request URLs tied to specific platforms (e.g., Mubi) for resources like movie images or user profiles, while A refers to generic URLs.",
        "B includes questions combining numeric metrics (e.g., counts) with non-numeric outputs (e.g., director names, URLs) in a single query, whereas A focuses on singular outputs.",
        "B references specific user IDs (e.g., 'user 4208563') and exact movie titles (e.g., '\"Patti Smith: Dream of Life\"') as filters, while A uses placeholders like 'user1' or generic identifiers.",
        "B frequently asks for temporal calculations involving intervals (e.g., '10 years after creation') or precise date ranges (e.g., '1/1/2017 to 12/31/2017'), unlike A's simpler date comparisons (e.g., 'greater than 2021-01-01').",
        "B includes queries about \"likes\" on critics' reviews (e.g., 'critic received the highest amount of likes'), a metric absent in A.",
        "B requires percentage-based aggregations (e.g., 'percentage of ratings with the highest score'), while A focuses on counts, averages, or maxima.",
        "B combines multiple aggregation types in a single query (e.g., 'how many movies... Indicate the name and highest rating'), whereas A typically isolates one aggregation per question.",
        "B explicitly asks for historical or ordinal data (e.g., 'third movie directed by Quentin Tarantino'), while A prioritizes current or top-ranked results (e.g., 'highest rating').",
        "B references platform-specific entities like 'critic' reviews and their attributes (e.g., likes on critiques), which are absent in A's user-centric focus.",
        "B includes conditional outputs (e.g., 'Indicate...' clauses) requiring multiple distinct data points per response, whereas A requests single-value outputs per query."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B includes requests for URLs (e.g., image URLs, user profile URLs, or rating URLs), while A does not.",
        "Dataset B combines multiple distinct data points in a single query (e.g., asking for a count and a URL simultaneously), whereas A focuses on singular outputs.",
        "Dataset B explicitly references platform-specific entities like \"Mubi\" and \"critic\" interactions, which are absent in A.",
        "Dataset B uses precise date ranges (e.g., '1/1/2017 to 12/31/2017') instead of general temporal filters like years or eras in A.",
        "Dataset B includes eligibility-based conditions tied to user actions (e.g., 'eligible for trial when they rated'), while A only references static user states (e.g., 'paying subscribers').",
        "Dataset B queries social engagement metrics (e.g., 'likes' on critic reviews), which are absent in A's focus on ratings and counts.",
        "Dataset B asks about list update timelines (e.g., 'lists updated 10 years after creation'), whereas A focuses on list creation or follower counts.",
        "Dataset B requests percentages (e.g., 'percentage of ratings with highest score'), while A uses absolute counts or averages.",
        "Dataset B includes compound filters with AND/OR logic (e.g., 'popularity > 400 but < 500'), whereas A uses simpler threshold comparisons.",
        "Dataset B explicitly queries user-generated content metadata (e.g., 'cover image of the user who created the list'), which A does not address."
      ]
    },
    "app_store": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries about sentiment subjectivity scores (e.g., 'total Sentiment subjectivity score'), while A only references sentiment polarity thresholds.",
        "Dataset B explicitly requests translated user reviews (e.g., 'state the translated review of each app'), unlike A which focuses solely on sentiment metrics without textual content.",
        "Dataset B contains questions about app version information (e.g., 'include their current version'), while A never references version metadata.",
        "Dataset B asks for neutral sentiment analysis (e.g., 'users hold neutral attitude'), whereas A only distinguishes between positive/negative sentiments through polarity thresholds.",
        "Dataset B requires identification of specific phrases in user comments (e.g., 'apps that have \"gr8\" in their comments'), a textual analysis absent in A.",
        "Dataset B combines multiple sentiment metrics in single queries (e.g., 'percentage ratio between positive sentiments and negative sentiments'), while A uses singular sentiment polarity thresholds.",
        "Dataset B includes questions about apps with no negative sentiment (e.g., 'does not have negative sentiment'), whereas A only uses polarity score thresholds.",
        "Dataset B references exact version numbers (e.g., 'current version') in requirements, while A never mentions version-specific criteria.",
        "Dataset B queries about comment quantity metrics (e.g., 'most no comment reviews'), a dimension absent in A's popularity metrics.",
        "Dataset B requires explicit disclosure of target demographics (e.g., 'age group that the app is targeted at'), while A only uses general content ratings."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries about specific sentiment metrics (e.g., 'sentiment polarity score', 'sentiment subjectivity') not just general sentiment labels (e.g., 'positive', 'neutral') like Dataset A.",
        "Dataset B explicitly requests translated reviews (e.g., 'state the translated review of each app'), whereas Dataset A does not reference review text content.",
        "Dataset B contains queries about app versions (e.g., 'current version'), which are absent in Dataset A.",
        "Dataset B asks for percentage ratios between sentiments (e.g., 'percentage ratio between positive sentiments and negative sentiments'), while Dataset A focuses on singular percentage metrics (e.g., 'percentage for free application').",
        "Dataset B includes queries about app metadata like price tiers (e.g., 'average price of games') and exact app sizes (e.g., 'size of Browser 4G'), whereas Dataset A only uses size thresholds.",
        "Dataset B references sentiment objectivity scores (e.g., 'sentiment objectivity of 0.3'), a metric not present in Dataset A.",
        "Dataset B requires identification of demographic targeting (e.g., 'age group that the app is targeted at'), unlike Dataset A which only uses generic content ratings.",
        "Dataset B includes 'top N' ranking queries (e.g., 'top 5 shopping apps'), while Dataset A focuses on counts/averages without rankings.",
        "Dataset B explicitly ties sentiment metrics to install counts (e.g., 'highest total Sentiment polarity score' with installs), whereas Dataset A treats installs as standalone attributes.",
        "Dataset B queries combine multiple sentiment dimensions in single requests (e.g., 'total Sentiment polarity score + subjectivity score'), while Dataset A typically isolates one sentiment aspect per query."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B includes queries about specific named apps (e.g., \"Basketball Stars\"), while Dataset A does not reference individual app names.",
        "Dataset B references sentiment subjectivity and objectivity scores (e.g., \"sentiment objectivity of 0.3\"), whereas Dataset A focuses solely on polarity scores.",
        "Dataset B includes requests for translated user reviews (e.g., \"state the translated review\"), which are absent in Dataset A.",
        "Dataset B incorporates app size as a filter or metric (e.g., \"size no more than 1.0 M\"), while Dataset A does not mention size.",
        "Dataset B explicitly quantifies counts of negative, neutral, or no-comment reviews (e.g., \"How many negative comments\"), whereas Dataset A focuses on aggregated sentiment ranges.",
        "Dataset B asks for percentage ratios between positive and negative sentiments (e.g., \"percentage ratio between positive and negative sentiments\"), which Dataset A does not.",
        "Dataset B involves price-related attributes (e.g., \"average price of games\"), while Dataset A only distinguishes free/paid status without monetary values.",
        "Dataset B links content ratings to specific keywords in user comments (e.g., apps with \"gr8\" in comments), a feature absent in Dataset A.",
        "Dataset B includes app version information (e.g., \"current version\"), which is not queried in Dataset A.",
        "Dataset B references apps with multiple genres and their combined sentiment scores (e.g., \"multiple genres... total sentiment subjectivity\"), while Dataset A does not."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B queries include explicit references to sentiment polarity and subjectivity scores (e.g., 'sentiment polarity score of 0.3') as standalone metrics, whereas A uses sentiment only as a binary qualifier (e.g., 'positive sentiment reviews').",
        "Dataset B incorporates app metadata like size (e.g., 'size of Browser 4G'), price (e.g., 'average price of games'), and version numbers (e.g., 'current version'), which are absent in A's queries.",
        "Dataset B explicitly references translated user reviews (e.g., 'state the translated review of each app'), while A focuses only on quantitative review metrics like counts or averages.",
        "Dataset B uses percentage-based metrics (e.g., 'percentage of positive sentiments') and ratios (e.g., 'percentage ratio between positive and negative sentiments'), which are not present in A's queries.",
        "Dataset B includes content suitability filters targeting specific age groups (e.g., 'suitable for teenagers') and content ratings (e.g., 'Everyone 10+'), whereas A filters only by broad content ratings (e.g., 'content rating = Everyone').",
        "Dataset B queries explicitly mention app update status (e.g., 'not been updated since 2018') as a filter, while A does not reference temporal attributes like update dates.",
        "Dataset B uses exact phrases from user reviews as search criteria (e.g., 'apps with \"gr8\" in their comments'), whereas A does not analyze textual review content.",
        "Dataset B includes queries about apps with multiple genres (e.g., 'apps have multiple genres') and aggregates sentiment metrics across them, while A focuses on single-genre filters.",
        "Dataset B specifies granular sentiment subjectivity thresholds (e.g., 'sentiment subjectivity of no more than 0.5'), whereas A uses broader sentiment categories like 'positive' or 'negative.'",
        "Dataset B employs unconventional numerical formats (e.g., '75 000 000 times') and install ranges (e.g., '1,000,000,000+ installs'), while A uses simpler numerical thresholds (e.g., '10,000 reviews')."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes queries about sentiment subjectivity and objectivity scores (e.g., 'sentiment subjectivity of no more than 0.5'), which are absent in Dataset A.",
        "Dataset B explicitly requests translated reviews (e.g., 'state the translated review of each app'), while Dataset A focuses only on raw review counts or sentiment polarity.",
        "Dataset B contains queries about temporal aspects like app update dates (e.g., 'not been updated since 2018'), whereas Dataset A lacks time-based filters.",
        "Dataset B requires percentage calculations (e.g., 'percentage ratio between positive sentiments and negative sentiments'), while Dataset A only uses absolute counts or averages.",
        "Dataset B combines multiple metrics in single queries (e.g., 'average sentiment polarity score and rating'), while Dataset A typically isolates one metric per query.",
        "Dataset B references app pricing (e.g., 'average price of games'), whereas Dataset A exclusively focuses on free vs. paid status without monetary values.",
        "Dataset B includes queries about app versions (e.g., 'current version'), which are never mentioned in Dataset A.",
        "Dataset B uses compound sentiment conditions (e.g., 'apps that do not have negative sentiment'), while Dataset A only distinguishes between positive/negative reviews.",
        "Dataset B explicitly requests genre/category popularity metrics (e.g., 'genre that have downloads more than 1000000000'), whereas Dataset A focuses on category attributes without popularity thresholds.",
        "Dataset B employs multi-criteria ranking (e.g., 'apps reviewed more than 75,000,000 times AND suitable for teenagers'), while Dataset A uses simpler single-criterion rankings like 'top 5 shopping apps'."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes queries about sentiment analysis metrics (e.g., sentiment polarity score, sentiment subjectivity score) not present in A.",
        "Dataset B explicitly references user sentiment categories (e.g., 'neutral attitude', 'negative comments', 'positive sentiments') while A does not.",
        "Dataset B contains queries requiring combined percentage ratios (e.g., 'percentage ratio between positive and negative sentiments') unlike A's simpler aggregates.",
        "Dataset B asks for textual analysis of user reviews (e.g., 'translated review', 'comments containing \"gr8\"') which A never references.",
        "Dataset B includes specific app metadata queries about exact review text content and sentiment categorization beyond A's version/year focus.",
        "Dataset B requires listing apps alongside verbatim user review data (e.g., 'state the translated review of each app') while A only requests statistical metadata.",
        "Dataset B contains queries about app size constraints (e.g., 'size no more than 1.0 M') not mentioned in A.",
        "Dataset B features exclusionary filters (e.g., 'does not have negative sentiment') whereas A only uses inclusionary thresholds.",
        "Dataset B asks about pricing specifics (e.g., 'average price of games') rather than A's binary free/paid distinction.",
        "Dataset B includes superlative queries based on non-rating metrics (e.g., 'most expensive app', 'highest total Sentiment polarity score') absent in A's top-ranked focus."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B includes queries about temporal aspects (e.g., 'not updated since 2018') not present in A.",
        "Dataset B explicitly requests translated reviews (e.g., 'state the translated review of each app'), whereas A does not reference translation.",
        "Dataset B contains queries combining percentage calculations with specific criteria (e.g., 'percentage for free application with a rating 4.5 and above'), while A focuses on averages or counts.",
        "Dataset B requires direct comparisons of sentiment ratios (e.g., 'percentage ratio between positive sentiments and negative sentiments'), unlike A\u2019s general sentiment polarity references.",
        "Dataset B asks for exact version numbers of apps (e.g., 'current version'), a detail absent in A.",
        "Dataset B references sentiment subjectivity and objectivity scores (e.g., 'sentiment subjectivity of no more than 0.5'), whereas A only mentions sentiment polarity.",
        "Dataset B includes queries about apps with 'no comment reviews' or 'neutral attitude,' introducing granular sentiment categories not seen in A.",
        "Dataset B explicitly combines multiple genres or mixed genres (e.g., 'apps that have multiple genres'), while A focuses on single genres/categories.",
        "Dataset B uses unconventional formatting for numeric thresholds (e.g., '75 000 000' with spaces instead of commas), differing from A\u2019s standardized formatting.",
        "Dataset B includes queries about app metadata like size (e.g., 'size no more than 1.0 M') not mentioned in A."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B includes questions requiring combined metrics (e.g., 'percentage of positive sentiments' paired with 'number of installs') in a single query, whereas Dataset A queries these metrics separately.",
        "Dataset B explicitly references temporal conditions (e.g., 'not been updated since 2018'), while Dataset A lacks time-based constraints.",
        "Dataset B incorporates app-specific metadata like size (e.g., 'size no more than 1.0 M'), price, and version details, which are absent in Dataset A queries.",
        "Dataset B queries textual content of reviews (e.g., 'translated review,' 'comments with \"gr8\"'), whereas Dataset A focuses solely on sentiment metrics without analyzing raw text.",
        "Dataset B uses compound numerical thresholds (e.g., 'sentiment subjectivity \u2264 0.5' combined with '100,000,000+ installs'), while Dataset A applies simpler standalone thresholds (e.g., 'rating > 4.0').",
        "Dataset B includes percentage ratios between sentiment categories (e.g., 'percentage ratio between positive and negative sentiments'), whereas Dataset A calculates percentages for single sentiment types.",
        "Dataset B explicitly targets apps with multiple genres (e.g., 'apps with multiple genres'), while Dataset A does not reference multi-genre conditions.",
        "Dataset B requires combining sentiment scores with non-sentiment attributes (e.g., 'highest total Sentiment polarity score' paired with genre), whereas Dataset A keeps sentiment analysis separate from other attributes.",
        "Dataset B references app update status (e.g., 'not been updated since 2018') and version numbers, which are absent in Dataset A.",
        "Dataset B includes queries for exact sentiment score totals (e.g., 'total Sentiment polarity score'), while Dataset A focuses on averages or counts of sentiment metrics."
      ]
    }
  }
}