{
  "sims": {
    "computer_student": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets query professors by specific ID numbers (e.g., 297, 267) to retrieve teaching assignments or attributes.",
        "Both include questions about course levels (e.g., 'high-level undergraduate' in A, 'Level_500' in B) as a classification filter.",
        "Both reference faculty membership or employment status (e.g., 'faculty member,' 'non-faculty employees') as a key attribute for filtering professors.",
        "Both involve counting or listing courses taught by professors (e.g., 'least amount of courses' in A, 'more than 2 years' in B).",
        "Both datasets use student/professor advising relationships (e.g., 'advised student ID 303' in A, 'students advised by professor 138' in B).",
        "Both require filtering by course type or category (e.g., 'professional/master/graduate' in A, 'professional course' in B).",
        "Both include queries about student phases or progress (e.g., 'phase of qualifications' in A, 'Phase 0' or 'Pre_Quals' in B).",
        "Both use numerical course IDs (e.g., 104, 147 in A; 16, 144 in B) as identifiers for specific course-related queries.",
        "Both datasets ask for aggregate metrics (e.g., 'percentage,' 'total number,' 'ratio') across courses, professors, or students.",
        "Both include comparative or ranking questions (e.g., 'top 5 professors' in A, 'highest course level' in B) to identify extremes or hierarchies."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets query about professors teaching specific courses using professor IDs and course IDs.",
        "Both involve questions about the course level (e.g., 'high-level undergraduate' in A and 'Level_500' in B).",
        "Both include requests to count the number of courses taught by professors with specific IDs.",
        "Both reference numerical identifiers for professors (e.g., 'ID 415' in A and 'ID 335' in B).",
        "Both ask about student-advisor relationships (e.g., 'students under advisor 415' in A and 'students advised by professor 335' in B).",
        "Both focus on verifying or listing professors' roles in teaching specific course levels.",
        "Both require filtering or aggregating data based on categorical course classifications (e.g., 'professional', 'undergraduate', 'Level_500').",
        "Both include queries to list course IDs taught by a specific professor (e.g., 'course ID 104' in A and 'course IDs taught by professor 79' in B).",
        "Both involve questions about the total number of students or courses in specific categories (e.g., 'total number of students' in B and 'total of professional courses' in A).",
        "Both datasets use granular identifiers (e.g., 'student ID 303' in A and 'course ID 139' in B) to retrieve precise records."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Queries involve filtering professors by their IDs and positions within the faculty.",
        "Questions frequently reference course levels (e.g., undergraduate, master/graduate, specific numeric levels).",
        "Focus on faculty membership status (e.g., faculty vs. non-faculty professors).",
        "Requests for counts or lists of courses taught by professors under specific conditions.",
        "Use of student-advisor relationships as a filtering criterion (e.g., students advised by specific professors).",
        "Numerical filters applied to IDs or course levels (e.g., ID ranges, course level thresholds).",
        "Aggregation operations (e.g., percentages, totals, averages) on course or professor data.",
        "Combination of course IDs with their corresponding levels in output requirements.",
        "Existence checks for faculty membership or course assignments (e.g., 'Is the teacher a faculty member?').",
        "Queries include ranking or extremal values (e.g., 'highest number of courses,' 'top 5 professors')."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Both datasets query academic roles (e.g., professors, students) and their relationships to courses.",
        "Questions in both datasets frequently filter results using numerical criteria (e.g., course levels, IDs).",
        "Aggregation functions (e.g., COUNT, percentage) are used to quantify entities like courses or students.",
        "Specific entity identifiers (e.g., course ID, professor ID) are central to queries in both datasets.",
        "Queries often involve hierarchical course classifications (e.g., undergraduate, graduate, professional).",
        "Both datasets include questions about status attributes (e.g., faculty membership, program enrollment phase).",
        "Questions require joins between entities (e.g., professors teaching courses, advisors linked to students).",
        "Explicit numerical thresholds (e.g., course levels > 500, years in program > 5) are used for filtering.",
        "Results are often constrained to top/bottom values (e.g., 'highest number of courses,' 'least amount').",
        "Queries focus on extracting structured tabular data (e.g., lists of IDs paired with attributes)."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets include queries filtering by specific course IDs (e.g., 27, 104, 147).",
        "Questions in both datasets ask about the academic level (e.g., undergraduate, graduate) of courses.",
        "Both datasets involve professor IDs and their roles (e.g., faculty member, position).",
        "Student-advisor relationships are queried using professor and student IDs in both datasets.",
        "Count operations are present (e.g., number of courses, students, faculty members).",
        "Questions seek to identify courses taught by specific professors using their IDs.",
        "Both datasets reference course levels (e.g., basic, medium, high-level) in queries.",
        "Faculty membership status (e.g., member of faculty or not) is a common filter in queries.",
        "Exact numerical IDs are used to reference entities (courses, professors, students) in both datasets.",
        "Queries often involve retrieving information based on a combination of entity attributes (e.g., course level and faculty status)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets query course levels (e.g., 'high-level undergraduate' in A, 'Level_500' in B).",
        "Both involve filtering or retrieving data using specific course IDs (e.g., 'course ID 104' in A, 'course id 120' in B).",
        "Both reference professor IDs (e.g., 'professor ID 297' in A, 'p_id 201' in B) to identify instructors.",
        "Questions in both datasets focus on professor-course teaching relationships (e.g., 'taught by professor' in A, 'taught by the professor' in B).",
        "Both include queries about faculty membership or employment status (e.g., 'member of faculty' in A, 'professor for more than 0 years' in B).",
        "Both datasets ask about student advising relationships (e.g., 'advised student ID 303' in A, 'students not advised' in B).",
        "ID ranges or criteria (e.g., 'course ID from 121 to 130' in A, 'professors teaching for more than 2 years' in B) are used to filter results.",
        "Both include existence checks (e.g., 'Is the teacher... a faculty member?' in A, 'Who has taught Level_500 course?' in B).",
        "Aggregation of counts (e.g., 'how many courses' in A, 'how many students' in B) is a common theme.",
        "Both datasets seek the highest or most frequent values (e.g., 'highest number of courses' in A, 'highest level of a course' in B)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Both datasets query courses taught by professors based on specific attributes (e.g., faculty status in A, years in program in B).",
        "Both involve filtering results using numerical thresholds (e.g., course levels <10 in A, years in program \u22655 in B).",
        "Course levels are explicitly referenced as a key filtering criterion in questions from both datasets.",
        "Questions in both datasets frequently request professor IDs or person IDs linked to courses.",
        "Both require aggregations (e.g., counts, percentages, ratios) of courses, professors, or students.",
        "Queries in both datasets focus on relationships between professors and courses (e.g., courses taught, course levels).",
        "Both use exact identifiers (e.g., course IDs like 104 in A, professor IDs like 1 in B) to retrieve specific records.",
        "Questions in both datasets seek to identify professors with extreme values (e.g., 'most courses taught' in A and B).",
        "Both include conditions based on categorical classifications (e.g., 'faculty member' in A, 'beginning phase' in B).",
        "Queries in both datasets involve joining entities (e.g., professors, courses, students) through shared identifiers like IDs."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Both datasets involve querying course levels (e.g., 'undergraduate,' 'graduate,' 'high-level,' 'Senior').",
        "Both include questions about professors teaching specific courses or course levels.",
        "Both require filtering or grouping by professor identifiers (e.g., 'professor ID,' 'p_id').",
        "Both involve counting entities (e.g., students, courses, professors) with specific criteria.",
        "Both reference advisor-advisee relationships (e.g., students linked to professors).",
        "Both include queries about faculty/program status (e.g., 'faculty member,' 'inPhase' status).",
        "Both require retrieving data tied to specific course IDs (e.g., course ID 104 in A, course_id 1 in B).",
        "Both focus on granular categorization of courses (e.g., 'basic,' 'medium,' 'professional,' 'advanced').",
        "Both involve calculating proportions or percentages of specific subsets (e.g., course types, faculty status).",
        "Both include requests to list IDs (e.g., professor IDs, student IDs) associated with specific conditions."
      ]
    },
    "movie_platform": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets involve queries about user-created lists, including attributes like list titles, followers, and creation details.",
        "Both datasets require aggregating numerical data such as average ratings, counts of ratings, and percentage calculations.",
        "Both datasets include questions about specific movies, focusing on attributes like titles, directors, release years, and popularity metrics.",
        "Both datasets query user demographics, including trialists, subscribers, and users with payment methods, in relation to their interactions (e.g., ratings, list creation).",
        "Both datasets use conditional filters based on thresholds (e.g., 'more than 5 followers,' 'ratings after 2011').",
        "Both datasets involve linking movies to directors and analyzing metrics like average scores or popularity for specific directors.",
        "Both datasets reference temporal constraints, such as timestamps for list creation or rating dates.",
        "Both datasets ask for rankings or extremes (e.g., 'most followers,' 'highest average rating,' 'least ratings').",
        "Both datasets require joining entities like users, lists, ratings, and movies to answer multi-table queries (e.g., lists created by users who rated specific movies).",
        "Both datasets include queries about list metadata, such as the number of movies in a list or how long a list has gone without updates."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets query aggregate metrics (e.g., counts, averages, percentages).",
        "Both include questions about maximum values (e.g., 'most comments,' 'highest rating').",
        "Both filter results using temporal constraints (e.g., release years, creation timestamps).",
        "Both reference specific entities using identifiers (e.g., movie titles, user IDs, director names).",
        "Both involve user-generated content (e.g., lists, ratings, likes).",
        "Both use conditional thresholds (e.g., 'more than 13000 popularity,' 'lists with more than 100 followers').",
        "Both request metadata like URLs or creation timestamps for entities.",
        "Both require joins between entities (e.g., movies to directors, users to ratings).",
        "Both include questions about popularity metrics (e.g., 'most popular movie').",
        "Both reference user attributes (e.g., subscriber status, list followers)."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Both datasets focus on aggregating data (e.g., count, average, max, min) for metrics like ratings, likes, followers, or popularity.",
        "Questions in both datasets filter results using user attributes such as subscription status, trial eligibility, or payment method.",
        "Both include queries about specific entities like movie titles, directors, user IDs, list IDs, and URLs.",
        "Temporal constraints (e.g., release years, update dates, rating timestamps) are used in filtering results in both datasets.",
        "Both datasets involve user-generated content, such as lists, with attributes like followers, update dates, or creation dates.",
        "Questions in both datasets frequently combine multiple criteria (e.g., popularity + release year, director + rating score) in a single query.",
        "Both focus on popularity metrics (e.g., 'most popular movie,' 'highest number of followers') and ranking (e.g., 'top 3,' 'highest average score').",
        "Queries in both datasets explicitly reference exact identifiers (e.g., movie IDs, user IDs, list IDs) for precision.",
        "Both include requests for URLs tied to movies or ratings on the platform.",
        "Questions in both datasets involve ranking or comparing entities (e.g., 'most followers,' 'least ratings') using numerical thresholds."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Both datasets involve queries about movie titles and their release years.",
        "Both datasets include questions about average rating scores for movies.",
        "Both datasets require filtering movies based on release years.",
        "Both datasets involve aggregating data, such as counts and averages.",
        "Both datasets ask about the popularity of movies, often tied to user interactions like ratings.",
        "Both datasets reference specific numerical thresholds (e.g., ratings >4.5, movies released after 2000).",
        "Both datasets include questions about user-generated content, such as lists and followers.",
        "Both datasets require comparisons to identify extremes (e.g., highest-rated, most popular).",
        "Both datasets involve queries about directors and their associated movies.",
        "Both datasets include requests for metadata like URLs, timestamps, or list IDs."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries filter results using specific attributes (e.g., movie title, user ID, list ID).",
        "Requests retrieval of specific data fields (e.g., titles, counts, URLs).",
        "Involves entities such as movies, users, or lists.",
        "Utilizes exact match conditions (e.g., by ID, exact title).",
        "Seeks a single, precise answer (e.g., one title, a numerical count).",
        "References numerical values (e.g., ratings, counts, years).",
        "Involves user-generated data (e.g., ratings, lists, likes).",
        "Focuses on measurable metrics (e.g., popularity, average score, number of followers).",
        "Requires joining entities (e.g., movies with ratings, users with lists).",
        "Contains explicit conditional logic (e.g., 'highest,' 'most,' 'after year X')."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "All questions require retrieving specific data fields (e.g., title, count, URL) using SELECT statements.",
        "Every query includes at least one filtering condition (e.g., exact ID, movie title, numerical threshold) in a WHERE clause.",
        "All questions reference structured entities such as movies, users, lists, or ratings.",
        "Queries involve exact identifiers (e.g., user IDs, list IDs, movie titles) or explicit criteria (e.g., 'highest rating').",
        "Results depend on measurable attributes like numerical scores (ratings), dates, or engagement metrics (likes, followers).",
        "Questions map to database operations requiring joins (e.g., users-to-ratings, movies-to-lists) for cross-referenced data.",
        "User metadata (e.g., subscriber status, payment method) is frequently used as a filter or result metric.",
        "Temporal filters (e.g., release year, 'last year', update dates) are common constraints.",
        "Aggregate functions (e.g., COUNT, AVG, MAX) or comparative logic (e.g., 'most', 'least') are explicitly or implicitly required.",
        "Natural language constructs directly translate to SQL clauses (e.g., 'percentage' implies division, 'average' implies AVG)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Both datasets query movie titles based on specific conditions (e.g., highest ratings, popularity, or list membership).",
        "Both involve filtering by user subscription status (e.g., trialists, subscribers, or users with payment methods).",
        "Both focus on numeric metrics like ratings (e.g., highest score, average score, or counts of ratings).",
        "Both include aggregate functions (e.g., average, count, percentage, or max/min values).",
        "Both reference user-generated lists (e.g., list titles, list followers, or list update timestamps).",
        "Both require filtering by user attributes (e.g., user IDs, subscription eligibility, or profile status).",
        "Both involve popularity metrics (e.g., popularity numbers, list followers, or most popular movies).",
        "Both use temporal filters (e.g., release years, rating timestamps, or list update dates).",
        "Both reference structured database elements (e.g., tables like 'lists' or 'Ratings', or URLs for specific entries).",
        "Both require combining multiple criteria (e.g., filtering by year, user status, and rating score simultaneously)."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Both datasets query for highest or maximum rating scores of movies.",
        "Both involve counting specific entities such as users, ratings, or likes.",
        "Aggregate functions (e.g., average, percentage) are used to analyze numerical data.",
        "Temporal filters (e.g., release year, post-2011 ratings) are applied in queries.",
        "Specific movie titles are referenced directly in questions.",
        "User demographics (e.g., subscribers, payment methods) are used as filtering criteria.",
        "List-related metrics (e.g., followers, creation dates) are queried.",
        "Conditional operators (>, <, =) are used to filter results (e.g., ratings > 5).",
        "Queries reference user-specific identifiers (e.g., user IDs) or user-generated content.",
        "Movie attributes (e.g., title, release year) are central to query conditions."
      ]
    },
    "app_store": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets focus on app ratings, including specific values and average calculations.",
        "Both datasets query the number of installs/downloads as a key metric.",
        "Both datasets filter or group data by app categories (e.g., 'GAME', 'EDUCATION').",
        "Both datasets include criteria related to app size (e.g., 'size less than 50M').",
        "Both datasets differentiate between free and paid apps in their queries.",
        "Both datasets analyze user reviews, including sentiment polarity and subjectivity scores.",
        "Both datasets reference content ratings (e.g., 'adults only 18+', 'Everyone').",
        "Both datasets request top-ranked apps based on installs, ratings, or review counts.",
        "Both datasets calculate averages for metrics like ratings or sentiment scores.",
        "Both datasets use compound conditions (e.g., rating thresholds combined with install counts or categories)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Both datasets focus on querying app ratings, including specific values, averages, and counts.",
        "Both involve filtering results by app categories (e.g., 'GAME', 'SOCIAL', 'Racing').",
        "Queries in both datasets use aggregation functions like average, count, and max/min.",
        "Both include numerical thresholds (e.g., ratings >4.0, installs >1,000,000, size \u22641.0M).",
        "Both datasets ask for counts of apps meeting specific criteria (e.g., rating thresholds, category matches).",
        "Questions in both reference user sentiment analysis (e.g., polarity scores, positive/negative reviews).",
        "Both involve queries about app metadata (e.g., install counts, size, content ratings, update status).",
        "Comparative language is used in both (e.g., 'top 5', 'highest', 'most reviews').",
        "Both datasets require statistical summaries (e.g., average ratings, percentage ratios, total installs).",
        "Queries in both combine multiple attributes (e.g., category + rating, sentiment + installs)."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Both datasets focus on app ratings, frequently querying specific values (e.g., 4.5, 5.0) and averages.",
        "Questions in both datasets filter results by app categories/genres (e.g., Puzzle, Action, Racing).",
        "Both include queries about the number of installs, often with thresholds (e.g., >1M, >10M).",
        "Sentiment analysis metrics (e.g., polarity scores, subjectivity) are referenced in questions across both datasets.",
        "Top-N rankings (e.g., 'top 5', 'top 3') are used to identify leading apps by installs, ratings, or reviews.",
        "Both datasets combine multiple criteria (e.g., rating \u22654.5 AND installs >1M AND free status) in queries.",
        "Free vs. paid app distinctions are explicitly mentioned in questions from both datasets.",
        "Time-based filters (e.g., apps not updated since 2018/2015) appear in both datasets.",
        "Quantitative comparisons (e.g., 'percentage of apps with negative sentiment') are present in both.",
        "Both datasets request translated reviews or review metadata (e.g., comments, sentiment labels)."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Both datasets focus on app ratings, including specific ratings, average ratings, and comparisons based on rating thresholds.",
        "Both datasets involve counting or aggregating the number of apps, reviews, or installs that meet specific criteria (e.g., rating thresholds, categories, review counts).",
        "Both group or filter apps by categories/genres (e.g., 'Tools,' 'Family,' 'Racing').",
        "Both use aggregation functions like averages (e.g., average rating, average sentiment polarity) and top/bottom rankings (e.g., top 5 apps).",
        "Both include queries with numerical thresholds (e.g., apps with ratings >4.5, reviews >10,000, installs >1,000,000,000).",
        "Both reference app metadata such as category, genre, content rating, and version.",
        "Both require combining multiple criteria (e.g., category + rating + review count, sentiment + installs).",
        "Both focus on quantitative outcomes (e.g., counts, percentages, averages) rather than qualitative analysis.",
        "Both involve filtering apps based on user feedback (e.g., sentiment polarity, review sentiment, subjectivity scores in A; positive sentiment reviews in B).",
        "Both include queries targeting specific named apps (e.g., 'Maps,' 'Super Mario Bros' in B; 'Dragon Ball Legends' in A)."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets focus on querying app ratings, including average ratings for specific categories or individual apps.",
        "Both datasets include questions about the number of reviews (e.g., positive, neutral, total) for apps.",
        "Both datasets involve filtering or grouping apps by category/genre (e.g., 'FAMILY', 'Racing', 'Photography').",
        "Both datasets ask for metadata about apps, such as size, version, or content rating (e.g., 'adults only 18+').",
        "Both datasets use aggregation functions (e.g., average, count, percentage) to analyze app data.",
        "Both datasets reference specific apps by name (e.g., 'Cooking Fever', 'Holy Quran Mehmet Emin Ay').",
        "Both datasets query install/download counts or thresholds (e.g., '1,000,000,000+ installs', 'minimum number of downloads').",
        "Both datasets apply rating thresholds as filters (e.g., 'rating more than 4.0', '4.5 and above').",
        "Both datasets involve user sentiment analysis, either explicitly (sentiment polarity in A) or implicitly (positive reviews in B).",
        "Both datasets include questions about app update status (e.g., 'not been updated since 2015', 'since 2018')."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Both datasets query numerical app ratings (specific values in A, averages in B).",
        "Both involve filtering or grouping by app categories/genres (e.g., Puzzle, Family).",
        "Both reference install counts (e.g., thresholds like 1,000,000+ in A and 500+ in B).",
        "Both include questions about free apps (e.g., \"top 5 free apps\" in A, \"free apps\" in B).",
        "Both use content rating filters (e.g., \"adults only 18+\" in A, \"teen\" in B).",
        "Both require aggregation functions (count, average, top N lists).",
        "Both mention specific app names (e.g., \"Dragon Ball Legends\" in A, \"BBW Dating\" in B).",
        "Both utilize numerical thresholds (e.g., \"rating 4.5+\" in A and B).",
        "Both analyze user feedback metrics (sentiment scores in A, ratings in B).",
        "Both filter results by app attributes (size, version, category, installs)."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Both datasets focus on querying app ratings, either for specific apps or aggregated averages.",
        "Both involve filtering or aggregating data based on app categories or genres (e.g., 'Games,' 'Entertainment').",
        "Numerical thresholds (e.g., ratings \u22654.5, installs \u22651M) are used in queries across both datasets.",
        "Questions in both datasets request aggregated metrics like averages, counts, percentages, or totals.",
        "Both include inquiries about user reviews, such as sentiment analysis (A) or review counts (B).",
        "Top-ranked lists (e.g., 'top 5 apps') are requested in both datasets based on ratings, installs, or reviews.",
        "App install numbers are a recurring metric for popularity or filtering criteria in queries.",
        "Both datasets use statistical measures (e.g., average rating, percentage of positive reviews) in questions.",
        "Queries in both datasets filter results using app metadata (e.g., category, rating, install range).",
        "Both emphasize comparisons or rankings within specific app groups (e.g., genres, categories, update years)."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Both datasets query app ratings, including specific values (e.g., 4.7, 5.0) and category averages (e.g., Games, Tools).",
        "Both include requests for counts of user reviews/comments (e.g., Instagram in B, Brit + Co in A).",
        "Sentiment analysis metrics are used (A: polarity scores, subjectivity; B: positive/negative sentiment categorizations).",
        "Queries target app categories/genres (e.g., Racing in A, Business in B) for filtering results.",
        "Aggregate functions like averages, percentages, and totals are applied to ratings and sentiment metrics.",
        "Specific numerical thresholds are used (e.g., 4.5+ ratings in A, >4.0 ratings in B).",
        "Questions focus on identifying extremes (e.g., 'highest rating,' 'lowest sentiment polarity score').",
        "Both reference named apps (e.g., Facebook in B, Dragon Ball Legends in A) for targeted analysis.",
        "Reviews are analyzed by sentiment type (neutral, negative, positive) across datasets.",
        "Metadata like app versions and update years (A) or table structures (B) are occasionally included in queries."
      ]
    }
  },
  "diffs_synth_from_real": {
    "computer_student": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries about professors' time in the program (e.g., 'more than 2 years') or student years in the program, while A does not reference temporal experience metrics.",
        "Dataset B explicitly references specific faculty names (e.g., 'Faculty of Mathematics') as filtering criteria, whereas A only distinguishes between faculty and non-faculty status without naming departments.",
        "Dataset B uses structured phase labels with underscores or numerical codes (e.g., 'Phase 0', 'Pre_Quals'), while A uses generic phrases like 'phase of qualifications'.",
        "Dataset B asks for minimum/maximum values (e.g., 'minimum and maximum number of years in program'), whereas A focuses on rankings (e.g., 'top 5 professors') without explicit min/max functions.",
        "Dataset B includes queries about student progress tied to program completion (e.g., 'students who have not completed their program'), while A focuses on phases without explicit completion status.",
        "Dataset B references students teaching courses (e.g., 'student with ID 100 is teaching'), a scenario absent in A, which only involves professors as instructors.",
        "Dataset B uses exact numerical course level classifications (e.g., 'Level_500'), whereas A employs descriptive terms like 'high-level undergraduate' without standardized codes.",
        "Dataset B queries professors' departmental positions (e.g., 'position in a department'), while A only distinguishes between faculty/non-faculty employment status.",
        "Dataset B includes explicit thresholds for experience (e.g., 'more than 5 years of experience'), whereas A uses qualitative filters like 'least amount of courses'.",
        "Dataset B asks about students' program years (e.g., 'first year in the program'), a granularity absent in A, which focuses on phases without temporal alignment."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries about individuals who are both students and professors, which is absent in A.",
        "Dataset B allows for students to be instructors (e.g., 'student has the most courses taught by them'), while A does not.",
        "Dataset B references personal names (e.g., 'name of the person') in queries, whereas A uses only numerical identifiers.",
        "Dataset A frequently involves faculty membership or position status (e.g., 'faculty member'), while B omits such references.",
        "Dataset A contains queries calculating percentages or ratios (e.g., 'percentage of courses'), which B lacks entirely.",
        "Dataset A references qualification phases (e.g., 'non-faculty members undergoing phase of qualifications'), while B does not.",
        "Dataset A uses combined course categories (e.g., 'professional or master/graduate'), while B queries single categories like 'Level_500'.",
        "Dataset A includes ranking-based queries (e.g., 'top 5 professors'), whereas B focuses solely on counts without rankings.",
        "Dataset A contains range-based queries for IDs (e.g., 'course ID from 121 to 130'), which B does not.",
        "Dataset B references student-specific attributes like 'yearsInProgram', absent in A."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Queries in B require conditions on the number of professors per course (e.g., 'taught by at least two professors', 'only one professor teaching it'), which are absent in A.",
        "B includes explicit references to professors' experience duration (e.g., '5 or more years of experience'), while A does not mention experience-based filters.",
        "B incorporates program phases (e.g., 'Phase 1', 'not in their program\u2019s first phase') as criteria, whereas A only indirectly references phases in a qualification context.",
        "B explicitly requests course names (e.g., 'Data Structures and Algorithms'), while A relies solely on course IDs and levels without naming specifics.",
        "B asks for student-centric aggregations (e.g., 'average number of students per professor', 'students with more than one advisor'), which A does not include.",
        "B references employment types like 'full-time professors', while A distinguishes only between faculty and non-faculty status without employment granularity.",
        "B uses structured numeric course levels (e.g., 'Level 300', 'Level_400'), whereas A employs qualitative descriptors like 'basic' or 'high-level' for course levels.",
        "B includes queries about students with multiple advisors, a criterion absent in A\u2019s samples.",
        "B specifies non-faculty positions (e.g., 'position other than Faculty_eme'), while A\u2019s non-faculty category lacks such granular positional distinctions.",
        "B combines student and professor roles in course-level queries (e.g., 'course levels taught by a student and the professor'), a feature not present in A."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B queries do not reference faculty membership status (e.g., 'faculty member' or 'non-faculty') as a filtering criterion, unlike Dataset A.",
        "Dataset B lacks questions involving advisor-student relationships (e.g., 'students under advisor X'), while Dataset A frequently includes these.",
        "Dataset B does not use explicit numerical ID ranges (e.g., 'course ID from 121 to 130') for filtering, unlike Dataset A.",
        "Dataset B includes repetitive questions about total counts (e.g., 'How many students are currently in the program?') with minimal variation, whereas Dataset A avoids redundancy.",
        "Dataset B omits percentage or ratio calculations (e.g., 'ratio of professors and students') present in Dataset A.",
        "Dataset B uses generalized course level terminology (e.g., 'intermediate level') instead of explicit hierarchical classifications (e.g., 'high-level undergraduate') seen in Dataset A.",
        "Dataset B queries are simpler in structure, often focusing on single aggregation (e.g., 'total number of courses'), while Dataset A combines multiple criteria (e.g., course level + faculty status + ID thresholds).",
        "Dataset B does not request top/bottom rankings (e.g., 'top 5 professors') as constraints for results, unlike Dataset A.",
        "Dataset B includes questions about professors' program tenure (e.g., 'more than 5 years in the program'), a criterion absent in Dataset A.",
        "Dataset B does not reference non-faculty member statuses (e.g., 'not undergoing phase of qualifications') or program enrollment phases, unlike Dataset A."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes references to professor names (e.g., Professor Jane, John, Smith) alongside numerical IDs, while A uses only numerical IDs.",
        "Dataset B contains queries about a professor's duration in the program (e.g., 'more than 5 years'), which A does not address.",
        "Dataset B explicitly asks about individuals who are both professors and students (dual roles), whereas A does not mention this overlap.",
        "Dataset B lacks queries involving percentage calculations (e.g., 'Calculate the percentage...'), which are present in A.",
        "Dataset B includes questions about students' program years (e.g., 'Year_1'), a concept absent in A.",
        "Dataset B omits queries requesting top/bottom rankings (e.g., 'top 5 professors', 'least amount of courses'), which are common in A.",
        "Dataset B does not filter queries by combined attributes like course level AND faculty status, unlike A.",
        "Dataset B often omits explicit academic level granularity (undergraduate/graduate) in course-related questions compared to A.",
        "Dataset B includes questions about a person\u2019s name (e.g., 'name of the professor'), while A never references names.",
        "Dataset B asks about students with no position (e.g., 'has no position'), a filter not present in A."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B uses standardized course level codes (e.g., 'Level_500') rather than descriptive classifications like 'high-level undergraduate' used in A",
        "Queries in B never reference faculty position/status details (e.g., 'position in faculty' from A) when identifying professors",
        "B includes temporal filters about professor experience (e.g., 'teaching for more than 2 years') not seen in A's samples",
        "B's queries use abbreviated ID labels like 'p_id' and 'professor_id' where A consistently uses full labels like 'professor ID'",
        "B lacks percentage/ratio calculations present in A (e.g., 'Calculate the percentage...')",
        "B shows consistent pattern of using 'course with id X' phrasing rather than A's mix of formats like 'course ID X' or 'course no.X'",
        "B never references specific student IDs in conditions (e.g., 'advised student ID 303' in A), only general advising status",
        "B's aggregation queries focus on simple counts rather than A's comparative metrics like 'highest number of courses'",
        "B's existence checks focus on teaching relationships rather than A's faculty membership verification (e.g., 'Is the teacher... faculty member')",
        "B lacks A's pattern of extremal queries about 'least amount of courses' or 'top 5 professors' rankings"
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B queries reference 'yearsInProgram' or program duration as a primary filtering attribute, while A uses faculty membership status (e.g., 'faculty member').",
        "Dataset B includes direct references to professor names (e.g., 'professor John'), whereas A exclusively uses numerical identifiers (e.g., 'teacher no.79').",
        "Dataset B contains repetitive phrasing variations for similar queries (e.g., multiple instances of 'taught by professors who have been in the program for more than 5 years'), while A demonstrates more diverse query structures.",
        "Dataset B introduces phase-based classifications (e.g., 'beginning phase') as categorical filters, whereas A uses positional classifications (e.g., 'faculty employee').",
        "Dataset B includes references to student advising relationships as a secondary filter (e.g., 'professors who have advised at least two students'), while A makes student advising relationships a primary focus of multiple queries.",
        "Dataset B uses simplified course level classifications (e.g., 'masters'), whereas A employs granular course level categories (e.g., 'basic/medium/high-level undergraduate').",
        "Dataset B queries frequently omit explicit aggregation requests (e.g., 'What courses are taught...') where A explicitly requires calculations (e.g., 'Calculate the percentage...').",
        "Dataset B includes professor identifiers with mixed formats (e.g., 'p_id 2', 'professor with ID 1'), while A maintains consistent numerical identifier formats (e.g., 'professor ID 297').",
        "Dataset B contains questions about professor-to-course relationships without additional faculty status constraints, whereas A frequently combines faculty status with other filters in single queries.",
        "Dataset B introduces temporal program phases (e.g., 'first year of a program') as categorical conditions, while A's temporal references focus on course levels rather than program duration."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B explicitly requests SQL queries (e.g., 'Write a SQLite query') while A focuses on natural language questions without code requirements.",
        "Dataset B uses simpler filtering conditions (e.g., 'more than 5 years') compared to A's multi-criteria filters (e.g., 'high-level undergraduate course of less than 10 in ID').",
        "Dataset B repeats identical question patterns multiple times (e.g., 5 instances of 'Which courses are taught by professors?') while A maintains more unique phrasings.",
        "Dataset B references database schema elements directly (e.g., 'inPhase column', 'person table') whereas A uses abstracted business terms (e.g., 'faculty member', 'professional courses').",
        "Dataset B includes queries about the database structure itself (e.g., 'course levels available in the database') rather than just operational data questions like A.",
        "Dataset B uses consistent column naming conventions (e.g., 'p_id', 'courseLevel') while A mixes formats ('professor ID', 'teacher no.79', 'course ID 104').",
        "Dataset B contains simpler counting requests (e.g., 'total number of students') without A's proportional calculations (e.g., 'percentage of high-level undergraduate course').",
        "Dataset B includes basic existence checks (e.g., 'Which courses have a professor teaching them?') where A focuses on rankings/performance (e.g., 'top 5 professors').",
        "Dataset B uses explicit phase terminology from column values (e.g., 'Phase 3 students') while A references program status abstractly (e.g., 'inPhase status').",
        "Dataset B maintains simpler relationship queries (e.g., 'Which professor is advising student X?') compared to A's complex connections (e.g., 'students advised to teach by professors teaching specific course levels')."
      ]
    },
    "movie_platform": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries that filter lists based on specific text patterns in titles (e.g., 'contains the word \"s\"') while Dataset A does not use text-based list title filters.",
        "Dataset B explicitly references exact timestamps (e.g., 'before 2012-11-13 00:00:00 UTC') for temporal constraints, whereas Dataset A uses relative or year-only time constraints (e.g., 'after 2011').",
        "Dataset B requires sorting results in explicit output orders (e.g., 'descending order of followers'), while Dataset A only asks for rankings/extremes without specifying output ordering.",
        "Dataset B includes queries about users' payment method status (e.g., 'users who have a payment method') as a standalone or combined condition, whereas Dataset A does not reference payment methods directly.",
        "Dataset B asks for percentages or averages tied to list metadata (e.g., 'percentage of lists created by trialists'), while Dataset A calculates percentages/aggregates based on user interactions (e.g., ratings).",
        "Dataset B combines multiple user status filters in single queries (e.g., 'both a trialist and a subscriber'), whereas Dataset A treats trialists and subscribers as separate demographic groups.",
        "Dataset B explicitly references the `lists` table name in queries (e.g., 'ballet movies from \"lists\" table'), while Dataset A does not mention table names in questions.",
        "Dataset B includes queries about the relationship between list creators and their non-followed lists (e.g., 'lists that a user created and is not a follower of'), which does not appear in Dataset A.",
        "Dataset B uses exact list titles as literal filters (e.g., 'list named \"2021\"'), while Dataset A references list titles indirectly through user actions or attributes.",
        "Dataset B requires comparisons of list popularity within user-specific contexts (e.g., 'most followers among all lists created by user X'), whereas Dataset A focuses on global popularity metrics across all users."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B's questions focus on singular metrics (e.g., 'most followers') without multi-part responses, while A often combines multiple metrics (e.g., 'average number of movies... and how many rated 5').",
        "Dataset B does not include percentage-based queries (e.g., 'percentage of ratings by subscribers'), unlike A.",
        "Dataset B omits conditional user interactions (e.g., 'likes received after giving a rating score of 5') seen in A.",
        "Dataset B uses temporal constraints primarily for entity attributes (e.g., release year), while A applies them to user actions (e.g., ratings after 2011).",
        "Dataset B lacks questions about user eligibility status (e.g., 'trialists', 'subscribers') as filtering criteria, which A includes.",
        "Dataset B avoids requests for URLs as standalone outputs (e.g., 'url of the movie'), instead pairing URLs with titles (e.g., 'title and URL').",
        "Dataset B does not require ordinal positions (e.g., 'third movie') or date-duration calculations (e.g., 'longest period not updated') like A.",
        "Dataset B omits explicit comparisons of popularity metrics against conditional thresholds (e.g., 'most popular movie with <13000 popularity'), which A includes.",
        "Dataset B includes direct database-wide counts (e.g., 'total number of followers for all lists'), while A focuses on user-specific or entity-specific aggregations.",
        "Dataset B simplifies joins to single-entity relationships (e.g., 'director of movie ID 235011'), whereas A often combines joins with multi-step conditions (e.g., movies rated by users who created specific lists)."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Queries in dataset B more frequently involve subqueries or nested conditions (e.g., 'movies added to lists of users who were subscribers when they created the list'), whereas dataset A focuses on direct attribute-based aggregations.",
        "Dataset B includes explicit filters for list titles containing specific text patterns (e.g., 'Avengers'), which are absent in dataset A.",
        "Questions in dataset B use dynamic temporal constraints like 'updated in the last month,' while dataset A relies on static time ranges (e.g., 'after 2011').",
        "Dataset B requires joining multiple entities (e.g., lists, users, directors) in a single query more frequently than dataset A.",
        "Queries in dataset B explicitly reference list IDs alongside titles (e.g., 'Indicate the list_id of each list'), whereas dataset A typically requests only one identifier type per query.",
        "Dataset B includes conditions that combine subscription types (e.g., 'users with both trial and paid subscriptions'), while dataset A filters by single subscription states.",
        "Questions in dataset B use existence-based criteria (e.g., 'users who have at least one movie in their list with...'), whereas dataset A focuses on exact counts or percentages.",
        "Dataset B aggregates metrics over list attributes (e.g., 'average rating of movies in lists with >3 followers'), while dataset A aggregates over user or movie attributes directly.",
        "Queries in dataset B more frequently associate list titles with specific movie IDs (e.g., 'list title of the movie with ID 1000'), a pattern not seen in dataset A.",
        "Dataset B includes popularity comparisons within explicitly named list categories (e.g., 'Mubi's Top Lists'), whereas dataset A uses generic popularity metrics."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B includes questions about movie genres and categories, which are absent in Dataset A.",
        "Dataset B asks for database-wide statistics (e.g., total movie count, unique directors) while Dataset A focuses on user-specific metrics.",
        "Dataset B contains explicit requests for top-N rankings (e.g., 'top 5 most popular') while Dataset A focuses on singular extremes (e.g., 'most comments', 'highest average score').",
        "Dataset B uses simple popularity metrics based on rating counts, while Dataset A defines popularity through multiple engagement metrics (likes, comments, followers).",
        "Dataset B includes general movie attributes (release year, genre) in results, while Dataset A frequently requests specific platform metadata (URLs, list IDs, timestamps).",
        "Dataset B asks about aggregate movie statistics without user context, while Dataset A frequently ties questions to specific user IDs or user-generated content.",
        "Dataset B contains explicit numerical range queries (e.g., 'greater than 8 out of 10') while Dataset A uses relative thresholds (e.g., 'more than 13000 popularity number').",
        "Dataset B includes platform structure questions (e.g., 'how many lists are there for a specific user?') absent from Dataset A.",
        "Dataset B uses simplified popularity comparisons while Dataset A requires combined popularity/quality metrics (e.g., 'most popular movie had... lower than 3 ratings').",
        "Dataset B asks for basic director filmographies while Dataset A requires director performance analysis (e.g., 'most popular movie and its average rating')."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries in B are exclusively single-part requests, whereas A often includes multi-part questions requiring multiple data points in the response.",
        "B does not reference temporal conditions (e.g., 'after year X' or 'longest period without updates'), while A frequently includes time-based filters or comparisons.",
        "B lacks explicit references to user eligibility statuses (e.g., 'subscribers,' 'trialists'), which A explicitly incorporates into conditional logic.",
        "B does not involve percentage calculations (e.g., 'percentage of ratings by subscribers'), whereas A includes proportional or ratio-based metrics.",
        "B omits references to directors or URLs as data fields, which A explicitly includes in some queries.",
        "B does not require joins across more than two entities (e.g., movies + ratings + users + lists), while A often involves multi-entity joins with layered conditions.",
        "B focuses on basic aggregations (e.g., total counts, averages), while A uses advanced aggregations (e.g., nested averages, combined metrics like 'average number of movies added and count of 5-star ratings').",
        "B does not reference user-generated interactions beyond ratings (e.g., comments on lists/critics, likes on reviews), which A includes as explicit metrics.",
        "B contains repetitive query structures (e.g., repeated requests for 'highest rating score'), while A exhibits greater syntactic diversity in phrasing.",
        "B avoids compound conditional logic (e.g., 'movies released in 1995 with popularity >13000 and least ratings'), relying instead on single-condition filters."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries in B do not require multiple aggregated metrics (e.g., average *and* count) in a single question, unlike A.",
        "B focuses on retrieving singular titles or identifiers (e.g., 'title of the movie') without additional contextual metrics (e.g., 'number of followers') required in A.",
        "B uses simpler comparative terms like 'highest' or 'lowest' without granular thresholds (e.g., 'more than 13000 popularity'), which are common in A.",
        "B does not involve temporal calculations (e.g., 'longest period of time since last update') beyond basic date filters like 'last year'.",
        "B lacks explicit percentage calculations (e.g., 'percentage of subscribers') that require division operations, which are frequent in A.",
        "B rarely combines user metadata (e.g., 'subscriber') with multi-step filtering (e.g., 'eligible for trial when creating a list'), unlike A.",
        "B does not require nested joins (e.g., users-to-ratings-to-comments) or indirect relationships (e.g., 'director\u2019s most popular movie') seen in A.",
        "B omits explicit requests for URLs or direct resource links (e.g., 'URL to the rating on Mubi'), which are common in A.",
        "B avoids compound questions (e.g., 'state how long it has not been updated') that demand multiple outputs in a single query, unlike A.",
        "B uses table-specific qualifiers (e.g., 'in the `lists` table') for disambiguation, whereas A assumes unified entity relationships without explicit table references."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B questions explicitly reference specific database table names (e.g., 'lists', 'Ratings', 'lists_users'), while Dataset A does not explicitly name tables.",
        "Dataset B queries are more repetitive and formulaic (e.g., multiple instances of 'movie with the highest rating'), while Dataset A uses diverse phrasing for similar intents.",
        "Dataset B lacks questions involving URL retrieval, whereas Dataset A explicitly requests URLs for entries like movies or ratings.",
        "Dataset B focuses on simple top-N rankings (e.g., 'top 3', 'top 5', 'top 10') without layered conditions, while Dataset A combines rankings with secondary metrics (e.g., popularity + ratings + year).",
        "Dataset B omits percentage calculations and multi-metric combinations (e.g., 'average score AND release year') present in Dataset A.",
        "Dataset B uses generic user identifiers (e.g., 'user1', 'user ID 1') instead of specific numeric user IDs like Dataset A.",
        "Dataset B includes genre-based filters (e.g., 'Horror' genre) absent in Dataset A's samples.",
        "Dataset B lacks temporal duration calculations (e.g., 'how long it has not been updated') present in Dataset A.",
        "Dataset B does not require counting interactions tied to specific user actions (e.g., 'likes received after rating a movie'), unlike Dataset A.",
        "Dataset B queries often specify exact numerical thresholds (e.g., 'list_movie_number > 5', '50 movies') without dynamic or relative criteria (e.g., 'most comments', 'longest period') used in Dataset A."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B queries often request a single aggregate value without requiring multiple interrelated metrics in the same question (e.g., no combined requests for averages and counts).",
        "Dataset B does not reference URLs (e.g., 'url of the movie') in any query, unlike Dataset A.",
        "Dataset B uses generic placeholders like 'specific user_id' instead of explicit user IDs (e.g., 'user 8516503') seen in Dataset A.",
        "Dataset B lacks references to user roles or statuses like 'trialists,' 'subscribers,' or 'critics' as filtering criteria.",
        "Dataset B focuses on list follower counts rather than list metadata (e.g., creation date, update duration, or creator eligibility) emphasized in Dataset A.",
        "Dataset B does not include temporal granularity in filters (e.g., no specific years like '2019' or phrases like 'for the longest period of time').",
        "Dataset B omits popularity metrics (e.g., 'popularity number') as a condition or metric in queries.",
        "Dataset B does not reference movie directors, creators, or film-specific attributes beyond titles and release years.",
        "Dataset B avoids multi-condition comparisons (e.g., 'most popular movie with ratings < 3') in favor of simpler filters (e.g., 'rating > 8').",
        "Dataset B does not query ordinal rankings (e.g., 'third movie directed by...') or positional metrics like 'least ratings.'"
      ]
    },
    "app_store": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B queries do not reference specific app names or titles in their questions, unlike Dataset A which frequently includes direct app name references (e.g., \"Brit + Co\", \"Garden Coloring Book\").",
        "Dataset B focuses on thresholds for install counts (e.g., '>10,000 installs') as explicit filtering criteria, whereas Dataset A emphasizes extremes (e.g., 'most no comment reviews', 'highest/lowest sentiment polarity').",
        "Dataset B includes queries about app price types (e.g., 'free type', 'have a price of free') as standalone conditions, while Dataset A ties price type to metrics like install counts or ratings.",
        "Dataset B lacks explicit questions about translated reviews or review text content, unlike Dataset A which frequently requests translated reviews or specific comment listings.",
        "Dataset B does not include questions about app update timelines (e.g., 'not been updated since 2015'), whereas Dataset A explicitly queries update recency and ties it to sentiment analysis.",
        "Dataset B omits granular sentiment classifications (e.g., 'neutral', 'pretty positive favorability') found in Dataset A, focusing instead on binary positive/negative thresholds like 'sentiment polarity more than 0.3'.",
        "Dataset B uses percentage calculations exclusively for install/rating combinations (e.g., '% of games with X installs and Y rating'), while Dataset A applies percentages to sentiment comparisons and update statuses.",
        "Dataset B queries category-level averages (e.g., 'average rating of apps in HOUSE_AND_HOME category') more frequently than Dataset A, which prioritizes app-specific averages.",
        "Dataset B includes direct comparisons between categories (e.g., 'top three categories with highest rating'), whereas Dataset A focuses on rankings within a single category.",
        "Dataset B does not request version numbers or technical metadata (e.g., 'current version') that appear in Dataset A questions."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B queries explicitly reference the 'playstore' table or context (e.g., 'in the playstore table'), while Dataset A questions omit platform references",
        "Dataset B includes queries about 'None' ratings (missing/null values), which never appear in Dataset A's samples",
        "Dataset A consistently requests translated reviews and specific review text analysis, while Dataset B only occasionally asks to 'list one review' without translation requirements",
        "Dataset B queries frequently repeat identical statistical requests (e.g., multiple variations of 'average rating in X category') with minimal contextual variations",
        "Dataset A requires combining sentiment analysis metrics (polarity scores, subjectivity) with technical attributes in 100% of samples, while Dataset B only references sentiment in 3/30 samples",
        "Dataset A questions specify version numbers and update timelines (e.g., 'not updated since 2015'), which are completely absent from Dataset B",
        "Dataset B uses simpler filter chaining (typically 1-2 conditions), while Dataset A regularly combines 3+ attributes (e.g., rating + size + sentiment + installs + update status)",
        "Dataset A explicitly requests percentage calculations and ratio comparisons in 40% of samples, compared to 0% in Dataset B",
        "Dataset B contains duplicate/rephrased questions about the same metric (e.g., 3 identical 'average rating in FAMILY category' variations), while Dataset A maintains unique contextual combinations",
        "Dataset A requires age group targeting analysis and content rating correlations (e.g., 'teens', 'adults only 18+') that never appear in Dataset B"
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B's questions do not reference specific app names (e.g., \"Brit + Co\"), while Dataset A frequently queries details for explicitly named apps.",
        "Dataset B does not include technical metadata queries (e.g., app size, version number), whereas Dataset A explicitly asks for these details.",
        "Dataset A explicitly requests translated reviews (e.g., \"state the translated review\"), while Dataset B refers to reviews without translation requirements.",
        "Dataset B aggregates metrics across categories (e.g., \"top 5 categories by average rating\"), while Dataset A focuses on app-level rankings within categories.",
        "Dataset A includes demographic targeting criteria (e.g., age groups) in queries, which are absent in Dataset B.",
        "Dataset A references sentiment subjectivity scores (e.g., \"highest total Sentiment subjectivity score\"), while Dataset B only involves sentiment polarity.",
        "Dataset A explicitly quantifies neutral sentiment (e.g., \"neutral reviews\"), whereas Dataset B focuses solely on positive/negative sentiment comparisons.",
        "Dataset A filters queries by content ratings (e.g., \"adults only 18+\"), a feature absent in Dataset B.",
        "Dataset B structures multi-criteria queries hierarchically (e.g., \"top 5 categories > apps under each category\"), while Dataset A uses flat, attribute-based filters.",
        "Dataset A includes queries about the lowest-rated apps (e.g., \"top 5 lowest rated puzzle games\"), while Dataset B exclusively focuses on top-ranked/highest-performing apps."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B queries explicitly request SQL code generation (e.g., 'Write a SQL query that returns...'), while A focuses solely on natural language answers",
        "Dataset B emphasizes category-level aggregate metrics (e.g., 'average rating of apps in [category]') rather than individual app sentiment analysis present in A",
        "Dataset B focuses on review quantity thresholds (e.g., 'apps with >10,000 reviews') without examining review content/quality aspects like sentiment polarity that are central to A",
        "Dataset B queries frequently combine category filters with numerical thresholds (e.g., 'Tools category with rating >4.5') without the sentiment+category combinations seen in A",
        "Dataset B lacks references to sentiment subjectivity scores, translated reviews, or comment sentiment classification that are prevalent in A's queries",
        "Dataset B contains no questions about app update dates, version numbers, or content rating age groups that are common in A",
        "Dataset B queries never mention install counts/size metrics that are frequently combined with other criteria in A",
        "Dataset B focuses on simple average ratings rather than A's complex sentiment comparisons (e.g., 'percentage with more positive than negative sentiment')",
        "Dataset B emphasizes ranking by review quantity (e.g., 'top 5 most reviewed apps') rather than A's focus on sentiment extremes and comment analysis",
        "Dataset B queries never request percentages of sentiment distribution or direct comparisons between positive/neutral/negative ratios that are characteristic of A"
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B does not mention or query sentiment polarity scores (e.g., '-1 sentiment polarity', 'average sentiment polarity') present in all Dataset A samples.",
        "Dataset B does not include requests for translated reviews (e.g., 'state the translated review of each app') seen in Dataset A queries.",
        "Dataset B lacks questions about sentiment subjectivity scores (e.g., 'highest total Sentiment subjectivity score') featured in Dataset A.",
        "Dataset B does not combine sentiment analysis with specific metrics like app size or user demographics (e.g., 'size no more than 1.0 M' + sentiment) as seen in A.",
        "Dataset B focuses on simple aggregations (e.g., 'average rating of apps in X category') without multi-layered conditions (e.g., 'apps with 4.7 rating having more positives than negatives') common in A.",
        "Dataset B does not ask for percentage comparisons between positive/negative sentiments (e.g., 'percentage of application [...] having more positives sentiment than negative') found in A.",
        "Dataset B includes broad questions like 'How many apps are available in the Play Store?' absent in A's category/app-specific focus.",
        "Dataset B does not query age-specific targeting (e.g., 'targeted to teens') or content rating correlations (e.g., 'adults only 18+' installs + reviews) present in A.",
        "Dataset B repeats identical question structures across categories (e.g., repetitive 'average rating of apps in [CATEGORY]') unlike A's varied analytical angles.",
        "Dataset B never references sentiment-based rankings (e.g., 'highest amount of -1 sentiment polarity score') or sentiment-driven app lists seen in A."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B queries focus on average ratings across categories or genres, while A requests specific app ratings or sentiment scores.",
        "B uses uppercase category names with underscores (e.g., 'NEWS_AND_MAGAZINES'), whereas A uses standard capitalization (e.g., 'Puzzle').",
        "B includes explicit references to app stores (e.g., 'Play Store'), while A omits platform mentions.",
        "A combines sentiment analysis (polarity/subjectivity scores) with ratings in queries, whereas B only references numerical ratings.",
        "B uses simpler aggregation thresholds (e.g., 'apps with more than 50,000 installs'), while A employs more complex numerical ranges (e.g., '1,000,000,000+ installs').",
        "A requires hybrid metrics (e.g., percentage ratios of positive sentiment), while B focuses exclusively on average rating calculations.",
        "B queries frequently compare paid vs. free apps within categories, while A focuses on free apps without explicit payment comparisons.",
        "A includes app technical attributes (size, version, update year) in filters, while B only references category and install count attributes.",
        "B uses the phrase 'top-rated apps' without specifying ranking criteria, while A explicitly defines ranking parameters (e.g., 'most reviews').",
        "A incorporates review text analysis requirements (translated reviews, comment counts), while B focuses purely on numerical/metric analysis."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B queries consistently reference the 'PlayStore' as the data source, while Dataset A does not mention any specific platform.",
        "Dataset B focuses exclusively on average ratings and popularity metrics, whereas Dataset A includes sentiment analysis components like polarity scores, subjectivity scores, and sentiment classifications (neutral/negative).",
        "Dataset B uses price constraints (e.g., 'price less than 5 dollars') in queries, which never appear in Dataset A.",
        "Dataset A explicitly requests translated user reviews or comment content, while Dataset B never addresses review text or translations.",
        "Dataset A includes metadata filters for app size and content ratings (e.g., 'adults only 18+'), while Dataset B lacks these dimensions.",
        "Dataset B queries aggregate data at the platform-wide level (e.g., 'all apps in the PlayStore'), whereas Dataset A focuses on specific apps or narrowly defined groups.",
        "Dataset A combines multiple metrics in single questions (e.g., rating + sentiment count), while Dataset B questions typically isolate single metrics like average rating.",
        "Dataset A references age-specific targeting (e.g., 'teens') in queries, a feature absent in Dataset B.",
        "Dataset B emphasizes popularity tiers (e.g., '10 million installs') as standalone criteria, while Dataset A ties install counts to sentiment analysis outcomes.",
        "Dataset A includes temporal filters related to app updates (e.g., 'not updated since 2015'), whereas Dataset B lacks time-based constraints."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B queries do not reference specific numerical install ranges (e.g., '1,000,000,000+ installs') present in Dataset A.",
        "Dataset B does not use granular sentiment polarity scores (-1 to 1 scale) or sentiment subjectivity metrics seen in Dataset A, instead using categorical sentiment labels (positive/negative/neutral).",
        "Dataset B queries lack references to translated reviews (e.g., 'state the translated review of each app') found in Dataset A.",
        "Dataset B does not include metadata filters like app update years, content ratings (e.g., 'adults only 18+'), or app versions present in Dataset A.",
        "Dataset B uses simpler numerical thresholds (e.g., '>4.0') compared to Dataset A's mixed formats (e.g., '4.5+', exact values like '3.9').",
        "Dataset A queries target age demographics (e.g., 'teens') while Dataset B never references user demographics.",
        "Dataset B does not mention app size constraints (e.g., 'size no more than 1.0 M') present in Dataset A queries.",
        "Dataset A calculates percentage ratios of sentiment comparisons (e.g., 'percentage of applications... having more positive sentiment'), which is absent in Dataset B.",
        "Dataset B queries focus on basic aggregations (e.g., average ratings, review counts) without combining multiple metrics like Dataset A's 'top 5 lowest rated puzzle games and count negative sentiments'.",
        "Dataset B shows repeated, simplified phrasing (e.g., multiple variations of 'average rating of apps in X category') while Dataset A uses more diverse, complex query structures."
      ]
    }
  },
  "diffs_real_from_synth": {
    "computer_student": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries that calculate percentages and ratios explicitly (e.g., 'Calculate the percentage...'), whereas A focuses on totals or counts without proportional metrics.",
        "B requires partial or sampled data outputs (e.g., 'List any five...'), while A demands complete listings (e.g., 'List all...').",
        "B uses numerical range filters (e.g., 'course ID from 121 to 130') for queries, unlike A, which filters by discrete values.",
        "B combines multiple distinct attributes in output requirements (e.g., 'professor ID and position'), while A typically retrieves single attributes or homogeneous lists.",
        "B contains boolean (yes/no) verification queries (e.g., 'Is the teacher...?'), which are absent in A.",
        "B employs logical OR conditions in categorical filters (e.g., 'professional or master/graduate'), whereas A uses singular or hyphenated categories (e.g., 'professional/master/graduate').",
        "B mandates inclusion of specific identifiers (e.g., 'Indicate each professor\u2019s unique ID') alongside aggregate results, while A separates identifier retrieval from aggregation.",
        "B applies numerical thresholds directly to ID fields (e.g., 'course of less than 10 in ID'), unlike A, which restricts thresholds to non-ID attributes like years or counts.",
        "B uses imperative verbs like 'Mention' or 'Describe' to prompt multi-attribute outputs, while A relies on interrogatives (e.g., 'What...?') for simpler responses.",
        "B frequently requests combined metric-and-identifier results (e.g., 'how many... and list IDs'), whereas A isolates metric calculations from identifier listings."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries that reference faculty positions or membership status (e.g., 'faculty member', 'non-faculty'), while dataset A does not mention faculty roles.",
        "Queries in dataset B often require combining multiple attributes in responses (e.g., 'professor ID and position'), whereas dataset A typically requests single attributes like course IDs or counts.",
        "Dataset B contains analytical queries involving percentages (e.g., 'Calculate the percentage of high-level undergraduate course') or ratios, absent in dataset A.",
        "Dataset B uses ranking clauses (e.g., 'top 5 professors') and result limits (e.g., 'List any five'), which dataset A does not employ.",
        "Dataset B includes verification questions (e.g., 'Is the teacher... a faculty member?') with yes/no answers, while dataset A focuses on retrievals without boolean checks.",
        "Queries in dataset B frequently use composite categorical filters (e.g., 'professional or master/graduate'), whereas dataset A filters by singular categories like 'Level_500'.",
        "Dataset B specifies numerical ranges (e.g., 'course ID from 121 to 130') for filtering, unlike dataset A, which uses exact IDs.",
        "Dataset B requires comparative analysis (e.g., 'professor taught the least amount of courses'), while dataset A focuses on direct counts or listings.",
        "Dataset B explicitly references ratios (e.g., 'ratio of professors and students'), a metric not present in dataset A.",
        "Queries in dataset B are structurally more complex, combining multiple conditions (e.g., faculty status and course level), whereas dataset A uses simpler, singular filters."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Queries in B explicitly mention 'professional' course levels (e.g., 'professional or master/graduate courses'), which are absent in A.",
        "B includes percentage-based calculations (e.g., 'Calculate the percentage of high-level undergraduate course'), whereas A focuses on totals, averages, or counts without percentages.",
        "B uses exact numeric ID ranges (e.g., 'course ID from 121 to 130') for filtering, while A employs inequality-based numeric filters (e.g., 'ID greater than 10').",
        "B combines descriptive course level qualifiers (e.g., 'basic,' 'medium,' 'high-level') with numeric levels, whereas A primarily uses numeric or categorical levels (e.g., 'Level 300').",
        "B explicitly requests ratios (e.g., 'ratio of professors and students'), which A does not include.",
        "B occasionally uses 'teacher' instead of 'professor' (e.g., 'teacher no.79'), while A consistently uses 'professor.'",
        "B specifies arbitrary output limits (e.g., 'List any five of course IDs'), whereas A uses ranking-based limits (e.g., 'top 5 professors').",
        "B requires combined outputs of statuses and IDs (e.g., 'position status and IDs of professor'), while A combines course IDs with levels but not multiple professor attributes.",
        "B references non-faculty members' qualification phases (e.g., 'not undergoing the phase of qualifications'), a criterion absent in A.",
        "B includes phrases like 'employed professor in faculty' or 'faculty employee professors,' emphasizing employment terminology, whereas A uses broader terms like 'faculty affiliated position.'"
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Queries in dataset B explicitly request combined attribute outputs (e.g., 'professor ID **and** position in faculty'), while A typically retrieves single attributes or counts.",
        "Dataset B includes percentage calculations (e.g., 'Calculate the percentage of high-level undergraduate course'), whereas A focuses exclusively on absolute counts.",
        "B uses explicit numerical range criteria for entity IDs (e.g., 'course ID from 121 to 130'), while A uses thresholds based on abstracted numerical properties (e.g., 'course levels > 500').",
        "Dataset B requires queries to return partial results (e.g., 'List any five of course IDs'), whereas A consistently retrieves complete results.",
        "B explicitly references employment roles and statuses (e.g., 'non-faculty members', 'position status'), while A refers only to generic 'faculty membership' or program phases.",
        "Queries in B combine entity roles (e.g., 'advisors who taught courses'), whereas A treats roles like 'professor' and 'advisor' as distinct entities.",
        "Dataset B includes ratio-based questions (e.g., 'ratio of professors and students'), which are absent in A.",
        "B uses non-numerical thresholds for hierarchical classifications (e.g., 'basic or medium undergraduate courses'), while A relies on numerical course levels (e.g., 'level 500').",
        "Queries in B explicitly require outputs to include unique identifiers (e.g., 'Indicate each of the professors unique identifying number'), whereas A implicitly assumes identifiers are part of results without explicit mention.",
        "Dataset B includes questions about indirect relationships (e.g., 'students advised to teach by professors teaching X courses'), while A focuses on direct relationships (e.g., 'students advised by professors')."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes queries requesting percentage calculations (e.g., 'Calculate the percentage of high-level undergraduate course'), while A focuses only on direct counts.",
        "Dataset B explicitly requires comparisons of quantities (e.g., 'most number of professors,' 'least amount of courses'), whereas A does not ask for rankings or extremes.",
        "Dataset B contains range-based filters (e.g., 'course ID from 121 to 130'), while A uses only exact numerical IDs without ranges.",
        "Dataset B incorporates ratio-based questions (e.g., 'ratio of professors and students'), which are absent in A.",
        "Dataset B specifies multi-attribute output requirements (e.g., 'course ID and the level of the course') in a single query more frequently than A, which often retrieves single attributes.",
        "Dataset B includes compound status filters (e.g., 'non-faculty members not undergoing the phase of qualifications'), whereas A uses simpler status checks (e.g., 'member of faculty or not').",
        "Dataset B uses explicit category combinations in filters (e.g., 'professional or master/graduate courses'), while A references single categories like 'graduate' without such combinations.",
        "Dataset B explicitly requests top-N results (e.g., 'top 5 professors'), a feature absent in A.",
        "Dataset B includes queries that mandate listing a specific number of results (e.g., 'List any five of course IDs'), while A does not specify output limits.",
        "Dataset B introduces multi-layered conditional filters (e.g., 'high-level or harder undergraduate courses'), whereas A\u2019s filters are simpler and single-layered (e.g., 'basic, medium')."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B includes explicit requests for percentages (e.g., 'Calculate the percentage of high-level undergraduate course'), while A focuses on counts or existence checks.",
        "Dataset B combines multiple categorical filters in a single query (e.g., 'professional or master/graduate courses'), whereas A uses singular categorical filters like course levels alone.",
        "Dataset B explicitly requests positional faculty status (e.g., 'position in faculty') alongside IDs, while A only references general faculty membership.",
        "Dataset B includes ranking queries (e.g., 'top 5 professors'), while A only asks for highest/most frequent values without explicit ranking limits.",
        "Dataset B requires multi-attribute outputs in answers (e.g., 'course ID and the level'), while A typically requests single attributes per query.",
        "Dataset B introduces phase/status qualifiers beyond faculty membership (e.g., 'not undergoing the phase of qualifications'), which A does not reference.",
        "Dataset B explicitly requests ratio calculations (e.g., 'ratio of professors and students'), while A uses simple aggregations like counts.",
        "Dataset B includes compound conditional filters (e.g., 'course ID from 121 to 130 of basic undergraduate courses'), combining ID ranges with categorical criteria in one query.",
        "Dataset B uses comparative quantifiers like 'least amount of courses' for minimization, whereas A focuses on maximization (e.g., 'highest number of courses').",
        "Dataset B specifies output formatting instructions (e.g., 'List any five of course IDs'), while A lacks explicit formatting directives in queries."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B queries frequently reference specific course classifications (e.g., 'high-level undergraduate', 'professional or master/graduate') not seen in A's simpler 'masters' or numerical level filters.",
        "Dataset B explicitly uses logical OR conditions (e.g., 'professional or master/graduate') in filters, while A uses singular categorical classifications.",
        "Dataset B includes questions about faculty employment status (e.g., 'position in faculty', 'non-faculty members') as standalone attributes, whereas A uses 'faculty member' only as a binary filter.",
        "Dataset B contains queries about student-advisor relationships (e.g., 'advised student IDs'), while A focuses purely on professor-course relationships without student advising context.",
        "Dataset B explicitly requests combined outputs (e.g., 'course ID and level') in single responses, while A typically asks for single attributes per query.",
        "Dataset B includes ratio calculations (e.g., 'ratio of professors and students') not present in A's aggregation-focused questions about counts or extremes.",
        "Dataset B uses explicit ID ranges (e.g., 'course ID from 121 to 130') for filtering, unlike A's threshold-based numerical constraints.",
        "Dataset B contains yes/no verification queries (e.g., 'Is the teacher... a faculty member?'), which are absent in A's purely informational questions.",
        "Dataset B references institutional phases (e.g., 'phase of qualifications') as filtering criteria, while A only uses temporal attributes like years in program.",
        "Dataset B explicitly requests ranked/top-N results (e.g., 'top 5 professors'), whereas A focuses on singular extremes like 'most courses taught' without ranking."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Queries in B explicitly combine multiple course level descriptors in a single condition (e.g., 'professional or master/graduate') while A uses singular level terms",
        "B includes explicit requests for ranked/top-N results (e.g., 'top 5 professors') whereas A focuses on absolute counts without ranking",
        "B contains range-based ID queries (e.g., 'course ID from 121 to 130') while A only references specific single IDs",
        "B utilizes negation in filtering conditions (e.g., 'professors who are not faculty member') while A queries only positive conditions",
        "B frequently combines multiple filters with logical operators (OR/AND) in single questions where A typically uses single-filter criteria",
        "B explicitly requests percentage calculations in output (e.g., 'Calculate the percentage') while A focuses on raw counts/proportions",
        "B specifies faculty position/status details (e.g., 'position in faculty', 'member of faculty') as discrete attributes where A uses simpler faculty membership checks",
        "B requires explicit listing of course IDs in results (e.g., 'List out all the course id') while A typically requests counts or descriptions without ID enumeration",
        "B contains ratio comparison queries (e.g., 'ratio of professors and students') absent in A's count-focused questions",
        "B defines complex subset criteria using multiple layered attributes (e.g., 'basic or medium undergraduate courses taught by faculty member') where A uses simpler single-attribute subsets"
      ]
    },
    "movie_platform": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes questions about user interactions with comments on lists (e.g., 'most comments'), which are absent in A.",
        "Dataset B references 'likes' on ratings or critic interactions (e.g., 'likes received after rating'), while A does not mention likes.",
        "Dataset B explicitly requests URLs for movies or ratings (e.g., 'URL of the movie'), a feature absent in A.",
        "Dataset B asks for popularity metrics tied to numerical thresholds (e.g., 'more than 13000 popularity number'), whereas A uses popularity qualitatively.",
        "Dataset B requires multi-part answers in single queries (e.g., 'average number of movies added... and count of 5-star ratings'), while A typically asks for single metrics.",
        "Dataset B introduces 'critics' as a distinct user role (e.g., 'critic of the movie'), unlike A, which focuses on generic users.",
        "Dataset B includes temporal constraints tied to specific user eligibility (e.g., 'eligible for trial when he created the list'), whereas A uses simpler temporal filters.",
        "Dataset B queries granular metadata like 'how long a list has not been updated,' whereas A focuses on simpler temporal attributes like creation timestamps.",
        "Dataset B references direct user interactions with ratings (e.g., 'users who loved the movie'), while A focuses on aggregated metrics like average ratings.",
        "Dataset B combines popularity metrics with rating counts (e.g., 'most popular movie and its average rating'), whereas A treats these as separate dimensions."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes user engagement metrics (e.g., likes, comments) tied to specific rating actions or critic activities, whereas A focuses on followers and subscriber status.",
        "Queries in B frequently combine multiple aggregate metrics (e.g., average number of movies and count of specific ratings) in a single question, while A typically isolates one metric per query.",
        "B uses conditional thresholds on entity attributes (e.g., 'lists with at least 200 movies') as a prerequisite for aggregation, whereas A applies thresholds directly to aggregated results (e.g., 'lists with more than 100 followers').",
        "B explicitly requests ordinal positions (e.g., 'third movie directed by X'), requiring ranking logic, while A focuses on maximum/minimum values without ordinal specificity.",
        "B references trial eligibility as a user attribute (e.g., 'users eligible for trial'), whereas A only references subscriber status.",
        "B calculates temporal durations (e.g., 'how long it has not been updated') instead of solely filtering by timestamps like in A.",
        "B requests URLs tied to user-generated ratings (e.g., 'URL to the rating on Mubi'), whereas A references entity-level URLs (e.g., movie URLs).",
        "B combines distinct metrics (e.g., 'most popular movie and its average rating') in composite results, while A typically isolates single metrics (e.g., 'most popular movie').",
        "B includes counts of specific rating scores (e.g., 'number of users who rated a movie 4'), whereas A queries maximum scores or total counts without score granularity.",
        "B calculates percentages of user subgroups (e.g., 'percentage of ratings from subscribers'), whereas A focuses on absolute counts or averages."
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B includes queries about user interactions with ratings, such as likes on critic reviews or comments on lists, which are absent in Dataset A.",
        "Dataset B requests percentage-based metrics (e.g., 'percentage of subscribers who rated a movie') while Dataset A focuses on absolute counts or averages.",
        "Dataset B explicitly asks for composite answers combining multiple attributes (e.g., 'most popular movie and its average rating'), whereas Dataset A typically requests singular values or lists.",
        "Dataset B requires temporal durations (e.g., 'how long it has not been updated') in answers, while Dataset A uses temporal constraints only for filtering, not as output metrics.",
        "Dataset B references 'critics' and their activity (e.g., likes on critiques), whereas Dataset A focuses on generic users or subscribers without specialized roles.",
        "Dataset B includes queries about ordinal positions (e.g., 'third movie directed by...') combined with user identifiers, which Dataset A does not address.",
        "Dataset B asks for URLs tied to specific user ratings (e.g., URLs for ratings with a certain number of likes), while Dataset A requests URLs for movies or lists only.",
        "Dataset B combines numerical thresholds with qualitative descriptors (e.g., 'lower than 3 ratings' for the 'most popular movie'), whereas Dataset A uses thresholds independently.",
        "Dataset B explicitly requests the inclusion of secondary metrics in answers (e.g., 'Indicate how many movies did he/she give a rating score of 5'), whereas Dataset A typically asks for standalone values.",
        "Dataset B uses phrases like 'loves the movie' or 'to the highest extent' to describe user preferences, introducing subjective language absent in Dataset A's queries."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B includes questions about user-specific interactions (e.g., likes, comments) tied to individual critics or users, while A focuses on general user aggregates without granular user actions.",
        "Dataset B explicitly references user IDs (e.g., user 8516503) in queries, whereas A does not mention specific user identifiers.",
        "Dataset B requires calculations involving user statuses (e.g., subscribers vs. trialists), while A does not differentiate user types in its questions.",
        "Dataset B asks for temporal metrics related to list updates (e.g., 'longest period without updates'), whereas A focuses on list creation timestamps only.",
        "Dataset B includes percentage-based queries (e.g., '% of ratings by subscribers'), while A uses absolute numerical thresholds exclusively.",
        "Dataset B references specific movie titles (e.g., 'Apocalypse Now') and directors (e.g., Abbas Kiarostami) in questions, while A uses generic references (e.g., 'movies directed by Christopher Nolan').",
        "Dataset B requires URL retrieval tied to specific user ratings (e.g., 'URL to the rating on Mubi'), while A asks for generic metadata URLs without user associations.",
        "Dataset B combines multiple conditions in single queries (e.g., 'movies released in 2003 rated by user 2941'), whereas A typically isolates filtering criteria (e.g., 'movies released after 2000').",
        "Dataset B quantifies qualitative metrics like 'loves the movie' (e.g., 'number of Mubi users who love the movie'), while A uses standardized metrics like 'average rating score'.",
        "Dataset B includes platform-specific terminology (e.g., 'Mubi user', 'eligible for trial'), whereas A remains platform-agnostic in its phrasing."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries in dataset B frequently request multiple distinct data fields or metrics in a single question (e.g., average score and release year, title and user IDs).",
        "Dataset B includes questions requiring percentage calculations or proportional metrics (e.g., '% of ratings by subscribers'), absent in A.",
        "Dataset B references granular user statuses or classifications (e.g., 'trialists,' 'subscribers') not present in A.",
        "Dataset B uses multi-layered conditional thresholds (e.g., 'lists with \u2265200 movies') prior to aggregation, unlike A's simpler filters.",
        "Dataset B incorporates entities like directors, critics, or user eligibility states, expanding join complexity beyond A's scope.",
        "Dataset B includes user interaction metrics beyond ratings (e.g., comments on lists, likes on critiques) not seen in A.",
        "Dataset B requires temporal duration calculations (e.g., 'how long since last update'), whereas A uses fixed date comparisons.",
        "Dataset B often involves hierarchical filtering (e.g., 'from movies with >13k popularity, find least-rated'), unlike A's single-step logic.",
        "Dataset B combines unrelated data fields (e.g., 'movie title and number of users who love it') in single requests, unlike A's singular focus.",
        "Dataset B includes state-dependent conditions (e.g., 'users eligible for trial when creating a list') absent in A's static filters."
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Queries in B require explicit percentage calculations (e.g., division of counts) not present in A.",
        "B includes references to user engagement metrics like comments on lists or likes on ratings, absent in A.",
        "B's questions frequently demand multiple aggregated results (e.g., average + count) in a single query, unlike A's single-metric focus.",
        "B involves temporal filters on user-generated actions (e.g., rating dates) rather than A's entity-centric dates (e.g., movie releases).",
        "B references director entities, requiring joins with director tables, a field absent in A.",
        "B requires URLs specific to user-generated ratings (e.g., individual rating pages), whereas A uses general movie/list URLs.",
        "B includes explicit duration calculations (e.g., time since last list update), absent in A's date filters.",
        "B checks user eligibility status (e.g., trialists) at the time of actions (e.g., rating), unlike A's static user filters.",
        "B uses HAVING clauses to filter aggregated results (e.g., lists with \u2265200 movies), while A applies WHERE clauses to raw fields.",
        "B combines multi-table criteria (e.g., director + release year + user ratings) in single queries, unlike A's simpler joins."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Dataset B includes queries requesting URLs (e.g., 'URL to the rating on Mubi') not present in A.",
        "Dataset B explicitly references directors (e.g., 'directed by director Abbas Kiarostami') in queries, while A does not.",
        "Dataset B specifies exact user IDs (e.g., 'user 8516503') in conditions, whereas A uses generic identifiers like 'user1'.",
        "Dataset B includes metrics about user interactions with ratings (e.g., 'likes', 'comments') absent in A.",
        "Dataset B asks for percentages (e.g., 'percentage of subscribers who rated') as standalone metrics, while A focuses on counts/aggregates without explicit percentages.",
        "Dataset B requires multi-part answers (e.g., 'Indicate how many movies did he/she give a rating score of 5') combining multiple metrics in a single response, unlike A's single-output queries.",
        "Dataset B references critic-specific actions (e.g., 'critic on which film got 1 like') not mentioned in A.",
        "Dataset B uses exact numerical thresholds (e.g., 'more than 13000 popularity number') in popularity filters, whereas A uses relative terms like 'highest' or 'most popular'.",
        "Dataset B explicitly ties movies to release years in filtering (e.g., 'released in 2003'), while A uses years primarily for temporal comparisons like update timestamps.",
        "Dataset B combines popularity metrics with rating counts (e.g., 'most popularity movie had... lower than 3 ratings') in compound conditions, a pattern not seen in A."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Queries in dataset B specifically request URLs related to movies or ratings (e.g., 'Provide the url of the movie'), whereas A does not reference URLs.",
        "Dataset B includes questions about directors (e.g., 'directed by director Abbas Kiarostami'), while A does not mention directors as a filtering or query criterion.",
        "B explicitly references user interactions with ratings, such as likes or comments on critics (e.g., 'critic of the movie ... received after ... a rating score'), which A does not include.",
        "B combines multiple distinct metrics in a single query (e.g., 'average number of movies ... Indicate how many movies ...'), while A focuses on single metrics per question.",
        "Dataset B references platform-specific terminology (e.g., 'Mubi user', 'rating on Mubi'), whereas A uses generic terms like 'database' or 'users'.",
        "B includes popularity metrics (e.g., 'more than 13000 popularity number') as query conditions, which are absent in A.",
        "Queries in B explicitly ask for time durations (e.g., 'how long it has not been updated'), whereas A uses temporal filters only for release years or rating dates.",
        "B references user eligibility status (e.g., 'eligible for trial'), while A uses broader demographic filters like 'payment method' or 'subscribers'.",
        "Dataset B includes list-specific attributes (e.g., 'number of movies in the list') as query criteria, whereas A focuses on list followers or creation dates.",
        "B requests director names or links between movies and directors (e.g., 'Jeannot Szwarc's most popular movie'), which A does not address."
      ]
    },
    "app_store": {
      "llama3.1-8b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B queries specific review attributes like comment presence (e.g., 'how many reviews have a comment?'), while A does not mention review content details.",
        "Dataset B explicitly requests translated reviews (e.g., 'state the translated review of each app'), whereas A never references translated text.",
        "Dataset B includes queries about app update timelines (e.g., 'not been updated since 2018'), while A lacks references to app version or update dates.",
        "Dataset B asks for exact counts of sentiment categories (e.g., 'neutral reviews'), whereas A focuses on average sentiment polarity/subjectivity scores.",
        "Dataset B references app versions (e.g., 'current version'), a detail absent in A\u2019s questions.",
        "Dataset B links sentiment polarity thresholds to specific apps (e.g., 'highest sentiment polarity score of Cooking Fever'), while A ties sentiment to broader categories.",
        "Dataset B queries percentage ratios of sentiment comparisons (e.g., 'percentage of apps with 4.7 rating having more positives than negatives'), whereas A calculates percentages based on installs or ratings alone.",
        "Dataset B targets extremes in sentiment scores (e.g., 'lowest sentiment polarity score'), while A focuses on average or threshold-based sentiment metrics.",
        "Dataset B combines sentiment analysis with app metadata in single queries (e.g., 'average sentiment polarity score... and its genre'), whereas A typically separates these into distinct conditions.",
        "Dataset B explicitly requests lists of user comments (e.g., 'list all negative comments'), while A never asks for raw review text."
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_test-time-info_v1": [
        "Dataset B includes queries requesting specific user review text or translated comments (e.g., 'List all negative comments'), while A only asks for sentiment counts",
        "Dataset B contains questions about percentage calculations (e.g., 'percentage of applications'), while A focuses on absolute counts and averages",
        "Dataset B explicitly references sentiment subjectivity scores and exact polarity score thresholds (-1 to 1 scale), while A uses generic sentiment labels (positive/negative)",
        "Dataset B queries app version information and update recency (e.g., 'not updated since 2018'), while A doesn't reference version history",
        "Dataset B uses precise install ranges (e.g., '1,000,000,000+ installs') compared to A's simpler thresholds (>1,000,000)",
        "Dataset B includes demographic targeting queries (e.g., 'age group targeted'), while A focuses on general content ratings",
        "Dataset B combines temporal filters with sentiment analysis (e.g., 'not updated since 2015' + sentiment ratios), which A never does",
        "Dataset B requires identification of apps with extreme sentiment metrics (e.g., 'highest total Sentiment subjectivity score'), while A focuses on basic rating extremes",
        "Dataset B explicitly references non-English review translation analysis ('state the translated review'), while A doesn't mention language aspects",
        "Dataset B contains multi-criteria size comparisons combined with sentiment (e.g., 'size \u22641.0M' + 'pretty positive favorability'), while A uses size only in basic filters"
      ],
      "llama3.1-8b_1000_few-shot_bg_v1": [
        "Dataset B includes queries requesting exact numerical counts of reviews/comments with specific sentiment labels (e.g., neutral, negative), while A focuses on averages or percentages",
        "Dataset B's questions reference specific technical details like app version numbers, which are absent in A's queries",
        "Dataset B contains questions about demographic targeting (e.g., age groups, teens), while A does not mention user demographics",
        "Dataset B explicitly asks for counts of apps with exact numeric ratings (e.g., 'How many apps have rating of 5?'), whereas A uses threshold-based rating filters (e.g., \u22654.5)",
        "Dataset B combines numerical metrics with categorical/demographic attributes (e.g., sentiment score + age group) in single queries, while A focuses on combining numerical criteria without demographic cross-references",
        "Dataset B includes queries about sentiment subjectivity scores as standalone metrics (e.g., 'highest total Sentiment subjectivity'), while A focuses primarily on polarity scores",
        "Dataset B uses billion-scale install thresholds (e.g., '1,000,000,000+ installs') in some queries, exceeding A's maximum threshold of >100 million installs",
        "Dataset B specifies granular user groups in sentiment analysis (e.g., 'people who dislikes the app pretty much'), while A uses broad sentiment categories",
        "Dataset B contains content rating filters (e.g., 'adults only 18+') not present in A's queries",
        "Dataset B explicitly pairs technical attributes (e.g., app size) with user sentiment counts in queries, while A uses technical attributes primarily as rating filters"
      ],
      "qwen2.5-coder-7b_1000_few-shot_bg_v1": [
        "Dataset B queries explicitly request details about individual reviews (e.g., comment lists, translated reviews, negative/neutral comments) while A focuses only on aggregated review metrics",
        "Dataset B contains queries about app update timelines (e.g., 'not updated since 2018/2015') which are absent in A",
        "Dataset B includes percentage-based outcomes (e.g., 'percentage of applications') while A focuses on absolute counts/averages",
        "Dataset B references specific technical metadata like app size ('size of Browser 4G') and version numbers ('current version') not seen in A",
        "Dataset B queries frequently combine sentiment analysis with demographic targeting (e.g., 'age group targeted at', 'teens') unlike A",
        "Dataset B explicitly mentions multilingual aspects through 'translated review' requirements not present in A",
        "Dataset B contains queries about app pricing status (e.g., 'free apps') as a filter criterion absent in A's samples",
        "Dataset B includes exact sentiment score thresholds (-1 polarity) and subjectivity scores (\u22640.5) while A uses broader sentiment categories",
        "Dataset B requires identification of apps by specific content ratings like 'adults only 18+' rather than just general categories",
        "Dataset B queries explicitly count neutral sentiment reviews as distinct category, while A only references positive sentiment generally"
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B queries explicitly reference sentiment analysis metrics (e.g., 'sentiment polarity score', 'sentiment subjectivity score'), while A only implies sentiment through generic terms like 'positive reviews'",
        "Dataset B includes requests for translated reviews (e.g., 'state the translated review of each app'), which never appear in A's questions",
        "Dataset B contains questions about comment presence/absence in reviews (e.g., 'reviews have a comment'), while A only asks for review counts without text analysis",
        "Dataset B requires percentage calculations combining multiple metrics (e.g., 'percentage of application with 4.7 rating having more positives sentiment than negative') that don't exist in A",
        "Dataset B asks for direct listing/display of review text content (e.g., 'List all of its reviews'), whereas A only queries quantitative aspects of reviews",
        "Dataset B combines sentiment analysis with demographic targeting (e.g., 'Indicate the age group that the app is targeted at') in single queries, which A never does",
        "Dataset B uses explicit sentiment subjectivity measurements (e.g., 'highest total Sentiment subjectivity score'), a dimension absent from A's questions",
        "Dataset B includes multi-component comparisons within queries (e.g., 'percentage ratio of positive sentiment reviews') that require ratio calculations beyond A's simple averages",
        "Dataset B queries combine version/content rating filters with text analysis (e.g., 'content rating of adults only 18+ and translated reviews'), while A keeps these aspects separate",
        "Dataset B specifically requests identification of extreme sentiment cases (e.g., 'highest amount of -1 sentiment polarity score'), whereas A only deals with basic positive/neutral counts"
      ],
      "llama3.1-8b_1000_zero-shot_bg_test-time-info_v1": [
        "Dataset B queries specifically analyze user reviews and comments (e.g., counting comments, listing reviews), while A focuses only on numerical ratings.",
        "B includes sentiment polarity and subjectivity scores as metrics, whereas A does not mention sentiment analysis.",
        "B requires listing or analyzing translated reviews (e.g., \"state the translated review\"), which A never references.",
        "B uses percentages or ratios (e.g., \"percentage of applications with more positive sentiment\"), while A relies solely on averages or counts.",
        "B references app metadata such as version numbers (e.g., \"current version\"), which A does not include.",
        "B incorporates app size as a filter (e.g., \"size no more than 1.0 M\"), whereas A never mentions size.",
        "B combines multiple metrics in single queries (e.g., rating + sentiment + installs), while A focuses on single attributes like rating or category.",
        "B explicitly tracks app update dates (e.g., \"not updated since 2018\"), which A does not address.",
        "B quantifies specific sentiment types (e.g., \"neutral reviews\"), while A lacks sentiment granularity.",
        "B uses sentiment subjectivity scores (e.g., \"sentiment subjectivity of no more than 0.5\"), which are absent in A."
      ],
      "llama3.1-8b_1000_zero-shot_bg_v1": [
        "Queries in dataset B explicitly request counts of user reviews with specific sentiment attributes (e.g., 'neutral reviews', 'negative comments'), while A focuses on aggregated sentiment analysis without granular sentiment-type counts.",
        "Dataset B includes direct references to app names (e.g., 'Brit + Co', 'Dragon Ball Legends') in queries, whereas A primarily uses general categorical filters without naming specific apps.",
        "Queries in B frequently combine sentiment metrics with non-sentiment app metadata (e.g., 'size', 'current version', 'content rating') in the same question, unlike A's singular focus on ratings or installs paired with categories.",
        "Dataset B explicitly references 'translated reviews' as a data point, which does not appear in A's queries.",
        "B contains queries about the temporal aspect of app updates (e.g., 'not updated since 2015'), while A lacks time-based update criteria.",
        "Dataset B explicitly quantifies 'sentiment subjectivity scores' and ties them to specific apps or categories, whereas A only references general sentiment polarity.",
        "Queries in B request exact numerical values for sentiment polarity extremes (e.g., 'highest total Sentiment subjectivity score', 'lowest sentiment polarity score'), while A uses sentiment polarity only for qualitative analysis like 'positive reviews'.",
        "Dataset B includes percentage-based comparisons of sentiment ratios (e.g., 'percentage of positive sentiment reviews') as standalone metrics, unlike A's use of percentages solely in aggregated rating contexts.",
        "B requires identification of apps targeted at specific age groups (e.g., 'teens', 'adults only 18+'), a criterion absent in A's queries.",
        "Dataset B explicitly asks for the total installs of apps with specific content ratings (e.g., 'adults only 18+'), while A uses install counts only for popularity rankings or threshold filters."
      ],
      "qwen2.5-coder-7b_1000_zero-shot_bg_v1": [
        "Dataset B queries frequently include app installation counts (e.g., '1,000,000,000+ installs') as filtering criteria, while A does not reference install metrics",
        "B explicitly requests raw text outputs of reviews/comments (e.g., 'List all of its reviews'), whereas A focuses only on quantitative analysis of reviews",
        "B contains queries about app technical specifications like size ('how much is the size of Browser 4G') not present in A",
        "B incorporates content rating categories (e.g., 'adults only 18+') as filters, while A's metadata focuses on update years/versions",
        "B combines multiple distinct metrics in single queries (e.g., rating + sentiment + installs), while A typically isolates single metrics per question",
        "B specifically references translated reviews ('state the translated review'), indicating multilingual data aspects absent in A",
        "B uses more nuanced sentiment qualifiers ('pretty positive favorability', 'dislikes pretty much') beyond basic positive/negative classifications",
        "B includes temporal constraints tied to updates ('not been updated since 2015') combined with sentiment analysis, while A's temporal references are standalone",
        "B queries frequently calculate percentage ratios between metrics (e.g., 'percentage ratio of positive sentiment reviews'), whereas A focuses on absolute values",
        "B explicitly references target demographics ('age group targeted', 'teens') as analytical dimensions, which A's queries never address"
      ]
    }
  }
}