Iteration final - PROBLEM_DESCRIPTION
Sequence: 5
Timestamp: 2025-07-28 00:25:08

Prompt:
You are a business analyst creating structured optimization problem documentation.

DATA SOURCES EXPLANATION:
- FINAL OR ANALYSIS: Final converged optimization problem from alternating process (iteration 1), contains business context and schema mapping evaluation
- DATABASE SCHEMA: Current database structure after iterative adjustments  
- DATA DICTIONARY: Business meanings and optimization roles of tables and columns
- CURRENT STORED VALUES: Realistic business data generated by triple expert (business + data + optimization)
- BUSINESS CONFIGURATION: Scalar parameters and business logic formulas separated from table data

CRITICAL REQUIREMENTS: 
- Ensure problem description naturally leads to LINEAR or MIXED-INTEGER optimization formulation
- Make business context consistent with the intended decision variables and objectives
- Align constraint descriptions with expected mathematical constraints
- Ensure data descriptions map clearly to expected coefficient sources
- Maintain business authenticity while fixing mathematical consistency issues
- Avoid business scenarios that would naturally require nonlinear relationships (variable products, divisions, etc.)

AUTO-EXTRACTED CONTEXT REQUIREMENTS:
- Business decisions match expected decision variables: allocation[i] for each institution i, representing the amount of resources allocated (continuous)
- Operational parameters align with expected linear objective: maximize total_sequence_identity = ∑(protein.sequence_identity_coefficient[i] * ResourceAllocation.allocation[i])
- Business configuration includes: Total resources available for allocation (used for Constraint bound for total resources), Coefficient representing sequence identity to human proteins (used for Objective coefficient), Capacity of the building associated with institution (used for Constraint bound for building capacity)
- Use natural language to precisely describe linear mathematical relationships
- NO mathematical formulas, equations, or symbolic notation
- Present data as current operational information
- Focus on precise operational decision-making that leads to linear formulations
- Resource limitations match expected linear constraints
- Avoid scenarios requiring variable products, divisions, or other nonlinear relationships
- Include specific operational parameters that map to expected coefficient sources
- Reference business configuration parameters where appropriate

FINAL OR ANALYSIS:
{
  "database_id": "protein_institute",
  "iteration": 1,
  "business_context": "The protein institute aims to optimize the allocation of research resources across different institutions to maximize the total sequence identity to human proteins, given constraints on building capacities and institutional enrollments.",
  "optimization_problem_description": "Optimize the allocation of research resources to maximize the total sequence identity to human proteins across institutions, subject to constraints on building capacities and total available resources.",
  "optimization_formulation": {
    "objective": "maximize total_sequence_identity = \u2211(protein.sequence_identity_coefficient[i] * ResourceAllocation.allocation[i])",
    "decision_variables": "allocation[i] for each institution i, representing the amount of resources allocated (continuous)",
    "constraints": [
      "\u2211(ResourceAllocation.allocation[i]) <= business_configuration_logic.total_available_resources",
      "ResourceAllocation.allocation[i] <= building.building_capacity[i] for each institution i"
    ]
  },
  "current_optimization_to_schema_mapping": {
    "objective_coefficients": {
      "sequence_identity_coefficient[i]": {
        "currently_mapped_to": "protein.sequence_identity_coefficient",
        "mapping_adequacy": "good",
        "description": "Coefficient representing sequence identity to human proteins for institution i"
      }
    },
    "constraint_bounds": {
      "total_available_resources": {
        "currently_mapped_to": "business_configuration_logic.total_available_resources",
        "mapping_adequacy": "good",
        "description": "Total resources available for allocation"
      },
      "building_capacity[i]": {
        "currently_mapped_to": "building.building_capacity",
        "mapping_adequacy": "good",
        "description": "Capacity of the building associated with institution i"
      }
    },
    "decision_variables": {
      "allocation[i]": {
        "currently_mapped_to": "ResourceAllocation.allocation",
        "mapping_adequacy": "good",
        "description": "Amount of resources allocated to institution i",
        "variable_type": "continuous"
      }
    }
  },
  "missing_optimization_requirements": [],
  "iteration_status": {
    "complete": true,
    "confidence": "high",
    "next_focus": "Ready for convergence"
  }
}

FINAL DATABASE SCHEMA:
```sql
-- Iteration 1 Database Schema
-- Objective: Schema changes include creating new tables for missing optimization data, modifying existing tables to improve mapping adequacy, and updating business configuration logic for scalar parameters and formulas.

CREATE TABLE ResourceAllocation (
  institution_id INTEGER,
  allocation FLOAT
);

CREATE TABLE protein (
  sequence_identity_coefficient FLOAT
);

CREATE TABLE building (
  building_capacity INTEGER
);


```

CURRENT STORED VALUES:
```sql
-- Iteration 1 Realistic Data
-- Generated by triple expert (business + data + optimization)
-- Values were determined based on typical research resource allocation scenarios, ensuring that the total resources and building capacities align with realistic institutional capabilities.

-- Realistic data for ResourceAllocation
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (1, 150.0);
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (2, 250.0);
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (3, 200.0);

-- Realistic data for protein
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.85);
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.9);
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.8);

-- Realistic data for building
INSERT INTO building (building_capacity) VALUES (600);
INSERT INTO building (building_capacity) VALUES (700);
INSERT INTO building (building_capacity) VALUES (500);


```

DATA DICTIONARY:
{
  "tables": {
    "ResourceAllocation": {
      "business_purpose": "Represents the allocation of resources to each institution",
      "optimization_role": "decision_variables",
      "columns": {
        "institution_id": {
          "data_type": "INTEGER",
          "business_meaning": "Unique identifier for each institution",
          "optimization_purpose": "Links allocation to specific institutions",
          "sample_values": "1, 2, 3"
        },
        "allocation": {
          "data_type": "FLOAT",
          "business_meaning": "Amount of resources allocated to the institution",
          "optimization_purpose": "Decision variable for resource allocation",
          "sample_values": "100.0, 200.0, 300.0"
        }
      }
    },
    "protein": {
      "business_purpose": "Stores protein data including sequence identity coefficients",
      "optimization_role": "objective_coefficients",
      "columns": {
        "sequence_identity_coefficient": {
          "data_type": "FLOAT",
          "business_meaning": "Coefficient representing sequence identity to human proteins",
          "optimization_purpose": "Objective coefficient",
          "sample_values": "0.8, 0.9, 0.85"
        }
      }
    },
    "building": {
      "business_purpose": "Stores building data including capacity",
      "optimization_role": "constraint_bounds",
      "columns": {
        "building_capacity": {
          "data_type": "INTEGER",
          "business_meaning": "Capacity of the building associated with institution",
          "optimization_purpose": "Constraint bound for building capacity",
          "sample_values": "500, 600, 700"
        }
      }
    }
  }
}


BUSINESS CONFIGURATION:

BUSINESS CONFIGURATION:
{
  "total_available_resources": {
    "data_type": "INTEGER",
    "business_meaning": "Total resources available for allocation",
    "optimization_role": "Constraint bound for total resources",
    "configuration_type": "scalar_parameter",
    "value": 1000,
    "business_justification": "Represents a realistic total resource pool available for allocation across institutions."
  },
  "sequence_identity_coefficient": {
    "data_type": "FLOAT",
    "business_meaning": "Coefficient representing sequence identity to human proteins",
    "optimization_role": "Objective coefficient",
    "configuration_type": "scalar_parameter",
    "value": 0.85,
    "business_justification": "Represents an average sequence identity coefficient for the institutions involved."
  },
  "building_capacity": {
    "data_type": "INTEGER",
    "business_meaning": "Capacity of the building associated with institution",
    "optimization_role": "Constraint bound for building capacity",
    "configuration_type": "scalar_parameter",
    "value": 600,
    "business_justification": "Represents an average building capacity for the institutions involved."
  }
}

Business Configuration Design: 
Our system separates business logic design from value determination:
- Configuration Logic (business_configuration_logic.json): Templates designed by data engineers with sample_value for scalars and actual formulas for business logic
- Configuration Values (business_configuration.json): Realistic values determined by domain experts for scalar parameters only
- Design Rationale: Ensures business logic consistency while allowing flexible parameter tuning


TASK: Create structured markdown documentation for SECTIONS 1-3 ONLY (Problem Description).

EXACT MARKDOWN STRUCTURE TO FOLLOW:

# Complete Optimization Problem and Solution: protein_institute

## 1. Problem Context and Goals

### Context  
[Regenerate business context that naturally aligns with LINEAR optimization formulation. Ensure:]
- Business decisions match expected decision variables: allocation[i] for each institution i, representing the amount of resources allocated (continuous)
- Operational parameters align with expected linear objective: maximize total_sequence_identity = ∑(protein.sequence_identity_coefficient[i] * ResourceAllocation.allocation[i])
- Business configuration includes: Total resources available for allocation (used for Constraint bound for total resources), Coefficient representing sequence identity to human proteins (used for Objective coefficient), Capacity of the building associated with institution (used for Constraint bound for building capacity)
- Use natural language to precisely describe linear mathematical relationships
- NO mathematical formulas, equations, or symbolic notation
- Present data as current operational information
- Focus on precise operational decision-making that leads to linear formulations
- Resource limitations match expected linear constraints
- Avoid scenarios requiring variable products, divisions, or other nonlinear relationships
- Include specific operational parameters that map to expected coefficient sources
- Reference business configuration parameters where appropriate
- CRITICAL: Include ALL business configuration information (scalar parameters AND business logic formulas) in natural business language

### Goals  
[Regenerate goals that clearly lead to LINEAR mathematical objective:]
- Optimization goal: maximize
- Metric to optimize: maximize total_sequence_identity = ∑(protein.sequence_identity_coefficient[i] * ResourceAllocation.allocation[i])
- Success measurement aligned with expected coefficient sources
- Use natural language to precisely describe linear optimization goal
- NO mathematical formulas, equations, or symbolic notation

## 2. Constraints    

[Regenerate constraints that directly match expected LINEAR mathematical constraints:]
- Expected constraint: ['∑(ResourceAllocation.allocation[i]) <= business_configuration_logic.total_available_resources', 'ResourceAllocation.allocation[i] <= building.building_capacity[i] for each institution i'] (Form: Standard constraint form based on business requirements)

[Each constraint should be described in business terms that naturally lead to LINEAR mathematical forms (no variable products or divisions)]

## 3. Available Data  

### Database Schema  
```sql
-- Iteration 1 Database Schema
-- Objective: Schema changes include creating new tables for missing optimization data, modifying existing tables to improve mapping adequacy, and updating business configuration logic for scalar parameters and formulas.

CREATE TABLE ResourceAllocation (
  institution_id INTEGER,
  allocation FLOAT
);

CREATE TABLE protein (
  sequence_identity_coefficient FLOAT
);

CREATE TABLE building (
  building_capacity INTEGER
);


```

### Data Dictionary  
[Create comprehensive business-oriented data dictionary mapping tables and columns to their business purposes and optimization roles - NOT technical database terms. Base this on the data_dictionary provided but rewrite in business language that clearly connects to the expected linear mathematical formulation]

### Current Stored Values  
```sql
-- Iteration 1 Realistic Data
-- Generated by triple expert (business + data + optimization)
-- Values were determined based on typical research resource allocation scenarios, ensuring that the total resources and building capacities align with realistic institutional capabilities.

-- Realistic data for ResourceAllocation
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (1, 150.0);
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (2, 250.0);
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (3, 200.0);

-- Realistic data for protein
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.85);
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.9);
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.8);

-- Realistic data for building
INSERT INTO building (building_capacity) VALUES (600);
INSERT INTO building (building_capacity) VALUES (700);
INSERT INTO building (building_capacity) VALUES (500);


```

STOP HERE - Do not include section 4 (Mathematical Optimization Formulation).

Response:
# Complete Optimization Problem and Solution: protein_institute

## 1. Problem Context and Goals

### Context  
The protein institute is focused on optimizing the distribution of research resources across various institutions. The primary objective is to enhance the total sequence identity to human proteins. This involves allocating resources in a manner that maximizes the sequence identity, which is a measure of how closely protein sequences match human proteins. The allocation decisions are represented by the amount of resources assigned to each institution, which are continuous variables.

The operational parameters are structured to support a linear optimization objective. The goal is to maximize the total sequence identity, which is calculated as the sum of the product of each institution's sequence identity coefficient and the resources allocated to it. The business configuration includes several critical parameters: the total resources available for allocation, which serves as a constraint on the total resources that can be distributed; the sequence identity coefficient, which is used as the objective coefficient; and the capacity of the buildings associated with each institution, which acts as a constraint on the maximum resources that can be allocated to each institution.

The data is presented as current operational information, focusing on precise decision-making that leads to linear formulations. Resource limitations are aligned with expected linear constraints, avoiding scenarios that require nonlinear relationships such as variable products or divisions. The business configuration parameters are referenced appropriately to ensure clarity and consistency.

### Goals  
The primary goal of this optimization problem is to maximize the total sequence identity to human proteins across all institutions. This is achieved by strategically allocating research resources to each institution. The metric to optimize is the total sequence identity, which is the sum of the sequence identity coefficients for each institution multiplied by the resources allocated to them. Success is measured by how well the allocation maximizes this metric, aligning with the expected coefficient sources. The optimization goal is described in natural language to ensure clarity and precision, avoiding mathematical formulas or symbolic notation.

## 2. Constraints    

The optimization problem is subject to several linear constraints. The first constraint ensures that the total resources allocated across all institutions do not exceed the total resources available, as defined in the business configuration. This constraint is critical to maintaining resource allocation within realistic limits. The second constraint ensures that the resources allocated to each institution do not exceed the capacity of the buildings associated with them. This constraint is essential to ensure that the allocation respects the physical limitations of each institution's infrastructure. Both constraints are described in business terms that naturally lead to linear mathematical forms, avoiding any nonlinear relationships.

## 3. Available Data  

### Database Schema  
```sql
-- Iteration 1 Database Schema
-- Objective: Schema changes include creating new tables for missing optimization data, modifying existing tables to improve mapping adequacy, and updating business configuration logic for scalar parameters and formulas.

CREATE TABLE ResourceAllocation (
  institution_id INTEGER,
  allocation FLOAT
);

CREATE TABLE protein (
  sequence_identity_coefficient FLOAT
);

CREATE TABLE building (
  building_capacity INTEGER
);
```

### Data Dictionary  
The data dictionary provides a comprehensive mapping of tables and columns to their business purposes and optimization roles:

- **ResourceAllocation Table**: Represents the allocation of resources to each institution. The `institution_id` column serves as a unique identifier for each institution, linking the allocation to specific institutions. The `allocation` column represents the amount of resources allocated to the institution, serving as the decision variable for resource allocation.

- **Protein Table**: Stores protein data, including sequence identity coefficients. The `sequence_identity_coefficient` column represents the coefficient for sequence identity to human proteins, serving as the objective coefficient in the optimization problem.

- **Building Table**: Stores building data, including capacity. The `building_capacity` column represents the capacity of the building associated with each institution, serving as the constraint bound for building capacity.

### Current Stored Values  
```sql
-- Iteration 1 Realistic Data
-- Generated by triple expert (business + data + optimization)
-- Values were determined based on typical research resource allocation scenarios, ensuring that the total resources and building capacities align with realistic institutional capabilities.

-- Realistic data for ResourceAllocation
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (1, 150.0);
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (2, 250.0);
INSERT INTO ResourceAllocation (institution_id, allocation) VALUES (3, 200.0);

-- Realistic data for protein
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.85);
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.9);
INSERT INTO protein (sequence_identity_coefficient) VALUES (0.8);

-- Realistic data for building
INSERT INTO building (building_capacity) VALUES (600);
INSERT INTO building (building_capacity) VALUES (700);
INSERT INTO building (building_capacity) VALUES (500);
```
