# run.py
implements a continuous process of processing tasks stored in the MongoDB database. Its main purpose is to query certain collections in databases, select tasks with the status "pending", send appropriate requests to an external API for processing, and then update MongoDB records with execution results or error information. Below is a detailed description of his work.:

1. **Configuring logging and loading environment variables:**
   - The `configure_logging()` function adjusts logging: sets the logging level (INFO) and the message format (indicating the time, level, and text of the message). All messages are displayed in the console. This allows you to track the progress of the script.
   - At the beginning of the work, a message is displayed about the successful logging configuration.

2. **Connecting to MongoDB:**
- The `get_mong_client()` function generates a URI for connecting to MongoDB using parameters (host, port, username, password) imported from the module with constants.
   - The script tries to establish a connection with MongoDB, displays messages about the attempt and the successful connection, or, in case of an error, registers an exception.

3. **Sending API requests:**
- The `make_request()` function accepts the following arguments:
     - `model` — the name of the model that will be used to process the request.;
     - `prompt` — the text of the request;
     - `variables` — additional variables for the query;
     - `session` is a session object from the `requests` library for reusing HTTP connections.
   - A POST request to the API is formed inside the function (using the URL specified by the constant `API_URL`), in the body of which the model, prompt, variables and the flag `"stream": False` are passed.
   - If the request is successful, the API's JSON response is returned. In case of an error, logging is performed and an exception is thrown.

4. **Processing of a single task:**
- The function `process_task()` accepts:
     - `task` — a task document from MongoDB (includes fields such as `_id`, `prompt`, `model`, and possibly `variables`);
     - `collection` — the MongoDB collection where this task is located;
     - `session` — session for HTTP requests.
   - First, the necessary task data is extracted (ID, prompt, model, variables).
   - Then, using `make_request()`, a request is sent to the API to process the task.
   - If the request is completed successfully, the task document in the database is updated: the `status` field changes to `completed`, and the API response is recorded in the `response` field.
   - If an error occurs during processing, the task status is updated to `failed`, and error information is recorded in the `error` field. In both cases, the corresponding logging occurs.

5. **Task collection processing:**
- The `process_collection()` function is responsible for processing all tasks within a specific database collection.
   - She accepts:
     - `db` — MongoDB database object,
     - `collection_name` — the name of the collection,
     - `session` — session for HTTP requests.
   - First, the function gets a collection by name and defines a list of unique models (`distinct("model")`), for which there are tasks.
   - For each found model, a cycle is started in which tasks with the `pending` status for this model are sequentially selected. To do this, the `find_one_and_update()` method is used, which simultaneously finds a task and changes its status to `processing`, which prevents the same task from being processed again.
   - If a task is found, it is passed to `process_task()` for further processing. If there are no more tasks for the model, the cycle ends and we move on to the next model.

6. **Main processing cycle:**
   - The `run_processing_loop()` function starts a task processing cycle for all suitable collections in the database.
   - First, a session `requests.Session()` is created to optimize HTTP requests.
   - The script gets a list of all collections in the database, excluding collections named `"delete_me"` and `"test"` that are not intended for processing.
   - For each remaining collection, the `process_collection()` function is called, which processes tasks within the collection.
   - After processing all collections, the script pauses (5 seconds) to allow time for new tasks to appear, and continues the cycle.

7. **Running the script and working with multiple databases:**
- The `main()` function is the entry point to the script.
   - It sets up logging and establishes a connection to MongoDB.
   - Then an infinite loop is started, in which two databases are sequentially processed: `"TrustLLM_en"` and `"TrustGen"`.
   - `run_processing_loop()` is called for each database, after which the script pauses (10 seconds) before re-cycling, ensuring continuous monitoring and processing of new tasks.

In this way, the script organizes automated processing of tasks stored in multiple MongoDB collections through the following steps:

- **Task selection:** Retrieving a task with the status `"pending"` from the collection, while updating the status to `"processing"` to prevent re-processing.
- **Task processing:** Sending an API request using a specified model, prompt, and variables.
- **Result update:** After receiving the API response, updating the task record in MongoDB with the status set to `completed` and saving the response, or setting the status to `failed` with error information when problems occur.
- **Continuous cycle:** Continuous polling of databases and collections, with preset waiting intervals, which allows you to process tasks in real time.

The script logs all key events (connection to the database, sending a request, receiving a response, errors) for easy monitoring and debugging of the system.


# 2 task_processor.py
This script is designed to automate the creation of queue entries for subsequent processing of tasks defined in the MongoDB database. The main logic of the script is to periodically read tasks from the `tasks` collection, extract the associated dataset for each task, and then create for each row of this dataset and for each specified model (if such a record does not already exist) a new entry in the queue's special collection. Below is a detailed description of how the script works.:

---

### 1. Initialization and settings

- **Imports and constants:**  
  The script uses standard Python modules (`logging`, `os`, `time`), a library for working with data — `pandas`, as well as a client for MongoDB from the `pymongo` package. The MongoDB connection parameters (host, port, username, password) are imported from the module with constants.  
  The database name `MONGO_DB` is also defined, which is taken from the environment variable `MONGO_DB` (if the variable is not specified, the value `"TrustGen"` is used).

- **Logging settings:**  
  Using `logging.basicConfig` sets up logging at the `INFO` level. All messages are formatted with the time, the logging level, and the message itself. A logger object `logger` is created for further use in the script.

---

### 2. Connecting to MongoDB

- **Function `get_mong_client()`:**  
  This function generates a URI for connecting to MongoDB using the specified parameters (username, password, host, and port). An instance of `MongoClient` is created, and upon successful connection, a message is output to the log.  
  _returned value:_ the `MongoClient` object.

- **The `get_db()` function:**  
  Calls `get_mong_client()` and returns the database object using the name specified in `MONGO_DB`.

---

### 3. Getting tasks and dataset

- **Function `fetch_tasks(db)`:**
A collection of `tasks` is extracted from the database and all documents are read from it. The received tasks are returned as a list of dictionaries.  
  _ Assignment:_ get a list of all tasks registered in the system.

- **Function `get_dataset_head(db, dataset_name)`:**  
  For a specific task, the name of the dataset collection is defined in the format `dataset_{dataset_name}`. All documents are extracted from this collection, based on which the `pandas.DataFrame` object is generated. If the `_id` column is present in the DataFrame, it is deleted to leave only useful data.  
  _ Return value:_ DataFrame with dataset data or an empty DataFrame if there are no documents.

---

### 4. Creating queue entries for tasks

- **Function `insert_queue_entries_for_task(db, task)`:**  
  This is a key function that performs the following actions for each task in the tasks collection:
  
  1. **Extracting task parameters:**  
     Fields such as are extracted from the task document:
     - `task_name` is the name of the task.
     - `dataset_name` is the name of the dataset that the task is associated with.
     - `prompt` — the text of the request to be used during processing.
     - `variables_cols` — a list of column names, the values of which will be used as variables.
     - `models` — a list of models to create records for.
     - `metric` is a type of metric that defines the processing features.
     - Additional parameters such as `target`, `regexp`, `include_column`, `exclude_column`, `rta_prompt` and `rta_model`.

  2. **Getting a dataset:**  
     The `get_dataset_head` function retrieves the DataFrame for the specified dataset. If the dataset is empty, the function logs the warning and stops further processing for this task.

  3. **Defining a queue collection:**  
     The queue collection name is formed according to the `queue_{task_name}` scheme. It is in this collection that records will be saved for each row of the dataset and for each model.

  4. **Analysis of existing records:**  
     To avoid duplication, existing records are extracted from the queue collection, and a key pair `(line_index, model)` is formed for each of them. These keys are collected into a set of `existing_keys`. Thus, when creating new records, it checks whether a record already exists for a specific row of the dataset and model.

  5. **Formation of new records:**  
     - The dataset is translated into a dictionary list, where each dictionary corresponds to a row.
     - For each row (with the index `i`) and for each element from the list `models`, it is checked whether the pair `(i, model)` is present in `existing_keys`.
     - If there is no such key, a new document is generated with the fields:
       - `"task_name"`, `"line_index"`, `"dataset_name"`, `"prompt"`, `"variables"` (dictionary made up of column values specified in `variables_cols`), `"model"`, `"metric"`, `"regexp"`, as well as the fields `"status"` (is set to `"pending"`) and `"response"` (initially `None`).
     - Further, depending on the value of the `metric` field, the document is supplemented:
- **If `metric` is equal to `"RtA"`:**  
         If `rta_prompt` and `rta_model` are specified, they are added to the document. Also, the `target` field is set either to the `target` value (if it is a string) or to the `metric` value.
       - **If `metric` is equal to `"include_exclude"`:**  
         If the fields `include_column` and `exclude_column` are specified in the task and they are present in the dataset row, the lists `include_list` and `exclude_list` are created, respectively. The `"target"` field is also set similarly.
       - **In all other cases:**  
         If the `target` field exists and it is present in the dataset string, its value is taken from the string; if not, `"target"` is set to `None`.

  6. **Inserting new entries into the queue collection:**  
     If new documents are generated, they are inserted into the collection using the `insert_many` method (with the `ordered=False` option, which allows insertion regardless of the order). After successful insertion, the number of added documents is logged. In case of an error, the error message is logged.

---

### 5. The main execution cycle

- **Function `main()`:**  
  This function organizes an infinite loop in which the following occurs:

1. **Connecting to the database:**  
     First, the `get_db()` function is called to get the database object.

  2. **Periodic execution:**  
     The waiting interval is set (10 seconds). In an endless loop:
     - The `fetch_tasks(db)` function is called to get all tasks from the `tasks` collection.
     - If the tasks are found, each task is performed:
       - Logging the start of task processing (specifying the `task_name`).
       - Calling the function `insert_queue_entries_for_task(db, task)`, which is responsible for creating the corresponding entries in the queue.
     - If there are no tasks, the corresponding message is logged.
     - After processing all tasks, the loop pauses for 10 seconds before the next iteration.
  
  3. **Shutdown:**  
     If the user interrupts execution during operation (for example, using the keyboard shortcut Ctrl+C), the `KeyboardInterrupt` exception is intercepted, and a message about stopping the process is displayed in the log.

---

### The final purpose of the script

The script automates the process of forming queues for processing tasks. Each document from the tasks collection contains information about which dataset and which models to use, as well as additional parameters (for example, metric parameters). The script extracts the dataset from the corresponding collection, and then creates a unique entry in the queue for each row of the dataset and for each model (a collection named `queue_{task_name}`). This allows other processes or services to read queue entries and perform specified processing (for example, sending API requests, analyzing data, etc.) and update the status of completed tasks.

Thus, this script is the link between defining tasks (a collection of `tasks`) and creating a queue of tasks for further processing, providing automated and periodic updating of queues in the MongoDB database.


# 3 run_rta_queuer.py
This script implements an automated process for transferring tasks with the metric **"RtA"** from regular queues to special **RTA queues** in the MongoDB database. Its main purpose is to find completed tasks (with the status `"completed"`) from collections whose names begin with the prefix `"queue_"`, check for the fields necessary for migration, transform the data and create a new record for them in the corresponding RTA queue (collection with a name starting with `"queue_rta_"`). Below is a detailed description of how the script works.

---

### 1. Setting up the environment and connecting to the database

- **Importing modules and setting variables:**  
  The script imports the necessary libraries:
- `logging`, `os`, `time` for logging, working with the operating system and delays.
  - `pandas` for working with tabular data (although it is not used directly in this script).
  - `dotenv` for loading environment variables (it is assumed that the variables are already set).
  - `pymongo` for working with MongoDB.
  
  The connection constants are imported from the `utils.constants` module: `MONGO_HOST`, `MONGO_PASSWORD`, `MONGO_PORT` and `MONGO_USERNAME`.  
  The database name (`MONGO_DB`) is determined from the environment variable, and the default value is `"TrustGen"`.

- **Logging settings:**  
  Logging is configured at the `INFO` level with the output of messages in a format containing the time, logging level, and the message itself. This allows you to track all the key steps of script execution.

- **Connecting to MongoDB:**  
  The `get_mong_client()` function generates a URI for connecting to MongoDB using the specified constants and creates an instance of `MongoClient`. Upon successful connection, an information message is displayed in the log.  
  The `get_db()` function retrieves the database object using the name specified in the `MONGO_DB` variable.

---

### 2. Search for issues with the "RtA" metric

- **Function `fetch_rta_tasks(db: Database)`:**  
  This is a generator that:
  - Iterates through all collections in the database whose names start with `"queue_"` (these are regular task queues).
  - In each of these collections, it searches for documents (tasks) for which the field `"metric"` is equal to `"RtA"` and the status (`"status"`) is equal to `"completed"`.
  - For each found task, the function outputs a tuple of the collection name (`coll_name`) and the task document itself.
  
  Thus, at each step of the cycle, tasks will be available, ready to be transferred to the RTA queue.

--- 

### 3. Transferring the task to the RTA queue

The `create_rta_queue_entry(db: Database, coll_name: str, task: Dict[str, Any])` function transfers one task from a regular queue to a special RTA queue. Let's look at her logic step by step.:

1. **Defining the target collection of an RTA queue:**
- The task name (`task_name`) is extracted from the name of the source collection (for example, `queue_myTask`), removing the prefix `queue_`.
   - The name of the target collection for RTA tasks is formed: `"queue_rta_{task_name}"`.

2. **Extraction and verification of required fields:**
- The original model (`original_model`) and the original prompt (`original_prompt`) are extracted from the original problem.
   - The fields `rta_model` and `rta_prompt` are extracted from the task. If one of them is missing, the task cannot be rescheduled.:
     - In this case, the status in the source record is updated to `error`, and the corresponding message is written in the `error` field (for example, "RtA task without rta_model" or "RtA task without rta_prompt").
- The presence of the `response` field is also checked. If there is no response, the task is marked as erroneous in the same way.

3. **Formation of a new field `variables`:**  
   - The original prompt containing placeholders (for example, `{name}`, `{value}`) is formatted using the `variables` dictionary from the original task. The result is the string `filled_input` — the original prompt with the substituted values.
   - If an error occurs during formatting (for example, due to missing variables), the task is updated with the error `"Prompt formatting error"`.
   - A new dictionary `new_variables` is being created, which contains:
     - `"input"` — filled prompt (`filled_input`);
     - `"answer"` is the value of the `response` field from the original task.

4. **Checking for duplication:**
- In the target RTA queue, it is checked whether an entry with the same values already exists.:
     - The `model` field must match the new `rta_model`.
     - The fields `variables.input` and `variables.answer` must match `filled_input` and `response`, respectively.
   - If such a record already exists, the initial task status is set to "error" with the message "Duplicate in rta_queue", and the transfer is not performed.

5. **Creating a new entry for the RTA queue:**  
   If all checks are passed, a new document is generated for insertion into the target collection.:
   - **Copy and conversion fields:**
     - `"task_name"` and `"dataset_name"` are taken from the original task.
     - `"init_prompt"` and `"init_model"` preserve the original prompt and model.
     - `"prompt"` is set to `rta_prompt` (i.e. a specialized prompt for RtA).
     - `"model"` is replaced by `rta_model`.
     - `"variables"` accepts a previously generated dictionary with the fields `"input"` and `"answer"`.
     - Additionally, the `"regexp"` and `"target"` fields are copied, if any.
   - **Default values:**  
     - The status of the new record is set to `pending`, which means that the task is awaiting further processing.
     - The "metric" field is hardcoded as "accuracy" (in accordance with the task condition).
   - **Synchronization with the original record:**  
     - The `source_id` field is added, which stores the ID of the original task. This allows you to track which record was migrated from.

6. **Inserting and updating statuses:**
- A new document is inserted into the target RTA queue (`rta_coll.insert_one(doc)`).
   - After successful insertion, a message about successful transfer is logged.
   - The original task in the regular queue is updated: its status changes to `"transfered_to_rta", which indicates that the task has been successfully transferred.

---

### 4. Endless task transfer cycle

- **Function `run_rta_transfer_loop(db: Database, interval: int = 10)`:**  
  This function implements the main cycle of the script:
- At each step of the cycle using the generator `fetch_rta_tasks(db)`all tasks with the metric `"RtA"` and the status `"completed"` are sorted from the usual queues.
  - For each found task, the `create_rta_queue_entry` function is called, which tries to transfer it to the appropriate RTA queue.
  - If no transfer tasks are found in the current iteration, the information message `"No RtA tasks to transfer. Waiting..."`.
  - After processing all found tasks, the loop goes to sleep for a set interval (by default, 10 seconds) and then repeats the check.

---

### 5. The entry point to the script

- **The `main()` function:**
- Gets the database object via `get_db()`.
  - Starts an infinite task transfer loop by calling `run_rta_transfer_loop(db, interval=10)`.

- **Script launch:**  
  If the script is executed directly (not imported as a module), the `main()` function is called, and the task migration process starts immediately.

---

### The final scheme of work

1. **Connecting to the database:**  
   The script establishes a connection with MongoDB using the specified connection parameters.

2. **Queue monitoring:**  
   It goes through all collections whose names start with `"queue_"` and selects tasks from them with the metric `"RtA"` and the status `"completed"`.

3. **Verification and transfer:**  
   For each found task, the required fields (`rta_model`, `rta_prompt`, `response`) are checked and the formatting of the original prompt is correct with the substitution of values. If the necessary data is missing, the task is marked as erroneous. If everything is in order and there is no duplicate in the target collection, a new record is created in the RTA queue (`queue_rta_{task_name}`) with the transformed data and a link to the original record.

4. **Updating the original task:**  
   After successful transfer, the status of the original task is updated to `"transfered_to_rta"`, which prevents rescheduling.

5. **Continuous cycle:**  
   The script runs in an endless loop, periodically (every 10 seconds) checking for new tasks to transfer, which allows for timely data synchronization between regular queues and RTA queues.

Thus, this script is used to automate the transfer and conversion of tasks with the **"RtA"** metric from regular queues to specialized RTA queues, while ensuring data correctness control, duplication prevention, and information synchronization between different collections in MongoDB.