Tutorial¶

In the following tutorial, we will create a solver for a predefined math dataset consisting of 500 multiplication, integer division and exponential questions (see Dataset File). You can also generate a new one with this script. Our solver is an execution graph of n parallel LLMs trying to solve the problem independent of each other and a filter that will take the majority result.

Cache¶

The cache stores the results both during one execution of the graph (Process Cache) as well as over multiple executions when running over the dataset or optimizing hyperparameters (Persistent Cache).

In [1]:
from pathlib import Path
from llm_graph_optimizer.language_models.cache.cache import CacheContainer

cache_path = Path().resolve() / "output/cache_tutorial.pkl"
cache = CacheContainer.from_persistent_cache_file("examples/test/output/cache_tutorial.pkl", skip_on_file_not_found=True)

Operations¶

Operations are the building blocks of an execution graph. Typically, for each of them we have to define:

  • input_types: A dictionary consisting of a key and the type of the variable
  • output_types: A dictionary consisting of a key and the type of the variable
  • cache_seed: This is optional but needs to be set for non-deterministic operations running in parallel so that a cached output from one is not used in the other operation.
  • other parameters: See the trace of the respective base operation for that.
In [2]:
from llm_graph_optimizer.graph_of_operations.types import ManyToOne
from llm_graph_optimizer.language_models.helpers.language_model_config import Config
from llm_graph_optimizer.language_models.openai_chat import OpenAIChat
from llm_graph_optimizer.operations.base_operations.end import End
from llm_graph_optimizer.operations.base_operations.filter_operation import FilterOperation
from llm_graph_optimizer.operations.base_operations.start import Start
from llm_graph_optimizer.operations.llm_operations.base_llm_operation import BaseLLMOperation

# Start and Stop have to be defined. They defined the beginning and end of the graph and can only exist once in a graph.

start_op = Start.factory(
    input_types={"question": str}
)
end_op = End.factory(
    input_types={"final_answer": int | None}
)

# Initialize the LLM operation and the corresponding prompter and parser

def prompter(input: str):
    return f'Answer the following math problem and only answer with the number. Do not think about it, just answer: {input}'
def parser(x: str):
    try:
        x_as_number = int(x)
    except ValueError:
        x_as_number = None
    return {"output": x_as_number}

# Note that we are defining hyperparameters here (like temperature). We could in theory also put everything in a function and optimize for them later.
llm = OpenAIChat(model="gpt-3.5-turbo", cache=cache, config=Config(temperature=1.0))

llm_op = BaseLLMOperation.factory(
    llm=llm,
    prompter=prompter,
    parser=parser,
    input_types={"input": str},
    output_types={"output": int | None},
    name="LLM"  # not necessary but it is shown in the graph viewer instead of the class name
)

# The filter operation is different from the others here: We define an input type as ManyToOne which indicates that the operation can take multiple inputs from different operations linked to the same key.
filter_op = FilterOperation.factory(
    input_types={"outputs": ManyToOne[int | None]},
    output_types={"output": int | None},
    filter_function=lambda outputs: {"output": max(set(outputs), key=outputs.count)}
)

Graph¶

The graph contains instances of operations and the flow of information through the graph. We are defining it inside a function so we can optimize for a different number of branches later.

In [3]:
from llm_graph_optimizer.graph_of_operations.graph_of_operations import GraphOfOperations
from llm_graph_optimizer.graph_of_operations.types import Edge

def create_graph(number_of_branches: int) -> GraphOfOperations:
    # Initialize the graph
    graph = GraphOfOperations()

    start_node = start_op()
    llm_nodes = [llm_op(cache_seed=i) for i in range(number_of_branches)]
    filter_node = filter_op()
    end_node = end_op()

    # Add nodes and edges to the graph
    graph.add_node(start_node)
    graph.add_node(filter_node)
    for i, node in enumerate(llm_nodes):
        graph.add_node(node)
        graph.add_edge(Edge(start_node, node, from_node_key="question", to_node_key="input"))
        graph.add_edge(Edge(node, filter_node, from_node_key="output", to_node_key="outputs"), order=i)  # The order parameter is not strictly necessary here. In operations with multiple ManyToOne inputs, the order can be used to sort the respective lists in the same order.
    graph.add_node(end_node)
    graph.add_edge(Edge(filter_node, end_node, from_node_key="output", to_node_key="final_answer"))
    return graph

Let's look how the graph looks like with 3 branches. You can click on the edges to see the keys and values. Currently, the values are not set. The node state is showing:

  • blue: The node is waiting for the input,
  • yellow: The node is processing,
  • green: The node is done,
  • red: The operation failed.
In [4]:
graph = create_graph(3)
graph.snapshot.visualize(show_keys=True, show_values=True, show_state=True, notebook=True)
Out[4]:

Controller¶

The controller is the object that executes the graph. It has the following parameters:

  • graph: The graph to execute
  • scheduler: The scheduler to use (BFS and DFS are implemented. Feel free to add your own :) )
  • max_concurrent: The controller runs the execution asynchronously. The maximum number of concurrent operations is defined here.
  • process_measurement: An object calculating the token use and cost of the execution. It will do it for both sequential costs as well as parallel costs. In our example, the price for running the model is determined by the execution of each LLM but maximum duration by the one that took the longest time.
In [5]:
from llm_graph_optimizer.controller.controller import Controller
from llm_graph_optimizer.measurement.process_measurement import ProcessMeasurement
from llm_graph_optimizer.schedulers.schedulers import Scheduler

def create_controller(number_of_branches: int):
    graph = create_graph(number_of_branches)
    scheduler = Scheduler.BFS
    process_measurement = ProcessMeasurement(graph)
    controller = Controller(graph, scheduler, max_concurrent=5, process_measurement=process_measurement)
    return controller

Now we can execute the graph on an example question:

In [6]:
controller = create_controller(number_of_branches=3)
answer, measurements = await controller.execute(input={"question": "What is 21*2?"})
answer
Out[6]:
{'final_answer': 42}
In [7]:
print(measurements)
ProcessMeasurement(
  total_sequential_cost=
    MeasurementsWithCache(
  no_cache=
    Measurement(request_tokens=np.float64(105.0), response_tokens=np.float64(3.0), total_tokens=np.float64(108.0), request_cost=np.float64(3.1500000000000004), response_cost=np.float64(0.18), total_cost=np.float64(3.33), execution_duration=np.float64(2.2870641257613897), execution_cost=np.float64(3.0)),
  with_process_cache=
    Measurement(request_tokens=np.float64(105.0), response_tokens=np.float64(3.0), total_tokens=np.float64(108.0), request_cost=np.float64(3.1500000000000004), response_cost=np.float64(0.18), total_cost=np.float64(3.33), execution_duration=np.float64(2.2870641257613897), execution_cost=np.float64(3.0)),
  with_persistent_cache=
    Measurement(request_tokens=np.float64(105.0), response_tokens=np.float64(3.0), total_tokens=np.float64(108.0), request_cost=np.float64(3.1500000000000004), response_cost=np.float64(0.18), total_cost=np.float64(3.33), execution_duration=np.float64(2.2870641257613897), execution_cost=np.float64(3.0))
),
  total_parallel_cost=
    MeasurementsWithCache(
  no_cache=
    Measurement(request_tokens=np.float64(105.0), response_tokens=np.float64(3.0), total_tokens=np.float64(108.0), request_cost=np.float64(3.1500000000000004), response_cost=np.float64(0.18), total_cost=np.float64(3.33), execution_duration=np.float64(0.8139148328918964), execution_cost=np.float64(1.0)),
  with_process_cache=
    Measurement(request_tokens=np.float64(105.0), response_tokens=np.float64(3.0), total_tokens=np.float64(108.0), request_cost=np.float64(3.1500000000000004), response_cost=np.float64(0.18), total_cost=np.float64(3.33), execution_duration=np.float64(0.8139148328918964), execution_cost=np.float64(1.0)),
  with_persistent_cache=
    Measurement(request_tokens=np.float64(105.0), response_tokens=np.float64(3.0), total_tokens=np.float64(108.0), request_cost=np.float64(3.1500000000000004), response_cost=np.float64(0.18), total_cost=np.float64(3.33), execution_duration=np.float64(0.8139148328918964), execution_cost=np.float64(1.0))
)
)
In [8]:
controller.graph_of_operations.snapshot.visualize(show_multiedges=False, show_values=True, show_keys=True, show_state=True, notebook=True)
Out[8]:

Dataset Evaluation¶

Now let's run the execution over an entire dataset.

In [9]:
from pathlib import Path
from typing import Iterable

class TestDatasetLoaderWithYield(Iterable):
    def __init__(self, file_path):
        self.file_path = file_path

    def __iter__(self):
        with open(self.file_path, 'r') as f:
            for line in f:
                # Split the line into input and output
                input_str, output_str = line.strip().split(', ')
                # Yield the parsed input and output
                yield {"question": input_str}, int(output_str)


dataset_path = Path().resolve() / "dataset/test_dataset.txt"
dataloader_factory = lambda: TestDatasetLoaderWithYield(dataset_path)

Next we need to define parameters that will be used to evaluate the dataset. We can define a confidence (alpha) value for a t-test for early stopping and a confidence interval width for each score.

In [10]:
from llm_graph_optimizer.graph_of_operations.types import ReasoningState
from llm_graph_optimizer.measurement.dataset_measurement import DatasetEvaluatorParameters, ScoreParameter


accuracy_score = ScoreParameter(
    name="accuracy",
    confidence_interval_width=0.95,
    acceptable_ci_width=0.05
)
parameters = DatasetEvaluatorParameters(
    min_runs=10,
    max_runs=500,
    score_parameters=[accuracy_score]
)

def calculate_score(reasoning_state: ReasoningState, measurement: ProcessMeasurement, ground_truth: int) -> dict[ScoreParameter, float]:
    return {accuracy_score: 1} if reasoning_state["final_answer"] == ground_truth else {accuracy_score: 0}

Now we need to create a controller factory with no parameters and a dataset evaluator object.

In [11]:
from llm_graph_optimizer.optimizer.dataset_evaluator import DatasetEvaluator


controller_factory = lambda: create_controller(number_of_branches=3)
dataset_evaluator = DatasetEvaluator(
    controller_factory=controller_factory,
    calculate_score=calculate_score,
    dataloader_factory=dataloader_factory,
    parameters=parameters
)
In [12]:
scores = await dataset_evaluator.evaluate_dataset(max_concurrent=10)
dataset_measurement = dataset_evaluator.dataset_measurement
Iteration 500: (accuracy = 0.6940, CI width = 0.0811), : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:34<00:00, 14.64it/s]
In [13]:
print(dataset_measurement.global_evaluation_measurements)
print(dataset_measurement.dataset_evaluator_parameters)
print(dataset_measurement.scores)
GlobalEvaluationMeasurements(total_execution_duration=34.14650893211365)
DatasetEvaluatorParameters(score_parameters=[ScoreParameter(name='accuracy', map=<function mean at 0x10b9c3370>, confidence_interval_width=0.95, acceptable_ci_width=0.05)], min_runs=10, max_runs=500)
[Score(name='accuracy', value=np.float64(0.694), confidence_interval_width=np.float64(0.08106304499837319))]

Now we can look at specific measurements concatenated by.a defined map function. For example the mean:

In [14]:
import numpy as np


dataset_measurement.calculate_dataset_measurement(np.mean)
Out[14]:
MappedSequentialAndParallelMeasurementsWithCache(sequential=MeasurementsWithCache(no_cache=Measurement(request_tokens=np.float64(96.006), response_tokens=np.float64(5.522), total_tokens=np.float64(101.528), request_cost=np.float64(2.88018), response_cost=np.float64(0.33132), total_cost=np.float64(3.211500000000001), execution_duration=np.float64(1.4787537706159055), execution_cost=np.float64(3.0)), with_process_cache=Measurement(request_tokens=np.float64(93.318), response_tokens=np.float64(5.352), total_tokens=np.float64(98.67), request_cost=np.float64(2.79954), response_cost=np.float64(0.32112), total_cost=np.float64(3.120660000000001), execution_duration=np.float64(1.441199965871172), execution_cost=np.float64(2.916)), with_persistent_cache=Measurement(request_tokens=np.float64(93.318), response_tokens=np.float64(5.352), total_tokens=np.float64(98.67), request_cost=np.float64(2.79954), response_cost=np.float64(0.32112), total_cost=np.float64(3.120660000000001), execution_duration=np.float64(1.441199965871172), execution_cost=np.float64(2.916))), parallel=MeasurementsWithCache(no_cache=Measurement(request_tokens=np.float64(96.006), response_tokens=np.float64(5.522), total_tokens=np.float64(101.528), request_cost=np.float64(2.88018), response_cost=np.float64(0.33132), total_cost=np.float64(3.211500000000001), execution_duration=np.float64(0.6443328883992508), execution_cost=np.float64(1.0)), with_process_cache=Measurement(request_tokens=np.float64(93.318), response_tokens=np.float64(5.352), total_tokens=np.float64(98.67), request_cost=np.float64(2.79954), response_cost=np.float64(0.32112), total_cost=np.float64(3.120660000000001), execution_duration=np.float64(0.630056657816749), execution_cost=np.float64(0.972)), with_persistent_cache=Measurement(request_tokens=np.float64(93.318), response_tokens=np.float64(5.352), total_tokens=np.float64(98.67), request_cost=np.float64(2.79954), response_cost=np.float64(0.32112), total_cost=np.float64(3.120660000000001), execution_duration=np.float64(0.630056657816749), execution_cost=np.float64(0.972))))

With the to_excel method we can save the global dataset measurement as well as calculated ones to an excel file. e.g.

dataset_measurement.to_excel(Path().resolve() / "output/dataset_measurement.xlsx", maps_for_measurements={"mean": np.mean})

Optimization¶

Now we can optimize the number of branches. Note that we are not making a difference between training and testing dataset here which we would have to in a real world scenario.

In [15]:
import optuna


def objective(trial: optuna.Trial):
    number_of_branches = trial.suggest_int("number_of_branches", 1, 21)
    controller_factory = lambda: create_controller(number_of_branches=number_of_branches)
    return controller_factory
In [17]:
from llm_graph_optimizer.measurement.study_measurement import StudyMeasurement
from llm_graph_optimizer.optimizer.study_optuna import Study

try:
    optuna.delete_study(study_name="test_optimizer_jupyter", storage="sqlite:///db.sqlite3")
except KeyError:
    pass
optuna_study = optuna.create_study(
    direction="maximize",
    storage="sqlite:///db.sqlite3",
    study_name="test_optimizer_jupyter",  # set load_if_exists to True if you want to continue the study after a trial failed. (See parameters in the comments below.)
    )
study_measurement = StudyMeasurement()
study = Study(
    optuna_study=optuna_study,
    metrics=[accuracy_score],
    dataset_evaluator=DatasetEvaluator(
        calculate_score=calculate_score,
        dataloader_factory=dataloader_factory,
        parameters=parameters),  # we can also add save_cache_after_each_trial to True to save to the cache afetr each iteration to cheaply recover from failed runs.
    max_concurrent=10,
    study_measurement=study_measurement  # we can also add save_study_measurement_after_each_trial for the same reason.
)
study.set_objective(objective)
[I 2025-09-24 17:23:19,311] A new study created in RDB with name: test_optimizer_jupyter
/Users/XXXX-1/llm-graph-optimizer/llm_graph_optimizer/optimizer/study_optuna.py:47: ExperimentalWarning: set_metric_names is experimental (supported from v3.2.0). The interface can change in the future.
  self.optuna_study.set_metric_names([metric.name for metric in metrics])

Run the terminal command to open the Optuna dashboard:

optuna-dashboard sqlite:///db.sqlite3

As you are inside a jupyter notebook, it will add the database inside its parent folder. Run the command therefore from there!

In [18]:
study.optimize(n_trials=10)
Iteration 500: (accuracy = 0.7140, CI width = 0.0795), : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [01:55<00:00,  4.33it/s]
[I 2025-09-24 17:25:23,245] Trial 0 finished with value: {'accuracy': 0.714} and parameters: {'number_of_branches': 20}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7000, CI width = 0.0806), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 247.79it/s]
[I 2025-09-24 17:25:25,280] Trial 1 finished with value: {'accuracy': 0.7} and parameters: {'number_of_branches': 12}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7020, CI width = 0.0805), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:01<00:00, 282.59it/s]
[I 2025-09-24 17:25:27,062] Trial 2 finished with value: {'accuracy': 0.702} and parameters: {'number_of_branches': 10}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7140, CI width = 0.0795), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 155.13it/s]
[I 2025-09-24 17:25:30,298] Trial 3 finished with value: {'accuracy': 0.714} and parameters: {'number_of_branches': 20}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7080, CI width = 0.0800), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:01<00:00, 334.66it/s]
[I 2025-09-24 17:25:31,804] Trial 4 finished with value: {'accuracy': 0.708} and parameters: {'number_of_branches': 8}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.6880, CI width = 0.0815), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 769.88it/s]
[I 2025-09-24 17:25:32,467] Trial 5 finished with value: {'accuracy': 0.688} and parameters: {'number_of_branches': 1}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7120, CI width = 0.0797), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:01<00:00, 348.68it/s]
[I 2025-09-24 17:25:33,915] Trial 6 finished with value: {'accuracy': 0.712} and parameters: {'number_of_branches': 7}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7140, CI width = 0.0795), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 172.16it/s]
[I 2025-09-24 17:25:36,834] Trial 7 finished with value: {'accuracy': 0.714} and parameters: {'number_of_branches': 16}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7000, CI width = 0.0806), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 223.73it/s]
[I 2025-09-24 17:25:39,083] Trial 8 finished with value: {'accuracy': 0.7} and parameters: {'number_of_branches': 12}. Best is trial 0 with value: 0.714.
Iteration 500: (accuracy = 0.7120, CI width = 0.0797), : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 187.27it/s]
[I 2025-09-24 17:25:41,766] Trial 9 finished with value: {'accuracy': 0.712} and parameters: {'number_of_branches': 15}. Best is trial 0 with value: 0.714.
Iteration 495: mean = 0.7180, CI width = 0.0792, accuracy = 0.7180, : 100%|██████████| 500/500 [01:03<00:00,  7.89it/s]
[I 2025-04-21 21:10:17,079] Trial 0 finished with value: {'accuracy': 0.718} and parameters: {'number_of_branches': 7}. Best is trial 0 with value: 0.718.
Iteration 14: mean = 0.6667, CI width = 0.4067, accuracy = 0.6667, :   5%|▍         | 24/500 [00:04<01:19,  5.97it/s]
[I 2025-04-21 21:10:21,118] Trial 1 finished with value: {'accuracy': 0.6666666666666666} and parameters: {'number_of_branches': 12}. Best is trial 0 with value: 0.718.
Iteration 499: mean = 0.6580, CI width = 0.0834, accuracy = 0.6580, : 100%|██████████| 500/500 [00:00<00:00, 830.54it/s]
[I 2025-04-21 21:10:21,754] Trial 2 finished with value: {'accuracy': 0.658} and parameters: {'number_of_branches': 2}. Best is trial 0 with value: 0.718.
Iteration 498: mean = 0.7100, CI width = 0.0798, accuracy = 0.7100, : 100%|██████████| 500/500 [00:54<00:00,  9.11it/s]
[I 2025-04-21 21:11:16,689] Trial 3 finished with value: {'accuracy': 0.71} and parameters: {'number_of_branches': 13}. Best is trial 0 with value: 0.718.
Iteration 499: mean = 0.6580, CI width = 0.0834, accuracy = 0.6580, : 100%|██████████| 500/500 [00:00<00:00, 752.33it/s]
[I 2025-04-21 21:11:17,371] Trial 4 finished with value: {'accuracy': 0.658} and parameters: {'number_of_branches': 2}. Best is trial 0 with value: 0.718.
Iteration 499: mean = 0.7140, CI width = 0.0795, accuracy = 0.7140, : 100%|██████████| 500/500 [00:01<00:00, 357.40it/s]
[I 2025-04-21 21:11:18,786] Trial 5 finished with value: {'accuracy': 0.714} and parameters: {'number_of_branches': 11}. Best is trial 0 with value: 0.718.
Iteration 499: mean = 0.7100, CI width = 0.0798, accuracy = 0.7100, : 100%|██████████| 500/500 [00:00<00:00, 550.47it/s]
[I 2025-04-21 21:11:19,708] Trial 6 finished with value: {'accuracy': 0.71} and parameters: {'number_of_branches': 4}. Best is trial 0 with value: 0.718.
Iteration 497: mean = 0.7160, CI width = 0.0793, accuracy = 0.7160, : 100%|██████████| 500/500 [00:55<00:00,  8.94it/s]
[I 2025-04-21 21:12:15,677] Trial 7 finished with value: {'accuracy': 0.716} and parameters: {'number_of_branches': 19}. Best is trial 0 with value: 0.718.
Iteration 499: mean = 0.7160, CI width = 0.0793, accuracy = 0.7160, : 100%|██████████| 500/500 [00:02<00:00, 216.12it/s]
[I 2025-04-21 21:12:18,006] Trial 8 finished with value: {'accuracy': 0.716} and parameters: {'number_of_branches': 19}. Best is trial 0 with value: 0.718.
Iteration 499: mean = 0.7140, CI width = 0.0795, accuracy = 0.7140, : 100%|██████████| 500/500 [00:01<00:00, 275.31it/s]
[I 2025-04-21 21:12:19,840] Trial 9 finished with value: {'accuracy': 0.714} and parameters: {'number_of_branches': 15}. Best is trial 0 with value: 0.718.
In [ ]:
 
In [ ]: