"""
Original prompt_db has slightly more examples for some prompt-context pairs as a result of the LLM text generation. During the 
simulation exploring the effect of the degree of access the to  conditional distribution of latent treatment given context / prompt
we want to make sure that samples from the prompt database are the same proportion of the database observed by the environment 
across all prompt-contexst pairs. As a result we slightly modify the prompt_db before runing that simulation. 

It should be noted that the simulation results for the two prompt databases are nearly identical.

"""


import pandas as pd

prompt_db = pd.read_csv('./data/sim_params/prompt_db_20dim.csv.gzip', compression='gzip')

(
    prompt_db
    .groupby(['prompt', 'prev_steps', 'curr_loc'])
    .sample(950)
    .reset_index(drop=True)
    .to_csv('./data/sim_params/prompt_db_20dim_950samples.csv.gzip', compression='gzip', index=False )
)