# metagen
Code and data for A Foundational Model for Metamaterial Design

# Description

This repository contains the implementation of the MetaDSL (called metagen in the code), as well as a dataset of MetaGen programs, and utilties for batch processing (executing DSL programs and simulating and validating their outputs), then formatting the data for LLM training and evaluation.


# Installation

Prerequisites: This repository assumes that you have aws credentials locally in the .aws subdirectory of your home directory. The easiest way to do this is to install and set up the AWS cli on your machine using the instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). For credential setup, follow the "IAM user long-term credentials" instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html).

Method 1: Conda and Pip
You can install this package directly on your machine using conda and pip by running

`conda env create -f environment.yml`

`conda activate metagen`

`pip install -e .`

There are some functions that require evaluateJSONMetagen from Microcodecs to be accessible on your path. The easiest way to do this is to instead use this repository from within a devcontainer.

Method 2: Devcontainer

Instead of manually installing dependencies, you can run this through a devcontainer. The only dependency to do so is docker / dockerengine (Windows). This devcontainer has been tested on Ubuntu and Ubuntu througt WSL 2 (on Windows). Visual Studio Code has support for dev containers, detailed [here](https://code.visualstudio.com/docs/devcontainers/containers).


# Organization

- data/ : The Metagen Dataset (program files only)
- metagen/ : Main python package
  - analysis.py : Functions to analyze program processing failures
  - augmentation.py : Unused (function to select hybridization combinations)
  - datagen.py : Data Generation functions. LLM formatting functions, utilities for pushing seed programs to S3.
  - evaluation.py : Functions for evaluating LLM predictions
  - llm.py : Unused
  - openai\_parallel.py : Batch-Inference on OpenAI (does not handle rate limiting well)
  - processing.py : Main entry point for running and evaluating MetaGen programs
  - rendering.py : Headless rendering metamaterial .obj files from multiple viewpoints
  - templates.py : Template format strings for LLM prompts. Includes MetaGen language specifications and task templates.
  - training.py : LLM formatting for training
  - util.py : Utility files, including formatting and S3 access
  - validation.py : Code to validate generated metamaterials for periodicity and contiguity
- notebooks/ : Jupyter notebooks, largely development scratchpads, but also contains some datset generation pipelines
- prompt\_materials/ : Deprecated in favor of templates.py
- scripts/ : Runnable scripts, largely deprecated
- scripts\_old : Old, deprecated scripts

# Utilities

The `metagen/utils.py` file contains many helper functions for manipulating and accessing the example directories stored on S3 (or elsewhere) by abstracting away whether paths passed are local directories or s3 URIs (s3://...).

Among the more useful functions in this file are:

- open_file : like open(x, 'r') for either local or remote files
- get_text : reads local or remote text files
- file_dir : like os.path.dirname for either local or remote files
- path_append : like os.path.join for either local or remote files (but only appends one level, not a list of suffixes)
 - read_json : opens and reads a local or remote json file
 - list_all : lists all files under a given local directory or s3 uri
 - list_all_filtered : like list_all, but can also take gitignore style accept and reject filters. For example, all of the `code.py` files in the `/data` can be listed with `list_all_filtered("/data", ['*.py'])`.
 - load_mesh : loads the mesh from a local or remote example directory
 - load_image_bytes : gets bytes formatted data from a local or remote image file. This will also handle local caching of remote images in a directory given in the environmental variable `S3CACHE` if such an environmental variable exists. The `S3CACHE` can be useful for collecting all image files for local training
 - format_prompt_contents: Formats multimodal prompts into LLM specific formats. This assumes that any images embedded in the prompt will be given as locations delimited by `<[`, `]>`.
 - graph_representation : loads `graph.json` given an example directory, either local or remote
 - code_representation : loads `code.py` given an example directory, either local or remote
 - read_text_file : opens and reads both local and remote text files (similar to get_text, but more robust)
 - list_all_programs : finds and lists all programs in a dataset. You need to provide a ProgramType, which tells it to look for either `code.py` files or `graph.json` files.
 - visualize_material : Loads all 4 rendered viewpoints of a material into one PIL image for quick debugging visualization
 - load_and_format_properties: loads simulation data for an example then rounds selected properties to quantization values and returns a compact json representation
 - parse_record_id : Decodes train/test example labels
 - load_jsonl : opens and loads a local or remote JSON Lines (jsonl) file
 - extract_and_classify_blocks : finds code blocks in LLM responses
 - parse_json : more robust json parsing for LLM output that can handle minor json deviations that LLMs like to do (like adding comments)