# Synthetic data generation for audio model
Generate a synthetic data entry for training a text-to-audio model based on multiple descriptions of audio clips provided by the user. For each description provided by the user, produce a JSON object with the following fields:
1) **"input"**: The original description of the initial audio as provided by the user (required, type string).
2) **"reasoning"**: Reasoning steps that analyze the input description and decide on which instruction would be appropriate (required, type string). 
3) **"speech"**: A boolean ("true" or "false") indicating whether the provided audio caption contains speech (type boolean).
4) **"n_elements"**: A number indicating how many unique audio elements are present in provided the audio caption (type number).
5) **"instruction"**: A single editing operation described as an instruction to modify the initial audio (required, type string).
6) **"type"**: The type of edit (required, type string, one of "add", "replace", "remove", "other").
7) **"output"**: A new description reflecting the input audio after the modification according to the "instruction" (required, type string).
8) **"input_neg"**: Elements explicitly not present in the input audio, emphasizing what is absent. (maybe empty, type string).
9) **"output_neg"**: Elements that should be absent from the output audio after modification. (maybe empty, type string).

Follow these content guidelines to ensure consistency, clarity, and technical feasibility.

## Content Guidelines
### "input" field:
- Use the exact audio description provided by the user.
 
### "reasoning" field:
- Analyze the provided input audio caption step by step and then choose a fitting instruction to modify the input audio.
- Should include 2-4 sentences analyzing the main elements of the input audio.
 - Example
    - Input: An audience gives applause as a man yells
    - Reasoning: Based on the input description, this clip may have been recorded in an auditorium with audible reverb. The man yelling could be changed to a woman yelling without changing the overall structure of the sound.

### "speech" field:
- Indicate whether the audio caption contains speech ("true" or "false").

### "n_elements" field:
- Represents the number of distinct audio elements in the provided caption.
 
### "instruction" field:
- Describe a **single and clear modification** to the audio.
- The edit instruction should be **simple** and **tailored** to the provided input caption. 
- Ensure that no changes are instructed that do not clearly affect the edited audio (e.g., changing "sound of ducks" to "sound of ducklings", changing "toddler crying" to "infant crying")
- If the input audio description already contains **two or more distinct sounds**, avoid adding new sounds; focus on altering characteristics or removing sounds.
- If appropriate, make sound replacements that have a similar structure (e.g., "change the dog to a parrot", "replace the guitar with a piano", "change the helicopter to a truck", "the vehicle should be a motorboat", "replace the beeping with someone clapping")
- Do not change the underlying **structure** of the sound (e.g., do not change tempo or speed, do not change the order of sounds, a continuous sound must stay continuous)
- Do not add human speech (e.g., speaking, yelling, talking, ... )
- The edit instruction should sound human-written. Change capitalization and add some grammatical mistakes

### "type" field:
- Specifies which type of edit was generated. The type is one of ("add", "replace", "remove", "other")
- Use "other" if it does not fall into the other categories (e.g., "It should be quieter", "it should be further in the distance")

### "output" field:
- Provide a stand-alone description of the resulting audio after applying the instruction.
- Do **not** reference how it differs from the input.
- Avoid phrases that refer back to the input state (e.g., "More than before," "less than before," "Without bird sounds").
- Instead, simply describe the present sounds and their characteristics.
- Examples of invalid and valid output prompts
    - **Invalid output prompt**:
        - "Ocean waves crashing on the shore but quieter than before."
        - *Issue*: References the input state with "quieter than before."
    - **Invalid output prompt**:
        - "Now the ocean waves are gentler."
        - *Issue*: References the input state with "gentler."
    - **Valid output prompt**:
        - "Ocean waves gently crashing on the shore."
        - *Explanation*: Describes the audio without referencing the input.

### "input_neg" field:
- Empty when changing sound characteristics, but required when adding a new sound not mentioned in the input audio.
- Specify elements that are **not present** in the input audio.
- Example: If adding rain sounds, mention "rain sounds" in "input_neg" to emphasize their absence in the original audio.

### "output_neg" field:
- Empty when changing sound characteristics, but required when removing or replacing a sound in the input audio.
- Specify elements that should be **absent** after modification.
- Example: If removing wind sounds, mention "wind" in "output_neg".


## Formatting
- Use lowercase or mixed capitalization naturally. Proper nouns can be capitalized if needed, but uniform lowercase is acceptable.


## Diversity
- Aim to include a wide range of sounds from different categories such as nature, animals, human activities, mechanical noises, musical instruments, and ambient environments.
- Avoid overusing common elements like "dog," "cat," or "bird".
- Generate instructions that involve different types of modifications, including adding rare or unique sounds, changing environments, or altering sound characteristics in novel ways.


## Examples of Usage
- Adding, replacing, or removing specific sounds: e.g., "Add crowd cheering", "replace the man with a woman", or "Remove background noise",
- Altering the sound characteristics: e.g., "Decrease the pitch", "increase the reverb" or "decrease the volume of ambient noise"
- Adding situation-specific effects: e.g., "make it sound like it was recorded in an auditorium", or "It should be underwater"
- Many more...


## Summary
The user will provide multiple audio captions, each in a separate line.
As a reminder for the generated instruction:
- Do not change speed or tempo
- Do not change the order of elements in the audio
- The generated instruction should be simple, clear, and be tailored to the provided input caption.
Your response should be formatted as a valid JSON object with a property called "samples".
This property is an array containing an object for each audio description the user provides.
