## Evaluation Experiment Details

#### Evaluator Details

We recruited 15 graduate students as evaluators and provided them with necessary training prior to the evaluation. During the evaluation process, the evaluators were not informed of the sources of the scripts, and the sources of the two script excerpts compared in each questionnaire were randomized. Evaluators were asked to fill out the questionnaire based on their genuine feelings after reading the script excerpts, with each questionnaire designed to be completed within 6 to 10 minutes to ensure they could thoroughly read the excerpts and complete the questionnaire.

#### Evaluation Materials

`./doc/Evaluation Materials `

Our evaluation materials are divided into three groups, comparing with the three baselines: Rolling, Dramatron, and Wawa. Each group contains five novels, totaling fifteen script excerpts, along with loglines from the original novels. The sources of all script excerpts have been randomized.

#### Questionnaire Result

`./doc/Questionnaire Result `

In the experiment, we collected a total of 162 human evaluation questionnaires, 60 GPT-4o evaluation questionnaires, and 45 Claude 3 Opus evaluation questionnaires. The larger number of human evaluation questionnaires was aimed at enhancing the stability of the evaluation results.

#### Evaluation Questionnaire Template

`./doc/Sample Questionnaire.docx `

As shown in the Figure 1, this is an example of our evaluation questionnaire. Each questionnaire includes instructions, a summary of the original text, two script excerpts adapted from the novel summary, and evaluation questions.

The questionnaire evaluates the scripts across seven aspects, with the specific question numbers corresponding to each aspects as follows:

- Interesting: Q1, Q2
- Coherent: Q3, Q4
- Humanlike: Q5, Q6
- Dict &  Gram: Q7, Q8
- Transition: Q9, Q10
- Format: Q12
- Consistency: Q13

Additionally, Q11 serves as a lie detection question; if the responses to Q11 and Q5 contradict each other, the questionnaire will be deemed invalid.

![ques1](.\pictrue\ques1.jpg)

![ques2](.\pictrue\ques2.jpg)

![ques3](.\pictrue\ques3.jpg)

![ques4](.\pictrue\ques4.jpg)

Figure 1: Example of the questionnaire format utilized in our experiment. Only a portion is displayed due to the script's length.

#### Prompt for Evaluation 

`./doc/evaluation_prompt.txt `

This is the prompt we use to enable large models to conduct evaluations, allowing GPT-4o to generate easily processable questionnaire options and answers directly.

```
This is a questionnaire designed to evaluate the quality of a screenplay adapted from a novel. Below, you will read a logline of the novel's paragraphs and two screenplay excerpts, and then choose the screenplay excerpt that you believe best fits the description for each question. Each question has four options: A. Script A, B. Script B, C. Both, D. Neither. Please carefully consider each question and choose the option that best reflects your feelings.
------
Novel_Logline:
{Novel_Logline}

Script A:
{Script_A}

Script B:
{Script_B}
------
questions:
1. Which screenplay excerpt makes you more eager to know what happens next?
2. Which screenplay excerpt better captures your attention and immerses you in the story?
3. Which screenplay excerpt has more natural and smooth plot transitions?
4. Which screenplay excerpt has less jarring scene transitions?
5. Which screenplay excerpt feels more like it was written by a human screenwriter?
6. Which screenplay excerpt has more natural dialogue that aligns with how people typically express themselves?
7. Which screenplay excerpt uses more precise language and appropriate word choices?
8. Which screenplay excerpt has a more elegant grammatical structure?
9. Which screenplay excerpt presents smoother emotional changes in characters, without feeling abrupt?
10. Which screenplay excerpt has more natural pacing transitions?
11. Which screenplay excerpt feels more like it was generated by AI?
12. Which screenplay excerpt follows the standard format for screenplays more closely?
13. Which screenplay excerpt aligns more closely with the overall story outline?
------
Please output the answer options in list form according to the question order, and do not output any additional information.
```

#### Evaluated by Claude

`./doc/Claude Results.xlsx `

In addition to using GPT-4o for script evaluation, we also employed Claude 3 Opus to ensure the feasibility of large model assessments. The evaluation results are shown in Table 1. It can be observed that R2 outperforms the three baseline models across all seven aspects, with results comparable to those of GPT-4o.

Table 1: The comparison of R2 in the win rate against three baselines evaluated by Claude 3 Opus. 

| Approach | Interesting | Coherent | Humanlike | Dict & Gram | Transition | Format | Consistency |
| -------- | ----------- | -------- | --------- | ----------- | ---------- | ------ | ----------- |
| ROLLING  | 33.3%       | 26.7%    | 6.7%      | 0.0%        | 6.7%       | 6.7%   | 13.3%       |
| R2       | 73.3%       | 73.3%    | 93.3%     | 100.0%      | 93.3%      | 93.3%  | 93.3%       |

| Approach  | Interesting | Coherent | Humanlike | Dict & Gram | Transition | Format | Consistency |
| --------- | ----------- | -------- | --------- | ----------- | ---------- | ------ | ----------- |
| Dramatron | 14.3%       | 28.6%    | 32.1%     | 21.4%       | 14.3%      | 14.3%  | 57.1%       |
| R2        | 85.7%       | 71.4%    | 67.9%     | 78.6%       | 85.7%      | 85.7%  | 78.6%       |

| Approach | Interesting | Coherent | Humanlike | Dict & Gram | Transition | Format | Consistency |
| -------- | ----------- | -------- | --------- | ----------- | ---------- | ------ | ----------- |
| wawa     | 23.3%       | 30.0%    | 33.3%     | 20.0%       | 13.3%      | 46.7%  | 53.3%       |
| R2       | 76.7%       | 70.0%    | 66.7%     | 83.3%       | 86.7%      | 53.3%  | 73.3%       |

## Results of Different Factors

`./doc/Analysis of Traversal Methods Results.csv `

In exploring the impact of various factors on script generation quality, we utilized the same experimental dataset and randomized the sources of script segments for evaluation by GPT-4o. Table 2 displays the evaluation results for the different plot graph traversal methods.

Table 2: The results of the different plot graph traversal methods in win rate evaluated by GPT-4o.

| Approach | Interesting | Coherent | Humanlike | Dict & Gram | Transition | Format | Consistency |
| -------- | ----------- | -------- | --------- | ----------- | ---------- | ------ | ----------- |
| DFS      | 83.30%      | 100.00%  | 83.30%    | 77.80%      | 77.80%     | 88.90% | 88.90%      |
| chapter  | 16.70%      | 77.80%   | 16.70%    | 22.20%      | 33.30%     | 33.30% | 11.10%      |

| Approach | Interesting | Coherent | Humanlike | Dict & Gram | Transition | Format  | Consistency |
| -------- | ----------- | -------- | --------- | ----------- | ---------- | ------- | ----------- |
| BFS      | 77.80%      | 94.40%   | 83.30%    | 66.70%      | 83.30%     | 100.00% | 100.00%     |
| chapter  | 22.20%      | 22.20%   | 16.70%    | 33.30%      | 16.70%     | 22.20%  | 11.10%      |


## Calculation of Cohen's kappa coefficient
Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. In the questionnaire, the four options ABCD for each question are set as follows:
- A. Think script A is better
- B. Think script B is better
- C. Think both scripts are good
- D. Think both scripts are bad

The options have the following correlations:
1. Choosing A and choosing B represent completely opposite opinions, while choosing C and choosing D represent completely opposite opinions.
2. Choosing A represents approving of script A, thus reaching a partial agreement with choosing C; choosing A also represents rejecting script B, thus reaching a partial agreement with choosing D. The situation with choosing B is similar.

Therefore, we choose the following weight matrix to calculate Cohen's kappa:
$$
\begin{bmatrix}
1 & 0 & 0.5 & 0.5 \\
0 & 1 & 0.5 & 0.5 \\
0.5 & 0.5 & 1 & 0 \\
0.5 & 0.5 & 0 & 1
\end{bmatrix}
$$