<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      PPLM Revisited: Steering and Beaming a Lumbering Mammoth to Control Text Generation &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/2021/12/01/PPLM/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">PPLM Revisited: Steering and Beaming a Lumbering Mammoth to Control Text Generation</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#natural-language-generation"> natural language generation </a>
  
    <a class="content-tag" href="/tags/#language-models"> language models </a>
  
    <a class="content-tag" href="/tags/#pplm"> pplm </a>
  
    <a class="content-tag" href="/tags/#reproducibility"> reproducibility </a>
  
    <a class="content-tag" href="/tags/#generalization"> generalization </a>
  </span>

  <span id="iclr-post-authors" class="post-date">Anonymous</span>
  <h2 id="-1-introduction"><a name="section1"></a> 1. Introduction</h2>

<p>With access to extensively pre-trained language models such as GPT-2/3 <a href="#Brown">[Brown et al., 2020]</a>, <a href="#Radford2019">[Radford et al., 2019]</a>, there is tremendous progress in the field of Natural Language Generation (NLG). Although these models can produce readable and coherent text, letting users influence the generated text by steering towards desired topics or attributes is a challenging task. The Plug and Play Language Model (PPLM), introduced at ICLR 2020 <a href="#Dathathri">[Dathathri et al., 2020]</a>, was one of the first works on controlled text generation. The Plug and Play (PP) component optimizes the output of a pre-trained language model towards containing certain topics or text attributes. PPLM employs a pre-trained language model (LM) that generates text based on a given prompt. The LM itself is not adapted, rather controllability of text is achieved by adapting the likelihood of words to be generated by either a Bag-of-Words (BoW) related to a desired topic, or a discriminative classifier to control, e.g., the sentiment of a sentence. Due to its simplicity and ease of use, PPLM has been widely adopted. At the time of this writing, the ICLR 2020 publication has been <a href="https://scholar.google.nl/scholar?cluster=9850887597524341216&amp;hl=en&amp;as_sdt=0,5">cited</a> more than 200 times and the <a href="https://github.com/uber-research/PPLM">official implementation</a> received &gt;800 stars on GitHub.
It also served as a basis for new controllable NLG models in various domains, such as  controlled counterfactual generation of text <a href="#Madaan">[Madaan et al., 2021]</a>, belief-based generation of argumentative claims <a href="#Alshomary">[Alshomary et al., 2021]</a> and fact-enhanced synthetic news generation <a href="#Shu">[Shu et al., 2021]</a>.
In this blogpost, we examine to which extent language generation can be controlled by investigating reproducibility, the impact of the prompt vs. BoW and style control.</p>

<p><strong>PPLM and mammoths</strong></p>

<p>In a <a href="https://eng.uber.com/pplm/">blogpost</a> accompanying the original paper, the authors compared language generation models with “unguided wooly mammoths that lumber wherever they please.” A language model (LM) generates a text word by word, similar to a mammoth lumbering step by step. Using this metaphor, PPLM was presented as a mouse sitting on top of the mammoth and telling it where to go. We can clarify PPLM further by including trees in the metaphor representing words. The path of a mammoth is then represented by the trees the mammoth passes on its way and consequently represents the sequence of words that is generated. By steering the mammoth towards specific trees, the mouse has control over the generated text (see Figure 1.1).</p>

<p style="text-align: center;">Figure 1.1: The sequence of words is represented by a mammoth passing trees. Each tree represents a word.</p>

<p><img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoths-path-as-word-sequence.jpg" alt="Can we steer the mammoth by only showing it the list of trees, i.e., using a BoW?" /></p>

<p><strong>Contributions</strong></p>

<p>Although this mammoth-mouse metaphor is an excellent representation of the general idea behind PPLM, it does not cover the full behavior of PPLM. With a set of experiments, we analyze PPLM in more depth and subsequently extend the mammoth metaphor to explain the workings of text controllability. Specifically, we analyze how well and to which extent the mammoth follows the mouse’s instructions. We first evaluate the reproducibility of PPLM in order to provide a baseline regarding the validation of PPLM.
Second, we analyze the interplay between the prompt and the BoW and their impact on the controllability of text by experiments on generating topic-related questions. We interpret the results by extending the mammoth metaphor: we introduce beams and islands (we keep it an intuitive metaphor, promise!). We investigate the BoW further by analyzing the importance of words within a bag, and propose an adaptation with a weighted BoW to account for general frequency of words in the English language. We summarize our conclusions from these experiments in a more comprehensive metaphor. In this way, our metaphor provides more insights into the workings of PPLM and, to some extent, into NLG controllability in general.
Third, we experiment with controlling language complexity of generated text while maintaining the topical content.</p>

<p>The code for reproducing our experiments can be found here: <a href="https://anonymous.4open.science/r/Control-Mammoth-70D6/">https://anonymous.4open.science/r/Control-Mammoth-70D6/</a>.</p>

<h2 id="-2-reproducibility"><a name="section2"></a> 2. Reproducibility</h2>
<!--- Importance of reproduciblity -->
<p>Verifying the results and findings of scientific publications can motivate the scientific community to adapt those findings faster and improve upon them. In addition to the evergrowing complexity involved in training and evaluating models, this has led the Machine Learning community to put higher value on reproducing scientific outcomes, and encouraged authors to publicly share the code and data used in their published research.</p>

<!--- The goal of this experiment -->
<p><strong>Experimental setup</strong>
In this experiment, we aim to reproduce some of the results that are presented in <a href="#Dathathri">[Dathathri et al., 2020]</a> using the <a href="https://colab.research.google.com/drive/1Ux0Z4-ruiVtJ6jUk98uk6FqfvGHCOYL3">Colab notebook</a> and the <a href="https://github.com/uber-research/PPLM">Github repository</a> provided by the authors.
More specifically, we prompt the language model with the prefix “The potato” (same as in the paper) and try to steer it towards a certain topic using a BoW. We use the “military” BoW provided by the authors, i.e., a set of words related to military, such as “war”, “bomb”, “attack”, etc.</p>

<p><strong>Results</strong>
We compare the results found by the original authors by executing the provided Colab notebook locally, first by leaving hyperparameters unchanged and then finetuning hyperparameters, following the author’s recipe. This finetuning comprises an increase of the learning rate (step size $\alpha$) in order to generate more topic-related words and a raise of the KL divergence $\lambda_{kl}$ to produce more fluent examples. Results are shown in the below, with words from the BoW highlighted in red (click the example to expand full details).</p>

<p>Although two of the three texts generated by the authors contain words from the military BoW, those texts do not make much sense. Probably because it is difficult for the language model to combine the potato prompt with a military context. When we run the provided code locally (without hyperparameter change), we get even poorer results: only one of the three generated texts contains a military word, and the context seems more related to cooking than military. After tuning the hyperparameters according to the authors’ recipe, topic relatedness increases, i.e., all texts contain words from the military BoW, but without a real connection to “potato.”</p>

<blockquote>
  <details>
<summary>Table 2.1: Results from Colab Notebook for prompt "The potato", BoW "military". Red indicates words in the BoW <small>(click to show full table)</small>
 </summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/2.png" />

</details>
</blockquote>

<p>We repeat the experiment with the authors’ code provided on Github with similar results: finetuning the hyperparameters leads to more military flavored texts, but the example from the paper is loaded with a lot more military words.</p>

<blockquote>
  <details>
<summary>Table 2.2: Results from Github repository for prompt "The potato", BoW "military". Red indicates words in the BoW <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/22.png" />

</details>
</blockquote>

<h3 id="hyperparameter-analysis">Hyperparameter analysis</h3>
<p>Beyond the reproduction, we explore the effect of hyperparameter configurations on the quality of generated text in terms of perplexity (<code class="language-plaintext highlighter-rouge">PPL</code>) under a language model <a href="#Radford2018">[Radford et al., 2018]</a> as a proxy for fluency and the number of distinct n-grams (<code class="language-plaintext highlighter-rouge">Dist</code>) as a measure of repetitiveness   <a href="#Li">[Li et al., 2016]</a>.</p>

<p><strong>Experimental setup</strong>
We focus on 4 hyperparameters - step size $\alpha$, KL-scale $\lambda_{kl}$, grad-length and GM-scale $\gamma_{gm}$ and study how the quality of the generated text (in terms of perplexity and distinctiveness) changes with various hyperparameter configurations. Our methodology is as follows:</p>
<ol>
  <li>We randomly select 400 different hyperparameter configurations, generated from the combination of the following values (numbers in square brackets indicate [start:end;interval]): step size $\alpha$  <code class="language-plaintext highlighter-rouge">[0:0.1;0.01]</code>, KL-scale $\lambda_{kl}$ <code class="language-plaintext highlighter-rouge">[0:0.1;0.01]</code>, grad-length <code class="language-plaintext highlighter-rouge">[0:20;2]</code>, GM-scale $\gamma_{gm}$ <code class="language-plaintext highlighter-rouge">[0:1;0.1]</code>. For each configuration, we perform steps 2 to 4.</li>
  <li>For PPLM text generation, we use the prompts and BoWs from the subsequent Section 3 and Section 4, resulting in a total of 31 prompts and BoW combinations.</li>
  <li>For each prompt+BoW combination, we generate 5 texts of length 20.</li>
  <li>We calculate and average perplexity and distinctiveness over all generated samples for all the prompt+BoW combination.</li>
</ol>

<p><strong>Results</strong>
From the two parallel coordinate plots below in Figure 2.1 and Figure 2.2., we observe the following:</p>
<ul>
  <li>For lower perplexity (more fluent text), higher values of all hyperparameters are better.</li>
  <li>For higher distinctiveness (more unique n-grams), lower GM-scale is better. This result is in line with the comments on the <a href="https://github.com/uber-research/PPLM">original PPLM GitHub page</a>, where the authors suggest to decrease the GM-scale to address repetitiveness in the generated text.</li>
  <li>From Figure 2.2. we cannot derive much information about the other hyperparameters, due to over-cluttering. However, after re-arranging the verticals, moving step size closer to distinctiveness, we observed the same pattern: lower step size yields higher distinctiveness. These extra plots generated after re-arranging the verticals can be found in this <a href="https://anonymous.4open.science/r/Control-Mammoth-70D6/hyperparameter-analysis.ipynb">notebook</a>.</li>
  <li>Thus, perplexity and distinctiveness are contradicting targets in terms of hyperparameter configuration.</li>
</ul>

<table><tr>
<td style="text-align: center; vertical-align: middle;"> Figure 2.1: parallel coordinate plot for perplexity.  </td>
<td style="text-align: center; vertical-align: middle;"> Figure 2.2: parallel coordinate plot for distinctiveness. </td>
</tr><tr>
<td style="text-align: center; vertical-align: middle;"> <img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/parallel_coordinate_plot_ppl_gmscale.png" alt="Drawing" style="width: 100%;" /> </td>
<td style="text-align: center; vertical-align: middle;"> <img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/parallel_coordinate_plot_dist_gmscale.png" alt="Drawing" style="width: 100%;" /> </td>
</tr></table>

<p>We conclude from the reproduction and hyperparameter analysis, that steering language generation towards a certain topic is feasible, but requires careful hyperparameter tuning. We cannot simply optimize hyperparameters for topical coherence, but need to account for perplexity and distinctiveness as well, which are already challenging to tune on their own. Even with a suitable hyperparameter configuration, that balances between fluency and repetitiveness, topical coherence between the original prompt and the desired topic is not guaranteed.</p>

<h2 id="-3-investigating-the-interplay-between-prompt-and-wordlist"><a name="section3"></a> 3. Investigating the interplay between prompt and wordlist</h2>

<h3 id="-31-using-pplm-to-generate-questions-and-text-for-more-specific-topics"><a name="section31"></a> 3.1 Using PPLM to generate questions and text for more specific topics</h3>
<p>The original publication <a href="#Dathathri">[Dathathri et al., 2020]</a> evaluated PPLM on general domains, both in the prompt (e.g., “potato”) and the topics steered to, like military, science and politics. We want to explore the following - Is the mouse strong enough to steer the mammoth towards highly specific domain-related topics? What would work better for this kind of controllability: prompt or BoW? Therefore, in this subsection, we study the interplay between the prompt and BoW setting in the PPLM model and its effect on the quality of the generated text on a domain-related topic.
<!--- Metaphorically, we would like to study how this interplay makes the mammoth walk between different trees on an island.-->
We perform the following text generation experiments using PPLM:</p>
<ol>
  <li>general questions</li>
  <li>machine-learning specific text</li>
  <li>machine-learning specific questions
<!-- Our motivation behind the chosen set of experiments is to check whether the model can generate Explainable AI (XAI) questions. This would help in automatic generation of questions which can be used to understand and evaluate the trustworthiness of a black-box model. -->
<!--- Metaphorically, we would like to steer the mammoth - i) to walk the path between question word trees, ii) in the machine-learning island and iii) to walk the path between question word trees in the machine-learning island.--></li>
</ol>

<h3 id="-a-generating-general-questions"><a name="section31a"></a> a. Generating general questions</h3>

<p>As questions typically start with interrogative words <code class="language-plaintext highlighter-rouge">['What','When','Why',...]</code>, we use the PPLM BoW mechanism and create a BoW, called ‘questions’ (q) using a list of <a href="https://en.wikipedia.org/wiki/Interrogative_word">English interrogative words</a> and a question mark (<code class="language-plaintext highlighter-rouge">?</code>).
<!--Metaphorically, we want to steer the mammoth to walk through the question word trees.--></p>

<p><strong>Experimental setup</strong>
As questions are generally short sentences, the text length hyperparameter of PPLM was set to 10 for both experiments to prompt generation of shorter sentences. This means PPLM will generate questions with exactly 10 tokens. We evaluate two different parameter settings for question generation:</p>
<ul>
  <li>[Q-BoW(q)] We don’t use any prompt for the text generation and set the BoW to our ‘questions’ word bank.</li>
  <li>[Q-prompt-BoW(?)] We set the prompt as ‘What’ for text generation and BoW as a single token <em>’?’</em>.</li>
</ul>

<p>We perform a qualitative evaluation by showing both good and (arguably) bad examples (if they could be generated) for each experiment. More examples can be found in the <a href="https://anonymous.4open.science/r/Control-Mammoth-70D6/Reproduce-ML-Question-Experiment.ipynb">notebook</a>.</p>

<p><strong>Results</strong>
We show the generated text in Table 3.1a with good and bad examples from each experiment.</p>
<ul>
  <li>Our results show that [Q-prompt-BoW(?)] can generate better questions compared to [Q-BoW(q)].</li>
  <li>None of the sentences generated by [Q-BoW(q)] were questions. So, we could only show bad examples for this experiment.</li>
  <li>[Q-prompt-BoW(?)] generates good questions. However, they don’t always end with a question mark.</li>
  <li>This points to the fact that PPLM is better at generating questions when prompted with interrogative words like <em>‘What’</em> compared to passing the interrogative words as BoW list.</li>
</ul>

<blockquote>
  <details>
<summary>Table 3.1a: results on generating general questions. Red indicates words/tokens in the BoW <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/31a.png" />

</details>
</blockquote>

<h3 id="-b-generating-text-about-machine-learning"><a name="section31b"></a> b. Generating text about machine learning</h3>

<p>In this step, we aim to generate text from a specific domain, namely machine learning. For this purpose, we use the PPLM-BoW mechanism, with BoW as ‘machine-learning’ (ml) created by picking 50 random words from a <a href="https://developers.google.com/machine-learning/glossary">machine learning glossary</a>, containing words like features, dataset, etc.
<!-- Metaphorically, we want to steer the mammoth in the machine learning island. --></p>

<p><strong>Experimental setup</strong>
Domain-specific text is usually longer, hence, we set the text length to 50 in this experiment.
We evaluate two different parameter settings for domain-specific text generation:</p>
<ul>
  <li>[ML-BoW(ml)] We don’t use any prompt for text generation and only use BoW as our <em>‘machine-learning’</em> word list.</li>
  <li>[ML-MLprompt] We set the prompt to a machine learning text (MLprompt): <em>‘The model is trained on the Iris dataset.’</em> and don’t pass any BoW list.</li>
</ul>

<p>We perform a qualitative evaluation by showing both good and bad examples (if they could be generated) for each experiment. More examples can be found in the <a href="https://anonymous.4open.science/r/Control-Mammoth-70D6/Reproduce-ML-Question-Experiment.ipynb">notebook</a>.</p>

<p><strong>Results</strong>
We show the generated text in Table 3.1b with good and bad examples from each experiment.</p>
<ul>
  <li>The results show that [ML-BoW(ml)] does not generate machine learning related text. However, it is interesting to observe that the text is related to some technical content.</li>
  <li>The text generated from [ML-MLprompt] is quite related to machine learning.</li>
  <li>This points to the similar observation of Experiment 3.b, that prompts are better at generating topic specific sentences than the BoW list.
    <blockquote>
      <details>
<summary>Table 3.1b: Generating text about machine learning results. Red indicates words in the BoW <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/31b.png" />

</details>
    </blockquote>
  </li>
</ul>

<h3 id="c-generating-questions-about-machine-learning"><a name="section31c"></a>c. Generating questions about machine learning</h3>

<p>As a final step, we want to generate machine learning questions by combining both, interrogative and machine learning words.</p>

<p>For this purpose, we use the PPLM-BoW mechanism with our ‘machine-learning’ (ml) word list and vary the prompt.
<!-- Metaphorically, we want to steer the mammoth in the machine learning island in between the question word trees.--></p>

<p><strong>Experimental setup</strong>
As we expect domain-specific questions to be longer than general questions, but shorter than domain-specific text, we set the text length to 20.
We evaluate two different parameter settings for domain-specific question generation:</p>

<ul>
  <li>[MLQ-GENprompt-BoW(ml)] We set the prompt to an interrogative word (GENprompt), <em>‘How’</em> and the BoW to the <em>‘machine-learning’</em> word list.</li>
  <li>[MLQ-MLprompt-BoW(ml)] We set the prompt to a machine learning text, ending with an interrogative word (MLprompt), <em>‘The model was trained on Iris dataset. How’</em> and the BoW to the ‘machine-learning’ word list.</li>
</ul>

<p>We perform a qualitative evaluation by showing three good examples and three bad examples for each experiment. More examples can be found in the <a href="https://anonymous.4open.science/r/Control-Mammoth-70D6/Reproduce-ML-Question-Experiment.ipynb">notebook</a>.</p>

<p><strong>Results</strong>
We show the generated text in Table 3.1c  with good and bad examples from each experiment.</p>
<ul>
  <li>The results show that [MLQ-GENprompt-BoW(ml)] is good at generating general questions, but not at generating machine learning specific questions.</li>
  <li>The [MLQ-MLprompt-BoW(ml)] is quite good at generating machine learning specific questions.</li>
  <li>Thus, the observation here is in line with Experiments 3.a and 3.b. The machine learning related prompt containing an interrogative word is better at generating machine learning questions than the BoW <em>‘machine-learning’</em> list.</li>
</ul>

<blockquote>
  <details>
<summary>Table 3.1c: Generating questions about machine learning results. Red indicates words in the BoW <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/31c.png" />

</details>
</blockquote>

<h3 id="32-weighted-bow-addressing-the-focus-of-the-pplm-to-common-english-words-in-the-bow"><a name="section32"></a>3.2. Weighted BoW: Addressing the focus of the PPLM to common English words in the BoW</h3>
<p>In experiment 3.1.b, the PPLM model tended to steer towards the words “feature” and “dataset”,
which are the most common words among 50 machine learning words (according to Corpus of Contemporary American English (<a href="https://www.wordfrequency.info">COCA</a>)).
A potential reason could be that PPLM is updated based on
$\log p(a|x) = \log (\sum_{i=0}^k p_{t+1}[w_i])$ (Equation 4 in the original paper), which is dependent on the likelihood of each word.
In the previous experiment, the word <em>“feature”</em> has the highest likelihood, so most texts are generated towards this word.
To verify this hypothesis, we set up another experiment with an empty prompt and the BoW as <code class="language-plaintext highlighter-rouge">[sport, classify, representation, instance]</code>.
We expect the word <em>“sport”</em> to be far more likely than the others, because it is a commonly used, general term.
Meanwhile, the other words are from the specific domain of machine learning and hence far less likely overall.</p>

<blockquote>
  <details>
<summary>Table 3.2: Weight modification results. Red indicates words in the BoW <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/32.png" />

</details>
</blockquote>

<p>As expected, the results show that most of the generated text is related to “sport”.
Obviously, if a word in the BoW is much more common than the others, the model may ignore the less common words and might be steered towards a wrong topic.
To cope with this problem, we modify the weight of each word probability.
Specifically, we add a weight parameter called $v_i$ to the above equation and get:</p>

\[\begin{equation}
\log p(a|x) = \log (\sum_{i=0}^k v_ip_{t+1}[w_i]).
\end{equation}\]

<p>Up-weighting less common words with this modification, we run the experiment again, and have some good and bad results in Table 3.2.</p>

<p>The results show that we could steer the PPLM model to less common words by increasing the likelihood of each word by additional weighting.
The samples in Table 3.2 after up-weighting are closer to the technical domain than sports, which indicates the efficiency of controlling the likelihood weights.</p>

<p>This is just a naive approach but shows the potential of likelihood regularization.
Another possible option would be to make the distribution over words in the BoW uniform.
Even though we can control the distribution of words in the BOW, the generation of the desired topic strongly depends on the choice of words in the BoW.
In the scope of this blog post, we only want to raise awareness for this detail and suggest some methods to go further.</p>

<p>In conclusion, carefully picking the words in the BoW is important to control the PPLM, and modifying the weight can help to control the PPLM to generate texts related to the desired topic.</p>

<h3 id="33-concluding-with-the-mammoth-metaphor">3.3 Concluding with the mammoth metaphor</h3>
<p>The PPLM authors compared the working of PPLM with a mouse steering a mammoth along a certain path. However, our experiments show that controlling the language model for a very specific topic (e.g., machine learning) is challenging for PPLM when only using a BoW. Using a prompt, which is part of the LM rather than PP, is more effective. Hence, we introduce the analogy of the prompt with aliens in UFO’s beaming the mammoth to a particular position in the World of Language. Without a prompt, the mammoth would start at a random position in this world, depending on the unknown, internal state of the LM. The Continent of General Knowledge, as visualized in Figure 3.1, is the largest in the World of Language making it likely that the mammoth ends up there. From its starting position, the mammoth will lumber along a path of trees, as shown in Figure 3.2 (bottom path). A standard LM might take the shortest path through the forest, passing trees representing general or specific terms in quite random sequence with the goal to generate coherent text. PPLM with a BoW acts as the mouse steering the mammoth to preferred trees (yellow, top path in Figure 3.2). The challenge is that there exist words which are topic-related but are also used in different context, such as the word “feature” in experiment 3.1.b.
In such cases, BoW might be insufficient to generate good text and we need a specific prompt to change the starting position of the mammoth from the Continent of General Knowledge to a specific topic-related island. In case of a specific prompt, such as <em>“The model was trained on the Iris dataset,”</em> the mammoth is beamed to the Machine Learning Island, making it easier for the mouse to steer the mammoth to topic-relevant trees since there are more topic-related trees (yellow) rather than trees that also have another general meaning for the same word (mixed color). We can therefore conclude that beaming the mammoth to the right starting position is crucial for improved controlled generation of text.</p>

<p>In Section 3.2, we address the issue of words having multiple meanings in more detail. We show that in case some trees occur more often than others, represented by the multi-colored trees. Those trees are quite attractive for the mouse, and it tends to focus on those since they “have some yellow in them.” The weighted BoW approach addresses the problem by making those trees less attractive, by expressing explicit preference and adjusting the weights of words in a bag. In this way, the mouse can still steer towards the “sports” tree, but is more enthusiastic about the “classify” and “instance” trees, as visualized in Figure 3.3.</p>

<p>In summary, the combination of beaming the mammoth to a suitable starting position (promote) and adjusting the preferences of the mouse (weighted BoW), leads to a higher topic-relevance of the generated text.</p>

<p style="text-align: center;">Figure 3.1: The prompt beams the mammoth to a starting position.</p>

<p><img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-beaming.jpg" alt="Beaming the mammoth to a relevant starting area?" style="width:75%;display:block;margin-left:auto;margin-right:auto;" />
<!--![Beaming the mammoth to a relevant starting area?](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-beaming.jpg)--></p>

<p style="text-align: center;">Figure 3.2: A standard language model might take the shortest path (bottom), whereas PPLM acts as a mouse steering the mammoth (top).</p>

<table><tr>
<td> <img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-LM-vs-PPLM.jpg" alt="Drawing" style="width: 100%;" /> </td>
<td> <img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoths-Tree-Legend.jpg" alt="Drawing" style="width: 100%;" /> </td>
</tr></table>

<!--![Steering the mammoth to relevant trees as indicated by the BoW](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-LM-vs-PPLM.jpg) ![Steering the mammoth to relevant trees as indicated by the BoW](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-LM-vs-PPLM.jpg)-->

<p style="text-align: center;">Figure 3.3: Given higher weights to topic-relevant trees that might occur less often than trees having multiple meanings.</p>

<table><tr>
<td> <img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-BoW-vs-weighted-BoW.jpg" alt="Drawing" style="width: 100%;" /> </td>
<td> <img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoths-Tree-Legend.jpg" alt="Drawing" style="width: 100%;" /> </td>
</tr></table>

<!--![Giving preferences to the mouse to steer to topic-relevant trees](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-BoW-vs-weighted-BoW.jpg)-->

<h2 id="-4-controlling-text-complexity-with-pplm"><a name="section4"></a> 4. Controlling text complexity with PPLM</h2>

<p>Analogously to controlling the topic of generated text, it may also be desirable to control <em>stylistic aspects</em>. Intuitively, there is more than one way to express a given topic (e.g., formal, informal, polite, knowledgable, etc.) and the appropriate formulation depends on the context <a href="#Ficler">[Ficler and Goldberg, 2017]</a>. This has important applications in, for example, conversational agents where a system may choose to use a different tone depending on the needs of the dialog partner <a href="#Smith">[Smith et al., 2020]</a>.</p>

<p>In this part we explore a simple idea: can we use PPLM to control the <em>complexity</em> of generated language? As a proof-of-concept, we use the PPLM BoW objective with a list of the most frequent English words. Intuitively, these words are likely to be known by the vast majority of English speakers. If the PPLM objective consistently favors common English words and stays on topic, the generated texts should be more easy to read. While the original PPLM publication only explored controlling <em>topical aspects</em>, this experiment evaluates if the control mechanism generalizes to <em>stylistic aspects</em> of text and can therefore be seen as a test for the generalizability of the approach.</p>

<!-- Note: should we mention related work on constrained decoding and non-plug-and-play approaches here? -->

<h4 id="-experimental-setup"><a name="section4Ex"></a> Experimental setup</h4>

<p>Our experimental setup is as follows: we use the most common 1K/2K/5K English words in the Corpus of Contemporary American English (<a href="https://www.wordfrequency.info">COCA</a>) as the PPLM BoW objective and compare with an unguided generation as baseline. We generate texts both for everyday topics in the prompt (e.g., <code class="language-plaintext highlighter-rouge">The train</code>, <code class="language-plaintext highlighter-rouge">The football</code>) and complex subjects (e.g., <code class="language-plaintext highlighter-rouge">A radiograph is</code>, <code class="language-plaintext highlighter-rouge">Convex optimization is</code>) to see how the vocabulary use differs across those contexts. We define a list of 25 prompts which are given in the listing below. For each BoW and prompt, we generate 5 samples each up to 100 tokens and select the sample with lowest PPLM loss for further evaluation.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PROMPTS</span> <span class="o">=</span> <span class="p">[</span>
  <span class="s">"The steam engine is"</span><span class="p">,</span> <span class="s">"The ozone layer is"</span><span class="p">,</span> <span class="s">"A fracture is"</span><span class="p">,</span> <span class="s">"Vitamine D is"</span><span class="p">,</span> <span class="s">"Electricity is"</span><span class="p">,</span>
  <span class="s">"Machine learning is"</span><span class="p">,</span> <span class="s">"Convex optimization is"</span><span class="p">,</span> <span class="s">"A car is"</span><span class="p">,</span> <span class="s">"Gravity is"</span><span class="p">,</span> <span class="s">"Rain is"</span><span class="p">,</span>
  <span class="s">"A radiograph is"</span><span class="p">,</span> <span class="s">"A pulmonary edema is"</span><span class="p">,</span> <span class="s">"A rope is"</span><span class="p">,</span> <span class="s">"The potato"</span><span class="p">,</span> <span class="s">"The football"</span><span class="p">,</span>
  <span class="s">"The chicken"</span><span class="p">,</span> <span class="s">"The horse"</span><span class="p">,</span> <span class="s">"The pizza"</span><span class="p">,</span> <span class="s">"The lake"</span><span class="p">,</span> <span class="s">"The house"</span><span class="p">,</span> <span class="s">"The train"</span><span class="p">,</span> <span class="s">"The plain"</span><span class="p">,</span>
  <span class="s">"The tunnel"</span><span class="p">,</span> <span class="s">"The mountains"</span><span class="p">,</span> <span class="s">"The French country"</span>
<span class="p">]</span>
</code></pre></div></div>

<p style="text-align: center;">Listing 4.1: prompts used in the style experiments.</p>

<p>To objectively compare the generated texts, we employ established NLG metrics. Following <a href="#Dathathri">[Dathathri et al., 2020]</a>, we measure perplexity (<code class="language-plaintext highlighter-rouge">PPL</code>) under a language model <a href="#Radford2018">[Radford et al., 2018]</a> as a proxy for fluency and the number of distinct n-grams (<code class="language-plaintext highlighter-rouge">Dist</code>) as a measure of repetitiveness   <a href="#Li">[Li et al., 2016]</a>. <!-- We average the number of distinct 1-/2-/3-grams. --> In addition, we consider several metrics that are relevant to the task of generating simple language. These include surface-level statistics such as the number of generated words (<code class="language-plaintext highlighter-rouge">Words</code>) and the percentage of generated words that are present in the simple BoW used as PPLM objective. We refer to the latter as “Simple Word Precision” (e.g., <code class="language-plaintext highlighter-rouge">Prec. 2K EN</code>). Finally, we calculate unsupervised readability metrics such as the Flesch Reading Ease and Gunning-Fog index as an indication of the difficulty of generated language. For a recent review of readability measures, we refer the reader to <a href="#Martinc">[Martinc et al., 2021]</a>.</p>

<h4 id="-setting-the-hyperparameters-for-generating-simple-text"><a name="section4Se"></a> Setting the hyperparameters for generating simple text</h4>

<p>As discussed above, PPLM is sensitive to the choice of hyperparameters and in particular, fluency and non-repetitiveness are contradicting targets. In the absence of a good strategy to balance between these, we adjusted the step size $\alpha$, KL-scale $\lambda_{kl}$ and GM-scale $\gamma_{gm}$ based on one prompt and manually asserted that generated texts are both fluent and show good usage of the words in the BoW.</p>

<!--
final choice:
$\alpha = 0.04$
$\lambda_{kl} = 0.1$
$\gamma_{gm} = 0.65$
-->

<p>One interesting problem is the interaction between step size and the chosen BoW. As we increased the step size, more words of the BoW were chosen by PPLM (see Example 4.1). The BoW only includes lowercase word forms, this led the language generation to produce lowercase words even at the beginning of a sentence. One might try to fix this problem by adding the capitalized forms to the BoW. However, that might result in the opposite problem where capital words are generated mid-sentence.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Prompt: How does the steam engine work?

=== PPLM generated text with small stepsize (alpha = 0.08) ===
This question is often asked in a variety of ways - some folks want to know how the steam engine
works, some folks want to know how much electricity is used to create the steam, etc. The truth is,
we don't know, but we do know that it is not that simple and is a lot of work. The steam engine is
very complex machine and is very difficult to explain and explain in a simple manner. The only way
is through science - and that is not how it [...]

=== PPLM generated text with large stepsize (alpha = 0.2) ===
Generation: the steam is the main source of power for the steam engine and all other parts of the
ship. and it is a huge part of the overall weight of the ship. and the amount of power is not the
only reason for the weight and [...]
</code></pre></div></div>

<p style="text-align: center;">Example 4.1: interaction between step size $\alpha$ and the BoW of most common English words. As the step size gets too large, only words of the BoW are chosen which leads to sentence starts with lowercase word forms.</p>

<h4 id="-quantitative-evaluation-of-generated-samples"><a name="section4Qa"></a> Quantitative evaluation of generated samples</h4>

<p>To answer the question if the BoW mechanism is suitable to control text complexity, we are first going to look at the automated analysis of generated samples (see Table 4.1 and Figure 4.1). We make following observations:</p>

<ul>
  <li><strong>BoW samples have comparable fluency (<code class="language-plaintext highlighter-rouge">PPL</code>) and a consistent diversity (<code class="language-plaintext highlighter-rouge">Dist</code>).</strong> Relative to the unguided generation, the samples generated with PPLM appear still fluent and show a high diversity.</li>
  <li><strong>All samples drawn with PPLM include a significantly higher portion of words from the word list.</strong> This confirms the effectiveness of PPLM as a mechanism to steer the generation of text. Across PPLM objectives (BoW of length 1K/2K/5K) we do not see a difference in how many words are included from the respective word lists.</li>
  <li><strong>Samples generated with PPLM are on average longer than unguided samples in the number of words.</strong> This is because of the underlying Byte-Pair Encoding (BPE). During generation we’ve set the maximum sample length to 100 <em>tokens</em>. However, since a word may consist of multiple tokens in BPE, the number of words is <em>at most</em> the number of tokens. As we see that PPLM samples consist of more words, we can infer that more common words are sampled (i.e., words which do <em>not</em> have a subword segmentation).</li>
  <li><strong>We do not see a significant reduction in reading complexity when considering traditional readability measures.</strong> The results across the four readability metrics are inconclusive. While we observe increased readability in the Flesch Reading Ease and Coleman-Liau index, readability according to Gunning-Fog Index and ARI decreased.</li>
</ul>

<p style="text-align: center;">Table 4.1: quantitative evaluation of generated texts for text complexity experiment. Metrics are averaged over 25 samples for each PPLM objective.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">PPLM objective</th>
      <th style="text-align: right">PPL</th>
      <th style="text-align: right">Dist.</th>
      <th style="text-align: right">Words</th>
      <th style="text-align: right">1K Prec.</th>
      <th style="text-align: right">2K Prec.</th>
      <th style="text-align: right">5K Prec.</th>
      <th style="text-align: right">Flesch (↑)</th>
      <th style="text-align: right">Gunning-Fog (↓)</th>
      <th style="text-align: right">ARI (↓)</th>
      <th style="text-align: right">Coleman-Liau (↓)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Unguided</td>
      <td style="text-align: right">33.2</td>
      <td style="text-align: right">0.82</td>
      <td style="text-align: right">81</td>
      <td style="text-align: right">0.61</td>
      <td style="text-align: right">0.68</td>
      <td style="text-align: right">0.76</td>
      <td style="text-align: right">69</td>
      <td style="text-align: right">10.6</td>
      <td style="text-align: right">9.8</td>
      <td style="text-align: right">7.6</td>
    </tr>
    <tr>
      <td style="text-align: left">BoW English 1K</td>
      <td style="text-align: right">23.8</td>
      <td style="text-align: right">0.84</td>
      <td style="text-align: right">91.7</td>
      <td style="text-align: right">0.79</td>
      <td style="text-align: right">0.84</td>
      <td style="text-align: right">0.88</td>
      <td style="text-align: right">70.8</td>
      <td style="text-align: right">11.7</td>
      <td style="text-align: right">10.3</td>
      <td style="text-align: right">6.1</td>
    </tr>
    <tr>
      <td style="text-align: left">BoW English 2K</td>
      <td style="text-align: right">29.6</td>
      <td style="text-align: right">0.84</td>
      <td style="text-align: right">90.2</td>
      <td style="text-align: right">0.76</td>
      <td style="text-align: right">0.84</td>
      <td style="text-align: right">0.88</td>
      <td style="text-align: right">70.6</td>
      <td style="text-align: right">12</td>
      <td style="text-align: right">10.4</td>
      <td style="text-align: right">6.4</td>
    </tr>
    <tr>
      <td style="text-align: left">BoW English 5K</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">0.84</td>
      <td style="text-align: right">93.2</td>
      <td style="text-align: right">0.76</td>
      <td style="text-align: right">0.82</td>
      <td style="text-align: right">0.89</td>
      <td style="text-align: right">70.1</td>
      <td style="text-align: right">12</td>
      <td style="text-align: right">10.8</td>
      <td style="text-align: right">6.7</td>
    </tr>
  </tbody>
</table>

<p style="text-align: center;">Figure 4.1: distribution of values for selected text complexity metrics of Table 4.1.</p>

<p><img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/text-complexity-metrics.svg" alt="readability metrics density" /></p>

<h4 id="-qualitative-evaluation-of-generated-samples"><a name="section4Qua"></a> Qualitative evaluation of generated samples</h4>

<p>To get a better understanding of how the BoW objective influences text readability, we next turn to a few examples. We present both samples with high simple word precision (see Table 4.2) and samples with low simple word precision (see Table 4.3). For prompts on complex subjects (e.g., <code class="language-plaintext highlighter-rouge">A pulmonary edema is</code>, <code class="language-plaintext highlighter-rouge">Gravity is</code>) we see that the PPLM samples contain substantially less “technical” terms compared with the unguided sample. For many of the prompts of everyday subjects (e.g., <code class="language-plaintext highlighter-rouge">A car is</code>, <code class="language-plaintext highlighter-rouge">A rope is</code>), we cannot observe the same trend.</p>

<blockquote>
  <details>
<summary style="text-align: center;">Table 4.2: examples with a high simple-word precision: the PPLM guidance leads to qualitatively less complex texts. Red indicates words that are NOT in the bag of simple words. <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/text-complexity-examples-good.png" />

</details>
</blockquote>

<blockquote>
  <details>
<summary style="text-align: center;">Table 4.3: examples with a low simple-word precision: the PPLM guidance does not substantially reduce text complexity. Red indicates words that are NOT in the bag of simple words. <small> (click to show full table)</small></summary>

<img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/text-complexity-examples-bad.png" />

</details>
</blockquote>

<p><br />
<strong>In summary</strong>, we observe the PPLM model picking up on terms from the provided BoW, but we do not observe a conclusive influence on language complexity with the BoW approach. Speaking in terms of the mammoth metaphor, the mammoth can be beamed to a particular position in the world of language via the prompt and steered towards particular trees by the mouse on top of its head (PPLM with BoW). How fast the mammoth moves from tree to tree remains uncontrolled.</p>

<h2 id="-tldr-summary"><a name="summary"></a> TL;DR Summary</h2>
<p>In order to control text generation of a pre-trained language model, the Plug and Play Language Model (PPLM) acts as a mouse that steers a lumbering mammoth from tree to tree in order to influence the text generation. The mouse receives a Bag of Words (BoW) that represents the relevant trees, and we suggest to use weighted BoW to give higher preference to rare, but relevant words. We also experimented with steering the mouse to simpler words in order to reduce text complexity while maintaining topic relevance. Our reproducibility experiment and hyperparameter analysis show that it is challenging to make the mammoth move in the proper way. Most influential for topic controllability is the prompt, which is part of the language model rather than PPLM. We compared the prompt with aliens in UFO’s beaming the mammoth to a particular position in the World of Language. Without a prompt, the mammoth will start at a random position and is likely to end up in the Continent of General Knowledge. When specifying a topic-related prompt, the mammoth will already start in a relevant area, such as the Machine Learning Island. Hence, beaming the mammoth makes it easier for the mouse to steer the mammoth to the right trees.</p>

<p align="center"> <b>How fast the mammoth moves and whether it walks, dances or rolls from tree to tree, remains its own will.</b></p>

<p><img src="https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/Mammoth-dancing.jpg" alt="Dancing mammoth" style="width:30%;display:block;margin-left:auto;margin-right:auto;" /></p>

<!--
## <a name="section4Ex"></a> 5. Analyzing the effect of hyperparamater configuration on the generated text

From our reproducibility experiment, we observed that it was hard to achieve reproducibility using the hyperparameter configuration mentioned in the original PPLM paper. On tuning the hyperparameter values, specifically, step size, kl scale, grad length and gm scale, it reduced repetitive words and resulted in better quality text. We had the same observation in the experiments from Section 3 and 4. Therefore, we decided to explore the effect of hyperparameter configuration on the quality of the generated text.

**Experimental setup**

In our experiment, we focus on 4 hyperparameters - stepsize, kl scale, grad length and gm scale. We study how the quality of the generated text (in terms of perplexity and distinctiveness) changes with various hyperparameter configuration. Our methodology is as follows:
1. We randomly select 400 different hyperparameter configuration generated from the combination of the following values - stepsize having values in the range 0 to 0.1 with an interval of 0.01; kl scale values in the range of 0 to 0.1 with an interval of 0.01; grad length values in the range 0 to 20 with an interval of 2 and gm scale values in the range 0 to 1 with an interval of 0.1. For each hyperparameter configuration, we perform the steps 2 to 4.
2. We used the prompts and BoWs used in Section 3 and 4 resulting in a total of 31 prompts and BoW combination, to generate text using PPLM.
3. From each prompt+BoW combination, we generated 5 texts of length 20. For each generated text, we calculate the perplexity and distinctiveness as introduced in Section 4.
4. Then, we average the perplexity and distinctiveness over all the generated samples for all the prompt+BoW combination
5. We show the influence of the 4 hyperparameter on perplexity and distinctiveness using a parallel coordinate plot.

**Results**

We show two parallel coordinate plots - one for perplexity (ppl) and the other for distinctiveness (dist). We observe the following things:
- For lower perplexity (more fluent text), higher gm scale, higher grad length, higher kl scale and higher stepsize is better.
- For higher distinctiveness (more unique n-grams), lower gm scale is better. Not much about the other hyperparameters could be extracted from a single plot due to over-cluttering of the lines. So, we rearranged the vertical hyperparameter lines to move the stepsize closer to the perplexity vertical line and shpwed another parallel coordinate plot (second plot). We can observe that lower stepsize is better for higher distinctiveness.
- Thus, there is a contradiction between the hyperparameter values for achieving better perplexity and better distinctiveness. We did not perform any experiment to conclude the range of values that would be good for both the evaluation metrics.
- Our results from the distinctiveness plot also align with the comments in the [original PPLM github page](https://github.com/uber-research/PPLM), where the authors suggest to decrease the gm scale to address repetitiveness in the generated text.
- In conclusion, it can be seen that generated text from PPLM is quite dependent on the correct selection of the hyperparameter values.

![Perplexity image](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/parallel_coordinate_plot_ppl_gmscale.png)
![Perplexity image](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/parallel_coordinate_plot_ppl_stepsize.png)

![Distinctiveness image](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/parallel_coordinate_plot_dist_gmscale.png)
![Distinctiveness image](https://iclr.iro.umontreal.ca/ef07ed73-c824-4bd8-ab24-aaaef3d8db16_1642173301/public/images/2021-12-01-PPLM/parallel_coordinate_plot_dist_stepsize.png)


## 6. Summary
-->

<h3 id="references">References</h3>

<p><a name="Alshomary" href="https://aclanthology.org/2021.eacl-main.17/">Alshomary, M., Chen, W.-F., Gurcke, T., &amp; Wachsmuth, H. (2021). Belief-based Generation of Argumentative Claims. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 224–233. </a></p>

<p><a name="Brown" href="https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html">Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … Amodei, D. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 1877–1901. </a></p>

<p><a name="Dathathri" href="https://openreview.net/forum?id=H1edEyBKDS">Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., &amp; Liu, R. (2020). Plug and Play Language Models: A Simple Approach to Controlled Text Generation. International Conference on Learning Representations, 1–18.</a></p>

<p><a name="Ficler" href="https://doi.org/10.18653/v1/W17-4912">Ficler, J., &amp; Goldberg, Y. (2017). Controlling Linguistic Style Aspects in Neural Language Generation. Proceedings of the Workshop on Stylistic Variation, 94–104.</a></p>

<p><a name="Li" href="https://doi.org/10.18653/v1/N16-1014">Li, J., Galley, M., Brockett, C., Gao, J., &amp; Dolan, B. (2016). A Diversity-Promoting Objective Function for Neural Conversation Models. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 110–119.</a></p>

<p><a name="Madaan" href="https://ojs.aaai.org/index.php/AAAI/article/view/17594">Madaan, N., Padhi, I., Panwar, N., &amp; Saha, D. (2021). Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13516–13524.</a></p>

<p><a name="Martinc" href="https://doi.org/10.1162/coli_a_00398"> Martinc, M., Pollak, S., &amp; Robnik-Šikonja, M. (2021). Supervised and Unsupervised Neural Approaches to Text Readability. Computational Linguistics, 47(1), 141–179.</a></p>

<p><a name="Radford2018" href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">Radford, A., Narasimhan, K., Salimans, T., &amp; Sutskever, I. (2018). Improving language understanding by generative pre-training.</a></p>

<p><a name="Radford2019" href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., &amp; Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.</a></p>

<p><a name="Shu" href="https://ojs.aaai.org/index.php/AAAI/article/view/17629">Shu, K., Li, Y., Ding, K., &amp; Liu, H. (2021). Fact-Enhanced Synthetic News Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13825–13833.</a></p>

<p><a name="Smith" href="https://arxiv.org/abs/2009.10855">Smith, E. M., Gonzalez-Rico, D., Dinan, E., &amp; Boureau, Y-L. (2020). Controlling Style in Generated Dialogue.</a></p>

</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/01/sample-submission/">
            Sample Submission
            <small>01 Sep 2021 | 
    <a class="content-tag" href="/tags/#natural-language-generation"> natural language generation </a>
  
    <a class="content-tag" href="/tags/#language-models"> language models </a>
  
    <a class="content-tag" href="/tags/#pplm"> pplm </a>
  
    <a class="content-tag" href="/tags/#reproducibility"> reproducibility </a>
  
    <a class="content-tag" href="/tags/#generalization"> generalization </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Basic Markdown)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#natural-language-generation"> natural language generation </a>
  
    <a class="content-tag" href="/tags/#language-models"> language models </a>
  
    <a class="content-tag" href="/tags/#pplm"> pplm </a>
  
    <a class="content-tag" href="/tags/#reproducibility"> reproducibility </a>
  
    <a class="content-tag" href="/tags/#generalization"> generalization </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
