<html><head>
    <title>The Bitter Lesson</title>
    <style type="text/css">
  <!--
  .style1 {font-family: Palatino}
  -->
    </style>
  </head>
  <body>
  <span class="style1">
  <h1>The Bitter Lesson<br>
  </h1>
  <h2>Rich Sutton</h2>
  <h3>March 13, 2019<br>
  </h3>
  The biggest lesson that can be read from 70 years of AI research is
  that general methods that leverage computation are ultimately the most
  effective, and by a large margin. The ultimate reason for this is
  Moore's law, or rather its generalization of continued exponentially
  falling cost per unit of computation. Most AI research has been
  conducted as if the computation available to the agent were constant
  (in which case leveraging human knowledge would be one of the only ways
  to improve performance) but, over a slightly longer time than a typical
  research project, massively more computation inevitably becomes
  available. Seeking an improvement that makes a difference in the
  shorter term, researchers seek to leverage their human knowledge of the
  domain, but the only thing that matters in the long run is the
  leveraging of computation. These two need not run counter to each
  other, but in practice they tend to. Time spent on one is time not
  spent on the other. There are psychological commitments to investment
  in one approach or the other. And the human-knowledge approach tends to
  complicate methods in ways that make them less suited to taking
  advantage of general methods leveraging computation.&nbsp; There were
  many examples of AI researchers' belated learning of this bitter
  lesson,
  and it is instructive to review some of the most prominent.<br>
  <br>
  In computer chess, the methods that defeated the world champion,
  Kasparov, in 1997, were based on massive, deep search. At the time,
  this was looked upon with dismay by the majority of computer-chess
  researchers who had pursued methods that leveraged human understanding
  of the special structure of chess. When a simpler, search-based
  approach with special hardware and software proved vastly more
  effective, these human-knowledge-based chess researchers were not good
  losers. They said that ``brute force" search may have won this time,
  but it was not a general strategy, and anyway it was not how people
  played chess. These researchers wanted methods based on human input to
  win and were disappointed when they did not.<br>
  <br>
  A similar pattern of research progress was seen in computer Go, only
  delayed by a further 20 years. Enormous initial efforts went into
  avoiding search by taking advantage of human knowledge, or of the
  special features of the game, but all those efforts proved irrelevant,
  or worse, once search was applied effectively at scale. Also important
  was the use of learning by self play to learn a value function (as it
  was in many other games and even in chess, although learning did not
  play a big role in the 1997 program that first beat a world champion).
  Learning by self play, and learning in general, is like search in that
  it enables massive computation to be brought to bear. Search and
  learning are the two most important classes of techniques for utilizing
  massive amounts of computation in AI research. In computer Go, as in
  computer chess, researchers' initial effort was directed towards
  utilizing human understanding (so that less search was needed) and only
  much later was much greater success had by embracing search and
  learning.<br>
  <br>
  In speech recognition, there was an early competition, sponsored by
  DARPA, in the 1970s. Entrants included a host of special methods that
  took
  advantage of human knowledge---knowledge of words, of phonemes, of the
  human vocal tract, etc. On the other side were newer methods that were
  more statistical in nature and did much more computation, based on
  hidden Markov models (HMMs). Again, the statistical methods won out
  over the human-knowledge-based methods. This led to a major change in
  all of natural language processing, gradually over decades, where
  statistics and computation came to dominate the field. The recent rise
  of deep learning in speech recognition is the most recent step in this
  consistent direction. Deep learning methods rely even less on human
  knowledge, and use even more computation, together with learning on
  huge training sets, to produce dramatically better speech recognition
  systems. As in the games, researchers always tried to make systems that
  worked the way the researchers thought their own minds worked---they
  tried to put that knowledge in their systems---but it proved ultimately
  counterproductive, and a colossal waste of researcher's time, when,
  through Moore's law, massive computation became available and a means
  was found to put it to good use.<br>
  <br>
  In computer vision, there has been a similar pattern. Early methods
  conceived of vision as searching for edges, or generalized cylinders,
  or in terms of SIFT features. But today all this is discarded. Modern
  deep-learning neural networks use only the notions of convolution and
  certain kinds of invariances, and perform much better.<br>
  <br>
  This is a big lesson. As a field, we still have not thoroughly learned
  it, as we are continuing to make the same kind of mistakes. To see
  this, and to effectively resist it, we have to understand the appeal of
  these mistakes. We have to learn the bitter lesson that building in how
  we think we think does not work in the long run. The bitter lesson is
  based on the historical observations that 1) AI researchers have often
  tried to build knowledge into their agents, 2) this always helps in the
  short term, and is personally satisfying to the researcher, but 3) in
  the long run it plateaus and even inhibits further progress, and 4)
  breakthrough progress eventually arrives by an opposing approach based
  on scaling computation by search and learning. The eventual success is
  tinged with bitterness, and often incompletely digested, because it is
  success over a favored, human-centric approach. <br>
  <br>
  One thing that should be learned from the bitter lesson is the great
  power of general purpose methods, of methods that continue to scale
  with increased computation even as the available computation becomes
  very great. The two methods that seem to scale arbitrarily in this way
  are <span style="font-style: italic;">search</span> and <span style="font-style: italic;">learning</span>. <br>
  <br>
  The second general point to be learned from the bitter lesson is that
  the actual contents of minds are tremendously, irredeemably complex; we
  should stop trying to find simple ways to think about the contents of
  minds, such as simple ways to think about space, objects, multiple
  agents, or symmetries. All these are part of the arbitrary,
  intrinsically-complex, outside world. They are not what should be built
  in, as their complexity is endless; instead we should build in only the
  meta-methods that can find and capture this arbitrary complexity.
  Essential to these methods is that they can find good approximations,
  but the search for them should be by our methods, not by us. We want AI
  agents that can discover like we can, not which contain what we have
  discovered. Building in our discoveries only makes it harder to see how
  the discovering process can be done.<br>
  <br>
  </span>
  
  
  </body></html>