{"message": {"transcript": [{"chunks": [{"end": 0.6, "start": 0.0, "text": "Hi,"}, {"end": 0.96, "start": 0.6, "text": "my"}, {"end": 1.2, "start": 0.96, "text": "name"}, {"end": 1.72, "start": 1.2, "text": "is"}, {"end": 2.44, "start": 1.72, "text": "Abby"}, {"end": 2.96, "start": 2.44, "text": "Meng"}, {"end": 3.44, "start": 2.96, "text": "Yuan,"}, {"end": 3.8, "start": 3.44, "text": "and"}, {"end": 4.12, "start": 3.8, "text": "today"}, {"end": 4.32, "start": 4.12, "text": "I'm"}, {"end": 5.04, "start": 4.32, "text": "presenting"}, {"end": 5.2, "start": 5.04, "text": "a"}, {"end": 5.48, "start": 5.2, "text": "work"}, {"end": 6.24, "start": 5.48, "text": "cooperated"}, {"end": 6.48, "start": 6.24, "text": "with"}, {"end": 6.68, "start": 6.48, "text": "my"}, {"end": 7.76, "start": 6.68, "text": "supervisors"}, {"end": 8.08, "start": 7.76, "text": "Pauline"}, {"end": 8.52, "start": 8.08, "text": "Lin"}, {"end": 8.72, "start": 8.52, "text": "and"}, {"end": 9.12, "start": 8.72, "text": "Justin"}, {"end": 10.6, "start": 9.12, "text": "Zubel."}, {"end": 11.0, "start": 10.6, "text": "The"}, {"end": 11.64, "start": 11.0, "text": "title"}, {"end": 11.96, "start": 11.64, "text": "is"}, {"end": 12.28, "start": 11.96, "text": "Document"}, {"end": 13.2, "start": 12.28, "text": "Clustering"}, {"end": 13.84, "start": 13.2, "text": "versus"}, {"end": 14.44, "start": 13.84, "text": "Topic"}, {"end": 15.04, "start": 14.44, "text": "Modeling,"}, {"end": 15.24, "start": 15.04, "text": "a"}, {"end": 16.0, "start": 15.24, "text": "Key"}, {"end": 19.16, "start": 16.0, "text": "Study."}, {"end": 20.16, "start": 19.16, "text": "So"}, {"end": 20.28, "start": 20.16, "text": "a"}, {"end": 20.6, "start": 20.28, "text": "very"}, {"end": 21.52, "start": 20.6, "text": "challenging"}, {"end": 22.24, "start": 21.52, "text": "question"}, {"end": 22.32, "start": 22.24, "text": "in"}, {"end": 22.92, "start": 22.32, "text": "IR"}, {"end": 23.16, "start": 22.92, "text": "is"}, {"end": 23.44, "start": 23.16, "text": "to"}, {"end": 23.84, "start": 23.44, "text": "learn"}, {"end": 23.92, "start": 23.84, "text": "about"}, {"end": 24.24, "start": 23.92, "text": "the"}, {"end": 24.64, "start": 24.24, "text": "scope"}, {"end": 24.84, "start": 24.64, "text": "of"}, {"end": 25.04, "start": 24.84, "text": "a"}, {"end": 25.44, "start": 25.04, "text": "collection"}, {"end": 25.72, "start": 25.44, "text": "when"}, {"end": 26.12, "start": 25.72, "text": "we"}, {"end": 26.72, "start": 26.12, "text": "have"}, {"end": 27.0, "start": 26.72, "text": "an"}, {"end": 27.24, "start": 27.0, "text": "unknown"}, {"end": 28.12, "start": 27.24, "text": "collection."}, {"end": 28.48, "start": 28.12, "text": "And"}, {"end": 28.76, "start": 28.48, "text": "there"}, {"end": 28.92, "start": 28.76, "text": "are"}, {"end": 29.24, "start": 28.92, "text": "many"}, {"end": 29.56, "start": 29.24, "text": "forms"}, {"end": 29.96, "start": 29.56, "text": "of"}], "text": " Hi, my name is Abby Meng Yuan, and today I'm presenting a work cooperated with my supervisors Pauline Lin and Justin Zubel. The title is Document Clustering versus Topic Modeling, a Key Study. So a very challenging question in IR is to learn about the scope of a collection when we have an unknown collection. And there are many forms of"}, {"chunks": [{"end": 30.32, "start": 30.0, "text": "of"}, {"end": 30.8, "start": 30.32, "text": "collection"}, {"end": 31.96, "start": 30.8, "text": "descriptions."}, {"end": 32.52, "start": 31.96, "text": "For"}, {"end": 33.88, "start": 32.52, "text": "example,"}, {"end": 34.04, "start": 33.88, "text": "a"}, {"end": 34.519999999999996, "start": 34.04, "text": "very"}, {"end": 35.28, "start": 34.519999999999996, "text": "simple"}, {"end": 35.88, "start": 35.28, "text": "collection"}, {"end": 36.6, "start": 35.88, "text": "description"}, {"end": 36.88, "start": 36.6, "text": "could"}, {"end": 37.32, "start": 36.88, "text": "be"}, {"end": 37.72, "start": 37.32, "text": "a"}, {"end": 39.0, "start": 37.72, "text": "distribution"}, {"end": 39.44, "start": 39.0, "text": "of"}, {"end": 40.96, "start": 39.44, "text": "topics"}, {"end": 41.28, "start": 40.96, "text": "or"}, {"end": 41.84, "start": 41.28, "text": "themes"}, {"end": 42.0, "start": 41.84, "text": "within"}, {"end": 42.28, "start": 42.0, "text": "a"}, {"end": 43.8, "start": 42.28, "text": "collection."}, {"end": 44.12, "start": 43.8, "text": "And"}, {"end": 44.480000000000004, "start": 44.12, "text": "when"}, {"end": 45.24, "start": 44.480000000000004, "text": "we"}, {"end": 45.64, "start": 45.24, "text": "have"}, {"end": 45.88, "start": 45.64, "text": "an"}, {"end": 46.04, "start": 45.88, "text": "endnote"}, {"end": 46.8, "start": 46.04, "text": "collection,"}, {"end": 47.2, "start": 46.8, "text": "there"}, {"end": 47.44, "start": 47.2, "text": "are"}, {"end": 47.760000000000005, "start": 47.44, "text": "many"}, {"end": 48.56, "start": 47.760000000000005, "text": "questions"}, {"end": 48.760000000000005, "start": 48.56, "text": "we"}, {"end": 48.879999999999995, "start": 48.760000000000005, "text": "may"}, {"end": 49.2, "start": 48.879999999999995, "text": "want"}, {"end": 49.239999999999995, "start": 49.2, "text": "to"}, {"end": 49.8, "start": 49.239999999999995, "text": "ask"}, {"end": 50.08, "start": 49.8, "text": "before"}, {"end": 50.32, "start": 50.08, "text": "we"}, {"end": 51.480000000000004, "start": 50.32, "text": "use"}, {"end": 51.92, "start": 51.480000000000004, "text": "it."}, {"end": 52.32, "start": 51.92, "text": "For"}, {"end": 53.120000000000005, "start": 52.32, "text": "example,"}, {"end": 53.4, "start": 53.120000000000005, "text": "what"}, {"end": 54.0, "start": 53.4, "text": "queries"}, {"end": 54.04, "start": 54.0, "text": "can"}, {"end": 54.16, "start": 54.04, "text": "I"}, {"end": 54.72, "start": 54.16, "text": "pose"}, {"end": 54.84, "start": 54.72, "text": "with"}, {"end": 55.08, "start": 54.84, "text": "this"}, {"end": 56.44, "start": 55.08, "text": "collection?"}, {"end": 57.08, "start": 56.44, "text": "Or"}, {"end": 57.28, "start": 57.08, "text": "if"}, {"end": 57.519999999999996, "start": 57.28, "text": "I"}, {"end": 57.92, "start": 57.519999999999996, "text": "have"}, {"end": 58.36, "start": 57.92, "text": "an"}, {"end": 58.64, "start": 58.36, "text": "information"}, {"end": 59.519999999999996, "start": 58.64, "text": "need,"}, {"end": 59.72, "start": 59.519999999999996, "text": "will"}, {"end": 59.96, "start": 59.72, "text": "this"}], "text": " of collection descriptions. For example, a very simple collection description could be a distribution of topics or themes within a collection. And when we have an endnote collection, there are many questions we may want to ask before we use it. For example, what queries can I pose with this collection? Or if I have an information need, will this"}, {"chunks": [{"end": 61.16, "start": 60.0, "text": "collection"}, {"end": 61.36, "start": 61.16, "text": "of"}, {"end": 61.68, "start": 61.36, "text": "my"}, {"end": 62.04, "start": 61.68, "text": "interest"}, {"end": 62.28, "start": 62.04, "text": "at"}, {"end": 62.68, "start": 62.28, "text": "all?"}, {"end": 63.32, "start": 62.68, "text": "Or"}, {"end": 63.72, "start": 63.32, "text": "how"}, {"end": 63.92, "start": 63.72, "text": "can"}, {"end": 64.2, "start": 63.92, "text": "I"}, {"end": 64.32, "start": 64.2, "text": "use"}, {"end": 64.48, "start": 64.32, "text": "the"}, {"end": 65.4, "start": 64.48, "text": "collection"}, {"end": 65.76, "start": 65.4, "text": "for"}, {"end": 65.88, "start": 65.76, "text": "a"}, {"end": 66.56, "start": 65.88, "text": "specific"}, {"end": 67.44, "start": 66.56, "text": "task?"}, {"end": 67.8, "start": 67.44, "text": "And"}, {"end": 67.96, "start": 67.8, "text": "by"}, {"end": 68.03999999999999, "start": 67.96, "text": "learning"}, {"end": 68.56, "start": 68.03999999999999, "text": "the"}, {"end": 68.92, "start": 68.56, "text": "scope"}, {"end": 68.96000000000001, "start": 68.92, "text": "of"}, {"end": 69.2, "start": 68.96000000000001, "text": "a"}, {"end": 69.92, "start": 69.2, "text": "collection,"}, {"end": 70.28, "start": 69.92, "text": "it"}, {"end": 70.56, "start": 70.28, "text": "may"}, {"end": 70.96000000000001, "start": 70.56, "text": "help"}, {"end": 71.16, "start": 70.96000000000001, "text": "us"}, {"end": 71.36, "start": 71.16, "text": "to"}, {"end": 71.8, "start": 71.36, "text": "answer"}, {"end": 72.16, "start": 71.8, "text": "those"}, {"end": 74.44, "start": 72.16, "text": "questions."}, {"end": 74.84, "start": 74.44, "text": "There"}, {"end": 74.88, "start": 74.84, "text": "are"}, {"end": 75.4, "start": 74.88, "text": "two"}, {"end": 76.16, "start": 75.4, "text": "commonly"}, {"end": 76.64, "start": 76.16, "text": "used"}, {"end": 77.44, "start": 76.64, "text": "approaches"}, {"end": 77.92, "start": 77.44, "text": "to"}, {"end": 78.56, "start": 77.92, "text": "describe"}, {"end": 78.76, "start": 78.56, "text": "a"}, {"end": 79.32, "start": 78.76, "text": "collection."}, {"end": 79.88, "start": 79.32, "text": "The"}, {"end": 80.4, "start": 79.88, "text": "first"}, {"end": 80.72, "start": 80.4, "text": "one"}, {"end": 81.24, "start": 80.72, "text": "is"}, {"end": 81.6, "start": 81.24, "text": "document"}, {"end": 83.0, "start": 81.6, "text": "clustering"}, {"end": 83.56, "start": 83.0, "text": "and"}, {"end": 83.84, "start": 83.56, "text": "the"}, {"end": 84.16, "start": 83.84, "text": "second"}, {"end": 84.36, "start": 84.16, "text": "one"}, {"end": 84.52, "start": 84.36, "text": "is"}, {"end": 84.88, "start": 84.52, "text": "topic"}, {"end": 86.44, "start": 84.88, "text": "modeling."}, {"end": 86.84, "start": 86.44, "text": "So"}, {"end": 87.28, "start": 86.84, "text": "with"}, {"end": 87.84, "start": 87.28, "text": "the"}, {"end": 89.44, "start": 87.84, "text": "size"}, {"end": 89.96000000000001, "start": 89.44, "text": "of"}], "text": " collection of my interest at all? Or how can I use the collection for a specific task? And by learning the scope of a collection, it may help us to answer those questions. There are two commonly used approaches to describe a collection. The first one is document clustering and the second one is topic modeling. So with the size of"}, {"chunks": [{"end": 90.76, "start": 90.0, "text": "collections"}, {"end": 90.96, "start": 90.76, "text": "we"}, {"end": 91.08, "start": 90.96, "text": "are"}, {"end": 91.48, "start": 91.08, "text": "using"}, {"end": 92.32, "start": 91.48, "text": "today"}, {"end": 92.56, "start": 92.32, "text": "because"}, {"end": 93.16, "start": 92.56, "text": "they"}, {"end": 93.36, "start": 93.16, "text": "are"}, {"end": 93.36, "start": 93.36, "text": "very"}, {"end": 94.04, "start": 93.36, "text": "large,"}, {"end": 94.88, "start": 94.04, "text": "usually"}, {"end": 95.12, "start": 94.88, "text": "we"}, {"end": 95.44, "start": 95.12, "text": "start"}, {"end": 95.76, "start": 95.44, "text": "with"}, {"end": 96.4, "start": 95.76, "text": "k-means"}, {"end": 97.24, "start": 96.4, "text": "clustering"}, {"end": 97.64, "start": 97.24, "text": "or"}, {"end": 98.08, "start": 97.64, "text": "other"}, {"end": 99.24, "start": 98.08, "text": "algorithms"}, {"end": 99.76, "start": 99.24, "text": "derived"}, {"end": 100.0, "start": 99.76, "text": "from"}, {"end": 101.76, "start": 100.0, "text": "k-means."}, {"end": 102.28, "start": 101.76, "text": "So"}, {"end": 102.68, "start": 102.28, "text": "I'm"}, {"end": 103.0, "start": 102.68, "text": "going"}, {"end": 103.16, "start": 103.0, "text": "to"}, {"end": 103.48, "start": 103.16, "text": "introduce"}, {"end": 104.2, "start": 103.48, "text": "k-means"}, {"end": 105.03999999999999, "start": 104.2, "text": "quickly."}, {"end": 105.36, "start": 105.03999999999999, "text": "In"}, {"end": 106.52, "start": 105.36, "text": "k-means"}, {"end": 107.6, "start": 106.52, "text": "clustering,"}, {"end": 107.84, "start": 107.6, "text": "we"}, {"end": 108.2, "start": 107.84, "text": "start"}, {"end": 108.36, "start": 108.2, "text": "by"}, {"end": 109.08, "start": 108.36, "text": "choosing"}, {"end": 109.12, "start": 109.08, "text": "the"}, {"end": 109.44, "start": 109.12, "text": "number"}, {"end": 109.64, "start": 109.44, "text": "of"}, {"end": 111.0, "start": 109.64, "text": "clusters"}, {"end": 111.16, "start": 111.0, "text": "and"}, {"end": 111.6, "start": 111.16, "text": "then"}, {"end": 111.84, "start": 111.6, "text": "we"}, {"end": 112.8, "start": 111.84, "text": "project"}, {"end": 112.96000000000001, "start": 112.8, "text": "the"}, {"end": 113.8, "start": 112.96000000000001, "text": "vectorized"}, {"end": 114.6, "start": 113.8, "text": "documents"}, {"end": 114.84, "start": 114.6, "text": "in"}, {"end": 115.28, "start": 114.84, "text": "higher"}, {"end": 115.64, "start": 115.28, "text": "dimensional"}, {"end": 116.24, "start": 115.64, "text": "space."}, {"end": 117.12, "start": 116.24, "text": "So"}, {"end": 117.36, "start": 117.12, "text": "we"}, {"end": 117.6, "start": 117.36, "text": "end"}, {"end": 117.84, "start": 117.6, "text": "up"}, {"end": 118.24, "start": 117.84, "text": "with"}, {"end": 119.64, "start": 118.24, "text": "partitions"}, {"end": 119.76, "start": 119.64, "text": "or"}, {"end": 119.96000000000001, "start": 119.76, "text": "set"}], "text": " collections we are using today because they are very large, usually we start with k-means clustering or other algorithms derived from k-means. So I'm going to introduce k-means quickly. In k-means clustering, we start by choosing the number of clusters and then we project the vectorized documents in higher dimensional space. So we end up with partitions or set"}, {"chunks": [{"end": 120.28, "start": 120.0, "text": "of"}, {"end": 121.88, "start": 120.28, "text": "documents."}, {"end": 122.52, "start": 121.88, "text": "And"}, {"end": 122.8, "start": 122.52, "text": "on"}, {"end": 123.36, "start": 122.8, "text": "the"}, {"end": 123.8, "start": 123.36, "text": "other"}, {"end": 124.72, "start": 123.8, "text": "hand,"}, {"end": 124.88, "start": 124.72, "text": "the"}, {"end": 125.52, "start": 124.88, "text": "topic"}, {"end": 126.16, "start": 125.52, "text": "modeling,"}, {"end": 126.84, "start": 126.16, "text": "which"}, {"end": 127.16, "start": 126.84, "text": "does"}, {"end": 127.68, "start": 127.16, "text": "totally"}, {"end": 128.72, "start": 127.68, "text": "differently,"}, {"end": 129.84, "start": 128.72, "text": "this"}, {"end": 130.2, "start": 129.84, "text": "is"}, {"end": 130.84, "start": 130.2, "text": "LDA"}, {"end": 131.56, "start": 130.84, "text": "model,"}, {"end": 131.88, "start": 131.56, "text": "which"}, {"end": 132.08, "start": 131.88, "text": "is"}, {"end": 132.24, "start": 132.08, "text": "the"}, {"end": 132.6, "start": 132.24, "text": "one"}, {"end": 132.68, "start": 132.6, "text": "we"}, {"end": 132.96, "start": 132.68, "text": "are"}, {"end": 133.44, "start": 132.96, "text": "using"}, {"end": 133.56, "start": 133.44, "text": "in"}, {"end": 133.84, "start": 133.56, "text": "our"}, {"end": 135.32, "start": 133.84, "text": "experiments."}, {"end": 135.8, "start": 135.32, "text": "So"}, {"end": 136.0, "start": 135.8, "text": "in"}, {"end": 136.36, "start": 136.0, "text": "LDA"}, {"end": 136.76, "start": 136.36, "text": "topic"}, {"end": 137.32, "start": 136.76, "text": "modeling,"}, {"end": 138.04, "start": 137.32, "text": "what"}, {"end": 138.44, "start": 138.04, "text": "we"}, {"end": 139.48, "start": 138.44, "text": "get"}, {"end": 140.36, "start": 139.48, "text": "after"}, {"end": 140.84, "start": 140.36, "text": "we"}, {"end": 141.04, "start": 140.84, "text": "run"}, {"end": 141.28, "start": 141.04, "text": "the"}, {"end": 141.88, "start": 141.28, "text": "algorithm"}, {"end": 142.04, "start": 141.88, "text": "are"}, {"end": 142.4, "start": 142.04, "text": "two"}, {"end": 143.72, "start": 142.4, "text": "distributions."}, {"end": 144.6, "start": 143.72, "text": "The"}, {"end": 145.0, "start": 144.6, "text": "first"}, {"end": 145.4, "start": 145.0, "text": "one"}, {"end": 145.48, "start": 145.4, "text": "is"}, {"end": 145.52, "start": 145.48, "text": "the"}, {"end": 146.4, "start": 145.52, "text": "distribution"}, {"end": 146.4, "start": 146.4, "text": "of"}, {"end": 147.12, "start": 146.4, "text": "terms"}, {"end": 147.32, "start": 147.12, "text": "over"}, {"end": 148.24, "start": 147.32, "text": "topics,"}, {"end": 148.64, "start": 148.24, "text": "where"}, {"end": 149.16, "start": 148.64, "text": "we"}, {"end": 149.96, "start": 149.16, "text": "have"}], "text": " of documents. And on the other hand, the topic modeling, which does totally differently, this is LDA model, which is the one we are using in our experiments. So in LDA topic modeling, what we get after we run the algorithm are two distributions. The first one is the distribution of terms over topics, where we have"}, {"chunks": [{"end": 150.48, "start": 150.0, "text": "the"}, {"end": 150.72, "start": 150.48, "text": "words"}, {"end": 151.56, "start": 150.72, "text": "from"}, {"end": 152.36, "start": 151.56, "text": "vocabulary"}, {"end": 152.52, "start": 152.36, "text": "of"}, {"end": 152.56, "start": 152.52, "text": "the"}, {"end": 153.08, "start": 152.56, "text": "collection,"}, {"end": 153.28, "start": 153.08, "text": "and"}, {"end": 153.76, "start": 153.28, "text": "then"}, {"end": 154.08, "start": 153.76, "text": "we"}, {"end": 155.24, "start": 154.08, "text": "compute"}, {"end": 155.44, "start": 155.24, "text": "the"}, {"end": 156.24, "start": 155.44, "text": "weight,"}, {"end": 156.36, "start": 156.24, "text": "or"}, {"end": 156.52, "start": 156.36, "text": "we"}, {"end": 156.84, "start": 156.52, "text": "call"}, {"end": 156.84, "start": 156.84, "text": "them"}, {"end": 156.96, "start": 156.84, "text": "the"}, {"end": 158.0, "start": 156.96, "text": "probability"}, {"end": 158.12, "start": 158.0, "text": "of"}, {"end": 158.32, "start": 158.12, "text": "the"}, {"end": 158.96, "start": 158.32, "text": "word"}, {"end": 159.16, "start": 158.96, "text": "in"}, {"end": 159.48, "start": 159.16, "text": "each"}, {"end": 159.48, "start": 159.48, "text": "of"}, {"end": 159.64, "start": 159.48, "text": "the"}, {"end": 160.72, "start": 159.64, "text": "document,"}, {"end": 161.0, "start": 160.72, "text": "in"}, {"end": 161.24, "start": 161.0, "text": "each"}, {"end": 161.24, "start": 161.24, "text": "of"}, {"end": 161.36, "start": 161.24, "text": "the"}, {"end": 161.76, "start": 161.36, "text": "topics,"}, {"end": 162.44, "start": 161.76, "text": "sorry."}, {"end": 163.0, "start": 162.44, "text": "And"}, {"end": 163.24, "start": 163.0, "text": "the"}, {"end": 163.72, "start": 163.24, "text": "second"}, {"end": 164.68, "start": 163.72, "text": "distribution"}, {"end": 165.08, "start": 164.68, "text": "is"}, {"end": 165.2, "start": 165.08, "text": "the"}, {"end": 166.16, "start": 165.2, "text": "topics"}, {"end": 168.16, "start": 166.16, "text": "mixture"}, {"end": 169.07999999999998, "start": 168.16, "text": "distribution"}, {"end": 169.36, "start": 169.07999999999998, "text": "over"}, {"end": 170.84, "start": 169.36, "text": "documents."}, {"end": 171.4, "start": 170.84, "text": "So"}, {"end": 171.64, "start": 171.4, "text": "we"}, {"end": 172.28, "start": 171.64, "text": "assume"}, {"end": 173.24, "start": 172.28, "text": "that"}, {"end": 174.0, "start": 173.24, "text": "in"}, {"end": 174.44, "start": 174.0, "text": "each"}, {"end": 174.44, "start": 174.44, "text": "of"}, {"end": 174.6, "start": 174.44, "text": "the"}, {"end": 175.84, "start": 174.6, "text": "documents,"}, {"end": 176.04, "start": 175.84, "text": "we"}, {"end": 176.92000000000002, "start": 176.04, "text": "have"}, {"end": 177.16, "start": 176.92000000000002, "text": "all"}, {"end": 178.16, "start": 177.16, "text": "topics,"}, {"end": 178.64, "start": 178.16, "text": "but"}, {"end": 178.84, "start": 178.64, "text": "the"}, {"end": 179.24, "start": 178.84, "text": "only"}, {"end": 179.6, "start": 179.24, "text": "difference"}, {"end": 179.96, "start": 179.6, "text": "is"}], "text": " the words from vocabulary of the collection, and then we compute the weight, or we call them the probability of the word in each of the document, in each of the topics, sorry. And the second distribution is the topics mixture distribution over documents. So we assume that in each of the documents, we have all topics, but the only difference is"}, {"chunks": [{"end": 180.04, "start": 180.0, "text": "the"}, {"end": 180.4, "start": 180.04, "text": "weight"}, {"end": 180.72, "start": 180.4, "text": "for"}, {"end": 180.8, "start": 180.72, "text": "each"}, {"end": 181.52, "start": 180.8, "text": "topic"}, {"end": 181.96, "start": 181.52, "text": "is"}, {"end": 183.24, "start": 181.96, "text": "different."}, {"end": 183.68, "start": 183.24, "text": "Therefore,"}, {"end": 183.88, "start": 183.68, "text": "we"}, {"end": 184.44, "start": 183.88, "text": "say"}, {"end": 184.8, "start": 184.44, "text": "there"}, {"end": 184.88, "start": 184.8, "text": "are"}, {"end": 185.08, "start": 184.88, "text": "all"}, {"end": 185.76, "start": 185.08, "text": "words"}, {"end": 185.92, "start": 185.76, "text": "in"}, {"end": 186.08, "start": 185.92, "text": "all"}, {"end": 187.32, "start": 186.08, "text": "topics"}, {"end": 187.56, "start": 187.32, "text": "and"}, {"end": 187.68, "start": 187.56, "text": "all"}, {"end": 188.52, "start": 187.68, "text": "topics"}, {"end": 188.52, "start": 188.52, "text": "are"}, {"end": 188.56, "start": 188.52, "text": "in"}, {"end": 188.84, "start": 188.56, "text": "all"}, {"end": 190.28, "start": 188.84, "text": "documents."}, {"end": 190.64, "start": 190.28, "text": "The"}, {"end": 190.96, "start": 190.64, "text": "only"}, {"end": 191.96, "start": 190.96, "text": "difference"}, {"end": 192.48, "start": 191.96, "text": "between"}, {"end": 192.96, "start": 192.48, "text": "each"}, {"end": 193.68, "start": 192.96, "text": "topic"}, {"end": 194.08, "start": 193.68, "text": "is"}, {"end": 194.24, "start": 194.08, "text": "in"}, {"end": 194.48, "start": 194.24, "text": "the"}, {"end": 195.08, "start": 194.48, "text": "weight"}, {"end": 195.32, "start": 195.08, "text": "of"}, {"end": 195.32, "start": 195.32, "text": "the"}, {"end": 198.07999999999998, "start": 195.32, "text": "terms."}, {"end": 198.48, "start": 198.07999999999998, "text": "And"}, {"end": 199.07999999999998, "start": 198.48, "text": "to"}, {"end": 199.48, "start": 199.07999999999998, "text": "compare"}, {"end": 200.2, "start": 199.48, "text": "clustering"}, {"end": 200.4, "start": 200.2, "text": "with"}, {"end": 200.8, "start": 200.4, "text": "topic"}, {"end": 201.44, "start": 200.8, "text": "models,"}, {"end": 201.44, "start": 201.44, "text": "what"}, {"end": 202.6, "start": 201.44, "text": "they"}, {"end": 203.68, "start": 202.6, "text": "generate"}, {"end": 203.96, "start": 203.68, "text": "or"}, {"end": 204.0, "start": 203.96, "text": "what"}, {"end": 204.92000000000002, "start": 204.0, "text": "they"}, {"end": 205.76, "start": 204.92000000000002, "text": "describe"}, {"end": 205.84, "start": 205.76, "text": "are"}, {"end": 206.12, "start": 205.84, "text": "very"}, {"end": 207.07999999999998, "start": 206.12, "text": "different"}, {"end": 207.2, "start": 207.07999999999998, "text": "in"}, {"end": 207.76, "start": 207.2, "text": "terms"}, {"end": 208.12, "start": 207.76, "text": "of"}, {"end": 209.96, "start": 208.12, "text": "collections."}], "text": " the weight for each topic is different. Therefore, we say there are all words in all topics and all topics are in all documents. The only difference between each topic is in the weight of the terms. And to compare clustering with topic models, what they generate or what they describe are very different in terms of collections."}, {"chunks": [{"end": 210.12, "start": 210.0, "text": "For"}, {"end": 210.52, "start": 210.12, "text": "document"}, {"end": 212.2, "start": 210.52, "text": "clustering,"}, {"end": 213.16, "start": 212.2, "text": "we"}, {"end": 213.44, "start": 213.16, "text": "are"}, {"end": 213.8, "start": 213.44, "text": "doing"}, {"end": 213.96, "start": 213.8, "text": "a"}, {"end": 214.72, "start": 213.96, "text": "completely"}, {"end": 215.72, "start": 214.72, "text": "unsupervised"}, {"end": 216.92, "start": 215.72, "text": "process,"}, {"end": 217.4, "start": 216.92, "text": "which"}, {"end": 217.76, "start": 217.4, "text": "we"}, {"end": 218.24, "start": 217.76, "text": "start"}, {"end": 218.56, "start": 218.24, "text": "by"}, {"end": 218.88, "start": 218.56, "text": "using"}, {"end": 219.08, "start": 218.88, "text": "a"}, {"end": 219.68, "start": 219.08, "text": "collection,"}, {"end": 219.92, "start": 219.68, "text": "and"}, {"end": 220.24, "start": 219.92, "text": "then"}, {"end": 220.48, "start": 220.24, "text": "we"}, {"end": 220.68, "start": 220.48, "text": "end"}, {"end": 220.84, "start": 220.68, "text": "up"}, {"end": 221.2, "start": 220.84, "text": "with"}, {"end": 222.08, "start": 221.2, "text": "partitions"}, {"end": 222.24, "start": 222.08, "text": "of"}, {"end": 223.96, "start": 222.24, "text": "collection."}, {"end": 224.32, "start": 223.96, "text": "But"}, {"end": 224.48, "start": 224.32, "text": "for"}, {"end": 224.72, "start": 224.48, "text": "topic"}, {"end": 225.72, "start": 224.72, "text": "models,"}, {"end": 225.92, "start": 225.72, "text": "it"}, {"end": 226.2, "start": 225.92, "text": "is"}, {"end": 226.4, "start": 226.2, "text": "a"}, {"end": 226.92000000000002, "start": 226.4, "text": "probabilistic"}, {"end": 228.0, "start": 226.92000000000002, "text": "approach."}, {"end": 228.4, "start": 228.0, "text": "So"}, {"end": 228.56, "start": 228.4, "text": "what"}, {"end": 228.8, "start": 228.56, "text": "we"}, {"end": 229.24, "start": 228.8, "text": "get"}, {"end": 229.4, "start": 229.24, "text": "is"}, {"end": 229.6, "start": 229.4, "text": "not"}, {"end": 230.28, "start": 229.6, "text": "partitions"}, {"end": 230.4, "start": 230.28, "text": "of"}, {"end": 231.6, "start": 230.4, "text": "documents,"}, {"end": 231.84, "start": 231.6, "text": "but"}, {"end": 232.48, "start": 231.84, "text": "instead"}, {"end": 232.68, "start": 232.48, "text": "we"}, {"end": 233.04, "start": 232.68, "text": "have"}, {"end": 234.16, "start": 233.04, "text": "distributions"}, {"end": 234.28, "start": 234.16, "text": "or"}, {"end": 235.4, "start": 234.28, "text": "probabilities"}, {"end": 235.68, "start": 235.4, "text": "over"}, {"end": 236.92000000000002, "start": 235.68, "text": "terms"}, {"end": 237.6, "start": 236.92000000000002, "text": "and"}, {"end": 237.72, "start": 237.6, "text": "over"}, {"end": 239.96, "start": 237.72, "text": "documents."}], "text": " For document clustering, we are doing a completely unsupervised process, which we start by using a collection, and then we end up with partitions of collection. But for topic models, it is a probabilistic approach. So what we get is not partitions of documents, but instead we have distributions or probabilities over terms and over documents."}, {"chunks": [{"end": 240.24, "start": 240.0, "text": "And"}, {"end": 240.36, "start": 240.24, "text": "in"}, {"end": 240.64, "start": 240.36, "text": "addition"}, {"end": 240.84, "start": 240.64, "text": "to"}, {"end": 241.24, "start": 240.84, "text": "that,"}, {"end": 241.28, "start": 241.24, "text": "the"}, {"end": 242.52, "start": 241.28, "text": "topic"}, {"end": 243.08, "start": 242.52, "text": "modeling"}, {"end": 243.88, "start": 243.08, "text": "can"}, {"end": 244.24, "start": 243.88, "text": "also"}, {"end": 244.4, "start": 244.24, "text": "be"}, {"end": 244.76, "start": 244.4, "text": "used"}, {"end": 245.0, "start": 244.76, "text": "to"}, {"end": 245.48, "start": 245.0, "text": "help"}, {"end": 246.2, "start": 245.48, "text": "describe"}, {"end": 247.2, "start": 246.2, "text": "clusters"}, {"end": 247.32, "start": 247.2, "text": "in"}, {"end": 247.72, "start": 247.32, "text": "many"}, {"end": 248.36, "start": 247.72, "text": "information-actual"}, {"end": 250.08, "start": 248.36, "text": "tasks."}, {"end": 250.64, "start": 250.08, "text": "And"}, {"end": 250.8, "start": 250.64, "text": "they"}, {"end": 251.08, "start": 250.8, "text": "can"}, {"end": 251.48, "start": 251.08, "text": "also"}, {"end": 251.6, "start": 251.48, "text": "be"}, {"end": 251.96, "start": 251.6, "text": "used"}, {"end": 252.36, "start": 251.96, "text": "to"}, {"end": 253.28, "start": 252.36, "text": "improve"}, {"end": 253.44, "start": 253.28, "text": "the"}, {"end": 254.32, "start": 253.44, "text": "performance"}, {"end": 254.52, "start": 254.32, "text": "of"}, {"end": 254.76, "start": 254.52, "text": "document"}, {"end": 256.96, "start": 254.76, "text": "clustering."}, {"end": 259.12, "start": 256.96, "text": "But"}, {"end": 259.32, "start": 259.12, "text": "it"}, {"end": 259.76, "start": 259.32, "text": "doesn't"}, {"end": 260.04, "start": 259.76, "text": "mean"}, {"end": 260.24, "start": 260.04, "text": "that"}, {"end": 260.72, "start": 260.24, "text": "topic"}, {"end": 261.24, "start": 260.72, "text": "modeling"}, {"end": 261.56, "start": 261.24, "text": "is"}, {"end": 261.68, "start": 261.56, "text": "a"}, {"end": 261.96, "start": 261.68, "text": "more"}, {"end": 262.88, "start": 261.96, "text": "advanced"}, {"end": 263.84, "start": 262.88, "text": "approach,"}, {"end": 264.04, "start": 263.84, "text": "or"}, {"end": 264.2, "start": 264.04, "text": "it"}, {"end": 264.56, "start": 264.2, "text": "can"}, {"end": 265.0, "start": 264.56, "text": "be"}, {"end": 265.12, "start": 265.0, "text": "a"}, {"end": 265.36, "start": 265.12, "text": "richer"}, {"end": 266.0, "start": 265.36, "text": "collection"}, {"end": 266.64, "start": 266.0, "text": "descriptor"}, {"end": 266.96, "start": 266.64, "text": "if"}, {"end": 267.44, "start": 266.96, "text": "we"}, {"end": 268.16, "start": 267.44, "text": "compare"}, {"end": 268.44, "start": 268.16, "text": "it"}, {"end": 268.92, "start": 268.44, "text": "with"}, {"end": 269.96, "start": 268.92, "text": "documents."}], "text": " And in addition to that, the topic modeling can also be used to help describe clusters in many information-actual tasks. And they can also be used to improve the performance of document clustering. But it doesn't mean that topic modeling is a more advanced approach, or it can be a richer collection descriptor if we compare it with documents."}, {"chunks": [{"end": 270.8, "start": 270.0, "text": "And"}, {"end": 272.64, "start": 270.8, "text": "in"}, {"end": 273.32, "start": 272.64, "text": "fact,"}, {"end": 273.6, "start": 273.32, "text": "our"}, {"end": 274.08, "start": 273.6, "text": "recent"}, {"end": 274.96, "start": 274.08, "text": "research"}, {"end": 275.4, "start": 274.96, "text": "has"}, {"end": 275.6, "start": 275.4, "text": "already"}, {"end": 275.96, "start": 275.6, "text": "shown"}, {"end": 276.8, "start": 275.96, "text": "that"}, {"end": 277.2, "start": 276.8, "text": "document"}, {"end": 278.52, "start": 277.2, "text": "clustering,"}, {"end": 278.88, "start": 278.52, "text": "even"}, {"end": 279.4, "start": 278.88, "text": "though"}, {"end": 279.6, "start": 279.4, "text": "it"}, {"end": 279.76, "start": 279.6, "text": "may"}, {"end": 280.36, "start": 279.76, "text": "generate"}, {"end": 280.72, "start": 280.36, "text": "different"}, {"end": 281.4, "start": 280.72, "text": "clusters"}, {"end": 281.64, "start": 281.4, "text": "every"}, {"end": 282.76, "start": 281.64, "text": "time,"}, {"end": 283.52, "start": 282.76, "text": "it's"}, {"end": 283.92, "start": 283.52, "text": "still"}, {"end": 284.48, "start": 283.92, "text": "quite"}, {"end": 285.48, "start": 284.48, "text": "stable"}, {"end": 285.68, "start": 285.48, "text": "and"}, {"end": 286.04, "start": 285.68, "text": "can"}, {"end": 286.24, "start": 286.04, "text": "be"}, {"end": 286.96, "start": 286.24, "text": "descriptive"}, {"end": 287.16, "start": 286.96, "text": "as"}, {"end": 288.92, "start": 287.16, "text": "well."}, {"end": 290.0, "start": 288.92, "text": "So"}, {"end": 290.36, "start": 290.0, "text": "this"}, {"end": 291.16, "start": 290.36, "text": "figure"}, {"end": 291.4, "start": 291.16, "text": "is"}, {"end": 292.04, "start": 291.4, "text": "from"}, {"end": 292.36, "start": 292.04, "text": "a"}, {"end": 292.68, "start": 292.36, "text": "paper"}, {"end": 292.68, "start": 292.68, "text": "we"}, {"end": 293.32, "start": 292.68, "text": "published"}, {"end": 293.52, "start": 293.32, "text": "two"}, {"end": 293.96, "start": 293.52, "text": "weeks"}, {"end": 295.04, "start": 293.96, "text": "ago."}, {"end": 295.44, "start": 295.04, "text": "The"}, {"end": 295.96, "start": 295.44, "text": "heat"}, {"end": 296.68, "start": 295.96, "text": "map"}, {"end": 297.08, "start": 296.68, "text": "is"}, {"end": 297.24, "start": 297.08, "text": "a"}, {"end": 298.08, "start": 297.24, "text": "comparison"}, {"end": 298.52, "start": 298.08, "text": "of"}, {"end": 299.36, "start": 298.52, "text": "10"}, {"end": 299.44, "start": 299.36, "text": "runs"}, {"end": 299.68, "start": 299.44, "text": "of"}, {"end": 299.96, "start": 299.68, "text": "cache"}], "text": " And in fact, our recent research has already shown that document clustering, even though it may generate different clusters every time, it's still quite stable and can be descriptive as well. So this figure is from a paper we published two weeks ago. The heat map is a comparison of 10 runs of cache"}, {"chunks": [{"end": 300.4, "start": 300.0, "text": "means"}, {"end": 302.2, "start": 300.4, "text": "clustering."}, {"end": 302.76, "start": 302.2, "text": "So"}, {"end": 303.0, "start": 302.76, "text": "we"}, {"end": 303.96, "start": 303.0, "text": "are"}, {"end": 304.68, "start": 303.96, "text": "comparing"}, {"end": 305.04, "start": 304.68, "text": "them"}, {"end": 305.2, "start": 305.04, "text": "by"}, {"end": 305.44, "start": 305.2, "text": "the"}, {"end": 306.12, "start": 305.44, "text": "similarity"}, {"end": 306.6, "start": 306.12, "text": "score"}, {"end": 306.96, "start": 306.6, "text": "and"}, {"end": 307.16, "start": 306.96, "text": "you"}, {"end": 307.72, "start": 307.16, "text": "can"}, {"end": 308.84, "start": 307.72, "text": "see"}, {"end": 309.64, "start": 308.84, "text": "almost"}, {"end": 310.0, "start": 309.64, "text": "all"}, {"end": 310.2, "start": 310.0, "text": "the"}, {"end": 311.56, "start": 310.2, "text": "pairs"}, {"end": 312.0, "start": 311.56, "text": "lies"}, {"end": 313.0, "start": 312.0, "text": "between"}, {"end": 313.8, "start": 313.0, "text": "orange"}, {"end": 313.88, "start": 313.8, "text": "and"}, {"end": 314.08, "start": 313.88, "text": "the"}, {"end": 315.8, "start": 314.08, "text": "red."}, {"end": 316.4, "start": 315.8, "text": "So"}, {"end": 316.64, "start": 316.4, "text": "it's"}, {"end": 316.96, "start": 316.64, "text": "not"}, {"end": 318.2, "start": 316.96, "text": "completely"}, {"end": 319.48, "start": 318.2, "text": "identical"}, {"end": 319.6, "start": 319.48, "text": "but"}, {"end": 319.96, "start": 319.6, "text": "they"}, {"end": 320.08, "start": 319.96, "text": "are"}, {"end": 320.48, "start": 320.08, "text": "very"}, {"end": 320.6, "start": 320.48, "text": "similar"}, {"end": 322.08, "start": 320.6, "text": "largely."}, {"end": 322.56, "start": 322.08, "text": "So"}, {"end": 322.72, "start": 322.56, "text": "it"}, {"end": 322.92, "start": 322.72, "text": "is"}, {"end": 324.04, "start": 322.92, "text": "a"}, {"end": 324.16, "start": 324.04, "text": "the"}, {"end": 325.04, "start": 324.16, "text": "clustering"}, {"end": 325.2, "start": 325.04, "text": "is"}, {"end": 326.16, "start": 325.2, "text": "generating"}, {"end": 326.8, "start": 326.16, "text": "stable"}, {"end": 329.52, "start": 326.8, "text": "clusters."}, {"end": 329.96, "start": 329.52, "text": "And"}], "text": " means clustering. So we are comparing them by the similarity score and you can see almost all the pairs lies between orange and the red. So it's not completely identical but they are very similar largely. So it is a the clustering is generating stable clusters. And"}, {"chunks": [{"end": 330.0, "start": 330.0, "text": "How"}, {"end": 331.08, "start": 330.0, "text": "can"}, {"end": 332.16, "start": 331.08, "text": "these"}, {"end": 332.64, "start": 332.16, "text": "stable"}, {"end": 333.08, "start": 332.64, "text": "clusters"}, {"end": 333.28, "start": 333.08, "text": "be"}, {"end": 333.76, "start": 333.28, "text": "used"}, {"end": 334.08, "start": 333.76, "text": "to"}, {"end": 334.68, "start": 334.08, "text": "describe"}, {"end": 336.36, "start": 334.68, "text": "contents?"}, {"end": 336.64, "start": 336.36, "text": "Well,"}, {"end": 336.88, "start": 336.64, "text": "we"}, {"end": 337.8, "start": 336.88, "text": "proposed"}, {"end": 338.12, "start": 337.8, "text": "this"}, {"end": 339.0, "start": 338.12, "text": "measure"}, {"end": 339.28, "start": 339.0, "text": "called"}, {"end": 339.84, "start": 339.28, "text": "collection"}, {"end": 341.12, "start": 339.84, "text": "coverage,"}, {"end": 341.52, "start": 341.12, "text": "which"}, {"end": 341.88, "start": 341.52, "text": "is"}, {"end": 342.48, "start": 341.88, "text": "the"}, {"end": 343.08, "start": 342.48, "text": "size"}, {"end": 343.68, "start": 343.08, "text": "of"}, {"end": 344.08, "start": 343.68, "text": "relevant"}, {"end": 345.16, "start": 344.08, "text": "clusters"}, {"end": 345.96, "start": 345.16, "text": "to"}, {"end": 346.32, "start": 345.96, "text": "a"}, {"end": 346.84, "start": 346.32, "text": "query"}, {"end": 347.24, "start": 346.84, "text": "divided"}, {"end": 347.56, "start": 347.24, "text": "by"}, {"end": 348.12, "start": 347.56, "text": "the"}, {"end": 348.52, "start": 348.12, "text": "size"}, {"end": 348.6, "start": 348.52, "text": "of"}, {"end": 348.64, "start": 348.6, "text": "the"}, {"end": 348.92, "start": 348.64, "text": "whole"}, {"end": 349.32, "start": 348.92, "text": "collection."}, {"end": 350.04, "start": 349.32, "text": "So"}, {"end": 350.48, "start": 350.04, "text": "if"}, {"end": 351.04, "start": 350.48, "text": "the"}, {"end": 351.72, "start": 351.04, "text": "coverage"}, {"end": 351.8, "start": 351.72, "text": "is"}, {"end": 352.56, "start": 351.8, "text": "small,"}, {"end": 352.84, "start": 352.56, "text": "that"}, {"end": 353.4, "start": 352.84, "text": "means"}, {"end": 353.4, "start": 353.4, "text": "the"}, {"end": 354.12, "start": 353.4, "text": "clusters"}, {"end": 354.68, "start": 354.12, "text": "are"}, {"end": 355.0, "start": 354.68, "text": "very"}, {"end": 355.76, "start": 355.0, "text": "effective"}, {"end": 356.2, "start": 355.76, "text": "at"}, {"end": 356.96, "start": 356.2, "text": "describing"}, {"end": 357.4, "start": 356.96, "text": "relevant"}, {"end": 358.08, "start": 357.4, "text": "documents"}, {"end": 358.56, "start": 358.08, "text": "to"}, {"end": 358.76, "start": 358.56, "text": "a"}, {"end": 359.96, "start": 358.76, "text": "query."}], "text": " How can these stable clusters be used to describe contents? Well, we proposed this measure called collection coverage, which is the size of relevant clusters to a query divided by the size of the whole collection. So if the coverage is small, that means the clusters are very effective at describing relevant documents to a query."}, {"chunks": [{"end": 360.8, "start": 360.0, "text": "And"}, {"end": 361.04, "start": 360.8, "text": "in"}, {"end": 361.4, "start": 361.04, "text": "this"}, {"end": 362.84, "start": 361.4, "text": "figure,"}, {"end": 363.2, "start": 362.84, "text": "every"}, {"end": 363.8, "start": 363.2, "text": "dot"}, {"end": 364.28, "start": 363.8, "text": "is"}, {"end": 364.52, "start": 364.28, "text": "a"}, {"end": 364.72, "start": 364.52, "text": "track"}, {"end": 365.92, "start": 364.72, "text": "topic,"}, {"end": 366.16, "start": 365.92, "text": "and"}, {"end": 366.2, "start": 366.16, "text": "we"}, {"end": 366.92, "start": 366.2, "text": "collected"}, {"end": 367.12, "start": 366.92, "text": "the"}, {"end": 367.6, "start": 367.12, "text": "relevant"}, {"end": 368.28, "start": 367.6, "text": "documents"}, {"end": 368.56, "start": 368.28, "text": "to"}, {"end": 368.84, "start": 368.56, "text": "each"}, {"end": 368.88, "start": 368.84, "text": "of"}, {"end": 369.0, "start": 368.88, "text": "the"}, {"end": 369.92, "start": 369.0, "text": "topics."}, {"end": 370.36, "start": 369.92, "text": "So"}, {"end": 370.8, "start": 370.36, "text": "some"}, {"end": 371.52, "start": 370.8, "text": "topics"}, {"end": 371.8, "start": 371.52, "text": "have"}, {"end": 372.16, "start": 371.8, "text": "many"}, {"end": 372.6, "start": 372.16, "text": "relevant"}, {"end": 373.6, "start": 372.6, "text": "documents,"}, {"end": 373.8, "start": 373.6, "text": "but"}, {"end": 374.12, "start": 373.8, "text": "some"}, {"end": 374.32, "start": 374.12, "text": "of"}, {"end": 374.6, "start": 374.32, "text": "them"}, {"end": 374.92, "start": 374.6, "text": "have"}, {"end": 375.36, "start": 374.92, "text": "only"}, {"end": 376.36, "start": 375.36, "text": "a"}, {"end": 377.16, "start": 376.36, "text": "few."}, {"end": 377.44, "start": 377.16, "text": "And"}, {"end": 377.72, "start": 377.44, "text": "the"}, {"end": 378.12, "start": 377.72, "text": "blue"}, {"end": 378.88, "start": 378.12, "text": "colored"}, {"end": 379.68, "start": 378.88, "text": "dot"}, {"end": 379.92, "start": 379.68, "text": "are"}, {"end": 380.24, "start": 379.92, "text": "from"}, {"end": 380.36, "start": 380.24, "text": "an"}, {"end": 381.2, "start": 380.36, "text": "actual"}, {"end": 382.56, "start": 381.2, "text": "k-means"}, {"end": 383.56, "start": 382.56, "text": "run,"}, {"end": 383.8, "start": 383.56, "text": "and"}, {"end": 384.0, "start": 383.8, "text": "the"}, {"end": 384.76, "start": 384.0, "text": "orange"}, {"end": 386.44, "start": 384.76, "text": "crosses"}, {"end": 386.64, "start": 386.44, "text": "are"}, {"end": 387.16, "start": 386.64, "text": "from"}, {"end": 387.32, "start": 387.16, "text": "a"}, {"end": 387.52, "start": 387.32, "text": "random"}, {"end": 389.96, "start": 387.52, "text": "partitioning."}], "text": " And in this figure, every dot is a track topic, and we collected the relevant documents to each of the topics. So some topics have many relevant documents, but some of them have only a few. And the blue colored dot are from an actual k-means run, and the orange crosses are from a random partitioning."}, {"chunks": [{"end": 390.24, "start": 390.0, "text": "Randomly,"}, {"end": 391.04, "start": 390.24, "text": "when"}, {"end": 391.56, "start": 391.04, "text": "we"}, {"end": 391.72, "start": 391.56, "text": "have"}, {"end": 392.04, "start": 391.72, "text": "more"}, {"end": 394.08, "start": 392.04, "text": "documents,"}, {"end": 394.76, "start": 394.08, "text": "because"}, {"end": 394.88, "start": 394.76, "text": "the"}, {"end": 395.68, "start": 394.88, "text": "partitions"}, {"end": 395.96, "start": 395.68, "text": "are"}, {"end": 396.4, "start": 395.96, "text": "random,"}, {"end": 396.84, "start": 396.4, "text": "these"}, {"end": 397.28, "start": 396.84, "text": "documents"}, {"end": 397.84, "start": 397.28, "text": "rely"}, {"end": 399.4, "start": 397.84, "text": "on"}, {"end": 399.76, "start": 399.4, "text": "each"}, {"end": 400.4, "start": 399.76, "text": "partitions"}, {"end": 401.92, "start": 400.4, "text": "equally."}, {"end": 402.2, "start": 401.92, "text": "But"}, {"end": 402.28, "start": 402.2, "text": "in"}, {"end": 402.72, "start": 402.28, "text": "fact,"}, {"end": 402.88, "start": 402.72, "text": "the"}, {"end": 403.48, "start": 402.88, "text": "actual"}, {"end": 403.84, "start": 403.48, "text": "k-means"}, {"end": 404.56, "start": 403.84, "text": "cluster"}, {"end": 405.24, "start": 404.56, "text": "generates"}, {"end": 405.68, "start": 405.24, "text": "something"}, {"end": 406.08, "start": 405.68, "text": "much"}, {"end": 406.28, "start": 406.08, "text": "more"}, {"end": 406.76, "start": 406.28, "text": "meaningful"}, {"end": 407.96, "start": 406.76, "text": "because"}, {"end": 408.12, "start": 407.96, "text": "no"}, {"end": 408.4, "start": 408.12, "text": "matter"}, {"end": 408.72, "start": 408.4, "text": "how"}, {"end": 409.04, "start": 408.72, "text": "many"}, {"end": 409.4, "start": 409.04, "text": "relevant"}, {"end": 409.88, "start": 409.4, "text": "documents"}, {"end": 410.0, "start": 409.88, "text": "we"}, {"end": 411.0, "start": 410.0, "text": "have,"}, {"end": 411.2, "start": 411.0, "text": "we"}, {"end": 411.6, "start": 411.2, "text": "can"}, {"end": 412.4, "start": 411.6, "text": "almost"}, {"end": 412.88, "start": 412.4, "text": "find"}, {"end": 413.32, "start": 412.88, "text": "all"}, {"end": 413.6, "start": 413.32, "text": "of"}, {"end": 413.72, "start": 413.6, "text": "them"}, {"end": 414.2, "start": 413.72, "text": "within"}, {"end": 415.24, "start": 414.2, "text": "20%"}, {"end": 415.48, "start": 415.24, "text": "of"}, {"end": 415.68, "start": 415.48, "text": "the"}, {"end": 419.96, "start": 415.68, "text": "collection."}], "text": " Randomly, when we have more documents, because the partitions are random, these documents rely on each partitions equally. But in fact, the actual k-means cluster generates something much more meaningful because no matter how many relevant documents we have, we can almost find all of them within 20% of the collection."}, {"chunks": [{"end": 420.4, "start": 420.0, "text": "To"}, {"end": 420.68, "start": 420.4, "text": "use"}, {"end": 420.92, "start": 420.68, "text": "these"}, {"end": 422.08, "start": 420.92, "text": "clusters"}, {"end": 422.48, "start": 422.08, "text": "as"}, {"end": 422.56, "start": 422.48, "text": "a"}, {"end": 423.08, "start": 422.56, "text": "content"}, {"end": 423.88, "start": 423.08, "text": "descriptor"}, {"end": 424.28, "start": 423.88, "text": "for"}, {"end": 424.4, "start": 424.28, "text": "an"}, {"end": 424.64, "start": 424.4, "text": "unknown"}, {"end": 425.2, "start": 424.64, "text": "collection,"}, {"end": 426.84, "start": 425.2, "text": "we"}, {"end": 427.56, "start": 426.84, "text": "usually"}, {"end": 428.0, "start": 427.56, "text": "use"}, {"end": 428.24, "start": 428.0, "text": "the"}, {"end": 429.0, "start": 428.24, "text": "informations"}, {"end": 429.88, "start": 429.0, "text": "of"}, {"end": 431.0, "start": 429.88, "text": "centroid"}, {"end": 431.96, "start": 431.0, "text": "and"}, {"end": 432.16, "start": 431.96, "text": "in"}, {"end": 432.6, "start": 432.16, "text": "terms"}, {"end": 432.72, "start": 432.6, "text": "of"}, {"end": 433.12, "start": 432.72, "text": "document"}, {"end": 433.88, "start": 433.12, "text": "clustering,"}, {"end": 434.28, "start": 433.88, "text": "we"}, {"end": 435.68, "start": 434.28, "text": "don't"}, {"end": 435.88, "start": 435.68, "text": "have"}, {"end": 435.96, "start": 435.88, "text": "an"}, {"end": 436.84, "start": 435.96, "text": "actual"}, {"end": 438.0, "start": 436.84, "text": "centroid,"}, {"end": 438.28, "start": 438.0, "text": "but"}, {"end": 438.32, "start": 438.28, "text": "what"}, {"end": 438.96, "start": 438.32, "text": "we"}, {"end": 439.28, "start": 438.96, "text": "have"}, {"end": 439.6, "start": 439.28, "text": "are"}, {"end": 439.8, "start": 439.6, "text": "the"}, {"end": 440.48, "start": 439.8, "text": "documents"}, {"end": 440.76, "start": 440.48, "text": "near"}, {"end": 441.0, "start": 440.76, "text": "the"}, {"end": 441.96, "start": 441.0, "text": "center."}, {"end": 442.16, "start": 441.96, "text": "So"}, {"end": 442.44, "start": 442.16, "text": "we"}, {"end": 442.84, "start": 442.44, "text": "use"}, {"end": 443.0, "start": 442.84, "text": "the"}, {"end": 443.48, "start": 443.0, "text": "central"}, {"end": 443.96, "start": 443.48, "text": "document"}, {"end": 444.32, "start": 443.96, "text": "zone"}, {"end": 445.16, "start": 444.32, "text": "to"}, {"end": 446.32, "start": 445.16, "text": "represent"}, {"end": 446.44, "start": 446.32, "text": "the"}, {"end": 447.08, "start": 446.44, "text": "content"}, {"end": 447.16, "start": 447.08, "text": "of"}, {"end": 447.16, "start": 447.16, "text": "a"}, {"end": 447.72, "start": 447.16, "text": "cluster."}, {"end": 447.96, "start": 447.72, "text": "But"}, {"end": 448.6, "start": 447.96, "text": "with"}, {"end": 449.96, "start": 448.6, "text": "students"}], "text": " To use these clusters as a content descriptor for an unknown collection, we usually use the informations of centroid and in terms of document clustering, we don't have an actual centroid, but what we have are the documents near the center. So we use the central document zone to represent the content of a cluster. But with students"}, {"chunks": [{"end": 450.16, "start": 450.0, "text": "who"}, {"end": 450.36, "start": 450.16, "text": "don't"}, {"end": 450.76, "start": 450.36, "text": "have"}, {"end": 451.44, "start": 450.76, "text": "evidence"}, {"end": 451.6, "start": 451.44, "text": "to"}, {"end": 452.28, "start": 451.6, "text": "say"}, {"end": 452.6, "start": 452.28, "text": "that"}, {"end": 453.2, "start": 452.6, "text": "these"}, {"end": 453.92, "start": 453.2, "text": "centroid"}, {"end": 454.64, "start": 453.92, "text": "documents"}, {"end": 454.8, "start": 454.64, "text": "are"}, {"end": 455.16, "start": 454.8, "text": "more"}, {"end": 456.48, "start": 455.16, "text": "representative"}, {"end": 456.68, "start": 456.48, "text": "and"}, {"end": 457.84, "start": 456.68, "text": "disruptive"}, {"end": 458.24, "start": 457.84, "text": "than"}, {"end": 458.68, "start": 458.24, "text": "other"}, {"end": 459.36, "start": 458.68, "text": "documents"}, {"end": 459.6, "start": 459.36, "text": "within"}, {"end": 459.72, "start": 459.6, "text": "the"}, {"end": 461.24, "start": 459.72, "text": "clusters."}, {"end": 461.68, "start": 461.24, "text": "So"}, {"end": 462.12, "start": 461.68, "text": "we"}, {"end": 462.36, "start": 462.12, "text": "have"}, {"end": 462.68, "start": 462.36, "text": "two"}, {"end": 463.32, "start": 462.68, "text": "ways"}, {"end": 463.68, "start": 463.32, "text": "to"}, {"end": 464.04, "start": 463.68, "text": "generate"}, {"end": 465.08, "start": 464.04, "text": "keywords"}, {"end": 465.32, "start": 465.08, "text": "from"}, {"end": 466.52, "start": 465.32, "text": "clusters."}, {"end": 467.24, "start": 466.52, "text": "The"}, {"end": 467.72, "start": 467.24, "text": "first"}, {"end": 468.64, "start": 467.72, "text": "one"}, {"end": 468.72, "start": 468.64, "text": "is"}, {"end": 469.08, "start": 468.72, "text": "to"}, {"end": 469.72, "start": 469.08, "text": "use"}, {"end": 470.04, "start": 469.72, "text": "all"}, {"end": 470.72, "start": 470.04, "text": "documents"}, {"end": 470.84, "start": 470.72, "text": "within"}, {"end": 471.12, "start": 470.84, "text": "a"}, {"end": 472.52, "start": 471.12, "text": "cluster."}, {"end": 472.64, "start": 472.52, "text": "And"}, {"end": 473.0, "start": 472.64, "text": "then"}, {"end": 473.12, "start": 473.0, "text": "we"}, {"end": 473.56, "start": 473.12, "text": "sort"}, {"end": 473.68, "start": 473.56, "text": "the"}, {"end": 474.56, "start": 473.68, "text": "terms"}, {"end": 475.0, "start": 474.56, "text": "by"}, {"end": 475.44, "start": 475.0, "text": "their"}, {"end": 476.52, "start": 475.44, "text": "frequency"}, {"end": 476.72, "start": 476.52, "text": "and"}, {"end": 476.72, "start": 476.72, "text": "we"}, {"end": 477.0, "start": 476.72, "text": "call"}, {"end": 477.12, "start": 477.0, "text": "them"}, {"end": 477.24, "start": 477.12, "text": "the"}, {"end": 477.64, "start": 477.24, "text": "cluster"}, {"end": 479.96, "start": 477.64, "text": "terms."}], "text": " who don't have evidence to say that these centroid documents are more representative and disruptive than other documents within the clusters. So we have two ways to generate keywords from clusters. The first one is to use all documents within a cluster. And then we sort the terms by their frequency and we call them the cluster terms."}, {"chunks": [{"end": 480.36, "start": 480.0, "text": "Second"}, {"end": 481.32, "start": 480.36, "text": "method"}, {"end": 481.92, "start": 481.32, "text": "is"}, {"end": 482.64, "start": 481.92, "text": "to"}, {"end": 482.96, "start": 482.64, "text": "use"}, {"end": 483.04, "start": 482.96, "text": "the"}, {"end": 483.64, "start": 483.04, "text": "documents"}, {"end": 483.72, "start": 483.64, "text": "near"}, {"end": 483.96, "start": 483.72, "text": "the"}, {"end": 484.2, "start": 483.96, "text": "central"}, {"end": 484.44, "start": 484.2, "text": "only,"}, {"end": 485.08, "start": 484.44, "text": "for"}, {"end": 486.0, "start": 485.08, "text": "example,"}, {"end": 486.44, "start": 486.0, "text": "10"}, {"end": 487.24, "start": 486.44, "text": "documents"}, {"end": 487.32, "start": 487.24, "text": "or"}, {"end": 487.68, "start": 487.32, "text": "20"}, {"end": 489.0, "start": 487.68, "text": "documents,"}, {"end": 489.16, "start": 489.0, "text": "and"}, {"end": 489.6, "start": 489.16, "text": "then"}, {"end": 489.76, "start": 489.6, "text": "we"}, {"end": 490.28, "start": 489.76, "text": "call"}, {"end": 490.56, "start": 490.28, "text": "them"}, {"end": 490.8, "start": 490.56, "text": "the"}, {"end": 490.92, "start": 490.8, "text": "central"}, {"end": 492.64, "start": 490.92, "text": "terms."}, {"end": 493.48, "start": 492.64, "text": "And"}, {"end": 493.76, "start": 493.48, "text": "the"}, {"end": 494.84, "start": 493.76, "text": "keywords"}, {"end": 495.12, "start": 494.84, "text": "for"}, {"end": 495.2, "start": 495.12, "text": "a"}, {"end": 496.8, "start": 495.2, "text": "topic"}, {"end": 497.24, "start": 496.8, "text": "is"}, {"end": 497.68, "start": 497.24, "text": "computed"}, {"end": 499.2, "start": 497.68, "text": "from"}, {"end": 499.36, "start": 499.2, "text": "the"}, {"end": 499.96, "start": 499.36, "text": "term"}, {"end": 501.04, "start": 499.96, "text": "topic"}, {"end": 502.0, "start": 501.04, "text": "distribution"}, {"end": 502.56, "start": 502.0, "text": "described"}, {"end": 504.12, "start": 502.56, "text": "previously."}, {"end": 504.4, "start": 504.12, "text": "So"}, {"end": 504.64, "start": 504.4, "text": "for"}, {"end": 505.08, "start": 504.64, "text": "each"}, {"end": 505.88, "start": 505.08, "text": "topic,"}, {"end": 506.28, "start": 505.88, "text": "we"}, {"end": 506.56, "start": 506.28, "text": "order"}, {"end": 506.96, "start": 506.56, "text": "the"}, {"end": 507.56, "start": 506.96, "text": "terms"}, {"end": 507.76, "start": 507.56, "text": "by"}, {"end": 508.04, "start": 507.76, "text": "their"}, {"end": 508.04, "start": 508.04, "text": "weight"}, {"end": 508.04, "start": 508.04, "text": "and"}, {"end": 508.24, "start": 508.04, "text": "we"}, {"end": 508.96, "start": 508.24, "text": "choose"}, {"end": 509.2, "start": 508.96, "text": "the"}, {"end": 509.96, "start": 509.2, "text": "top"}], "text": " Second method is to use the documents near the central only, for example, 10 documents or 20 documents, and then we call them the central terms. And the keywords for a topic is computed from the term topic distribution described previously. So for each topic, we order the terms by their weight and we choose the top"}, {"chunks": [{"end": 510.6, "start": 510.0, "text": "terms"}, {"end": 510.88, "start": 510.6, "text": "as"}, {"end": 511.04, "start": 510.88, "text": "the"}, {"end": 511.16, "start": 511.04, "text": "topic"}, {"end": 514.0, "start": 511.16, "text": "terms."}, {"end": 514.56, "start": 514.0, "text": "And"}, {"end": 514.88, "start": 514.56, "text": "we"}, {"end": 515.08, "start": 514.88, "text": "are"}, {"end": 515.68, "start": 515.08, "text": "interested"}, {"end": 516.08, "start": 515.68, "text": "in"}, {"end": 517.24, "start": 516.08, "text": "how"}, {"end": 517.44, "start": 517.24, "text": "the"}, {"end": 518.2, "start": 517.44, "text": "clusters"}, {"end": 518.64, "start": 518.2, "text": "align"}, {"end": 518.8, "start": 518.64, "text": "with"}, {"end": 518.96, "start": 518.8, "text": "the"}, {"end": 520.48, "start": 518.96, "text": "topics."}, {"end": 521.2, "start": 520.48, "text": "So"}, {"end": 521.6, "start": 521.2, "text": "for"}, {"end": 521.96, "start": 521.6, "text": "each"}, {"end": 522.68, "start": 521.96, "text": "document"}, {"end": 523.0, "start": 522.68, "text": "in"}, {"end": 523.16, "start": 523.0, "text": "a"}, {"end": 524.56, "start": 523.16, "text": "collection,"}, {"end": 524.92, "start": 524.56, "text": "we"}, {"end": 525.32, "start": 524.92, "text": "can"}, {"end": 525.52, "start": 525.32, "text": "give"}, {"end": 525.8, "start": 525.52, "text": "them"}, {"end": 525.92, "start": 525.8, "text": "a"}, {"end": 526.36, "start": 525.92, "text": "cluster"}, {"end": 527.24, "start": 526.36, "text": "label,"}, {"end": 527.64, "start": 527.24, "text": "which"}, {"end": 527.92, "start": 527.64, "text": "is"}, {"end": 528.08, "start": 527.92, "text": "the"}, {"end": 528.76, "start": 528.08, "text": "membership"}, {"end": 529.08, "start": 528.76, "text": "of"}, {"end": 529.16, "start": 529.08, "text": "a"}, {"end": 529.76, "start": 529.16, "text": "clustering."}, {"end": 530.56, "start": 529.76, "text": "And"}, {"end": 531.04, "start": 530.56, "text": "we"}, {"end": 531.4, "start": 531.04, "text": "also"}, {"end": 532.12, "start": 531.4, "text": "generate"}, {"end": 532.12, "start": 532.12, "text": "a"}, {"end": 533.36, "start": 532.12, "text": "topic"}, {"end": 534.0, "start": 533.36, "text": "label"}, {"end": 534.2, "start": 534.0, "text": "for"}, {"end": 534.44, "start": 534.2, "text": "each"}, {"end": 534.52, "start": 534.44, "text": "of"}, {"end": 534.64, "start": 534.52, "text": "the"}, {"end": 535.32, "start": 534.64, "text": "documents."}, {"end": 536.64, "start": 535.32, "text": "The"}, {"end": 537.28, "start": 536.64, "text": "topic"}, {"end": 538.52, "start": 537.28, "text": "label"}, {"end": 539.28, "start": 538.52, "text": "is"}, {"end": 539.52, "start": 539.28, "text": "got"}, {"end": 539.96, "start": 539.52, "text": "from"}], "text": " terms as the topic terms. And we are interested in how the clusters align with the topics. So for each document in a collection, we can give them a cluster label, which is the membership of a clustering. And we also generate a topic label for each of the documents. The topic label is got from"}, {"chunks": [{"end": 540.68, "start": 540.0, "text": "the"}, {"end": 541.36, "start": 540.68, "text": "topic"}, {"end": 541.56, "start": 541.36, "text": "with"}, {"end": 541.68, "start": 541.56, "text": "the"}, {"end": 542.36, "start": 541.68, "text": "highest"}, {"end": 542.8, "start": 542.36, "text": "weight"}, {"end": 543.12, "start": 542.8, "text": "of"}, {"end": 543.24, "start": 543.12, "text": "the"}, {"end": 544.16, "start": 543.24, "text": "document"}, {"end": 544.44, "start": 544.16, "text": "from"}, {"end": 544.88, "start": 544.44, "text": "the"}, {"end": 545.28, "start": 544.88, "text": "second"}, {"end": 546.4, "start": 545.28, "text": "distribution."}, {"end": 546.84, "start": 546.4, "text": "And"}, {"end": 547.36, "start": 546.84, "text": "then"}, {"end": 547.64, "start": 547.36, "text": "we"}, {"end": 547.88, "start": 547.64, "text": "would"}, {"end": 548.0, "start": 547.88, "text": "like"}, {"end": 548.44, "start": 548.0, "text": "to"}, {"end": 549.16, "start": 548.44, "text": "compute"}, {"end": 549.8, "start": 549.16, "text": "the"}, {"end": 551.0, "start": 549.8, "text": "distribution"}, {"end": 551.12, "start": 551.0, "text": "of"}, {"end": 551.92, "start": 551.12, "text": "topics"}, {"end": 552.16, "start": 551.92, "text": "for"}, {"end": 552.4, "start": 552.16, "text": "each"}, {"end": 552.4, "start": 552.4, "text": "of"}, {"end": 552.52, "start": 552.4, "text": "the"}, {"end": 554.56, "start": 552.52, "text": "cluster."}, {"end": 554.84, "start": 554.56, "text": "In"}, {"end": 555.36, "start": 554.84, "text": "this"}, {"end": 556.44, "start": 555.36, "text": "comparison,"}, {"end": 557.0, "start": 556.44, "text": "the"}, {"end": 557.8, "start": 557.0, "text": "dataset"}, {"end": 558.08, "start": 557.8, "text": "we"}, {"end": 558.36, "start": 558.08, "text": "use"}, {"end": 558.72, "start": 558.36, "text": "is"}, {"end": 558.76, "start": 558.72, "text": "the"}, {"end": 558.96, "start": 558.76, "text": "Wall"}, {"end": 559.24, "start": 558.96, "text": "Street"}, {"end": 559.64, "start": 559.24, "text": "Journal"}, {"end": 561.28, "start": 559.64, "text": "collection."}, {"end": 561.4, "start": 561.28, "text": "And"}, {"end": 562.12, "start": 561.4, "text": "there"}, {"end": 562.68, "start": 562.12, "text": "are"}, {"end": 563.52, "start": 562.68, "text": "around"}, {"end": 564.48, "start": 563.52, "text": "98,000"}, {"end": 565.2, "start": 564.48, "text": "documents"}, {"end": 565.28, "start": 565.2, "text": "in"}, {"end": 567.88, "start": 565.28, "text": "total."}, {"end": 568.12, "start": 567.88, "text": "We"}, {"end": 568.92, "start": 568.12, "text": "also"}, {"end": 569.24, "start": 568.92, "text": "split"}, {"end": 569.28, "start": 569.24, "text": "it"}, {"end": 569.4, "start": 569.28, "text": "into"}, {"end": 569.96, "start": 569.4, "text": "subsets."}], "text": " the topic with the highest weight of the document from the second distribution. And then we would like to compute the distribution of topics for each of the cluster. In this comparison, the dataset we use is the Wall Street Journal collection. And there are around 98,000 documents in total. We also split it into subsets."}, {"chunks": [{"end": 570.16, "start": 570.0, "text": "by"}, {"end": 570.68, "start": 570.16, "text": "the"}, {"end": 571.32, "start": 570.68, "text": "length"}, {"end": 571.84, "start": 571.32, "text": "of"}, {"end": 572.32, "start": 571.84, "text": "the"}, {"end": 572.84, "start": 572.32, "text": "documents."}, {"end": 573.08, "start": 572.84, "text": "But"}, {"end": 573.64, "start": 573.08, "text": "the"}, {"end": 574.32, "start": 573.64, "text": "actual"}, {"end": 574.6, "start": 574.32, "text": "result"}, {"end": 574.96, "start": 574.6, "text": "doesn't"}, {"end": 575.24, "start": 574.96, "text": "show"}, {"end": 575.56, "start": 575.24, "text": "much"}, {"end": 576.6, "start": 575.56, "text": "difference."}, {"end": 576.92, "start": 576.6, "text": "So"}, {"end": 577.24, "start": 576.92, "text": "we"}, {"end": 577.4, "start": 577.24, "text": "are"}, {"end": 577.52, "start": 577.4, "text": "only"}, {"end": 578.2, "start": 577.52, "text": "presenting"}, {"end": 578.4, "start": 578.2, "text": "the"}, {"end": 578.92, "start": 578.4, "text": "results"}, {"end": 579.48, "start": 578.92, "text": "for"}, {"end": 579.64, "start": 579.48, "text": "the"}, {"end": 579.76, "start": 579.64, "text": "whole"}, {"end": 580.48, "start": 579.76, "text": "collection."}, {"end": 580.84, "start": 580.48, "text": "And"}, {"end": 581.36, "start": 580.84, "text": "if"}, {"end": 581.64, "start": 581.36, "text": "you"}, {"end": 581.88, "start": 581.64, "text": "are"}, {"end": 582.24, "start": 581.88, "text": "interested"}, {"end": 582.72, "start": 582.24, "text": "in"}, {"end": 582.88, "start": 582.72, "text": "the"}, {"end": 583.28, "start": 582.88, "text": "results"}, {"end": 583.64, "start": 583.28, "text": "for"}, {"end": 583.64, "start": 583.64, "text": "the"}, {"end": 584.64, "start": 583.64, "text": "partitions,"}, {"end": 585.68, "start": 584.64, "text": "please"}, {"end": 586.0, "start": 585.68, "text": "read"}, {"end": 586.24, "start": 586.0, "text": "our"}, {"end": 590.0, "start": 586.24, "text": "paper."}, {"end": 590.16, "start": 590.0, "text": "And"}, {"end": 590.44, "start": 590.16, "text": "this"}, {"end": 590.6, "start": 590.44, "text": "is"}, {"end": 590.8, "start": 590.6, "text": "the"}, {"end": 591.44, "start": 590.8, "text": "result"}, {"end": 592.2, "start": 591.44, "text": "for"}, {"end": 592.88, "start": 592.2, "text": "the"}, {"end": 593.84, "start": 592.88, "text": "distribution"}, {"end": 593.88, "start": 593.84, "text": "of"}, {"end": 594.68, "start": 593.88, "text": "topics"}, {"end": 594.84, "start": 594.68, "text": "per"}, {"end": 596.68, "start": 594.84, "text": "cluster."}, {"end": 597.0, "start": 596.68, "text": "Each"}, {"end": 598.08, "start": 597.0, "text": "bar"}, {"end": 598.4, "start": 598.08, "text": "of"}, {"end": 598.96, "start": 598.4, "text": "this"}, {"end": 599.52, "start": 598.96, "text": "figure"}, {"end": 599.96, "start": 599.52, "text": "represents"}], "text": " by the length of the documents. But the actual result doesn't show much difference. So we are only presenting the results for the whole collection. And if you are interested in the results for the partitions, please read our paper. And this is the result for the distribution of topics per cluster. Each bar of this figure represents"}, {"chunks": [{"end": 600.2, "start": 600.0, "text": "as"}, {"end": 600.24, "start": 600.2, "text": "a"}, {"end": 601.72, "start": 600.24, "text": "cluster,"}, {"end": 601.92, "start": 601.72, "text": "and"}, {"end": 602.16, "start": 601.92, "text": "the"}, {"end": 602.84, "start": 602.16, "text": "colored"}, {"end": 603.92, "start": 602.84, "text": "segments"}, {"end": 604.2, "start": 603.92, "text": "are"}, {"end": 606.24, "start": 604.2, "text": "topics."}, {"end": 606.6, "start": 606.24, "text": "We"}, {"end": 607.16, "start": 606.6, "text": "have"}, {"end": 607.56, "start": 607.16, "text": "the"}, {"end": 608.64, "start": 607.56, "text": "x-axis"}, {"end": 608.92, "start": 608.64, "text": "as"}, {"end": 609.12, "start": 608.92, "text": "the"}, {"end": 610.0, "start": 609.12, "text": "percentage"}, {"end": 610.12, "start": 610.0, "text": "of"}, {"end": 611.04, "start": 610.12, "text": "topics."}, {"end": 611.52, "start": 611.04, "text": "So"}, {"end": 611.6, "start": 611.52, "text": "it"}, {"end": 612.04, "start": 611.6, "text": "starts"}, {"end": 612.44, "start": 612.04, "text": "from"}, {"end": 613.52, "start": 612.44, "text": "0%"}, {"end": 613.92, "start": 613.52, "text": "to"}, {"end": 615.84, "start": 613.92, "text": "100%."}, {"end": 615.96, "start": 615.84, "text": "And"}, {"end": 616.16, "start": 615.96, "text": "the"}, {"end": 617.28, "start": 616.16, "text": "y-axis"}, {"end": 617.88, "start": 617.28, "text": "represents"}, {"end": 618.04, "start": 617.88, "text": "the"}, {"end": 619.88, "start": 618.04, "text": "clusters."}, {"end": 620.28, "start": 619.88, "text": "Within"}, {"end": 620.48, "start": 620.28, "text": "the"}, {"end": 621.28, "start": 620.48, "text": "bracket,"}, {"end": 621.64, "start": 621.28, "text": "we"}, {"end": 621.68, "start": 621.64, "text": "have"}, {"end": 622.24, "start": 621.68, "text": "the"}, {"end": 622.72, "start": 622.24, "text": "cluster"}, {"end": 623.32, "start": 622.72, "text": "size,"}, {"end": 623.64, "start": 623.32, "text": "which"}, {"end": 624.08, "start": 623.64, "text": "are"}, {"end": 624.48, "start": 624.08, "text": "the"}, {"end": 625.0, "start": 624.48, "text": "number"}, {"end": 625.48, "start": 625.0, "text": "of"}, {"end": 627.88, "start": 625.48, "text": "documents."}, {"end": 628.08, "start": 627.88, "text": "And"}, {"end": 628.36, "start": 628.08, "text": "this"}, {"end": 628.72, "start": 628.36, "text": "result"}, {"end": 628.92, "start": 628.72, "text": "is"}, {"end": 629.36, "start": 628.92, "text": "very"}, {"end": 629.96, "start": 629.36, "text": "impressive."}], "text": " as a cluster, and the colored segments are topics. We have the x-axis as the percentage of topics. So it starts from 0% to 100%. And the y-axis represents the clusters. Within the bracket, we have the cluster size, which are the number of documents. And this result is very impressive."}, {"chunks": [{"end": 631.96, "start": 630.0, "text": "because"}, {"end": 632.52, "start": 631.96, "text": "in"}, {"end": 633.16, "start": 632.52, "text": "almost"}, {"end": 633.44, "start": 633.16, "text": "all"}, {"end": 634.8, "start": 633.44, "text": "clusters,"}, {"end": 635.0, "start": 634.8, "text": "we"}, {"end": 635.32, "start": 635.0, "text": "can"}, {"end": 636.6, "start": 635.32, "text": "identify"}, {"end": 637.6, "start": 636.6, "text": "one"}, {"end": 637.76, "start": 637.6, "text": "or"}, {"end": 638.32, "start": 637.76, "text": "two"}, {"end": 638.76, "start": 638.32, "text": "dominant"}, {"end": 640.88, "start": 638.76, "text": "topics."}, {"end": 641.28, "start": 640.88, "text": "Even"}, {"end": 641.92, "start": 641.28, "text": "though"}, {"end": 642.04, "start": 641.92, "text": "the"}, {"end": 642.8, "start": 642.04, "text": "topics"}, {"end": 642.88, "start": 642.8, "text": "are"}, {"end": 643.8, "start": 642.88, "text": "latent,"}, {"end": 643.88, "start": 643.8, "text": "even"}, {"end": 644.04, "start": 643.88, "text": "though"}, {"end": 644.32, "start": 644.04, "text": "we"}, {"end": 644.72, "start": 644.32, "text": "can't"}, {"end": 645.24, "start": 644.72, "text": "interpret"}, {"end": 645.44, "start": 645.24, "text": "what"}, {"end": 645.64, "start": 645.44, "text": "they"}, {"end": 647.0, "start": 645.64, "text": "mean,"}, {"end": 647.88, "start": 647.0, "text": "but"}, {"end": 648.16, "start": 647.88, "text": "we"}, {"end": 648.48, "start": 648.16, "text": "know"}, {"end": 649.28, "start": 648.48, "text": "that"}, {"end": 649.68, "start": 649.28, "text": "in"}, {"end": 650.04, "start": 649.68, "text": "these"}, {"end": 651.4, "start": 650.04, "text": "clusters,"}, {"end": 651.76, "start": 651.4, "text": "there"}, {"end": 652.32, "start": 651.76, "text": "has"}, {"end": 652.36, "start": 652.32, "text": "a"}, {"end": 652.64, "start": 652.36, "text": "topic,"}, {"end": 652.8, "start": 652.64, "text": "there"}, {"end": 653.08, "start": 652.8, "text": "has"}, {"end": 653.32, "start": 653.08, "text": "a"}, {"end": 653.44, "start": 653.32, "text": "theme"}, {"end": 653.92, "start": 653.44, "text": "that"}, {"end": 654.64, "start": 653.92, "text": "dominates"}, {"end": 654.88, "start": 654.64, "text": "the"}, {"end": 655.36, "start": 654.88, "text": "whole"}, {"end": 655.76, "start": 655.36, "text": "cluster."}, {"end": 656.16, "start": 655.76, "text": "And"}, {"end": 656.16, "start": 656.16, "text": "we"}, {"end": 657.56, "start": 656.16, "text": "can"}, {"end": 657.88, "start": 657.56, "text": "use"}, {"end": 657.88, "start": 657.88, "text": "it"}, {"end": 658.08, "start": 657.88, "text": "to"}, {"end": 658.76, "start": 658.08, "text": "describe"}, {"end": 658.76, "start": 658.76, "text": "the"}, {"end": 659.96, "start": 658.76, "text": "clusters."}], "text": " because in almost all clusters, we can identify one or two dominant topics. Even though the topics are latent, even though we can't interpret what they mean, but we know that in these clusters, there has a topic, there has a theme that dominates the whole cluster. And we can use it to describe the clusters."}, {"chunks": [{"end": 661.2, "start": 660.0, "text": "On"}, {"end": 662.12, "start": 661.2, "text": "the"}, {"end": 662.56, "start": 662.12, "text": "other"}, {"end": 664.12, "start": 662.56, "text": "hand,"}, {"end": 664.64, "start": 664.12, "text": "how"}, {"end": 664.76, "start": 664.64, "text": "do"}, {"end": 664.96, "start": 664.76, "text": "the"}, {"end": 665.6, "start": 664.96, "text": "topics"}, {"end": 665.92, "start": 665.6, "text": "align"}, {"end": 666.0, "start": 665.92, "text": "with"}, {"end": 667.68, "start": 666.0, "text": "clusters?"}, {"end": 668.08, "start": 667.68, "text": "Well,"}, {"end": 668.72, "start": 668.08, "text": "we've"}, {"end": 669.24, "start": 668.72, "text": "seen"}, {"end": 669.68, "start": 669.24, "text": "that"}, {"end": 670.4, "start": 669.68, "text": "how"}, {"end": 670.52, "start": 670.4, "text": "the"}, {"end": 671.24, "start": 670.52, "text": "clusters"}, {"end": 671.56, "start": 671.24, "text": "align"}, {"end": 671.76, "start": 671.56, "text": "with"}, {"end": 673.44, "start": 671.76, "text": "topics"}, {"end": 674.16, "start": 673.44, "text": "from"}, {"end": 674.32, "start": 674.16, "text": "the"}, {"end": 674.52, "start": 674.32, "text": "bar"}, {"end": 675.08, "start": 674.52, "text": "chart,"}, {"end": 675.2, "start": 675.08, "text": "and"}, {"end": 675.56, "start": 675.2, "text": "there"}, {"end": 675.88, "start": 675.56, "text": "are"}, {"end": 676.24, "start": 675.88, "text": "already"}, {"end": 676.76, "start": 676.24, "text": "other"}, {"end": 678.28, "start": 676.76, "text": "practices"}, {"end": 679.08, "start": 678.28, "text": "on"}, {"end": 679.64, "start": 679.08, "text": "using"}, {"end": 680.08, "start": 679.64, "text": "topic"}, {"end": 680.48, "start": 680.08, "text": "modeling"}, {"end": 680.52, "start": 680.48, "text": "to"}, {"end": 681.16, "start": 680.52, "text": "support"}, {"end": 681.8, "start": 681.16, "text": "clustering."}, {"end": 681.8, "start": 681.8, "text": "But"}, {"end": 682.84, "start": 681.8, "text": "we"}, {"end": 683.28, "start": 682.84, "text": "are"}, {"end": 683.76, "start": 683.28, "text": "also"}, {"end": 684.48, "start": 683.76, "text": "interested"}, {"end": 685.16, "start": 684.48, "text": "in"}, {"end": 685.4, "start": 685.16, "text": "whether"}, {"end": 685.76, "start": 685.4, "text": "the"}, {"end": 686.56, "start": 685.76, "text": "clusters"}, {"end": 686.84, "start": 686.56, "text": "can"}, {"end": 687.04, "start": 686.84, "text": "be"}, {"end": 687.48, "start": 687.04, "text": "used"}, {"end": 687.96, "start": 687.48, "text": "to"}, {"end": 688.44, "start": 687.96, "text": "describe"}, {"end": 689.96, "start": 688.44, "text": "topics."}], "text": " On the other hand, how do the topics align with clusters? Well, we've seen that how the clusters align with topics from the bar chart, and there are already other practices on using topic modeling to support clustering. But we are also interested in whether the clusters can be used to describe topics."}, {"chunks": [{"end": 690.4, "start": 690.0, "text": "And"}, {"end": 691.4, "start": 690.4, "text": "what"}, {"end": 691.76, "start": 691.4, "text": "we"}, {"end": 692.6, "start": 691.76, "text": "did"}, {"end": 692.76, "start": 692.6, "text": "is"}, {"end": 693.32, "start": 692.76, "text": "to"}, {"end": 694.28, "start": 693.32, "text": "first"}, {"end": 694.64, "start": 694.28, "text": "match"}, {"end": 695.32, "start": 694.64, "text": "each"}, {"end": 696.52, "start": 695.32, "text": "cluster"}, {"end": 696.84, "start": 696.52, "text": "with"}, {"end": 697.28, "start": 696.84, "text": "its"}, {"end": 697.52, "start": 697.28, "text": "dominant"}, {"end": 698.92, "start": 697.52, "text": "topic."}, {"end": 699.56, "start": 698.92, "text": "So"}, {"end": 700.0, "start": 699.56, "text": "for"}, {"end": 700.2, "start": 700.0, "text": "example,"}, {"end": 701.44, "start": 700.2, "text": "cluster"}, {"end": 702.52, "start": 701.44, "text": "19"}, {"end": 702.84, "start": 702.52, "text": "is"}, {"end": 703.16, "start": 702.84, "text": "mapped"}, {"end": 703.6, "start": 703.16, "text": "with"}, {"end": 704.2, "start": 703.6, "text": "topic"}, {"end": 705.0, "start": 704.2, "text": "three"}, {"end": 705.4, "start": 705.0, "text": "because"}, {"end": 705.64, "start": 705.4, "text": "this"}, {"end": 705.8, "start": 705.64, "text": "is"}, {"end": 705.88, "start": 705.8, "text": "a"}, {"end": 706.24, "start": 705.88, "text": "dominant"}, {"end": 707.72, "start": 706.24, "text": "topic."}, {"end": 708.4, "start": 707.72, "text": "And"}, {"end": 708.56, "start": 708.4, "text": "then"}, {"end": 708.88, "start": 708.56, "text": "we"}, {"end": 709.6, "start": 708.88, "text": "compare"}, {"end": 709.72, "start": 709.6, "text": "the"}, {"end": 711.44, "start": 709.72, "text": "keywords"}, {"end": 712.08, "start": 711.44, "text": "for"}, {"end": 712.52, "start": 712.08, "text": "each"}, {"end": 712.56, "start": 712.52, "text": "of"}, {"end": 712.96, "start": 712.56, "text": "the"}, {"end": 714.24, "start": 712.96, "text": "cluster"}, {"end": 714.68, "start": 714.24, "text": "topic"}, {"end": 715.6, "start": 714.68, "text": "pair."}, {"end": 715.68, "start": 715.6, "text": "And"}, {"end": 716.24, "start": 715.68, "text": "this"}, {"end": 716.48, "start": 716.24, "text": "is"}, {"end": 717.24, "start": 716.48, "text": "very"}, {"end": 717.88, "start": 717.24, "text": "surprising"}, {"end": 719.96, "start": 717.88, "text": "because"}], "text": " And what we did is to first match each cluster with its dominant topic. So for example, cluster 19 is mapped with topic three because this is a dominant topic. And then we compare the keywords for each of the cluster topic pair. And this is very surprising because"}, {"chunks": [{"end": 720.48, "start": 720.0, "text": "We"}, {"end": 720.68, "start": 720.48, "text": "have"}, {"end": 721.0, "start": 720.68, "text": "three"}, {"end": 721.76, "start": 721.0, "text": "columns"}, {"end": 721.88, "start": 721.76, "text": "of"}, {"end": 723.12, "start": 721.88, "text": "keywords"}, {"end": 724.36, "start": 723.12, "text": "from"}, {"end": 724.48, "start": 724.36, "text": "the"}, {"end": 725.0, "start": 724.48, "text": "cluster"}, {"end": 725.64, "start": 725.0, "text": "terms"}, {"end": 725.96, "start": 725.64, "text": "and"}, {"end": 726.16, "start": 725.96, "text": "then"}, {"end": 726.6, "start": 726.16, "text": "the"}, {"end": 727.0, "start": 726.6, "text": "topic"}, {"end": 727.6, "start": 727.0, "text": "terms."}, {"end": 727.68, "start": 727.6, "text": "And"}, {"end": 728.16, "start": 727.68, "text": "we"}, {"end": 728.72, "start": 728.16, "text": "also"}, {"end": 729.0, "start": 728.72, "text": "have"}, {"end": 729.0, "start": 729.0, "text": "the"}, {"end": 729.76, "start": 729.0, "text": "terms"}, {"end": 730.08, "start": 729.76, "text": "from"}, {"end": 730.16, "start": 730.08, "text": "the"}, {"end": 730.72, "start": 730.16, "text": "centroid."}, {"end": 732.16, "start": 730.72, "text": "And"}, {"end": 732.44, "start": 732.16, "text": "the"}, {"end": 732.88, "start": 732.44, "text": "first"}, {"end": 733.72, "start": 732.88, "text": "observation"}, {"end": 734.76, "start": 733.72, "text": "is"}, {"end": 735.24, "start": 734.76, "text": "for"}, {"end": 735.44, "start": 735.24, "text": "the"}, {"end": 736.08, "start": 735.44, "text": "cluster"}, {"end": 736.76, "start": 736.08, "text": "terms"}, {"end": 736.96, "start": 736.76, "text": "and"}, {"end": 736.96, "start": 736.96, "text": "the"}, {"end": 737.44, "start": 736.96, "text": "topic"}, {"end": 739.0, "start": 737.44, "text": "terms,"}, {"end": 739.52, "start": 739.0, "text": "they"}, {"end": 739.8, "start": 739.52, "text": "are"}, {"end": 740.36, "start": 739.8, "text": "almost"}, {"end": 740.68, "start": 740.36, "text": "identical"}, {"end": 741.0, "start": 740.68, "text": "to"}, {"end": 741.44, "start": 741.0, "text": "each"}, {"end": 742.8, "start": 741.44, "text": "other."}, {"end": 743.48, "start": 742.8, "text": "And"}, {"end": 743.96, "start": 743.48, "text": "even"}, {"end": 744.56, "start": 743.96, "text": "the"}, {"end": 745.08, "start": 744.56, "text": "order"}, {"end": 745.4, "start": 745.08, "text": "of"}, {"end": 745.72, "start": 745.4, "text": "the"}, {"end": 746.6, "start": 745.72, "text": "terms"}, {"end": 746.8, "start": 746.6, "text": "are"}, {"end": 747.32, "start": 746.8, "text": "very"}, {"end": 749.96, "start": 747.32, "text": "similar."}], "text": " We have three columns of keywords from the cluster terms and then the topic terms. And we also have the terms from the centroid. And the first observation is for the cluster terms and the topic terms, they are almost identical to each other. And even the order of the terms are very similar."}, {"chunks": [{"end": 750.64, "start": 750.0, "text": "For"}, {"end": 751.44, "start": 750.64, "text": "example,"}, {"end": 751.88, "start": 751.44, "text": "the"}, {"end": 752.64, "start": 751.88, "text": "pair"}, {"end": 753.84, "start": 752.64, "text": "cluster"}, {"end": 754.6, "start": 753.84, "text": "three"}, {"end": 754.76, "start": 754.6, "text": "and"}, {"end": 755.2, "start": 754.76, "text": "topic"}, {"end": 756.2, "start": 755.2, "text": "four,"}, {"end": 756.56, "start": 756.2, "text": "the"}, {"end": 757.08, "start": 756.56, "text": "first"}, {"end": 757.52, "start": 757.08, "text": "word"}, {"end": 757.72, "start": 757.52, "text": "is"}, {"end": 760.28, "start": 757.72, "text": "quota."}, {"end": 760.72, "start": 760.28, "text": "And"}, {"end": 760.76, "start": 760.72, "text": "in"}, {"end": 760.88, "start": 760.76, "text": "the"}, {"end": 761.32, "start": 760.88, "text": "topic"}, {"end": 761.52, "start": 761.32, "text": "is"}, {"end": 761.72, "start": 761.52, "text": "also"}, {"end": 763.24, "start": 761.72, "text": "quota."}, {"end": 763.6, "start": 763.24, "text": "And"}, {"end": 763.76, "start": 763.6, "text": "by"}, {"end": 764.4, "start": 763.76, "text": "simply"}, {"end": 764.8, "start": 764.4, "text": "looking"}, {"end": 764.96, "start": 764.8, "text": "at"}, {"end": 765.24, "start": 764.96, "text": "these"}, {"end": 766.28, "start": 765.24, "text": "terms,"}, {"end": 766.52, "start": 766.28, "text": "we"}, {"end": 766.96, "start": 766.52, "text": "can"}, {"end": 767.24, "start": 766.96, "text": "have"}, {"end": 767.48, "start": 767.24, "text": "the"}, {"end": 768.04, "start": 767.48, "text": "very"}, {"end": 768.68, "start": 768.04, "text": "initial"}, {"end": 769.32, "start": 768.68, "text": "essential"}, {"end": 769.76, "start": 769.32, "text": "idea"}, {"end": 770.24, "start": 769.76, "text": "of"}, {"end": 770.48, "start": 770.24, "text": "what"}, {"end": 770.8, "start": 770.48, "text": "this"}, {"end": 771.36, "start": 770.8, "text": "cluster"}, {"end": 771.72, "start": 771.36, "text": "is"}, {"end": 774.16, "start": 771.72, "text": "about."}, {"end": 774.36, "start": 774.16, "text": "And"}, {"end": 774.64, "start": 774.36, "text": "we"}, {"end": 775.0, "start": 774.64, "text": "can"}, {"end": 776.6, "start": 775.0, "text": "also"}, {"end": 777.16, "start": 776.6, "text": "see"}, {"end": 777.16, "start": 777.16, "text": "that"}, {"end": 777.56, "start": 777.16, "text": "the"}, {"end": 778.88, "start": 777.56, "text": "topics"}, {"end": 779.16, "start": 778.88, "text": "is"}, {"end": 779.88, "start": 779.16, "text": "using"}, {"end": 779.96, "start": 779.88, "text": "very"}], "text": " For example, the pair cluster three and topic four, the first word is quota. And in the topic is also quota. And by simply looking at these terms, we can have the very initial essential idea of what this cluster is about. And we can also see that the topics is using very"}, {"chunks": [{"end": 780.52, "start": 780.0, "text": "very"}, {"end": 781.36, "start": 780.52, "text": "similar"}, {"end": 782.04, "start": 781.36, "text": "terms"}, {"end": 783.12, "start": 782.04, "text": "compared"}, {"end": 783.36, "start": 783.12, "text": "to"}, {"end": 783.36, "start": 783.36, "text": "the"}, {"end": 784.76, "start": 783.36, "text": "clusters."}, {"end": 784.92, "start": 784.76, "text": "So"}, {"end": 785.08, "start": 784.92, "text": "the"}, {"end": 785.84, "start": 785.08, "text": "clusters"}, {"end": 786.68, "start": 785.84, "text": "can"}, {"end": 787.2, "start": 786.68, "text": "help"}, {"end": 787.48, "start": 787.2, "text": "us"}, {"end": 787.8, "start": 787.48, "text": "to"}, {"end": 788.4, "start": 787.8, "text": "understand"}, {"end": 788.72, "start": 788.4, "text": "more"}, {"end": 789.04, "start": 788.72, "text": "about"}, {"end": 789.04, "start": 789.04, "text": "the"}, {"end": 789.76, "start": 789.04, "text": "topics."}, {"end": 792.0, "start": 789.76, "text": "But"}, {"end": 792.24, "start": 792.0, "text": "the"}, {"end": 792.76, "start": 792.24, "text": "second"}, {"end": 793.32, "start": 792.76, "text": "observation"}, {"end": 793.48, "start": 793.32, "text": "is"}, {"end": 794.4, "start": 793.48, "text": "that"}, {"end": 794.6, "start": 794.4, "text": "when"}, {"end": 795.0, "start": 794.6, "text": "we"}, {"end": 795.28, "start": 795.0, "text": "use"}, {"end": 795.32, "start": 795.28, "text": "the"}, {"end": 796.04, "start": 795.32, "text": "keywords"}, {"end": 796.64, "start": 796.04, "text": "collected"}, {"end": 796.84, "start": 796.64, "text": "from"}, {"end": 796.96, "start": 796.84, "text": "the"}, {"end": 797.6, "start": 796.96, "text": "centroid"}, {"end": 799.0, "start": 797.6, "text": "documents,"}, {"end": 799.32, "start": 799.0, "text": "they"}, {"end": 800.24, "start": 799.32, "text": "seem"}, {"end": 800.88, "start": 800.24, "text": "irrelevant"}, {"end": 801.12, "start": 800.88, "text": "to"}, {"end": 801.4, "start": 801.12, "text": "the"}, {"end": 801.84, "start": 801.4, "text": "topics."}, {"end": 802.72, "start": 801.84, "text": "And"}, {"end": 803.08, "start": 802.72, "text": "they"}, {"end": 803.4, "start": 803.08, "text": "seem"}, {"end": 804.08, "start": 803.4, "text": "very"}, {"end": 804.48, "start": 804.08, "text": "different"}, {"end": 805.0, "start": 804.48, "text": "from"}, {"end": 805.04, "start": 805.0, "text": "the"}, {"end": 806.68, "start": 805.04, "text": "keywords"}, {"end": 807.32, "start": 806.68, "text": "generated"}, {"end": 807.52, "start": 807.32, "text": "from"}, {"end": 807.6, "start": 807.52, "text": "the"}, {"end": 807.84, "start": 807.6, "text": "whole"}, {"end": 808.64, "start": 807.84, "text": "clusters"}, {"end": 809.56, "start": 808.64, "text": "at"}, {"end": 809.96, "start": 809.56, "text": "all."}], "text": " very similar terms compared to the clusters. So the clusters can help us to understand more about the topics. But the second observation is that when we use the keywords collected from the centroid documents, they seem irrelevant to the topics. And they seem very different from the keywords generated from the whole clusters at all."}, {"chunks": [{"end": 810.88, "start": 810.0, "text": "So"}, {"end": 811.36, "start": 810.88, "text": "if"}, {"end": 811.84, "start": 811.36, "text": "we"}, {"end": 812.16, "start": 811.84, "text": "are"}, {"end": 812.36, "start": 812.16, "text": "going"}, {"end": 812.6, "start": 812.36, "text": "to"}, {"end": 812.96, "start": 812.6, "text": "use"}, {"end": 814.28, "start": 812.96, "text": "clustering"}, {"end": 814.88, "start": 814.28, "text": "for"}, {"end": 816.68, "start": 814.88, "text": "document"}, {"end": 817.92, "start": 816.68, "text": "description,"}, {"end": 818.16, "start": 817.92, "text": "then"}, {"end": 818.72, "start": 818.16, "text": "maybe"}, {"end": 819.48, "start": 818.72, "text": "the"}, {"end": 820.12, "start": 819.48, "text": "documents"}, {"end": 820.44, "start": 820.12, "text": "near"}, {"end": 820.76, "start": 820.44, "text": "the"}, {"end": 821.24, "start": 820.76, "text": "central"}, {"end": 821.48, "start": 821.24, "text": "are"}, {"end": 821.88, "start": 821.48, "text": "not"}, {"end": 822.2, "start": 821.88, "text": "as"}, {"end": 823.36, "start": 822.2, "text": "descriptive"}, {"end": 823.64, "start": 823.36, "text": "and"}, {"end": 824.92, "start": 823.64, "text": "representative"}, {"end": 825.32, "start": 824.92, "text": "as"}, {"end": 825.56, "start": 825.32, "text": "we"}, {"end": 825.6, "start": 825.56, "text": "believed."}, {"end": 825.64, "start": 825.6, "text": "And"}, {"end": 826.6, "start": 825.64, "text": "this"}, {"end": 827.12, "start": 826.6, "text": "is"}, {"end": 827.44, "start": 827.12, "text": "different"}, {"end": 827.76, "start": 827.44, "text": "from"}, {"end": 828.04, "start": 827.76, "text": "our"}, {"end": 829.6, "start": 828.04, "text": "intuition."}, {"end": 829.96, "start": 829.6, "text": "So"}, {"end": 830.36, "start": 829.96, "text": "if"}, {"end": 830.68, "start": 830.36, "text": "we"}, {"end": 830.68, "start": 830.68, "text": "are"}, {"end": 830.72, "start": 830.68, "text": "going"}, {"end": 831.0, "start": 830.72, "text": "to"}, {"end": 831.6, "start": 831.0, "text": "use"}, {"end": 832.0, "start": 831.6, "text": "it,"}, {"end": 832.88, "start": 832.0, "text": "we"}, {"end": 833.6, "start": 832.88, "text": "start"}, {"end": 834.56, "start": 833.6, "text": "with"}, {"end": 834.76, "start": 834.56, "text": "the"}, {"end": 835.28, "start": 834.76, "text": "whole"}, {"end": 835.52, "start": 835.28, "text": "clusters."}, {"end": 835.76, "start": 835.52, "text": "That"}, {"end": 837.4, "start": 835.76, "text": "would"}, {"end": 837.64, "start": 837.4, "text": "be"}, {"end": 838.84, "start": 837.64, "text": "a"}, {"end": 839.4, "start": 838.84, "text": "better"}, {"end": 839.96, "start": 839.4, "text": "choice."}], "text": " So if we are going to use clustering for document description, then maybe the documents near the central are not as descriptive and representative as we believed. And this is different from our intuition. So if we are going to use it, we start with the whole clusters. That would be a better choice."}, {"chunks": [{"end": 840.92, "start": 840.0, "text": "we"}, {"end": 842.44, "start": 840.92, "text": "have"}, {"end": 843.12, "start": 842.44, "text": "shown"}, {"end": 843.44, "start": 843.12, "text": "from"}, {"end": 843.68, "start": 843.44, "text": "both"}, {"end": 844.6, "start": 843.68, "text": "directions"}, {"end": 844.76, "start": 844.6, "text": "that"}, {"end": 845.04, "start": 844.76, "text": "the"}, {"end": 845.88, "start": 845.04, "text": "clusters"}, {"end": 846.36, "start": 845.88, "text": "can"}, {"end": 846.52, "start": 846.36, "text": "be"}, {"end": 846.88, "start": 846.52, "text": "aligned"}, {"end": 847.12, "start": 846.88, "text": "with"}, {"end": 848.04, "start": 847.12, "text": "topics"}, {"end": 848.12, "start": 848.04, "text": "and"}, {"end": 848.28, "start": 848.12, "text": "the"}, {"end": 848.88, "start": 848.28, "text": "topics"}, {"end": 848.96, "start": 848.88, "text": "can"}, {"end": 849.08, "start": 848.96, "text": "be"}, {"end": 849.4, "start": 849.08, "text": "aligned"}, {"end": 849.56, "start": 849.4, "text": "with"}, {"end": 850.96, "start": 849.56, "text": "clusters"}, {"end": 851.24, "start": 850.96, "text": "by"}, {"end": 851.72, "start": 851.24, "text": "using"}, {"end": 852.48, "start": 851.72, "text": "the"}, {"end": 853.32, "start": 852.48, "text": "distribution"}, {"end": 853.4, "start": 853.32, "text": "of"}, {"end": 854.32, "start": 853.4, "text": "topics"}, {"end": 854.48, "start": 854.32, "text": "and"}, {"end": 854.68, "start": 854.48, "text": "by"}, {"end": 854.96, "start": 854.68, "text": "using"}, {"end": 855.08, "start": 854.96, "text": "the"}, {"end": 855.76, "start": 855.08, "text": "keywords"}, {"end": 856.16, "start": 855.76, "text": "generated"}, {"end": 856.32, "start": 856.16, "text": "from"}, {"end": 856.32, "start": 856.32, "text": "the"}, {"end": 857.2, "start": 856.32, "text": "clusters."}, {"end": 858.48, "start": 857.2, "text": "And"}, {"end": 858.8, "start": 858.48, "text": "we"}, {"end": 859.08, "start": 858.8, "text": "also"}, {"end": 859.24, "start": 859.08, "text": "need"}, {"end": 859.44, "start": 859.24, "text": "to"}, {"end": 859.84, "start": 859.44, "text": "notice"}, {"end": 860.44, "start": 859.84, "text": "that"}, {"end": 860.8, "start": 860.44, "text": "this"}, {"end": 860.92, "start": 860.8, "text": "is"}, {"end": 861.6, "start": 860.92, "text": "just"}, {"end": 861.84, "start": 861.6, "text": "an"}, {"end": 862.36, "start": 861.84, "text": "initial"}, {"end": 864.04, "start": 862.36, "text": "exploration."}, {"end": 864.44, "start": 864.04, "text": "We"}, {"end": 864.64, "start": 864.44, "text": "are"}, {"end": 864.64, "start": 864.64, "text": "not"}, {"end": 865.12, "start": 864.64, "text": "turning"}, {"end": 865.24, "start": 865.12, "text": "to"}, {"end": 866.52, "start": 865.24, "text": "parameters"}, {"end": 868.76, "start": 866.52, "text": "precisely"}, {"end": 869.12, "start": 868.76, "text": "what"}, {"end": 869.76, "start": 869.12, "text": "just"}, {"end": 869.96, "start": 869.76, "text": "used"}], "text": " we have shown from both directions that the clusters can be aligned with topics and the topics can be aligned with clusters by using the distribution of topics and by using the keywords generated from the clusters. And we also need to notice that this is just an initial exploration. We are not turning to parameters precisely what just used"}, {"chunks": [{"end": 870.32, "start": 870.0, "text": "using"}, {"end": 870.84, "start": 870.32, "text": "the"}, {"end": 871.76, "start": 870.84, "text": "simplest"}, {"end": 872.4, "start": 871.76, "text": "untrained"}, {"end": 872.72, "start": 872.4, "text": "models"}, {"end": 873.12, "start": 872.72, "text": "as"}, {"end": 873.8, "start": 873.12, "text": "the"}, {"end": 874.44, "start": 873.8, "text": "simplest"}, {"end": 875.44, "start": 874.44, "text": "K-means"}, {"end": 875.64, "start": 875.44, "text": "and"}, {"end": 875.92, "start": 875.64, "text": "the"}, {"end": 876.8, "start": 875.92, "text": "simplest"}, {"end": 877.44, "start": 876.8, "text": "untrained"}, {"end": 877.92, "start": 877.44, "text": "LDA"}, {"end": 878.28, "start": 877.92, "text": "top"}, {"end": 880.68, "start": 878.28, "text": "model."}, {"end": 881.12, "start": 880.68, "text": "And"}, {"end": 881.72, "start": 881.12, "text": "they"}, {"end": 882.08, "start": 881.72, "text": "both"}, {"end": 882.64, "start": 882.08, "text": "generate"}, {"end": 882.92, "start": 882.64, "text": "very"}, {"end": 883.44, "start": 882.92, "text": "similar"}, {"end": 885.24, "start": 883.44, "text": "informations"}, {"end": 885.92, "start": 885.24, "text": "as"}, {"end": 886.4, "start": 885.92, "text": "collection"}, {"end": 887.88, "start": 886.4, "text": "descriptors,"}, {"end": 888.2, "start": 887.88, "text": "even"}, {"end": 888.92, "start": 888.2, "text": "though"}, {"end": 889.64, "start": 888.92, "text": "their"}, {"end": 890.84, "start": 889.64, "text": "mechanisms"}, {"end": 890.92, "start": 890.84, "text": "are"}, {"end": 891.4, "start": 890.92, "text": "very"}, {"end": 891.84, "start": 891.4, "text": "different"}, {"end": 893.92, "start": 891.84, "text": "fundamentally."}, {"end": 895.16, "start": 893.92, "text": "And"}, {"end": 895.56, "start": 895.16, "text": "the"}, {"end": 895.8, "start": 895.56, "text": "third"}, {"end": 896.36, "start": 895.8, "text": "conclusion"}, {"end": 896.6, "start": 896.36, "text": "is"}, {"end": 898.08, "start": 896.6, "text": "that"}, {"end": 898.28, "start": 898.08, "text": "the"}, {"end": 899.0, "start": 898.28, "text": "documents"}, {"end": 899.24, "start": 899.0, "text": "near"}, {"end": 899.36, "start": 899.24, "text": "the"}, {"end": 899.96, "start": 899.36, "text": "century"}], "text": " using the simplest untrained models as the simplest K-means and the simplest untrained LDA top model. And they both generate very similar informations as collection descriptors, even though their mechanisms are very different fundamentally. And the third conclusion is that the documents near the century"}, {"chunks": [{"end": 900.4, "start": 900.0, "text": "are"}, {"end": 900.8, "start": 900.4, "text": "very"}, {"end": 901.44, "start": 900.8, "text": "limited"}, {"end": 901.72, "start": 901.44, "text": "at"}, {"end": 902.32, "start": 901.72, "text": "describing"}, {"end": 902.48, "start": 902.32, "text": "the"}, {"end": 903.08, "start": 902.48, "text": "content"}, {"end": 903.16, "start": 903.08, "text": "of"}, {"end": 903.16, "start": 903.16, "text": "a"}, {"end": 904.0, "start": 903.16, "text": "cluster"}, {"end": 904.36, "start": 904.0, "text": "if"}, {"end": 904.72, "start": 904.36, "text": "we"}, {"end": 905.08, "start": 904.72, "text": "use"}, {"end": 905.12, "start": 905.08, "text": "the"}, {"end": 906.12, "start": 905.12, "text": "topics"}, {"end": 906.48, "start": 906.12, "text": "as"}, {"end": 906.76, "start": 906.48, "text": "a"}, {"end": 907.36, "start": 906.76, "text": "reference."}, {"end": 907.48, "start": 907.36, "text": "And"}, {"end": 907.76, "start": 907.48, "text": "this"}, {"end": 907.96, "start": 907.76, "text": "is"}, {"end": 908.24, "start": 907.96, "text": "very"}, {"end": 908.68, "start": 908.24, "text": "different"}, {"end": 909.0, "start": 908.68, "text": "from"}, {"end": 909.28, "start": 909.0, "text": "our"}, {"end": 912.32, "start": 909.28, "text": "intuition."}, {"end": 912.92, "start": 912.32, "text": "So"}, {"end": 913.36, "start": 912.92, "text": "what's"}, {"end": 915.0, "start": 913.36, "text": "next?"}, {"end": 915.68, "start": 915.0, "text": "The"}, {"end": 916.04, "start": 915.68, "text": "next"}, {"end": 916.76, "start": 916.04, "text": "thing"}, {"end": 917.24, "start": 916.76, "text": "we"}, {"end": 917.44, "start": 917.24, "text": "will"}, {"end": 917.68, "start": 917.44, "text": "do"}, {"end": 918.28, "start": 917.68, "text": "is"}, {"end": 919.12, "start": 918.28, "text": "first,"}, {"end": 919.24, "start": 919.12, "text": "of"}, {"end": 919.68, "start": 919.24, "text": "course,"}, {"end": 920.0, "start": 919.68, "text": "to"}, {"end": 920.8, "start": 920.0, "text": "generalize"}, {"end": 920.8, "start": 920.8, "text": "the"}, {"end": 921.16, "start": 920.8, "text": "case"}, {"end": 922.48, "start": 921.16, "text": "study"}, {"end": 922.76, "start": 922.48, "text": "by"}, {"end": 923.12, "start": 922.76, "text": "two"}, {"end": 924.56, "start": 923.12, "text": "means."}, {"end": 925.2, "start": 924.56, "text": "We"}, {"end": 925.76, "start": 925.2, "text": "will"}, {"end": 926.52, "start": 925.76, "text": "explore"}, {"end": 927.04, "start": 926.52, "text": "other"}, {"end": 928.4, "start": 927.04, "text": "quantitative"}, {"end": 929.08, "start": 928.4, "text": "methods"}, {"end": 929.76, "start": 929.08, "text": "to"}, {"end": 929.96, "start": 929.76, "text": "visualize"}], "text": " are very limited at describing the content of a cluster if we use the topics as a reference. And this is very different from our intuition. So what's next? The next thing we will do is first, of course, to generalize the case study by two means. We will explore other quantitative methods to visualize"}, {"chunks": [{"end": 930.8, "start": 930.0, "text": "visualized"}, {"end": 931.64, "start": 930.8, "text": "alignment"}, {"end": 932.24, "start": 931.64, "text": "between"}, {"end": 932.84, "start": 932.24, "text": "topics"}, {"end": 932.88, "start": 932.84, "text": "and"}, {"end": 933.56, "start": 932.88, "text": "clusters,"}, {"end": 933.68, "start": 933.56, "text": "of"}, {"end": 934.16, "start": 933.68, "text": "course,"}, {"end": 934.2, "start": 934.16, "text": "and"}, {"end": 934.6, "start": 934.2, "text": "we"}, {"end": 935.0, "start": 934.6, "text": "will"}, {"end": 935.56, "start": 935.0, "text": "also"}, {"end": 935.92, "start": 935.56, "text": "use"}, {"end": 936.16, "start": 935.92, "text": "other"}, {"end": 937.16, "start": 936.16, "text": "collections"}, {"end": 937.88, "start": 937.16, "text": "with"}, {"end": 938.2, "start": 937.88, "text": "different"}, {"end": 939.32, "start": 938.2, "text": "sizes"}, {"end": 939.52, "start": 939.32, "text": "and"}, {"end": 940.04, "start": 939.52, "text": "types"}, {"end": 940.76, "start": 940.04, "text": "of"}, {"end": 942.96, "start": 940.76, "text": "documents."}, {"end": 943.08, "start": 942.96, "text": "And"}, {"end": 943.4, "start": 943.08, "text": "then"}, {"end": 943.6, "start": 943.4, "text": "we"}, {"end": 944.04, "start": 943.6, "text": "will"}, {"end": 944.6, "start": 944.04, "text": "explore"}, {"end": 945.12, "start": 944.6, "text": "how"}, {"end": 945.72, "start": 945.12, "text": "these"}, {"end": 946.76, "start": 945.72, "text": "comparison"}, {"end": 947.52, "start": 946.76, "text": "techniques"}, {"end": 947.72, "start": 947.52, "text": "can"}, {"end": 948.48, "start": 947.72, "text": "be"}, {"end": 948.96, "start": 948.48, "text": "used"}, {"end": 949.04, "start": 948.96, "text": "to"}, {"end": 949.76, "start": 949.04, "text": "enhance"}, {"end": 949.96, "start": 949.76, "text": "each"}, {"end": 950.12, "start": 949.96, "text": "other"}, {"end": 950.36, "start": 950.12, "text": "and"}, {"end": 950.84, "start": 950.36, "text": "how"}, {"end": 951.32, "start": 950.84, "text": "these"}, {"end": 952.2, "start": 951.32, "text": "techniques"}, {"end": 952.4, "start": 952.2, "text": "can"}, {"end": 952.56, "start": 952.4, "text": "be"}, {"end": 952.88, "start": 952.56, "text": "used"}, {"end": 954.2, "start": 952.88, "text": "to"}, {"end": 954.68, "start": 954.2, "text": "describe"}, {"end": 956.2, "start": 954.68, "text": "clusters"}, {"end": 956.28, "start": 956.2, "text": "with"}, {"end": 957.08, "start": 956.28, "text": "topics,"}, {"end": 957.24, "start": 957.08, "text": "how"}, {"end": 957.32, "start": 957.24, "text": "we"}, {"end": 957.76, "start": 957.32, "text": "can"}, {"end": 958.16, "start": 957.76, "text": "use"}, {"end": 958.4, "start": 958.16, "text": "them"}, {"end": 958.8, "start": 958.4, "text": "to"}, {"end": 959.28, "start": 958.8, "text": "expand"}, {"end": 959.84, "start": 959.28, "text": "topics"}, {"end": 959.96, "start": 959.84, "text": "with"}], "text": " visualized alignment between topics and clusters, of course, and we will also use other collections with different sizes and types of documents. And then we will explore how these comparison techniques can be used to enhance each other and how these techniques can be used to describe clusters with topics, how we can use them to expand topics with"}, {"chunks": [{"end": 961.6, "start": 960.0, "text": "classes."}, {"end": 961.8, "start": 961.6, "text": "And"}, {"end": 962.24, "start": 961.8, "text": "then"}, {"end": 962.4, "start": 962.24, "text": "if"}, {"end": 963.08, "start": 962.4, "text": "possible,"}, {"end": 963.48, "start": 963.08, "text": "we"}, {"end": 963.88, "start": 963.48, "text": "will"}, {"end": 964.8, "start": 963.88, "text": "examine"}, {"end": 965.64, "start": 964.8, "text": "how"}, {"end": 965.88, "start": 965.64, "text": "we"}, {"end": 966.76, "start": 965.88, "text": "can"}, {"end": 967.16, "start": 966.76, "text": "use"}, {"end": 967.56, "start": 967.16, "text": "these"}, {"end": 968.32, "start": 967.56, "text": "visualized"}, {"end": 970.0, "start": 968.32, "text": "representations"}, {"end": 970.24, "start": 970.0, "text": "these"}, {"end": 971.52, "start": 970.24, "text": "tools"}, {"end": 971.96, "start": 971.52, "text": "generate"}, {"end": 972.32, "start": 971.96, "text": "to"}, {"end": 972.56, "start": 972.32, "text": "support"}, {"end": 973.48, "start": 972.56, "text": "users"}, {"end": 973.68, "start": 973.48, "text": "with"}, {"end": 973.88, "start": 973.68, "text": "their"}, {"end": 974.96, "start": 973.88, "text": "search"}, {"end": 975.56, "start": 974.96, "text": "on"}, {"end": 975.92, "start": 975.56, "text": "an"}, {"end": 976.16, "start": 975.92, "text": "unknown"}, {"end": 978.28, "start": 976.16, "text": "collection."}, {"end": 978.64, "start": 978.28, "text": "And"}, {"end": 978.72, "start": 978.64, "text": "that's"}, {"end": 979.04, "start": 978.72, "text": "pretty"}, {"end": 979.36, "start": 979.04, "text": "much"}, {"end": 980.36, "start": 979.36, "text": "everything"}, {"end": 980.64, "start": 980.36, "text": "I'd"}, {"end": 980.96, "start": 980.64, "text": "like"}, {"end": 981.56, "start": 980.96, "text": "to"}, {"end": 982.0, "start": 981.56, "text": "say"}, {"end": 982.52, "start": 982.0, "text": "today."}, {"end": 983.2, "start": 982.52, "text": "Thank"}, {"end": 983.84, "start": 983.2, "text": "you"}, {"end": 984.28, "start": 983.84, "text": "for"}, {"end": 984.56, "start": 984.28, "text": "listening."}, {"end": 985.04, "start": 984.56, "text": "And"}, {"end": 985.44, "start": 985.04, "text": "I'm"}, {"end": 985.72, "start": 985.44, "text": "happy"}, {"end": 985.84, "start": 985.72, "text": "to"}, {"end": 986.04, "start": 985.84, "text": "take"}, {"end": 986.2, "start": 986.04, "text": "any"}, {"end": 988.44, "start": 986.2, "text": "questions."}, {"end": 988.8, "start": 988.44, "text": "Thank"}, {"end": 989.96, "start": 988.8, "text": "you."}], "text": " classes. And then if possible, we will examine how we can use these visualized representations these tools generate to support users with their search on an unknown collection. And that's pretty much everything I'd like to say today. Thank you for listening. And I'm happy to take any questions. Thank you."}, {"chunks": [{"end": 990.56, "start": 990.0, "text": "you"}], "text": " you"}]}}