Noise Robust Detection of the Emergence and Spread of Topics on the Web
by Masahiro Inoue, Keishi Tajima
Abstract
As the same information appears on many Web pages, we often want to
know which page is the first one that discussed it, or how the
information has spread on the Web as time passes. In this paper, we
develop two methods: a method of detecting the first page that
discussed the given information, and a method of generating a graph
showing how the number of pages discussing it has changed along the
timeline. To extract such information, we need to determine which
pages discuss the given topic, and also need to determine when these
pages were created. For the former step, we design a metric for
estimating inclusion degree between information and a page. For the
latter step, we develop a technique of extracting creation timestamps
on web pages. Although timestamp extraction is a crucial component in
temporal Web analysis, no research has shown how to do it in detail.
Both steps are, however, still error-prone. In order to improve noise
elimination, we examine not only the properties of each page, but also
temporal relationship between pages. If temporal relationship between
some candidate page and other pages are unlikely in typical patterns
of information spread on the Web, we eliminate the candidate page as a
noise. Results of our experiments show that our methods achieve high
precision and can be used for practical use.
Web mining,
information retrieval,
information flitering,
temporal information,
information dissemination,
topic initiator,
topic detection,
timestamp extraction,
creation time,
publishing date,
noise elimination
Publishd in Proc. of TempWeb 2012 (in conjunction with WWW Conf.), pp.9-16, Apr. 2012, Lyon