Discovery and Retrieval of Information Units in Web

by Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, Katsumi Tanaka


In ordinary search engines for Web pages, the data unit for query processing is individual pages. Indexes are produced for each page in accordance with the words appearing in it. In actual Web data, however, a logical document discussing one topic is often organized into a set of pages connected via links provided by the page author as ``standard navigation routes.'' In such a situation, conjunctive queries with multiple keywords may fail to retrieve an appropriate document if those keywords appear in different pages within that document. Therefore, a data unit for Web data retrieval should not be a page but should be a connected subgraph corresponding to one logical document. In this paper, we develop new techniques for discovering and retrieving the logical information units in Web data. As in some previous researches, we adopt minimal subgraph semantics for conjunctive queries. In our approach, when given a conjunctive query, we try to approximate information units including all the given keywords in the following three steps: (1) we distinguish standard route links from the others, (2) we find minimal subgraphs connected via those links and including all the keywords, and (3) we compute the score of each subgraph based on the locality of the keywords within it in order to examine whether it is really a logical information unit relevant to the query.

Full Text: pdf

Slides: pdf

BibTex entry


Web, WWW, hypertext, query, data unit, structure discovery, information discovery, logical document
Published in Proc. of ACM Digital Library Workshop on Organizing Web Space (WOWS), pp.13-23, Aug. 1999, Berkeley, CA.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
tajima@i.kyoto-u.ac.jp / Fax: +81(Japan) 75-753-5978 / Office: Research Bldg. #7, room 404