In ordinary search engines for Web pages, the data unit for query
processing is individual pages. Indexes are produced for each page
in accordance with the words appearing in it. In actual Web data,
however, a logical document discussing one topic is often organized
into a set of pages connected via links provided by the page author as
“standard navigation routes.” In such a situation, conjunctive
queries with multiple keywords may fail to retrieve an appropriate
document if those keywords appear in different pages within that
document. Therefore, a data unit for Web data retrieval should not be
a page but should be a connected subgraph corresponding to one logical
document. In this paper, we develop new techniques for discovering
and retrieving the logical information units in Web data. As in some
previous researches, we adopt minimal subgraph semantics for
conjunctive queries. In our approach, when given a conjunctive query,
we try to approximate information units including all the given
keywords in the following three steps: (1) we distinguish standard
route links from the others, (2) we find minimal subgraphs connected
via those links and including all the keywords, and (3) we compute the
score of each subgraph based on the locality of the keywords within it
in order to examine whether it is really a logical information unit
relevant to the query.
Permission to make digital or hard copies of part or all of this
work for personal or classroom use is granted without fee provided
that copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on
the first page.
/ Fax: +81(Japan) 75-753-5978 / Office: Research Bldg. #7, room 404