[top]

Heading-Aware Proximity Measure and Its Application to Web Search

by Tomohiro Manabe, Keishi Tajima

Abstract

Proximity of query keyword occurrences is one important evidence which is useful for effective query-biased document scoring. If a query keyword occurs close to another in a document, it suggests high relevance of the document to the query. The simplest way to measure proximity between keyword occurrences is to use distance between them, i.e., difference of their positions. However, most web pages contain hierarchical structure composed of nested logical blocks with their headings, and it affects logical proximity. For example, if a keyword occurs in a block and another occurs in the heading of the block, we should not simply measure their proximity by their distance. This is because a heading describes the topic of the entire corresponding block, and term occurrences in a heading are strongly connected with any term occurrences in its associated block with less regard for the distance between them. Based on these observations, we developed a heading-aware proximity measure and applied it to three existing proximity-aware document scoring methods: MinDist, P6, and Span. We evaluated these existing methods and our modified methods on the data sets from TREC web tracks. The results indicate that our heading-aware proximity measure is better than the simple distance in all cases, and the method combining it with the Span method achieved the best performance.

Full Text: pdf

BibTex entry

Experiment data

Keywords

logical proximity; proximity search; hierarchical structure; document structure; sectional structure; heading structure; hierarchical headings
Published in DBSJ Journal, Vol.14, No.2, pp.1-6, 2016


tajima@i.kyoto-u.ac.jp / Fax: +81(Japan) 75-753-5978 / Office: Research Bldg. #7, room 404