Extracting Logical Hierarchical Structure of HTML Documents Based on Headings

by Tomohiro Manabe, Keishi Tajima


We propose a method for extracting logical hierarchical structure of HTML documents. Because mark-up structure in HTML documents does not necessarily coincide with logical hierarchical structure, it is not trivial how to extract logical structure of HTML documents. Human readers, however, easily understand their logical structure. The key information used by them is headings in the documents. Human readers exploit the following properties of headings: (1) headings appear at the beginning of the corresponding blocks, (2) headings are given prominent visual styles, (3) headings of the same level share the same visual style, and (4) headings of higher levels are given more prominent visual styles. Our method also exploits these properties for extracting hierarchical headings and their associated blocks. Our experiment shows that our method outperforms existing methods. In addition, our method extracts not only hierarchical blocks but also their associated headings.

Full Text: pdf

Slides: pdf

Poster: pdf

BibTex entry


information extraction; data extraction; document structure; sectional structure; web pages
Published in Proc. of VLDB Endowment, Vol.8, No.12, pp.1606-1617, Kohala Coast, HI, 2015

tajima@i.kyoto-u.ac.jp / Fax: +81(Japan) 75-753-5978 / Office: Research Bldg. #7, room 404