We propose a method for extracting logical hierarchical structure of
HTML documents. Because mark-up structure in HTML documents does not
necessarily coincide with logical hierarchical structure, it is not
trivial how to extract logical structure of HTML documents. Human
readers, however, easily understand their logical structure. The key
information used by them is headings in the documents. Human readers
exploit the following properties of headings: (1) headings appear at
the beginning of the corresponding blocks, (2) headings are given
prominent visual styles, (3) headings of the same level share the same
visual style, and (4) headings of higher levels are given more
prominent visual styles. Our method also exploits these properties
for extracting hierarchical headings and their associated blocks. Our
experiment shows that our method outperforms existing methods. In
addition, our method extracts not only hierarchical blocks but also
their associated headings.
information extraction;
data extraction;
document structure;
sectional structure;
web pages
Published in Proc. of VLDB Endowment, Vol.8, No.12, pp.1606-1617, Kohala Coast, HI, 2015