| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In the leave-one-page-out experiment, when extraction performance is tested on a page pi, the learned extraction program has no knowledge whatsoever of pi itself. However, it may well be the case that the learner has seen examples of only slightly different pages--for instance, pages from the same Web site, or Web pages from different sites that present similar information. So it is still possible that the learned extraction heuristics are to some extent specialized to the benchmark problems from which they were generated, and would work poorly in a novel application domain.
We explored this issue in several ways. The 84 benchmark problems we
consider were taken from four different demonstrations of WHIRL: one
integrating information on North American birds (birds), one
concerning computer games for children (games), one concerning
movies and movie reviews (movies), and one concerning news
stories and company information (news). In the middle section of
Table
, we give the performance in the
leave-one-page-out experiments in each individual domain. Performance
seems to be roughly comparable7
on all domains, a first indication that the learned extraction
heuristics are not highly domain-specific.
We explored this issue by conducting two variants of the
leave-one-page-out experiment. The first variant is a
``leave-one-domain-out'' experiment. Here we group the pages by
domain, and for each domain, test performance of the extraction
heuristics obtained by training on the other three domains. If the
extraction heuristics were domain-specific, then one would expect to
see markedly worse performance; in fact, the performance degrades only
slightly. (Note also that less training data is available in the
``leave-one-domain-out'' experiments, another possible cause of
degraded performance.) These results shown in the leftmost section of
Table
.
The second variant is presented in the rightmost section of
Table
, labeled as the ``intra-domain
leave-one-page-out'' experiment. Here we again group the pages by
domain, and perform a separate leave-one-page-out experiment for each
domain. Thus, in this experiment the extraction heuristics tested for
page pi are learned from only the most similar pages--the pages
from the same domain. In this variant, one would expect a marked improvement in performance if the learned extraction heuristics
were very domain- or site-specific. In fact, there is little change.
These experiments thus support the conjecture that the learned
extraction are in fact quite general.
We also explored using classification learners other than RIPPER.
Table
shows the results for the same set of
experiments using CART, a widely used decision tree
learner.8 CART achieves performance
generally comparable to RIPPER. We also explored using C4.5
[17] and an implementation of Naive Bayes; however,
preliminary experiments suggested that their performance was somewhat
worse than both RIPPER and CART.