| Performance Level | # pages reached |
| perfect | 26/84 31% |
| good ( e =1%) | 33/84 39% |
| good ( e =3%) | 35/84 39% |
| good ( e =5%) | 41/84 49% |
| good ( e =10%) | 45/84 54% |
| good ( e =15%) | 47/84 56% |
| good ( e =20%) | 48/84 57% |
| good ( e =25%) | 48/84 57% |
We used the following method to evaluate the learned extraction heuristics. For each wrapper/Web page pair wi,pi, we trained the learner on a dataset constructed from all other wrapper/page pairs: that is, from the pairs < w1,p1 > ,...,< wi-1,pi-1>,<wi+1,pi+1 > ,..., < wm,pm > . We then tested the learned extraction heuristics on data constructed from the single held-out page pi, measuring the recall and precision of the learned classifier.4
This results in 168 measurements, two for each page. Before
attempting to summarize these measurements we will first present the
raw data in detail. All the results of this ``leave one page out''
experiment (and two variants that will be described shortly) are shown
in the scatterplot of Figure
; here we plot for
each page pi a point where recall is the x-axis position and
precision is the y-axis position. So that nearby points can be more
easily distinguished, we added 5% noise to both
coordinates.5
The scatter plot shows three distinct clusters. One cluster is near the point (100%,100%), corresponding to perfect agreement with the target wrapper program. The second cluster is near (0%,100%), and usually corresponds to a test case for which no data at all was extracted.6 The third cluster is near (50%,100%) and represents an interesting type of error: for most pages in the cluster, the learned wrapper extracted the anchor nodes correctly, but incorrectly assumed that the text node was identical to the anchor node. We note that in many cases, the choice of how much context to include in the description si of a URL ui is somewhat arbitrary, and hand-examination of a sample these results showed that the choices made by the learned system are usually not unreasonable; therefore it is probably appropriate to consider results in this cluster as qualified successes, rather than failures.
For an information integration system like WHIRL--one which is
somewhat tolerant to imperfect extraction--many of these results
would acceptably accurate. In particular, results near either the
(100%,100%) or (50%,100%) clusters are probably good enough for
WHIRL's purposes. In aggregating these results, we thus considered
two levels of performance. A learned extraction heuristic has
perfect performance on a page pi if recall and precision are
both 1. An extraction heuristic has
e
-good performance
on a page pi if recall and precision are both at least
1 -
e
,or if precision is at least
1 -
e
and recall is
at least
1/2 -
e.
The table in Figure
shows the number of perfect and
e
-good results in conducting
the leave-one-out experiment above.
We will use
e
= 5%
as a baseline performance threshold for
later experiments; however, as shown in Figure
,
the number of
e
-good pages does not change much as
e
is varied (because the clusters are so tight).
We believe that this sort of aggregation is more appropriate for
measuring overall performance than other common aggregation schemes,
such as measuring average precision and recall. On problems like this,
a system that finds perfect wrappers only half the time and fails
abjectly on the remaining problems is much more useful than a system
which is consistently mediocre.