2.2 Cocitation Algorithm
An alternative approach for nding related pages is to examine the siblings of a starting node u
in the web graph. Two nodes are if they have a common parent. The number of common
parents of two nodes is their . As an alternative to the Companion algorithm, we
have developed a very simple algorithm that determines related pages by looking for sibling nodes
with the highest degree of co-citation. The Cocitation algorithm rst chooses up to B arbitrary
parents of u. For each of these parents p, it adds to a set S up to BF children of p that surround
the link from p to u. The elements of S are siblings of u. For each node s in S, we determine
the degree of cocitation of s with u. Finally, the algorithm returns the 10 most frequently cocited
nodes in S as the related pages.
In some cases there is an insucient level of cocitation with u to provide meaningful results. In
our implementation, if there are not at least 15 nodes in S that are cocited with u at least twice,
then we restart the algorithm using the node corresponding to u's URL with one path element
removed. For example, if u's URL is a.com/X/Y/Z and an insucient number of cocited nodes
exist for this URL, then we restart the algorithm with the URL a.com/X/Y (if the resulting URL
is invalid, we continue to chop elements until we are left with just a host name, or we nd a valid
URL).
In our implementation, we chose B to be 2000 and BF to be 8 (the same parameter values we
used for our implementation of the Companion algorithm).
One way of looking at the Cocitation algorithm is that it nds \maximal" n X 2 bipartite
subgraphs in the vicinity graph.
2.3 Netscape's Approach
Netscape introduced a new \What's Related?" feature in version 4.06 of the Netscape Communicator browser. Details about the approach used to identify related pages in their algorithm are
sketchy. However, the What's Related FAQ page indicates that the algorithm uses connectivity
information, usage information, and content analysis of the pages to determine relationships. We
quote from the \What's Related" FAQ page:
The What's Related data is created by Alexa Internet. Alexa uses crawling, archiving, categorizing and data mining techniques to build the related sites lists for millions of web URLs.
For example, Alexa uses links on the crawled pages to nd related sites. The day-to-day use of
What's Related also helps build and rene the data. As the service is used, the requested URLs
are logged. By looking at high-level trends, Netscape and Alexa can deduce relationships between
web sites. For example, if thousands of users go directly from site A to site B, the two sites are
likely to be related.
Next, Alexa checks all the URLs to make sure they are live links. This process removes links that
would try to return pages that don't exist (404 errors), as well as any links to servers that aren't
available to the general Internet population, such as servers that are no longer active or are
behind rewalls. Finally, once all of the relationships are established and the links are checked,
the top ten related sites for each URL are chosen by looking at the strength of the relationship
between the sites.
Each month, Alexa recrawls the web and rebuilds the data to pull in new sites and to rene
the relationships between the existing sites. New sites with strong relationships to a site will
automatically appear in the What's Related list for that site by displacing any sites with weaker
relationships.
Note that since the relationships between sites are based on strength, What's Related lists are not
necessarily balanced. Site A may appear in the list for Site B, but Site B may not be in the list
6