This link has been bookmarked by 62 people . It was first bookmarked on 29 Oct 2006, by Dagang Wei.
-
02 Nov 09
-
01 Nov 09
Clayton BussHow it works. Seeing as we are looking at search engines this next week, some of you might like to know how one works to find what you are looking for and then display that result!
-
The citation (link) graph of the web is an important resource that has
larg -
The citation (link) graph of the web is an important resource that has
larg - 26 more annotations...
-
-
The citation (link) graph of the web is an important resource that has
larg -
PageRank
-
PageRank
-
PageRank
-
PageRank
-
portance
-
ng Orde
-
PageRank
-
PageRank
-
PageRank
-
ageRank
-
PageRank
-
PageRank
-
PageRank
-
PageRank
-
PageRank
-
PageRank
-
PageRan
-
PageRan
-
ageRank
-
ageRank
-
PageRank
-
PageRank
-
PageRank
-
PageRank
-
PageRank
-
-
-
22 Oct 09
-
The Anatomy of a Large-Scale Hypertextual Web Search Engine
-
-
09 Oct 09
Tom MarchSergey Brin and Lawrence Page's early Stanford paper about how Google works.
-
01 Oct 09
-
-
23 Sep 09
-
19 Jun 09
-
18 Jun 09
-
17 Jun 09
-
10 May 09
-
22 Apr 09
-
21 Mar 09
-
15 Mar 09
-
17 Jan 09
-
13 Dec 08
-
29 Nov 08
-
remain largely a black
art and to be advertising oriented -
makes use of the link structure
- 48 more annotations...
-
-
utilizes
link -
PageRank extends this idea by not counting
links from all pages equally, and by normalizing by the number of links
on a page. -
damping factor
-
a PageRank for 26 million web pages can be computed in
a few hours on a medium size workstation. -
One important variation is to only add the damping factor d
to a single page, -
This idea of propagating anchor text to the page it refers to was implemented
in the World Wide Web Worm -
We use anchor propagation mostly because
anchor text can help provide better quality results. -
technically difficult
-
has location information for all hits
-
visual presentation details
-
the standard vector space model tries
to return the document that most closely approximates the query, -
s very short documents that are the
query plus a few words. -
"Bill Clinton Sucks" and picture from a "Bill
Clinton" query. -
completely uncontrolled heterogeneous documents.
-
On
the other hand, we define external meta information as information that
can be inferred about a document, but is not contained within it. -
update frequency, quality, popularity or usage, and citations.
-
virtually no control over what people can
put on the web. -
because any text on the page
which is not directly represented to the user is abused to manipulate search
engines. -
crawling, indexing, and searching
-
several
distributed crawlers. -
URLserver t
-
Every web page has an associated ID number called
a docID -
Each document is converted into a set of word
occurrences called hits. -
a partially sorted forward
index. -
It puts the anchor text into the
forward index, -
resorts them by wordID to generate
the inverted index. -
in place
-
a list of wordIDs
and offsets into the inverted index -
this list together with the lexicon produced by the indexer
-
addressable
by 64 bit integers -
BigFiles are virtual files spanning multiple file systems
-
operating systems do not provide
enough for our needs. -
We chose zlib's speed over a significant improvement in compression
offered by bzip. -
prefixed by docID, length, and URL
-
a file which lists crawler errors
-
The information stored in each entry includes the current document status,
a pointer into the repository, -
width file called docinfo which contains its URL and title.
-
pointer points into the URLlist
-
reasonably compact data structure
-
In order to find the docID of a particular URL, the
URL's checksum is computed and a binary search is performed on the checksums
file to find its docID. -
URLresolver
uses to turn URLs into docIDs. -
The current lexicon contains
14 million words (though some rare words were not added to the lexicon). -
a hash table of pointers.
-
Hit lists account for most of the space used in both the forward and the
inverted indices -
URL, title, anchor text, or meta tag.
-
For anchor hits, the 8 bits
of position are split into 4 bits for position in anchor and 4 bits for
a hash of the docID the anchor occurs in. -
wordID in the forward
index and the docID in the inverted index. -
Each barrel holds a range of wordID's.
-
-
-
16 Nov 08
-
(Note: There are two versions of this paper -- a longer full version
and a shorter printed version. The full version is available on the
web and the conference CD-ROM.)
The web creates new challenges for information retrieval. The amount
of information on the web is growing rapidly, as well as the number of
new users inexperienced in the art of web research. People are likely to
surf the web using its link graph, often starting with high quality human
maintained indices such as Yahoo! or
with search engines. Human maintained lists cover popular topics effectively
but are subjective, expensive to build and maintain, slow to improve, and
cannot cover all esoteric topics. Automated search engines that rely on
keyword matching usually return too many low quality matches. To make matters
worse, some advertisers attempt to gain people's attention by taking measures
meant to mislead automated search engines. We have built a large-scale
search engine which addresses many of the problems of existing systems.
It makes especially heavy use of the additional structure present in hypertext
to provide much higher quality search results. We chose our system name,
Google, because it is a common spelling of googol, or 10100
and fits well with our goal of building very large-scale search engines. -
Search engine technology has had to scale dramatically to keep up with
the growth of the web. In 1994, one of the first web search engines, the
World Wide Web Worm (WWWW) [McBryan
94] had an index of 110,000 web pages and web accessible documents.
As of November, 1997, the top search engines claim to index from 2 million
(WebCrawler) to 100 million web documents (from Search
Engine Watch). It is foreseeable that by the year 2000, a comprehensive
index of the Web will contain over a billion documents. At the same time,
the number of queries search engines handle has grown incredibly too. In
March and April 1994, the World Wide Web Worm received an average of about
1500 queries per day. In November 1997, Altavista claimed it handled roughly
20 million queries per day. With the increasing number of users on the
web, and automated systems which query search engines, it is likely that
top search engines will handle hundreds of millions of queries per day
by the year 2000. The goal of our system is to address many of the problems,
both in quality and scalability, introduced by scaling search engine technology
to such extraordinary numbers.
-
-
12 Nov 08
-
In 1994, one of the first web search engines, the
World Wide Web Worm (WWWW) [McBryan
94] had an index of 110,000 web pages and web accessible documents -
WebCrawler
- 16 more annotations...
-
-
the number
of documents in the indices has been increasing by many orders of magnitude,
but the user's ability to look at documents has not -
build systems that reasonable numbers
of people can actually use -
To support novel research
uses, Google stores all of the actual documents it crawls in compressed
form -
First, it makes use of the link structure of the
Web to calculate a quality ranking for each web page. This ranking is called
PageRank and is described in detail in [Page 98]. Second, Google utilizes
link to improve search results. -
PageRank extends this idea by not counting
links from all pages equally, and by normalizing by the number of links
on a page. PageRank is defined as follows: -
PageRank can be thought of as a model of user behavior
-
The text of links is treated in a special way in our search engine. Most
search engines associate the text of a link with the page that the link
is on -
The indexer
performs a number of functions. It reads the repository, uncompresses the
documents, and parses them -
It parses
out all the links in every web page and stores important information about
them in an anchors file. -
Google is designed
to avoid disk seeks whenever possible, and this has had a considerable
influence on the design of the data structures. -
The document index keeps information about each document.
-
A hit list corresponds to a list of occurrences of a particular word in
a particular document including position, font, and capitalization information -
A major performance stress is DNS lookup. Each crawler maintains
a its own DNS cache so it does not need to do a DNS lookup before crawling
each document. -
For maximum speed, instead of using YACC to generate a CFG parser, we use
flex to generate a lexical analyzer which we outfit with its own stack. -
Google maintains much more information about web documents than typical
search engines. Every hitlist includes position, font, and capitalization
information. Additionally, we factor in hits from anchor text and the PageRank
of the document. Combining all of this information into a rank is difficult. -
Some simple improvements to efficiency
include query caching, smart disk allocation, and subindices.
-
-
-
21 Oct 08
-
21 Sep 08
Nick CowieSergey Brin & Lawrence Page on the prototype of Google while still at Stanford
-
13 Aug 08
Lisa SpiroIn this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. -
08 Aug 08
-
07 Aug 08
-
28 Jul 08
-
24 Jul 08
-
20 May 08
-
09 May 08
-
05 May 08
-
18 Apr 08
-
09 Apr 08
-
Human maintained lists cover popular topics effectively
but are subjective, expensive to build and maintain, slow to improve, and
cannot cover all esoteric topics -
rely on
keyword matching - 28 more annotations...
-
-
low quality matches
-
advertisers attempt
-
mislead automated search engines
-
Our main goal is to improve the quality of web search engines
-
make it easy to find almost anything on the Web (once all the data is entered)
-
Junk results"
often wash out any results that a user is interested in -
People are still only
willing to look at the first few tens of results -
two important features that help it produce
high precision results -
First,
-
link structure
-
PageRank
-
Second
-
objective measure of its citation importance that corresponds well with
people's subjective idea of importance -
page can have a high PageRank
if there are many pages that point to it -
or if there are some pages that
point to it and have a high PageRank -
Some argue that on the web, users should specify more accurately
what they want and add more words to their query. We disagree vehemently
with this position. If a user issues a query like "Bill Clinton" they should
get reasonable results since there is a enormous amount of high quality
information available on this topic. Given examples like these, we believe
that the standard information retrieval work needs to be extended to deal
effectively with the web. -
here are even numerous companies which specialize in manipulating
search engines for profit. -
web crawling (downloading of web pages)
-
documents are stored
one after the other and are prefixed by docID, length, and URL -
The document index keeps information about each document
-
A hit list corresponds to a list of occurrences of a particular word in
a particular document including position, font, and capitalization information -
Plain hits include everything else
-
Fancy hits include hits occurring in
a URL, title, anchor text, or meta tag -
The goal of searching is to provide quality search results efficiently
-
Every hitlist includes position, font, and capitalization
information -
factor in hits from anchor text and the PageRank
of the document -
The biggest problem facing users of web search engines today is the quality
of the results they get back -
Google is a research
tool.
-
-
-
10 Feb 08
Concepción Abraira Fernándezartículo d presentación del proyecto Google allá por el año 1997/98
-
09 Oct 07
-
01 Oct 07
Chris ChesherOriginal article on Google technology
arin2610 Wk10 google architecture search engine design web history pagerank fromdelicious
-
09 Aug 07
-
n this section, we will give a high level overview of how the whole system
works as pictured in Figure 1. Further sections will discuss the applications
and data structures not mentioned in this section. Most of Google is implemented
in C or C++ for efficiency and can run in either Solaris or Linux.In Google, the web crawling (downloading of web pages) is done by several
distributed crawlers. There is a URLserver that sends lists of URLs to
be fetched to the crawlers. The web pages that are fetched are then sent
to the storeserver. The storeserver then compresses and stores the web
pages into a repository. Every web page has an associated ID number called
a docID which is assigned whenever a new URL is parsed out of a web page.
The indexing function is performed by the indexer and the sorter. The indexer
performs a number of functions. It reads the repository, uncompresses the
documents, and parses them. Each document is converted into a set of word
occurrences called hits. The hits record the word, position in document,
an approximation of font size, and capitalization. The indexer distributes
these hits into a set of "barrels", creating a partially sorted forward
index. The indexer performs another important function. It parses
out all the links in every web page and stores important information about
them in an anchors file. This file contains enough information to determine
where each link points from and to, and the text of the link.The URLresolver reads the anchors file and converts relative URLs into
absolute URLs and in turn into docIDs. It puts the anchor text into the
forward index, associated with the docID that the anchor points to. It
also generates a database of links which are pairs of docIDs. The links
database is used to compute PageRanks for all the documents.The sorter takes the barrels, which are sorted by docID (this is a simplification,
see Section 4.2.5), and resorts them by wordID to generate
the inverted index. This is done in place so that little temporary space
is needed for this operation. The sorter also produces a list of wordIDs
and offsets into the inverted index. A program called DumpLexicon takes
this list together with the lexicon produced by the indexer and generates
a new lexicon to be used by the searcher. The searcher is run by a web
server and uses the lexicon built by DumpLexicon together with the inverted
index and the PageRanks to answer queries.
-
-
31 May 07
-
24 May 07
-
21 May 07
-
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
-
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
-
-
14 Apr 07
-
31 Mar 07
-
29 Mar 07
-
19 Mar 07
-
30 Jan 07
-
12 Dec 06
-
30 Oct 06
-
29 Oct 06

Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.