This link has been bookmarked by 4 people . It was first bookmarked on 23 Mar 2009, by alfred westerveld.
-
12 Sep 09
-
21 May 09
-
23 Mar 09
-
21 Sep 08
-
The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store.
-
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
-
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
-
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best?
-
he front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.
-
Some links that may prove useful: "Agile web-crawler", a paper from Estonia (in English) Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended. "Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!
-
Xapian is another option for you. I've heard it scales better than some implementations of Lucene.
-
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.