Normally the search engine data bases of web pages are built and update automatically by Web crawlers. When one searches the web using one of the search engines, one is not searching the entire web. Instead one is only searching the database that has been compiled by the search engine and this database.
Search Engine Architecture
Typical search engine architecture consists of many components including the following three major components.
1. The crawler and the indexer: It collects pages from the web, creates and maintains the index.
2. The user interface: It allows submitting queries and enables result presentation.
3. The database and the query server: It stores information about the Web pages and processes the query and returns results
All the search engines essentially include a crawler, an indexer and a query server although the algorithms used in these components and quality of the algorithms may vary significantly.
The crawler is an application program that carries out a task similar to graph traversal. It is given a set of starting URLs that it uses to automatically traverse the Web by retrieving a page, initially from the starting set. Work Stress
Since a crawler has a relatively simple task, it is not CPU-bound. It is bandwidth bound. In crawling, bandwidth can become a bottleneck.
Crawling the Web
A Web crawler starts with a given set of URLs and fetches those pages. This continues until no new pages are found or a threshold is reached. While the crawler if fetching pages new pages are being put o the web and old pages are deleted or being modified. Therefore, a crawler could potentially continue finding new or modified pages forever.
Crawlers follow an algorithm like the following:
• Find base URLs: a set of know and working hyperlinks are collected.
• Build a queue: put the base URLs in the queue and add new URLs to the queue as more are discovered.
• Retrieve the next page: retrieve the next page in the queue, process and store in the search engine database.
• Add to the queue: check if the out lines of the current page have already been processed.
• Continue the process until some stopping criteria are met.
An index is essential to reduce the cost of query evaluation. These indexes can be built while the crawler is collecting the web pages or this could be done later. Indexing can be a very resource intensive process and often CPU-bound since it involves a lot analysis of each page.
Building an index requires document analysis and term extraction. Term extraction may involve extracting all the words from each page, elimination of stop words (common words like “the”, “it”, “and”, “that”) and stemming (transforming words like “computer”, “computing” and “computation” into one word, say “computer”, since there is little point in treating them as different words and similarly transforming “find” and “found” into one word) to help modify, for example, the inverted file structure. It may also involve analysis of hyperlinks.
Indexing Data Query server
First of all, a search engine needs to receive the query and check the spelling of keywords that the user has typed. If the search engine cannot recognize the keywords as words in the language or proper nouns it is desirable to suggest alternative spelling to the user. Once the keywords are found to be acceptable, the query may to be transformed. Queries are resolved using the inverted index. Consider the example query “Cat Mat Hat”. This is evaluated as follows:
• Select a word from the query (say, “Cat”)
• Retrieve the inverted list from disk for the word
• Process the list. For each document the word occurs in, add weight to an accumulator for that document based on the TF, IDF, and document length
• Repeat for each word in the query
• Find the best-ranked documents with the highest weights
• Lookup the document in the mapping table
• Retrieve and summarize the documents, and present to the user
A user has to submit a query a number of times using somewhat different keywords before more or less the “right” result is obtained. A search engine providing query refinement based on used feedback would be useful. Search engines often cache the results of a query and can then use the cached results if the refined query is a modification of a query that has already been processed.
In database system, query processing requires that attribute values match exactly the values provided in the query. In search engine query processing, an exact match is not always necessary. A partial match or a fuzzy match may be sufficient. Essentially a cache is supposed to cache a user’s or client’s request, get the page that the user has requested if one is not available in the cache and then save a copy of it in the cache.
Search query processing
In search, true success comes from understanding what the user is asking from their query. Some user queries are simply stated, while others are stated in a Boolean format (“apples AND oranges OR bananas”), or presented as whole paragraphs, passages, or documents with a request to “find similar” information. So the search platform must have a range of tools in order to accurately understand what is being asked.
The challenge with information retrieval revolves around two basic problems: 1) getting a good query from search users with the aim of helping them craft better questions, and 2) presenting “easy-to-judge” results to minimize what the user has to read through. In general, queries from the user come into the query processing and transformation subsystem. This framework takes the original query, analyzes it, transforms it with, corrections of spelling mistakes and then sends the query to the search engine.
The above fig shows the elements of query and results processing. The node in the search matrix that receives the query performs its retrieval operation and returns its results to the results-processing subsystem. The raw results are passed to the results-processing subsystem (which performs duplicate removal), results merging (from different search nodes), sorting, rank ordering, etc. All results are then sent to the search user.