Search Engine Architecture
The indexing process
Text Acquisition
Crawlers / Feeds
Crawler discovers and acquires documents for the search engine.
Conversion
Documents found by a crawler must be converted from a variety of formats (HTML, XML, .pdf, .doc etc.) to text plus metadata format.
Document data store
Obtained documets are stored in document data store.
Text Transformation
Parser
Document text is tokenized. Tokens are words or expressions.
Stopping
Ignoring (stopping) some of common words (the, of, etc.) reduce the size of search engine index.
Stemming
Stemming replaces words that are derived from a common stem by its stem.
Link extraction and analysis
Links can be used for rating the popularity of the linked documents.
Information extraction
More information (named entities e.g. person or company names, dates, locations) can be extracted from the text for advanced usages.
Classifier
Classifier identifies related metadata for documents or parts of documents.
Index Creation
Document statistic
Document statistic (statistic information about word features and documents) are obtained for use in ranking component.
Weighting
Word weight reflects its relative importance in documents.
Inversion
Creating terms -> documents information (inverted file or inverted index) form documents -> terms information.
Index distribution
The index can be distributed across multiple computers.
The query process
User Interaction
Query input
Query input is an interface and a parser for a query language.
Query transformation
Query can be transformed to improve the initial query (spell checking, query suggestion, query expansion).
Result output
Result output is a list of ranked documents, it may include snippets of the retreived documents, highlighting importants words pr passages in documents, clustering of related group of documents or showing related advertising.
Ranking
Scoring
The scoring component calculates scores for documents using the ranking algorithm.
Performance optimization
The ranking and index creation can be optimized in various ways.
Distribution
Ranking can be distributed in a similar way as index creation.
Evaluation
Logging
Logs store users' queries, clicks etc. This information can be used for spell checking, query suggestions and other tasks.
Ranking analysis
Effectiveness of the ranking algorithm can be measured using log data or other information.
Performance analysis
The performance analysis involves monitoring ans improving overal engine performance.