Information retrieval

Search Engine Architecture

The indexing process

Text Acquisition

Crawlers / Feeds

Crawler discovers and acquires documents for the search engine.

Conversion

Documents found by a crawler must be converted from a variety of formats (HTML, XML, .pdf, .doc etc.) to text plus metadata format.

Document data store

Obtained documets are stored in document data store.

Text Transformation

Parser

Document text is tokenized. Tokens are words or expressions.

Stopping

Ignoring (stopping) some of common words (the, of, etc.) reduce the size of search engine index.

Stemming

Stemming replaces words that are derived from a common stem by its stem.

Link extraction and analysis

Links can be used for rating the popularity of the linked documents.

Information extraction

More information (named entities e.g. person or company names, dates, locations) can be extracted from the text for advanced usages.

Classifier

Classifier identifies related metadata for documents or parts of documents.

Index Creation

Document statistic

Document statistic (statistic information about word features and documents) are obtained for use in ranking component.

Weighting

Word weight reflects its relative importance in documents.

Inversion

Creating terms -> documents information (inverted file or inverted index) form documents -> terms information.

Index distribution

The index can be distributed across multiple computers.

The query process

User Interaction

Query input

Query input is an interface and a parser for a query language.

Query transformation

Query can be transformed to improve the initial query (spell checking, query suggestion, query expansion).

Result output

Result output is a list of ranked documents, it may include snippets of the retreived documents, highlighting importants words pr passages in documents, clustering of related group of documents or showing related advertising.

Ranking

Scoring

The scoring component calculates scores for documents using the ranking algorithm.

Performance optimization

The ranking and index creation can be optimized in various ways.

Distribution

Ranking can be distributed in a similar way as index creation.

Evaluation

Logging

Logs store users' queries, clicks etc. This information can be used for spell checking, query suggestions and other tasks.

Ranking analysis

Effectiveness of the ranking algorithm can be measured using log data or other information.

Performance analysis

The performance analysis involves monitoring ans improving overal engine performance.