OpenFTS Primer

by Oleg Bartunov (oleg@sai.msu.su), Neophytos Demetriou (neophytos@gmail.com), Teodor Sigaev (teodor@sigaev.ru), and Daniel Wickstrom (danw@rtp.ericsson.se)

1. Introduction

OpenFTS is a PostgreSQL-based full text search engine that provides advanced features such as online indexing of data, proximity based relevance ranking, multilingual support and stemming. You can check the full list of features here.

OpenFTS is Copyright 2000-2002 XWare and licensed under the GNU General Public License, version 2 (June 1991). This means you can use it and modify it in any way you want. If you choose to redistribute OpenFTS, you must do so under the terms of the GNU license.

2. Motivation

General search engines such as Google and Altavista are great for finding content on the Web, however these search tools do not have access to information about the framework of a web site such as access control. For example, users may receive a set of links from a search that they do not have access to view. It is frustrating for users to click on the link only to get an access denied page. Thus, at the level of performing an internal site wide search, an architecture-aware search tool is required. On the other hand, most of the architecture-aware search tools utilize inverted index which is very fast for searching but very slow for online update. Incremental update of inverted index is a complex engineering task while we needed something light and free.

3. Changes

IMPORTANT NOTICE: This version is incompatible with earlier versions due to changes in the base data type, the structure of the indexing tables, and the interfaces of the dictionaries.
OpenFTS is in what is likely to be one of many stages. The OpenFTS developers are experimenting with various features which should eventually result in a full-featured search engine within PostgreSQL.

The latest incarnation has more natural interface which is easier to understand. In the old system, search queries look something like the following:


      SELECT
          txt.tid,
      FROM
          txt
      WHERE
          (txt.fts_index @ '{14054652}')
and the new system uses a natural language approach that supports boolean operators and it looks like the following:

      SELECT * FROM foo WHERE titleidx @@ '(the|this)&!we';
This is quite an improvement over the previous approach. Here's a more complete list of changes in the latest version:
  1. The latest version is based on tsearch2 , a PostgreSQL contrib module, which provides the implementation of a special text data type, namely tsvector, suitable for text indexing. It uses words 'as is' without hashing to integers and provides search interface in more natural way. For example, it's possible now to test full text search from psql. More information about tsearch2 is available Tsearch2 home page.

  2. Implementations of dictionary interfaces are required to work with lexems instead of integers: lemms method instead of lemmsid, is_stoplexem instead of is_stoplemm.

  3. Better administration and maintenance API. Added:

  4. Added generic interfaces to ISpell dictionaries and Snowball stemmers. ISpell dictionaries are free and available for many languages and could be used to return base forms of a word. This is very important for inflective languages, like Russian. Snowball stemmers, available from http://snowball.tartarus.org, can be used to stem a word, i.e. to cut the word's ending and use the linguistic root for indexing and searching.

4. Installation

Prerequisities

Preparation Tasks

Make sure you have installed all headers (for example, spi.h) using gmake-install-headers during the PostgreSQL installation. If you have not installed tsearch2, please do so now. tsearch2 is under the contrib directory of the PostgreSQL source code.

Install OpenFTS

5. How it works

Indexer Configuration

First, you should configure OpenFTS. Indexer configuration is covered mostly by Search::OpenFTS::Index->init function. You should start by creating a base table that stores the indexed documents. Here's an example from the OpenFTS distribution:

    create table txt (
      tid  int not null primary key, 
      txt varchar,
      fts_index tsvector
    );
To configure your OpenFTS instance, call Search::OpenFTS::Index->init that creates the configuration and indexing tables. In our example, the call looks like:

my $idx=Search::OpenFTS::Index->init( 
        dbi=>$dbi, 
        txttid=>'txt.tid',
        tsvector_field=>'fts_index',
        ignore_id_index=>[ qw( 7 13 14 12 23 ) ],
        ignore_headline=>[ qw(13 15 16 17 5) ],
        map=>'{ \'19\'=>[1], 18=>[1], 8=>[1], 7=>[1], 6=>[1], 5=>[1], 4=>[1], 
}',
        dict=>[
                'Search::OpenFTS::Dict::PorterEng',
                'Search::OpenFTS::Dict::UnknownDict',
        ] 
);
You have to specify the table name to be indexed together with its primary key (txttid), the indexing field (tsvector), the types of lexemes that should be ignored by the indexer (ignore_id_index) and types ignored while constructing headlines for search results (ignore_headlines), the available dictionaries (dict), and a mapping of types of lexemes to dictionaries (map) that is used for optimization and for multi-language support.

All configuration parameters are stored in a database table, fts_conf, that is created upon the invocation of the initialization function. ignore_id_index and ignore_headline can only accept types of lexemes as specified later in this document [Parser]. For example, value 13 is the type ID for an html tag, namely SYMTAG. Type IDs are also used to map lexemes to dictionaries. This is helpful for optimizing the search engine and it is also helpful for indexing multi-languages or exotic-text documents.

You can create more than one OpenFTS instance by passing a character value (a-z) for prefix upon the invocation of the initialization function. The initialization function will also create a table, fts_unknown_lexem, that stores the lexemes that are not recognized by any of the available dictionaries. Note that fts_unknown_lexem is created by Search::OpenFTS::Dict::UnknownDict dictionary, not the OpenFTS core. If it is not specified upon initialization it will not be created.

Indexing

We have already introduced the initialization function which is part of the indexing module. When a request is received for indexing a document (Search::OpenFTS::Index->index), the parser reads and converts it into a stream of lexemes. Lexemes that were marked to be ignored are filtered and the position of each lexeme is calculated and stored in the corresponding indexing table.

Search

When a query is received (function search), it is converted into a stream of lexemes and then the SQL parts are constructed (function get_sql) and combined together to form the final SQL query (function _sql). Here's how the generated SQL query for searching for the word "stars" looks like:

    SELECT 
	txt.tid,
        rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'star\'', 0 ) as pos
    FROM
	txt
    WHERE
	txt.fts_index @@ '\'star\''
    ORDER BY pos desc

Function rank is used for ranking of the search results. The first argument ('{0.1, 0.2, 0.4, 1.0}') denotes weights for words in different parts of document - weight 0.1 used for words in body, and 1.0 - in title. Other weights (0.2, 0.4) currently are not used in OpenFTS. Read Tsearch2 documentation for details.

Second argument (txt.fts_index) is the table name together with its tsvector field. Notice, that if you use several OpenFTS instances (see Indexer Configuration), for example, prefix 'a', you will see a_txt.fts_index.

Third argument is the query itself after normalization ('stars' -> 'star').

The last argument (0) is used to normalize rank for document length ( total number of indexed words ).

In the WHERE clause the statement txt.fts_index @@ '\'star\'' is the "contains" predicate which denotes whether fts_index (of data type tsvector) contains the keyword star or not. Here's a more complicated query for the words supernovae and stars:


    SELECT 
	txt.tid,
        rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'supernova\' & \'star\'', 0 ) as pos
    FROM
	txt
    WHERE
	txt.fts_index @@ '\'supernova\' & \'star\''
    ORDER BY pos desc

Dictionaries

Dictionaries are required to have two methods: lemms and is_stoplexem. In the case of a real dictionary, lemms returns an array of lexemes. is_stoplexem accepts a lexeme and returns true if it is a stop word, otherwise it returns false. For example, is_stoplexem("yahoo") returns 0 since "yahoo" is not a stop word while is_stoplexem("are") returns 1 since "are" is a stop word. An optional method init, if available, can be used to initialize the dictionary.

Proximity Ranking

A prominent feature is the ability to rank documents according to the proximity between words of the search query -- this is accomplished by maintaining coordinate information of the lexemes for each indexed document. For example, given the query "full text search", OpenFTS will rank documents that contain the phrase "full text search" higher than documents in which the words "full", "text", "search" occur in different places. The ranking procedure is a C function that utilizes PostgreSQL's SPI and it can be used inside SQL queries. The ranking function uses methodology proposed by
Andrew Kovalenko and Nickolay Kharin.

The relevance ranking function supports weights for words found in the title and the body. The default values are 1.0 for title and < code>0.1 - for body. You can change the default values by passing them as parameters to Search::OpenFTS->new.
Example:


$fts = Search::OpenFTS->new( $dbi, prefix=>$PREFIX,
          relfunc=>q[ rank( '{0.1, 0.2, 0.4, 1.0}', $TSVECTOR, $QUERY, 1 )],
 );

Parser

OpenFTS uses a parser that reads a document or a search query and converts it into a stream of lexemes. You can use different parsers for different projects. The parser distributed with OpenFTS recognizes 19 types of lexemes:

TypeIDDescriptionExam ples
LATWORD1latin wordhello
CYRWORD2cyrillic word...
UWORD3mixed word...
EMAIL4email addressteodor@sigaev.ru
FURL5full URLhttp://www.yahoo.com/index.html
HOST6host name...
SCIENTIFIC7number in scientific notation-0.12345e+15
VERSIONNUMBER8integer or version number3
7.1.2
PARTHYPHENWORD9part of mixed hyphenated word...
CYRPARTHYPHENWORD10cyrillic part of hyphenated word...
LATPARTHYPHENWORD11latin part of hyphenated wordmulti in word multi-key
SPACE12symbols$#%^
SYMTAG13HTML tag<b>
<table>
HTTP14HTTPhttp://
HYPHENWORD15mixed hyphenated word...
LATHYPHENWORD16latin hyphenated wordmulti-key
CYRHYPHENWORD17cyrillic hyphenated word...
URI18Uniform Resource Identifier/index.html
FILEPATH19filename or pathexample.txt
DECIMAL20number in decimal notation 10.345
SIGNEDINT21integer -4
UNSIGNEDINT22unsigned integer 4
HTMLENTITY23HTML entity 4

The package Search::OpenFTS::Parser is a wrapper around the parser functions. Here's an example of how it can be used:

use Search::OpenFTS::Parser;

my $parser=Search::OpenFTS::Parser->new();

my $txt = 'Hello world. OpenFTS rules!';

$parser->start_parser( \$txt );

while ((($type, $word)=$parser->get_word) && $type) {

    print $parser->type_description($type), "\t$word\n";

}

$parser->end_parser;

Save the script (above) into a file called parser-test-1.pl and then call it with perl parser-test-1.pl. The output should look like this:

Latin word	Hello
Space symbols	 
Latin word	world
Space symbols	.
Space symbols	 
Latin word	OpenFTS
Space symbols	 
Latin word	rules
Space symbols	!

To get all types that the parser supports, try this:

use Search::OpenFTS::Parser;

my $parser=Search::OpenFTS::Parser->new();

my @types = $parser->alltypes;

map { print "$_ => $types[$_]\n"; } 1..$#types;
The output should look like this:
1 => Latin word
2 => Cyrillic word
3 => Word
4 => Email
5 => URL
6 => Host
7 => Scientific notation
8 => VERSION
9 => Part of hyphenated word
10 => Cyrillic part of hyphenated word
11 => Latin part of hyphenated word
12 => Space symbols
13 => Char in tag
14 => HTTP head
15 => Hyphenated word
16 => Latin hyphenated word
17 => Cyrillic hyphenated word
18 => URI
19 => File or path name
20 => Decimal notation
21 => Signed integer
22 => Unsigned integer
23 => HTML entity

See simple_parser.pl in examples directory for example of very simple parser written in perl and recognizing space delimited words with length =>2.

echo 'Simple parser written in perl a'|perl simple_parser.pl 
Space delimited word (len=>2),word=Simple
Space delimited word (len=>2),word=parser
Space delimited word (len=>2),word=written
Space delimited word (len=>2),word=in
Space delimited word (len=>2),word=perl
0 => Unknown type
1 => Space delimited word (len=>2)
Notice, last word 'a' doesn't recognized because it's too short.

6. API

Search

Index

Parser

7. Examples

This section explains how you can run the examples that come with the OpenFTS distribution.

8. Interesting Papers

9. Related Links