OpenFTS is Copyright 2000-2002 XWare and licensed under the GNU General Public License, version 2 (June 1991). This means you can use it and modify it in any way you want. If you choose to redistribute OpenFTS, you must do so under the terms of the GNU license.
IMPORTANT NOTICE: This version is incompatible with earlier versions due to changes in the base data type, the structure of the indexing tables, and the interfaces of the dictionaries.OpenFTS is in what is likely to be one of many stages. The OpenFTS developers are experimenting with various features which should eventually result in a full-featured search engine within PostgreSQL.
The latest incarnation has more natural interface which is easier to understand. In the old system, search queries look something like the following:
SELECT
txt.tid,
FROM
txt
WHERE
(txt.fts_index @ '{14054652}')
and the new system uses a natural language approach that supports boolean
operators and it looks like the following:
SELECT * FROM foo WHERE titleidx @@ '(the|this)&!we';
This is quite an improvement over the previous approach. Here's a more complete
list of changes in the latest version:
tsvector
, suitable for text indexing.
It uses words 'as is' without hashing to integers and provides
search interface in more natural way. For example, it's possible
now to test full text search from psql. More information about
tsearch2 is available Tsearch2 home page.
lemms
method instead of lemmsid
,
is_stoplexem
instead of is_stoplemm
.
drop
--
removes all OpenFTS tables, indices, dictionaries
(if dictionary provides 'drop' method);
drop_index
--
removes all OpenFTS indices from index tables (INDEX1,,,INDEXN) and
the GiST index on the base table (where the documents
are stored together with their primary key).
gmake-install-headers
during the PostgreSQL installation. If you have not installed tsearch2,
please do so now.
tsearch2 is under the contrib directory of the PostgreSQL source code.
cd /usr/local/src/
tar -xzvf Search-OpenFTS-0.35.tar.gz
cd Search-OpenFTS-0.35
perl Makefile.PL
make
make install
Search::OpenFTS::Index->init
function. You should start by
creating a base table that stores the indexed documents. Here's an example from
the OpenFTS distribution:
create table txt (
tid int not null primary key,
txt varchar,
fts_index tsvector
);
To configure your OpenFTS instance, call
Search::OpenFTS::Index->init
that creates the configuration and
indexing tables. In our example, the call looks like:
my $idx=Search::OpenFTS::Index->init(
dbi=>$dbi,
txttid=>'txt.tid',
tsvector_field=>'fts_index',
ignore_id_index=>[ qw( 7 13 14 12 23 ) ],
ignore_headline=>[ qw(13 15 16 17 5) ],
map=>'{ \'19\'=>[1], 18=>[1], 8=>[1], 7=>[1], 6=>[1], 5=>[1], 4=>[1],
}',
dict=>[
'Search::OpenFTS::Dict::PorterEng',
'Search::OpenFTS::Dict::UnknownDict',
]
);
You have to specify the table name to be indexed together with its primary key
(txttid
), the indexing field (tsvector
),
the types of lexemes that should be
ignored by the indexer (ignore_id_index
) and types ignored while
constructing headlines for search results (ignore_headlines
), the
available dictionaries (dict
), and a mapping of types of lexemes
to dictionaries (map
) that is used for optimization and for
multi-language support.
All configuration parameters are stored in a database table,
fts_conf
, that is created upon the invocation of the
initialization function. ignore_id_index
and
ignore_headline
can only accept types of lexemes as specified
later in this document [Parser]. For example, value 13 is the type ID for an
html tag, namely SYMTAG. Type IDs are also used to map lexemes to dictionaries.
This is helpful for optimizing the search engine and it is also helpful for
indexing multi-languages or exotic-text documents.
You can create more than one OpenFTS instance by passing a character value
(a-z) for prefix upon the invocation of the initialization function. The
initialization function will also create a table,
fts_unknown_lexem
, that stores the lexemes that are not recognized
by any of the available dictionaries. Note that
fts_unknown_lexem
is created by
Search::OpenFTS::Dict::UnknownDict dictionary, not the OpenFTS core. If it is
not specified upon initialization it will not be created.
Search::OpenFTS::Index->index
), the parser reads and converts it
into a stream of lexemes. Lexemes that were marked to be ignored are filtered
and the position of each lexeme is calculated and stored in the corresponding
indexing table.
SELECT
txt.tid,
rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'star\'', 0 ) as pos
FROM
txt
WHERE
txt.fts_index @@ '\'star\''
ORDER BY pos desc
Function rank is used for ranking of the search results. The first
argument ('{0.1, 0.2, 0.4, 1.0}'
) denotes weights for words
in different parts of document - weight 0.1 used for words in body, and
1.0 - in title. Other weights (0.2, 0.4) currently are not used in OpenFTS.
Read Tsearch2 documentation for details.
Second argument (txt.fts_index
) is the table name together
with its tsvector field. Notice, that if you use several OpenFTS instances
(see Indexer Configuration), for example, prefix 'a', you will see
a_txt.fts_index
.
Third argument is the query itself after normalization ('stars' -> 'star').
The last argument (0
) is used to normalize rank for
document length ( total number of indexed words ).
In the WHERE clause the statement
txt.fts_index @@ '\'star\''
is the "contains" predicate which denotes whether fts_index
(of data type tsvector
) contains the keyword star
or
not. Here's a more complicated query for the words supernovae
and
stars
:
SELECT
txt.tid,
rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'supernova\' & \'star\'', 0 ) as pos
FROM
txt
WHERE
txt.fts_index @@ '\'supernova\' & \'star\''
ORDER BY pos desc
Dictionaries
Dictionaries are required to have two methods: lemms
and
is_stoplexem
. In the case of a real dictionary, lemms returns an
array of lexemes. is_stoplexem accepts a lexeme and returns true if it is a stop
word, otherwise it returns false. For example, is_stoplexem("yahoo") returns 0
since "yahoo" is not a stop word while is_stoplexem("are") returns 1 since "are"
is a stop word. An optional method init
, if available, can be used
to initialize the dictionary.
Proximity Ranking
A prominent feature is the ability to rank documents according to
the proximity between words of the search query --
this is accomplished by maintaining coordinate information
of the lexemes for each indexed document. For example,
given the query "full text search", OpenFTS will rank documents that contain the
phrase "full text search" higher than documents in which the
words "full", "text", "search" occur in different places.
The ranking procedure is a C function that utilizes PostgreSQL's SPI and
it can be used inside SQL queries. The ranking function uses methodology
proposed by
Andrew Kovalenko
and Nickolay Kharin.
The relevance ranking function supports weights for words found in the title
and the body. The default values are 1.0
for title and <
code>0.1 - for body. You can change the default values by passing them
as parameters to Search::OpenFTS->new
.
Example:
$fts = Search::OpenFTS->new( $dbi, prefix=>$PREFIX,
relfunc=>q[ rank( '{0.1, 0.2, 0.4, 1.0}', $TSVECTOR, $QUERY, 1 )],
);
OpenFTS uses a parser that reads a document or a search query and converts it into a stream of lexemes. You can use different parsers for different projects. The parser distributed with OpenFTS recognizes 19 types of lexemes:
Type | ID | Description | Exam ples |
LATWORD | 1 | latin word | hello |
CYRWORD | 2 | cyrillic word | ... |
UWORD | 3 | mixed word | ... |
4 | email address | teodor@sigaev.ru | |
FURL | 5 | full URL | http://www.yahoo.com/index.html |
HOST | 6 | host name | ... |
SCIENTIFIC | 7 | number in scientific notation | -0.12345e+15 |
VERSIONNUMBER | 8 | integer or version number | 3 7.1.2 |
PARTHYPHENWORD | 9 | part of mixed hyphenated word | ... |
CYRPARTHYPHENWORD | 10 | cyrillic part of hyphenated word | ... |
LATPARTHYPHENWORD | 11 | latin part of hyphenated word | multi in word multi-key |
SPACE | 12 | symbols | $#%^ |
SYMTAG | 13 | HTML tag | <b>
<table> |
HTTP | 14 | HTTP | http:// |
HYPHENWORD | 15 | mixed hyphenated word | ... |
LATHYPHENWORD | 16 | latin hyphenated word | multi-key |
CYRHYPHENWORD | 17 | cyrillic hyphenated word | ... |
URI | 18 | Uniform Resource Identifier | /index.html |
FILEPATH | 19 | filename or path | example.txt |
DECIMAL | 20 | number in decimal notation | 10.345 |
SIGNEDINT | 21 | integer | -4 |
UNSIGNEDINT | 22 | unsigned integer | 4 |
HTMLENTITY | 23 | HTML entity | 4 |
The package Search::OpenFTS::Parser
is a wrapper around the parser
functions. Here's an example of how it can be used:
use Search::OpenFTS::Parser;
my $parser=Search::OpenFTS::Parser->new();
my $txt = 'Hello world. OpenFTS rules!';
$parser->start_parser( \$txt );
while ((($type, $word)=$parser->get_word) && $type) {
print $parser->type_description($type), "\t$word\n";
}
$parser->end_parser;
Save the script (above) into a file called parser-test-1.pl
and then call it with perl parser-test-1.pl
.
The output should look like this:
Latin word Hello
Space symbols
Latin word world
Space symbols .
Space symbols
Latin word OpenFTS
Space symbols
Latin word rules
Space symbols !
To get all types that the parser supports, try this:
use Search::OpenFTS::Parser;
my $parser=Search::OpenFTS::Parser->new();
my @types = $parser->alltypes;
map { print "$_ => $types[$_]\n"; } 1..$#types;
The output should look like this:
1 => Latin word
2 => Cyrillic word
3 => Word
4 => Email
5 => URL
6 => Host
7 => Scientific notation
8 => VERSION
9 => Part of hyphenated word
10 => Cyrillic part of hyphenated word
11 => Latin part of hyphenated word
12 => Space symbols
13 => Char in tag
14 => HTTP head
15 => Hyphenated word
16 => Latin hyphenated word
17 => Cyrillic hyphenated word
18 => URI
19 => File or path name
20 => Decimal notation
21 => Signed integer
22 => Unsigned integer
23 => HTML entity
See simple_parser.pl in examples directory for example of very simple parser written in perl and recognizing space delimited words with length =>2.
echo 'Simple parser written in perl a'|perl simple_parser.pl
Space delimited word (len=>2),word=Simple
Space delimited word (len=>2),word=parser
Space delimited word (len=>2),word=written
Space delimited word (len=>2),word=in
Space delimited word (len=>2),word=perl
0 => Unknown type
1 => Space delimited word (len=>2)
Notice, last word 'a' doesn't recognized because it's too short.
Search::OpenFTS::Search->search
Search::OpenFTS::Search->get_sql
Search::OpenFTS::Search->_sql
get_sql
to get the SQL parts and then combines them to the SQL query to be executed.
Search::OpenFTS::Search->get_headline
Search::OpenFTS::Index->init
init
)
about the available dictionaries,
the parser, mapping of lexemes to dictionaries, and types of
lexemes to be ignored in indexing or headlines.
Search::OpenFTS::Index->index
Search::OpenFTS::Index->delete
Search::OpenFTS::Index->drop
Search::OpenFTS::Index->drop_index
Search::OpenFTS::Index->create_index
Search::OpenFTS::Index->start_index
Search::OpenFTS::Index->index_chunk
Search::OpenFTS::Index->flush
Search::OpenFTS::Parser->start_parser
Search::OpenFTS::Parser->end_parser
Search::OpenFTS::Parser->get_word
Search::OpenFTS::Parser->type_description
Search::OpenFTS::Parser->alltypes
$ createdb testfts
$ cd PGSQL_SRC_HOME/contrib/tsearch2
$ psql testfts < tsearch2.sql
$ cd /usr/local/src/Search-OpenFTS-0.35/examples
$ ./init.pl testfts
$ ./index.pl testfts file1 [file2 [...] ]
$ ./search.pl -p testfts word1 [word2 [...] ]