]po[ Full-Text Search Engine

(Please click on the hexagons for more information)

This package allows users to perform full text searches of the ]project-open[ database. All searching functions are permission sensitive and security conscious of a users rights and privileges. This package is an adapter, allowing PostgreSQL 's TSearch2 module to index ]project-open[ contents and then provide the search functionality for project, users, forums, files.

The Big Picture

]project-open[ emphasizes knowledge internet collaboration and knowedge management. However, internet collaboration needs to respect the tight security permissions, featured by other parts of the system. So the search engine needs to comply with all of the requirements:

High performance
Support of the same permissions as the rest of the system and
User friendlyness

Implementation Architecture

]project-open[ Search is implemented using the TSearch2 full text engine that comes as a part of the PostgreSQL search engine (on Oracle, ]project-open[ supports Intermedia).

TSearch2 allows for a tight integration of full text indices and SQL statements, allowing to mix instructions for structured queries (in order to determine permissions and object relationships) and access to full text indices.

The implementation consists of the following elements:

A centralized "im_search_objects" table that contains the full text index for all objects in the system that are indexed
A number of SQL triggers that are associated to the indexed original tables. These triggers keep im_search_objects in sync with the application tables
A centralized search form

-- The main search table with Full Text Index.
--
create table im_search_objects (
        object_id               integer,
                                -- may include "object types" outside of OpenACS
                                -- that are not in the "acs_object_types" table.
        object_type_id          integer
                                constraint im_search_objects_object_type_id_fk
                                references im_search_object_types
                                on delete cascade,
                                -- What is the topmost container for this object?
                                -- Allows to speed up the elimination of objects
                                -- that the current user can't access
        biz_object_id           integer
                                constraint im_search_objects_biz_obj_id_fk
                                references acs_objects
                                on delete cascade,
                                -- Owner may not need to be a "user" (in the case
                                -- of a deleted user). Owners can be asked to give
                                -- permissions to a document even if the document
                                -- is not readable for the searching user.
        owner_id                integer
                                constraint im_search_objects_owner_id_fk
                                references persons
                                on delete cascade,
                                -- Bitset with one bit for each "profile":
                                -- We use an integer instead of a "bit varying"
                                -- in order to keep the set compatible with Oracle.
                                -- A set bit indicates that object is readable to
                                -- members of the profile independent of the
                                -- biz_object_id permissions.
        profile_permissions     integer,
                                -- counter for number of accesses to this object
                                -- either from the permission() proc or from
                                -- reading in the server log file.
        popularity              integer,
                                -- Full Text Index
        fti                     tsvector,
                                -- For tables that don't respect the OpenACS object
                                -- scheme we may get "object_id"s that start with 0.
        primary key (object_id, object_type_id)
);

create index im_search_objects_fti_idx on im_search_objects using gist(fti);
create index im_search_objects_object_id_idx on im_search_objects (object_id);

The following steps are executed during each search:

The main search SQL query determines in a subquery the set of "container objects" (projects, customers, users, ...) that are accessible for a specific user. Only part of the permission rules are checked for the container objects, leading to a slighly more search results then what a user should see at the end. These cases are treated later in the algorithm.
The main search SQL determines all searchable objects within the "container objects".
These searchable objects are queried using the PostgreSQL TSearch2 full text index. Please note that the queried set of objects may already be greatly reduced due to step 1.
The results of the SQL search query are checked for security permissions via TCL code (the security system doesn't have a SQL API). This step eliminates some objects that might have slipped through the coarse grain (fast) permission check in step 1.
The results are displayed on the search screen, together with HTML links to the object's view pages and with a "summary" of the object's content (the TSearch2 full text index list of words).

Search Internationalization

TSearch2 contains several features allowing to adapt the search process to specific languages such as dictionaries, language specific stop words etc. However, ]project-open[ needs to be able to operate with content items of several languages at the same time.

However, it is not always possible to determine the language of a content item, so that we have decided not to implement these features at the moment.

However, the practical experiences of the use of TSearch with languages such as French, Spanish and German has required us to add a "normalization" feature to TSearch2 that "normalized" search content and queries in order to deal with accents and notational variants:

We transform all accented characters such as á, è, ü, ñ etc. into the non-accented form
We replace "@", "-" and "." characters by a space in order to split URLs and mail adddresses into their components

This normalization allows to search for "carlos" and to receive search results such as "Carlós" or "carlos@abc.com".

Actually, we had to implement this normalization ourselves, because there was no code on the PostgreSQL page about it. Also, the PostgreSQL "conversion" functionality (UTF-8 => SQL_ASCII) did not elimiated the accents. Please check for the latest version at Sourceforge.net.

Ranking

Ranking is currently limited to the built-in TSearch2 ranking functionality. In the future we are going to use several types of statistics to determine the "popularity" of a content item.

References

Related Packages

PostgreSQL File Full-Text Search Engine Package

Related Modules

Collaboration & KM

Related Software

Package Documentation

Procedure Files

tcl/intranet-search-pg-procs.tcl Procedures for tsearch full text enginge driver

Procedures

im_package_search_id       Returns the ID of the current package.

im_tsvector_to_headline       Converts a tsvector (or better: its string representation) into a text string, obviously without the stop words.

tsearch2::build_query       Convert conjunctions to query characters for tsearch2 and => & not => ! or => | space => | (or)

tsearch2::driver_info

tsearch2::index       add object to full text index

tsearch2::search       ftsenginedriver search operation implementation for tsearch2

tsearch2::summary       Highlights matching terms.

tsearch2::unindex       Remove item from FTS index

tsearch2::update_index       update full text index

SQL Files

sql/postgresql/intranet-search-pg-create.sql

sql/postgresql/intranet-search-pg-drop.sql

sql/postgresql/upgrade/upgrade-4.0.3.0.0-4.0.3.0.1.sql

sql/postgresql/upgrade/upgrade-4.0.3.3.2-4.0.3.3.3.sql

sql/postgresql/upgrade/upgrade-4.0.4.0.0-4.0.4.0.1.sql

sql/postgresql/upgrade/upgrade-4.0.5.0.0-4.0.5.0.1.sql

sql/postgresql/upgrade/upgrade-4.0.5.0.1-4.0.5.0.2.sql

sql/postgresql/upgrade/upgrade-5.0.0.0.0-5.0.0.0.1.sql

sql/postgresql/upgrade/upgrade-5.0.2.3.4-5.0.2.3.5.sql

sql/postgresql/upgrade/upgrade-5.0.4.0.0-5.0.4.0.1.sql

Content Pages

www/
advanced-search.adp
advanced-search.tcl
index.adp
index.tcl
search.adp
search.tcl