]po[ Full-Text File Search

(Please click on the hexagons for more information)

Provides full-text indexing for filenames and files in the ]po[ filestorages. Uses a number of external filters to periodically scan the ]po[ file storage for new files, builds up a Full Text Index and allows the user to retrieve the files using the normal search interface.

Required Software

intranet-search-pg-files requires the following software to extract indexable strings from different file formats:

  • CatDoc: /usr/local/bin/catdoc
  • HTMLtoTxt /usr/bin/html2text
  • wvText: /usr/bin/wvText (optional)

Basic Operation

The package will periodically (default: every 5 minutes) check a maximum number of objects (default: 100) for new files. Please see below for the parameters controlling the indexing behaviour.

This scheduled behaviour is necessary in order to balance the desire for fast indexing with the considerable load that full text indexing will pose on your database.

Supported File Types

  • txt, text, perl, php, sql:
    These files are considered to consist fully of indexable text.
  • doc:
    We use CatDoc to extract strings from Microsoft Word format
  • htm, html, xml, asp:
    We use HTMLtoText to extract the indexable text from these files.
  • The following extensions are explicitely ignored:
    • Image files: gif, jpg, pgp, bmp, png, wav, mp3, ico
    • File types without reasonable converter: xls, rtf (may be added later)
    • Other files: log, bz2, zip, tar, tgz, rar, gz, js, mso, exe 

 To add new file type please see ~/packages/intranet-search-pg-files-procs.tcl and search for "intranet_search_pg_files_fti_content". Very basic TCL skills are sufficient to add a new converter once you have the converter running on the shell level.

Administration & Control

To control indexing please see the page http://<your_server>/intranet-search-pg-files/. In this page you can see the files found by the indexer and you can re-index certain business objects.

Please see the error log at ~/log/error.log for detailed messages.


  • IndexerMaxFiles - 100
    Limit indexer activity to MaxFiles. You can determine this parameter by dividing the number of files in your intranet (example: 30.000) by the time interval (in seconds) to check all files (for example: 24*60*60 for 1 day) and multiplying with the SearchIndexerInterval (example: 300). You have to make sure that the indexer can handle MaxFiles in SearchIndexerInterval, otherwise the system may get overload.
  • SearchIndexerInterval - 300
    Run the search indexer every X seconds
  • IndexFileContentsP - 1
    Should we index the _contents_ of a file, in addition to its filename?
    Disable this parameter if you are running a translation business, because your file contents are related to your customers, but not to your own business (in general). Set the parameter to 1 if you are interested in the contents of your files.


Related Packages

Related Modules

Related Software

  • PostgreSQL  - we use the TSearch2 engine from PostgreSQL for full text indexing

Package Documentation

Procedure Files

tcl/intranet-search-pg-files-procs.tcl       File Search Library 


im_package_intranet_pg_files_id       Returns the package id of the intranet-search-pg-files module 
intranet_search_pg_files_fti_content       Extract and normalize the file contents - using a best effort attempt using variuos filters 
intranet_search_pg_files_index_all       Index the entire server 
intranet_search_pg_files_index_object       Index the files of a single object such as a project, company or user. 
intranet_search_pg_files_search_indexer       Index the entire server. 

SQL Files


Content Pages

      index.tcl Show files that are not indexed by the FTS
      reindex-biz-object.tcl Show files that are not indexed by the FTS


  Contact Us
  Project Open Business Solutions S.L.

Calle Aprestadora 19, 12o-2a

08902 Hospitalet de Llobregat (Barcelona)


 Tel Europe: +34 609 953 751
 Tel US: +1 415 200 2465
 Mail: info@project-open.com