]po[ Full-Text File Search

(Please click on the hexagons for more information)


Provides full-text indexing for filenames and files in the ]po[ filestorages. Uses a number of external filters to periodically scan the ]po[ file storage for new files, builds up a Full Text Index and allows the user to retrieve the files using the normal search interface.

Required Software

intranet-search-pg-files requires the following software to extract indexable strings from different file formats:

  • CatDoc: /usr/local/bin/catdoc
  • HTMLtoTxt /usr/bin/html2text
  • wvText: /usr/bin/wvText (optional)

Basic Operation

The package will periodically (default: every 5 minutes) check a maximum number of objects (default: 100) for new files. Please see below for the parameters controlling the indexing behaviour.

This scheduled behaviour is necessary in order to balance the desire for fast indexing with the considerable load that full text indexing will pose on your database.

Supported File Types

  • txt, text, perl, php, sql:
    These files are considered to consist fully of indexable text.
  • doc:
    We use CatDoc to extract strings from Microsoft Word format
  • htm, html, xml, asp:
    We use HTMLtoText to extract the indexable text from these files.
  • The following extensions are explicitely ignored:
    • Image files: gif, jpg, pgp, bmp, png, wav, mp3, ico
    • File types without reasonable converter: xls, rtf (may be added later)
    • Other files: log, bz2, zip, tar, tgz, rar, gz, js, mso, exe 

 To add new file type please see ~/packages/intranet-search-pg-files-procs.tcl and search for "intranet_search_pg_files_fti_content". Very basic TCL skills are sufficient to add a new converter once you have the converter running on the shell level.

Administration & Control

To control indexing please see the page http://<your_server>/intranet-search-pg-files/. In this page you can see the files found by the indexer and you can re-index certain business objects.

Please see the error log at ~/log/error.log for detailed messages.

Parameters

  • IndexerMaxFiles - 100
    Limit indexer activity to MaxFiles. You can determine this parameter by dividing the number of files in your intranet (example: 30.000) by the time interval (in seconds) to check all files (for example: 24*60*60 for 1 day) and multiplying with the SearchIndexerInterval (example: 300). You have to make sure that the indexer can handle MaxFiles in SearchIndexerInterval, otherwise the system may get overload.
  • SearchIndexerInterval - 300
    Run the search indexer every X seconds
  • IndexFileContentsP - 1
    Should we index the _contents_ of a file, in addition to its filename?
    Disable this parameter if you are running a translation business, because your file contents are related to your customers, but not to your own business (in general). Set the parameter to 1 if you are interested in the contents of your files.


References

Related Packages

Related Modules

Related Software

  • PostgreSQL  - we use the TSearch2 engine from PostgreSQL for full text indexing

Package Documentation

Procedure Files

tcl/intranet-search-pg-files-procs.tcl       File Search Library 

Procedures

im_package_intranet_pg_files_id       Returns the package id of the intranet-search-pg-files module 
im_package_search_pg_files_id_helper        
intranet_search_pg_files_fti_content       Extract and normalize the file contents - using a best effort attempt using variuos filters 
intranet_search_pg_files_index_all       Index the entire server 
intranet_search_pg_files_index_object       Index the files of a single object such as a project, company or user. 
intranet_search_pg_files_search_indexer       Index the entire server. 

SQL Files

sql/postgresql/intranet-search-pg-files-create.sql        
sql/postgresql/intranet-search-pg-files-drop.sql        

Content Pages

www/
      index.adp
      index.tcl Show files that are not indexed by the FTS
      reindex-biz-object.tcl Show files that are not indexed by the FTS
 

 

  Contact Us
  Project Open Business Solutions S.L.

Calle Aprestadora 19, 12o-2a

08902 Hospitalet de Llobregat (Barcelona)

Spain

 Tel Europe: +34 609 953 751
 Tel US: +1 415 200 2465
 Mail: info@project-open.com