DESCRIPTION:
============

Castor is a way to detect similar documents before starting a translation, 
in order to check whenever it makes sense to translate them together using Cyclotis for example.

This is not a translation memory: documents to be stored are unilingual,
and should all be in the same language.

The tool makes suggestions you are free to use or not. After that, documents should be removed
so that they do not appear in a new group.

INSTALLATION:
=============

You must install Postgresql database server, version 8.4 or more.
Create a catalog and run
   sql/tables.sql, as the user which will have write access to the tables;

USAGE:
======
First you should configure the tool using config/castor.yml; package contains some examples.
If you want to use multiple configuration files, option --config-file can be added to scripts.

Use the script cr-import.pl to add a file to the database. The file must be in any format supported by Spongiae
Syntax is described in the file itself

During importation the script will compare new segments with those from previously imported documents
At the end it tells you which documents match the one you just imported, if any

Once a document is translated, unless you keep your project, you should use cr-delete.rb to remove it from the database,
not only to save space but to avoid having it appearing in new grouping suggestions.

Question: why not use Lucene indexes instead of a database?
===========================================================

Documents to be inserted in Castor are supposed to be deleted once they have been translated, using its suggestions or not.
Lucene indexes have poor support for deletions. If we want to use a local database in the future, we should use SQLite instead.

Question: why base the score on exact matches only, not on fuzzy matches?
=========================================================================

Most CAT tools use automatic insertion on 100% matches but not on fuzzy matches. The role of Castor is to detect in which case
your favorite CAT tool would make auto-insertion.

Even if functions like levenstein distance are by definition symetric, once calculated on all segments the symetry may be broken:
since we do not consider all matches but only the best ones, the list of matches of document B in document A may differ from the list
of matches of document A in document B. As a result the score may become asymetric and it is difficult to set a correct acceptation
criteria.

This possibility will be studied in the future but for the moment, symetric 100% matches seem to be the best.

LICENSE
========

Please read the file LICENSE for the very last version of conditions applied to this code.

