Lucene translation memories for OmegaT

The patch M-5 introduces a new kind of translation memories dedicated to OmegaT, based on the Lucene library which is already inside OmegaT. As far as we could test, these memories are not always faster than TMX fully loaded in memory, but their advantage is elsewhere: as they are not really loaded in memory, you can use gigabytes of such memories in a project without the risk of memory leak during the translation.

Usage

First, create a project exactly as you did before. In particular, add the desired TMX files in the tm/ folder, as usual.
The next step will use a new command in Omega-T's console mode :
OmegaT --mode=console-index-memory "project directory"

This will replace all tmx files (and all files readable as a binary format) from the project (except project_save and tmx2source) like this:

  • In the translation memories folder, the TMX file is replaced by a properties file which is the loader for the plugin;
  • A new directory named tm-indexes will contain, for each TMX file, a directory with same name but extension .ottm (OmegaT translation memory) which is a Lucene index
  • Eventually, if you add parameter --save-files, the original TMX files will be moved in new directory tm-saved. Alternatively you can also use parameter --save-to=XXX to specify the saving directory.

You can also choose to index only one file :
OmegaT --mode=console-index-memory source=something.tmx target=some-directory.ottm -source-lang=en --target-lang=fr

After this step, the memory will work the same way as if it were a TMX. Properties and attributes are supported, as well as auto/ and enforce/ directories, but not tmx2source because this contains project memories. Search is also supported.

It is not certain that you will receive exactly the same results as from the original TMX file : some segments may have a poor score in Lucene while they keep a good one in OmegaT. We hope that this is rare enough.

Technical note: these indexes are correct Lucene 5.2 indexes; however, they are based on n-gram analysers, not on linguistic ones. For this reason it will probably not be possible to do any other kind of search than OmegaT's fuzzy searching (more precisely: search windows wok, but they use full scanning of the contents, not indexation). Possibility to implement a different index, for string searches in windows, can be left for the future.