Culter Segmentation Extended Format

The Culter Segmentation Extended Format is a specification for a format to specify segmentation rules.

This format, contrairly to "Compatible" format, can implement some rules which are totally impossible to specify using SRX. For that reason, it is not convertible to SRX.

Actually the main added rule is the notion of protected parts: they make possible to prevent segmentation in some parts of the text (for example, between quotation marks).

CSEX, contrairly to other formats, can also be used not only during splitting, but also when you do the contrary: it enables to specify whenever segments should be joined by spaces or not, depending on the target language (actually: one rule per language; in the future, it may be possible to have rules depending on regular expressions before and after, exactly as for splitting).

Files in this format should have the .csex extension, the last 'x' meaning that it is in XML. In the future, we plan to add other representations of this format, in YAML for example.

You can find a sample and a schema in the package. The program seems to work correctly. However, this format is highly susceptible to receive added features later: do not consider the schema as definitive.

About the license: The Ruby code is under EUPL 1.1, like most of our programs. The schemas are under license Creative Commons Attibution-NoDerivatives : feel free to make your own implementation of our formats, but if you plan to make improvements in the schemas, please discuss with us before.