Culter Segmentation Compatible Format : specification

Specification of Culter Segmentation Compatible Format is also available as XML schema inside the Culter distribution. Here we will try to produce an human-readable version.

Since some parts of CSCX are based on SRX markups, we make references to the original specification, which is conform to their license (CC BY 3.0 says that we are free to adapt or remix). However, since named anchors inside the original specification do not work, we will use this copy, where links work correctly, instead.

CSC specification is distributed under Creative Commons Attribution-NoDerivatives. We are not strictly opposed to derivation, but we would prefer that you first open the discussion to see if we could not publish a better specification instead.

A CSCX document is an XML document being in the namespace "http://culter.silvestris-lab.org/compatible". You should start the document by <seg-rules xmlns="http://culter.silvestris-lab.org/compatible"> or give a name to your namespace and prefix all next markups. In the following of this document, let's consider that we use the first solution.

seg-rules

The document is made of 4 subelements. Elements can appear at any order, but each one must appear only once or not at all (in XML Schema: type "all"). Elements rules-mapping and languagerules are mandatory (except when you are extending a file), while format-handles and rule-templates are optional.

Attributes:

version (optional) indicates the version number of the specification. Actual version is 0.10. Since this is less than 1.0, this means that we are in a beta status, and until the specification is definitive, an engine is not required to support older versions.
extends (optional) indicates that this document re-uses data from a more generic one. The mechanism is described later in this document

format-handles

This element contains one or more formathandle markup: these are the same as formathandle markups in SRX format.

rules-mapping

This corresponds to maprules in SRX format. Contents is identical, except that <rules-mapping> also contains the cascade attribute.

languagerules

This corresponds to languagerules element in SRX. Contents is similar, but definition of languagerule is a little bit different.

languagerule

This corresponds to languagerule element in SRX. Attributes are the same as in SRX. Contents is one or more of the following:

rule element, as in SRX;
break-rule : synonym for rule/break = yes
exception-rule : synonym for rule/break = no
apply-rule-template

rule-templates

This element enables to define one or more rule template, which can then be used inside languagerule.
Contains one or more rule-template element.

rule-template

Defines a template to be used in rule-template element.
Attributes : name, a free string which will be re-used in apply-rule-template

Contents: one rule, break-rule or exception-rule where beforebreak and afterbreak contain regular expressions with one or more variables, with the syntax %{varName}

apply-rule-template

Indicates that we want to apply a template defined in rule-templates. See here how templates are applied.
Attributes : name, a free string which must be the same as in the corresponding rule-template

Contents : one or more param

param

Sets one parameter for apply-rule-template
Attributes : name, a string without spaces, which must be identical to the name given as %{varName} inside rule-template/beforebreak or arfterbreak;

mode indicates how to build the parameter, with two possibilities:

value : param also has a parameter value containing the parameter value as a string;
loop : param also contains one or more item or item-list-file

item

Contains one value for multi-valued parameter. No attribute, contents as a string.

item-list-file

Indicates that the loader must read here a file containing list of items.
Attributes:

name : name of the file to be read. Must be accessible from the location the current file is;
format: format of the file. Actually supported only txt: followed by any encoding name
remove : if you add this optional field, the corresponding regular expression will be removed from all items. Usually done to remove ending carriage returns
comments : if you add this optional field, lines matching the given regular expression is ignored. Usually done to put comments in a list of items

How templates are applied

Where you put apply-rule-template the engine must generate an equivalent rule. The first thing is to replace all %{variables} corresponding to value parameters by the contents of param/@value

For muti-value parameters (mode=loop) there are two possibilities:

Machine method: All items are joined using regular expression's OR symbol : |

This method gives faster segmentation engines. But when converted to SRX, the result is not easy to read. For that reason, this method is used by default but when you convert to SRX, better choose the human method.
Human method: Generates one SRX rule for each item. This is easier to read for an human. But long regular expressions are known to be faster in most implementations, so if you target a file to be used by computers only, use machine mode instead

Let's take a totally theorical example.

<rule-template name="example"><rewrite>
   <break-rule>
      <beforebreak>(%{var1})\.</beforebreak>
      <afterbreak>\s(%{var2})?</afterbreak>
   </break-rule>
</rewrite><rule-template>

Now, let's apply the template:

<apply-rule-template name="example">
   <param name="var1" mode="loop">
      <item>A</item> 
      <item>B</item> 
   </param>
   <param name="var2" mode="loop">
      <item>a</item> 
      <item>b</item> 
   </param>
<apply-rule-template>

Having 2 parameters means that we do a cartesian product : each item for var1 is associated to each item of var2. In machine mode the result will be:

   <break-rule>
      <beforebreak>(A|B)\.</beforebreak>
      <afterbreak>\s(a|b)?</afterbreak>
   </break-rule>

In human mode, the result will be:

   <break-rule>
      <beforebreak>(A)\.</beforebreak>
      <afterbreak>\s(a)?</afterbreak>
   </break-rule>
   <break-rule>
      <beforebreak>(A)\.</beforebreak>
      <afterbreak>\s(b)?</afterbreak>
   </break-rule>
   <break-rule>
      <beforebreak>(B)\.</beforebreak>
      <afterbreak>\s(a)?</afterbreak>
   </break-rule>
   <break-rule>
      <beforebreak>(B)\.</beforebreak>
      <afterbreak>\s(b)?</afterbreak>
   </break-rule>

Please note that expansion of variables is made as is: variable values are also regular expressions, if they contain regular expression meta characters, they are not interpreted during the expansion. Also the expander will not add any parenthesis before or after the variable, because it cannot know whenever you would like to put a quantifier after the parenthesis or not: that's the reason why parenthesis are in the template, not in the variables.

Extending a file

When the top markup seg-rules contains an extends attribute, the corresponding file is first loaded in memory. Then you have to put at each location how they affect those which are in the extended document.

Extending rules mapping

In the extended file, contrarily to original one, rules-mapping is optional: by default it will reuse the existing mappings, but map them to the extended language rules rather than the original ones.

However you can define one new mapping (i.e. one rules-mapping but which can contain more languagemap), but you need to add an attribute extension-mode with one of the following values:

before : add the language maps from the new file before the rules of the original one;
after : add the language maps from the new file after the rules of the original one;
replace : remove the rules of the original file, and use the rules of the new file instead;

Extending language rules

The mechanism is exactly the same as what we described for rules-mapping. The only difference is that since language rules are named, you are allowed to define new language rules and use them in the new rules-mapping, in which case the extension-mode attribute becomes optional.

Extending rule templates

No attribute to be added here. If the new file contains rule-templates, those who have same name as a rule in the original file will replace them, others are simply added in the dictionary and can be used in language rules.

You are here

Culter Segmentation Compatible Format : specification