Segmentation and tokenisation are two important steps within the Machine Translation pipeline. Basic segmentation and tokenisation are automatically applied to all data that is processed by KantanMT. In addition, clients can also change segmentation and/or tokenisation rules according to their projects' needs.
How to Change Segmentation:
Examples of segmentation rules:
Fig.1. Example of non-breaking rule
Fig.2. Example of breaking rule
Figure 1 displays a non-breaking rule (i.e., a rule to prevent segmentation), while Figure 2 displays a breaking rule (i.e., a rule to force segmentation). To add a different rule, copy and paste or edit one of the rules above. The text between the beforebreak and afterbreak tags is based on regular expressions.
To implement segmentation rules, add them to an XML-based file named segment.srx. To change the segmentation of your training data, add the SRX file to the Training tab of your engine. To change the segmentation of the data to be translated, you will need to add the SRX file to the Translation tab of your engine.
How to Change Tokenisation:
Tokenisation is the process of dividing any text into units (called tokens). A token can be a word, a number, a punctuation mark or a combination of these.
Tokenisation can be changed by applying specific tokeniser or detokeniser exceptions. The following is an example:
$segment =~ s /([A-Z0-9]+)\s-\s([A-Z0-9]+)/$1-$2/g;
The exceptions work as search-and-replace patterns and are applied right after tokenisation (in case of tokeniser exceptions) or right after detokenisation (in case of detokeniser exceptions).
To implement tokeniser or detokeniser exceptions, add them to an XML-based file named tokenizerexceptions.pm and upload it to the engine. To change the tokenisation of your training data, add the file to the Training tab of your engine. To change the tokenisation of the data to be translated, add the file to the Translation tab of your engine.
If you encounter any problem or need help, please contact us at support@kantanmt.com.