KantanMT supports a number of different file formats for Training engines and Translating files.
The following are the file formats currently supported on KantanMT:
Training Data Formats
- TMX - Translation Memory Exchange format. These are parallel texts containing both source and target segments and are used to train your KantanMT engine.
- TXT - Text-based training files are also supported, but make sure are UTF8 encoded and use source.utf8.src for source segments and source.utf8.trg for target segments.
- XLSX - An excel spreadsheet with all of the source content in column A and target content in column B. It is important that no cell is left blank and there is no special formatting.
- TBX - Terminology Interchange Format. This file lists words/phrases that are to translated in specific ways and/or lists of untranslatable words/phrases. An excel spread sheet may also be used but must be named glossary.xlsx or terminology.xlsx.
- MONO - Microsoft Office DOCX and readable Adobe PDF files are treated as monolingual training data. If you want to upload a text file containing monolingual training data, please ensure it is called source.utf8.trg.mono.
- Test & Tune Data - This data should be stored in aligned UTF8 encoded text files called source.test.src and source.test.trg, or for Tune data source.tune.src and source.tune.trg. Each file should have one segment per line aligned with the equivalent segment in the partner file.
Compression Formats
KantanMT supports two compression file formats for training data files.
- ZIP - ZIP is an archive file format that may contain one or more files or directories that may have been compressed.
- TAR.GZ/TGZ/GZ - GZ files are created by files that have been placed in a TAR archive and then compressed using Gzip. These types of compressed TAR files are called tarballs.
Translation File Formats
- XLIFF - Standard XLIFF document format
- TTX - SDL TRADOS Tag format (xml version)
- SDL-XLIFF - SDL Xliff format (TRADOS Studio 2011)
- TXML - WordFast Translation File Format
- TMX - Translation Memory Exchange format
- EXP - Transplicity File Interchange Format
- XLZ - Idiom Worldserver Desktop Workbench Files
- MQXLZ - MemoQ Bundle Files
- MQXLIFF - MemoQ XLIFF File Format
- .sub.trg - Movie Subtitle File Format
- XLF - CAD file exports prepared in Muldrato
- Documentation Formats
- DOCX - Microsoft Word Format
- PDF - Adobe PDF Format
- ODT - OpenOffice Document Format
- DITA - Standard DITA document format
- XML - Generic XML documents *
- Desktop Publishing Formats
- INX - Adobe inDesign File format
- IDML - Adobe IDML File format
- XML - Adobe Framemaker XML File Format
- Web Based Formats
- HTML - Standard HTML documents used in the development of WEB content
- SVG - Scalable Vector Graphic files.
- Content Management Formats
- NovaDoc - Nova document format
- MonTag XML - Montag document format
- AborText XML - XML document with AborText markup.
- Misc. Formats
- TXT - Standard TXT file format (make sure they're UTF8 encoded!)
- XLSX - Microsoft Excel Formats
* For XML files it may be required to upload a custom GENTRY rule file in order to describe the contents of the file you would like translated. For more information on creating a GENTRY rule file click here.
GENTRY is our parsing technology which is simple, flexible and customisable, meaning if the default behavior of KantanMT for a particular file format is not ideal you can customise it to suit your needs.
Please log a support ticket if you require assistance in creating your custom GENTRY rule file.
For more details on supported file formats or our GENTRY parsing technology, please see the following links: