So, you have created your engine on KantanMT. Before launching the build, you will need to upload your training data. You can do this by dragging and dropping the file in the 'Training' tab interface.
What kind of data do you need to train your engine?
- Bilingual Data
- TMX, TXML, XLSX, etc.
- Monolingual data (optional)
- Always in target language
- DOCX, PDF, TXT
- Glossaries
- Terminology databases (TBX,XLSX)
- KantanLibrary
- available for over 200 language combinations
- particularly useful if you don't have a big amount of data
You can find out more about the supported file formats here.
While the only file you absolutely need to build your engine is your bilingual data, in order to achieve the best possible translation results it's always recommended to add at least a glossary.
For the best performance, your engine's unique word-count should be at least 500k words for an SMT engine, and 900k words for an NMT engine. If your bilingual data is not sufficient, add a domain-specific KantanLibrary stock data to your training data, to improve it.