Improving engine quality - Build, Measure, Learn


This article discusses some of the most effective ways to improve the quality of your KantanMT engines, which will in turn improve the quality of the translations you get back from KantanMT.

There are a few things in relation to the improvement of MT engines that we should take note of:

  1. Improving quality in MT engines is an evolutionary process
    • What this means is that it will usually take a number of different experiments with different combinations of data to get the best quality out of an MT engine.
    • It is a continuous improvement process, not a once off effort.
  2. The quality of MT engines is something that should be incrementally improved over time

At KantanMT we like to refer to this continuous, evolutionary improvement process as the Build, Measure, Learn process.

  • Build
    • There are 3 main factors to consider in the training data:
      1. Quality of the training data: Crucially important. Good quality data is vital to building a good quality engine.
      2. Relevance to domain: Training data should be in the same domain as the content that is going to be translated through it.
      3. Quantity: Generally the more the better. Need to consider the total word count but also the unique word count as this gives an idea of language coverage.
    • When selecting the training data it is important to strike a balance between these three factors. Sacrificing the quality of the data to get more quantity will not necessarily increase in a better engine overall. Similarly mixing domains to increase quantity may not provide a better quality engine for domain specific translations. 
    • The first step in the process is Build. This involves building your KantanMT engine.
    • The most important part of the Build step is the training data. The training data is what KantanMT uses to learn to translate. Simply put the better the training data the better the engine.
  • Measure
    • We have selected a number of industry standard automated measurements that measure the quality of your MT engines and we provide these for every build. These are BLEU, F-Measure and TER scores. NMT builds will also have a Perplexity score. These will give you an idea of the quality of your engine for a specific build.
    • It is also useful to measure the quality of the actual translations that come from the MT engine. This can be done very easily by running a KantanAnalytics translation job. This will give you back the equivalent of a fuzzy match report that can be used to see the quality of the matches being found in the MT engine.
    • Once we have built our KantanMT engine we need to Measure its performance. This will give us a good idea as to where we are in terms of the quality of the engine.
    • BuildAnalytics provides a comprehensive overview of the engine and includes features like Gap Analysis and Rejects Report. 
  • Learn
    • Modifying training data
      • Add more high-quality, in-domain training data to improve the language coverage of your engine.
      • Review existing training data. This could mean removing any low quality or out of domain training data. Although removing might seem strange to improve the quality of an engine, if the data is poor quality or out of domain it can sometimes drown out any high-quality in-domain training data.
      • Add mono-lingual training data which can help improve the quality of the translations.
    • You can also review sample translations to see if there are any improvements to be made, e.g. PEX rules. 
    • The final step in the process involves reviewing BuildAnaltics- here you will also find tips on how to improve engine quality.

Once the Build, Measure, Learn process is completed there should be a good understanding of how the engine performs in terms of its quality and also where efforts should be focused to improve that quality.

After the process has been completed this is when the experimentation really starts to kicks in. This process should now be repeated with some modifications to see if we can improve the quality by adding new training material or reviewing the existing training material to see if there is any data that is not having much of an impact.

The following are a few things to consider when experimenting with engine builds.

  • Unique Word Count
    • This gives an idea of the coverage of an engine. If in a re-build you add in more data and the scores drop, it is worth comparing the unique word counts before and after adding the data to see if there is any change. Is there is little increase in the number of unique words and the scores have decreased, then it is very likely that the data added is not going to provide any additional language coverage, and it is only adding noise into the engine causing a drop in scores.
  • BLEU Score
    • If a BLEU score is low but other scores are reasonably high this is a sign that the engine is good at recalling translations for words, but not so good at arranging them correctly in a segment.
    • Engines with low BLEU scores but reasonable F-Measure and TER scores are usually very good candidates for the addition of in-domain mono-lingual data. This will usually help increase your BLEU score and improve the fluency of the translations produced by the engine.
  • F-Measure Score
    • A low F-Measure score is often a sign that the engine has not got enough training material to produce good translations. Engines with a low F-Measure score will usually benefit from an injection of high-quality training material.
  • KantanAnalytics
    • KantanAnalytics can provide great insights into the performance of an engine and where quality improvement efforts should be focused.
    • KantanAnalytics translation reports give a quality score for each translation. Using this before and after a build you can see how much the actual outputs from your engine have improved.
    • Kantan BuildAnalytics allows you to see detailed reports of your BLEU, F-Measure and TER scores at a segment level. With this you can see what segments scored well and what segments did not score well. Using this information you can identify common patterns where the engine supplies poor quality translations and then use this information to improve the training data in future builds.
    • Kantan BuildAnalytics also allows you to look at the training data rejects list. These are segments that we have rejected from your training for one of a number of reasons. Here you can view the number of segments rejected and why they were rejected. Using this you can improve your data so that less data is rejected during the build process.

After this process is completed it is important to check the quality with native, qualified translators. Kantan LQR is a quality review tool built in to the KantanMT platform which manages review projects and provides a platform where the reviewers can work online and project managers can track the progress as they go. 

For more information please see:


Have more questions? Submit a request