Below we’ve summarized the key takeaways from an eBay presentation regarding lessons learned from evaluating machine translated content. Attendees learned best practices on how to evaluate MT and useful hints on how to improve the quality of the MT’ed output.
Collect feedback from evaluators, but do not over-rely on it.
Verify how terminology was handled, and whether instructions were followed, ask for evaluator’s general impression, and check for issues with the test data. Examples of issues with test data – misalignments, too many duplicates, test data might have been in your training data which makes the test invalid. General level of satisfaction is important, because the evaluator is the end user, but there are limitations to this feedback, because it is subjective. You shouldn’t rely only to evaluator’s feedback. You always have to compare it to empirical data. For instance
Feedback vs Empirical data
- Test set must be varied (duplicate check).
- Test set must be representative (TM proportions check).
- Test data should not be in training data (overlap check).
- Ask for concrete examples of bad MT segments.
Evaluate the MT quality and benchmark customized engines, but do not merge the two types of quality evaluations.
Evaluate adequacy and fluency. Adequacy covers the meaning and fluency refers to the style. You might have an engine doing really good on Romance languages, but not so good on Asian, or another one that is doing well on Slavic languages, but not so good on German. You should have a mix of different engines with different strengths for different types of content and languages pairs. Also, it is important to not mix quality evaluations and benchmark. Mixing both confuses evaluators, plus it is very time consuming to evaluate several outputs.
Choose evaluators with content expertise, but do not limit yourself to two evaluators to complete the evaluation faster, use at least three to avoid bias.
Evaluate terminology and TM compliance. Assess in what ways MT is helpful to post-editors. Make sure that the evaluation is reliable. It is possible that the output is fluent, but not accurate enough. Using at least three evaluators will ensure less MT scepticism, various feedback, balanced results in case of overediting.
A translation is no translation, he said, unless it will give you the music of a poem along with the words of it.
John Millington Synge
Look for patterns of overediting, but do not pressure evaluators to lower quality standards.
Using a third-party to do a sanity check is one way to check for overediting. Another way to check for overediting automatically without using a third-party to check the post-edits is to compare the edit distance of your MT vs the human translation (your TM) and compare this results with the edit distance of your MT vs PE. Normally edit distance should be lower, because the PE is always based on the MT, so it should be closer to the MT. If it is higher, then this is a clear case of overediting, because the post editing went even over the human translation in terms of transcreation.
Let your evaluators or post editors know not to overedit, but don’t pressure them to lower their quality standards. In the end, you don’t want results that look good, but a MT that is really helping the translators and post-editors. Low quality MT takes more time to understand and post-edit the output, so it takes more time than just translating the content, especially if you have skilled and fast translators. When you are MT translating user facing content you need perfect style and fluency, so the meaning might be correct, but the style is off and feels machine translated, and you don’t want that when you’re working with user facing content, so don’t pressure evaluators to ignore this.
It is time-consuming to score translation error rates (TER), e.g. once from the initial test with overediting, then comparing it to the next test, which will be quite similar, so you will lose a lot of time on the same test.
Improve your engine’s vocabulary and terminology by adding new data, but do not focus on creating glossaries.
Runtime glossaries override MT translations that are not correct according to the glossary, but this is not recommended, because it prevents the engine from learning by itself. The engine will always do the same mistake that you will have to overwrite at the end. It also, affects the fluency, because it doesn’t take into account the context of the sentence around the term that you want to override.
Training glossaries, even if extensive, are usually are way smaller compared to the rest of the training data, such as translation memories.
Important facts
- Machine translation is cost-effective.
- Raw MT does not offer a quality guarantee.
- MT post-editing offers acceptable quality, but it is slow and translators dislike it.
- Technology augments humans rather than replaces them, it makes them more productive and efficient.
Translation costs are often the major factor behind localization budget decisions. It is a fact that machine translation is cheaper than human translation. Even when linguists are editing the output, they often leave the result to be too similar to what the MT engine originally proposed. The MT system makes a suggestion, they clean up the error, but they can’t really produce the sentences that are their best work. Counterintuitively, MT post-editing can be slower that translating correctly the first time. Learn more
Have a question related to machine translation? No problem, we’re glad to help!