Few-shot evaluation
Keeping things simple, and sticking with only the resources at hand, what if the evaluation stage involved the following:
Take the draft translation and find the most similar translation pairs to the paired prediction
Try to evaluate if any of the words seem misused
Perhaps you could have a rules-based matcher that identifies any tokens in the source+prediction pair that only occur in EITHER the source OR the target in any of the similar examples
It might also be worth examining whether any words seem to be mistranslated based on the gold standard examples
POSaligning these examples would probably help
Pass the task back to the translate bot with special notes about any seemingly mistranslated words.
Essentially: [ most similar examples ] --> [ prediction ] --> [ most similar examples ]
Last updated