Few-shot evaluation
Keeping things simple, and sticking with only the resources at hand, what if the evaluation stage involved the following:
- Take the draft translation and find the most similar translation pairs to the paired prediction 
- Try to evaluate if any of the words seem misused - Perhaps you could have a rules-based matcher that identifies any tokens in the source+prediction pair that only occur in EITHER the source OR the target in any of the similar examples 
- It might also be worth examining whether any words seem to be mistranslated based on the gold standard examples 
- POSaligning these examples would probably help 
 
- Pass the task back to the translate bot with special notes about any seemingly mistranslated words. 
Essentially: [ most similar examples ] --> [ prediction ] --> [ most similar examples ]
Last updated