Character and word n-gram evaluation
The basic idea here is to use the chrF++ algorithm to evaluate against a known translation. Shifting the β value (where β=1 means precision [hypothesis overlap normalized by hypothesis length] and recall [hypothesis overlap normalized by reference length] are equally weighted) may be significant.
The problem here is that you always need a reference. So this might be useful for fine-tuning a model, or else just naively scoring a model, but exposing the non-overlap would be very interesting in terms of data an LLM can use, especially
How the chrF++ algorithm works
The chrF++ algorithm is a metric used in machine translation to evaluate the quality of translated text. It is an extension of the chrF metric, which is based on character n-gram precision and recall. The chrF++ algorithm adds word n-gram precision and recall to the mix, providing a more comprehensive evaluation of translation quality.
Let's break down how the chrF++ algorithm works using a toy example.
Suppose we have a reference sentence (the correct translation) and a hypothesis sentence (the translation produced by a machine translation system) as follows:
Reference: "The cat sat on the mat." Hypothesis: "A cat is sitting on the mat."
We'll use bigrams (2-grams) for simplicity in this example.
- Character n-gram precision and recall: - First, we calculate the character n-gram precision and recall. For bigrams, we look at pairs of characters. - Reference bigrams: {"Th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", "at", "t ", " o", "on", "n ", " t", "th", "he", "e ", " m", "ma", "at", "."} - Hypothesis bigrams: {"A ", " c", "ca", "at", "t ", " i", "is", "s ", " s", "si", "it", "tt", "ti", "in", "ng", "g ", " o", "on", "n ", " t", "th", "he", "e ", " m", "ma", "at", "."} - Precision (P) is the number of bigrams in the hypothesis that are also in the reference, divided by the total number of bigrams in the hypothesis. - Recall (R) is the number of bigrams in the hypothesis that are also in the reference, divided by the total number of bigrams in the reference. 
- Word n-gram precision and recall: - Next, we calculate the word n-gram precision and recall. For bigrams, we look at pairs of words. - Reference bigrams: {"The cat", "cat sat", "sat on", "on the", "the mat"} - Hypothesis bigrams: {"A cat", "cat is", "is sitting", "sitting on", "on the", "the mat"} - Again, precision (P) is the number of bigrams in the hypothesis that are also in the reference, divided by the total number of bigrams in the hypothesis. - Recall (R) is the number of bigrams in the hypothesis that are also in the reference, divided by the total number of bigrams in the reference. 
- Combine the scores: - The final chrF++ score is a combination of the character and word n-gram scores. The formula is as follows: - chrF++ = β * (1 + (P_char + R_char) / 2) * (1 + (P_word + R_word) / 2) - where β is a parameter that controls the balance between precision and recall. A common choice is β = 1, which gives equal weight to precision and recall. 
In this toy example, the numbers are made up, but in a real scenario, you would calculate the precision and recall values based on the actual overlap between the reference and hypothesis n-grams. The chrF++ score gives a measure of how similar the hypothesis is to the reference, taking into account both the character and word level information.
Last updated