🚀
Codex Editor
  • Project Overview
    • Welcome to Codex
    • Codex Editor
      • Features
      • What was Project Accelerate?
        • Project Philosophy
        • Steering Committee
    • Vision
      • Powerful Tools and Simplicity
      • Streamlining the Translation Process
      • Unopinionated Assistance
    • Architecture
    • Frequently Asked Questions
    • Translator Requirements
    • Roadmap
    • Why VS Code?
    • Multimodality
  • Translator's Copilot
    • What is Translator's Copilot?
    • Information Management
      • Resource Indexing
      • Greek/Hebrew Insights
      • Chat with Resources
      • Project-Based Insights
    • Translation Assistance
      • Translation Drafting
        • Prioritizing Semantics over Structures
        • Reranked Translation Suggestions
      • Quality Checking
        • Token matching evaluation
        • Few-shot evaluation
        • Fine-tuned Bible QA model
        • Character and word n-gram evaluation
        • Simulated GAN evaluation
        • Linguistic Anomaly Detection (LAD)
      • Back Translation
    • Orchestration
      • Translation Memory
      • Multi-Agent Simulations
      • Drafting linguistic resources
    • Intelligent Functions Library
  • Development
    • Codex Basics
      • Projects
      • The Editor
      • Extensions
        • Rendered Views
        • Language Servers
        • Custom Notebooks
        • Global State
    • Experimental Repositories
Powered by GitBook
On this page
  1. Translator's Copilot
  2. Translation Assistance
  3. Quality Checking

Linguistic Anomaly Detection (LAD)

PreviousSimulated GAN evaluationNextBack Translation

Last updated 1 year ago

Linguistic Anomaly Detection is a technique for similarity scoring (compression-based, embedding-based, or traditional NLP-based) between source-source and target-target pairs.

This involves comparing the similarity of source texts and their corresponding target texts. Rather than being scored against a baseline, however, it has the essential feature of being relativized to the current state of the translation project, such that only the verses a human has already translated/edited serve as the gold standard.

Here's an idea for how LAD might work in a translation project

  1. Suppose you have a source1-target1 pair that has been manually vetted and is considered the gold standard.

  2. You also have a source2-target2 pair that has not been vetted yet.

  3. Compare the similarity of source1 and source2. This could involve comparing the compression results or similarity between the sources and between the targets, doing TF-IDF intersection scoring against your current project, or some other method.

  4. If the similarity between source1 and source2 is 0.8, then the similarity between target1 and target2 should be within a standard deviation of 0.8. (This is my suspicion at least.)

  5. Provide Scores or Reports on Estimated Consistency: The above technique (or a variation thereof) could be used to provide scores or reports on the estimated consistency for the unvetted target text (target2). This allows for a quantitative measure of the translation's internal consistency, and by implication it should indicate something about its overall quality.

See cleaned up post with additional thoughts .

Linguistic Anomaly Detection