Intelligent Functions Library
We are planning to collaborate with SIL on a library of intelligent functions including:
Translation suggestion functions
Quality-checking functions
Basic checking functionality
This library will likely be a Lerna + yarn + TypeScript stack, installable as a simple JS package.
We might use some of the proposed functionality in this proposed shared project from the Parnership for Applied Biblical NLP: bible-ai-endpoints-test
Simpler tasks
BLEU Scoring Endpoint: An endpoint that accepts a reference translation and a candidate translation and returns a BLEU score (via NLTK).
Romanizing Text Endpoint: An endpoint that converts text from one script to another. This would be particularly useful for languages with non-Latin scripts. See Ulf’s library uroman.
Keyword List Retrieval Endpoint: An endpoint that can extract the list of key content words or names from the text for further alignment tasks.
More complex tasks
FastAlign Endpoint: An endpoint that attempts to align two sentences using a simple, efficient model like FastAlign. This would likely require access to pre-trained models (do these exist? Do we need to make these?) or an ability to train models from provided datasets.
Pretrained models
Endpoints for specific language pairs/sets
Statistical Alignment Endpoint: An endpoint that takes two strings as an input and attempts to statistically align them. It can use models like IBM Model 1 to 4, HMM, etc. Are there tools in SIL Machine for this?
Input: parallel sentences
Function trains model
Output word-by-word alignment
Syntax-Aware Alignment Endpoint: An endpoint that aligns considering the syntax of the languages. We could leverage MACULA data for source-text syntax chunking.
Multilingual Alignment Endpoint: An endpoint that can take multiple sentences in different languages and align them all together. We would need to specify the alignment method, or this could be an LLM-powered tool.
Preprocessing before alignment: merged vrefs in one language, etc.
E-Bible endpoints: what services could we expose relevant to this data?
Endpoint to get multiple versions for a given verse or range
Resolve versification between versions (using ParaText “original” versification?)
Last updated