Skip to content

Rogaton/coptic-translation-interface

Repository files navigation

title Coptic Translation Interface
emoji 🔮
colorFrom green
colorTo indigo
sdk docker
app_port 7860
pinned false
license apache-2.0
short_description Coptic↔English translation + neural-symbolic parser

🔮 Coptic Translation & Parsing Interface

A comprehensive research tool for Coptic language analysis combining neural machine translation with neural-symbolic dependency parsing.

License DOI HuggingFace Space Python 3.10+

Overview

This interface provides three integrated tools for Coptic language research:

  1. Neural Machine Translation (Coptic ↔ English)
  2. Neural-Symbolic Dependency Parser (Stanza + Prolog)
  3. Grammatical Validation (Walter Till's grammar + Crum's lexicon)

Features

🔄 Translation (Coptic ↔ English)

  • Coptic → English: megalaa/coptic-english-translator
  • English → Coptic: megalaa/english-coptic-translator
  • Dialects: Sahidic (literary standard) and Bohairic (liturgical)
  • Virtual Keyboard: 31 Coptic Unicode characters
  • Example Corpus: Simple sentences, complex structures, and full texts
  • Models: Fine-tuned MarianMT on 50,000+ CopticScriptorium parallel sentences

📊 Dependency Parsing (Neural-Symbolic Hybrid)

  • Neural Layer:

    • Stanza NLP pipeline for Coptic
    • Tokenization, POS tagging, lemmatization
    • DiaParser for dependency trees
  • Symbolic Layer:

    • Prolog implementation of Walter Till's Coptic Grammar (1955)
    • Integration with Crum's Coptic Dictionary (1939)
    • Grammatical pattern detection (tripartite sentences, etc.)
    • Error detection for neural parser hallucinations
  • Export: CoNLL-U format for corpus linguistics research

🔍 Grammatical Validation

  • Detects grammatical patterns from Walter Till's grammar
  • Validates dependency structures against linguistic rules
  • Identifies common parsing errors and hallucinations
  • Provides grammatical warnings and suggestions

Installation

Prerequisites

  • Python 3.10+
  • SWI-Prolog 8.0+ (for Prolog validation)
  • Docker (for containerized deployment)

Local Installation

# Clone the repository
git clone https://github.com/Rogaton/coptic-translation-interface.git
cd coptic-translation-interface

# Install dependencies
pip install -r requirements.txt

# Download Stanza Coptic models
python -c "import stanza; stanza.download('cop')"

# Run the interface
python app.py

Docker Deployment

# Build the Docker image
docker build -t coptic-interface .

# Run the container
docker run -p 7860:7860 coptic-interface

Usage

Web Interface

Access the interface at http://localhost:7860 with three tabs:

  1. Coptic → English: Translate Coptic text to English
  2. English → Coptic: Translate English text to Coptic
  3. Dependency Analysis: Parse Coptic text with neural-symbolic validation

Example: Translation

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model
tokenizer = AutoTokenizer.from_pretrained("megalaa/coptic-english-translator")
model = AutoModelForSeq2SeqLM.from_pretrained("megalaa/coptic-english-translator")

# Translate
coptic_text = "ⲡϫⲟⲉⲓⲥ ⲡⲉ ⲡⲁⲛⲟⲩⲧⲉ"
inputs = tokenizer(coptic_text, return_tensors="pt")
outputs = model.generate(**inputs)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Output: "The Lord is my God"

Example: Dependency Parsing

from coptic_parser_core import CopticParserCore

# Initialize parser
parser = CopticParserCore()
parser.load_parser()

# Parse text
result = parser.parse_text("ⲁⲛⲟⲕ ⲡⲉ ⲡⲛⲟⲩⲧⲉ")

# Export to CoNLL-U
conllu = parser.format_conllu(result)

Architecture

Neural-Symbolic Hybrid

Input Text
    ↓
[Neural Layer]
    ↓
Stanza Pipeline → Tokenization, POS, Lemmatization
    ↓
DiaParser → Dependency Trees
    ↓
[Symbolic Layer]
    ↓
Prolog Rules → Grammatical Validation
    ↓
Till Grammar + Crum Lexicon → Error Detection
    ↓
Output: Validated Parse + Warnings

Translation Models

Both translation models use:

  • Architecture: MarianMT (Seq2Seq Transformer)
  • Training Data: CopticScriptorium parallel corpus (50,000+ sentences)
  • Preprocessing: Coptic Unicode → Greek transcription
  • Dialect Tags: Cyrillic markers (з for Sahidic, б for Bohairic)

Data

Test Corpus

The interface includes coptic_test_corpus.json with:

  • Simple Sentences: 10+ examples (Sahidic & Bohairic)
  • Complex Sentences: 5+ examples with subordination
  • Full Texts: Biblical narratives and parables
  • Grammar Patterns: Tripartite nominals, perfect tense, etc.

Linguistic Resources

  • Walter Till's Grammar (coptic_grammar.pl): 700+ Prolog rules
  • Crum's Dictionary (coptic_lexicon.pl): 12,000+ lexical entries
  • Stanza Models: Pre-trained on CopticScriptorium Universal Dependencies

Research Applications

  • Corpus Linguistics: CoNLL-U export for quantitative analysis
  • Digital Humanities: Automated parsing of Coptic manuscripts
  • Language Learning: Interactive translation with grammatical feedback
  • Computational Linguistics: Neural-symbolic architecture research
  • Egyptology: Analysis of Coptic Biblical and documentary texts

Performance

  • Translation Quality: BLEU score ~35-40 (Coptic→English)
  • Parsing Accuracy: UAS ~85%, LAS ~80% on CopticScriptorium test set
  • Prolog Validation: Detects 70%+ of common parsing errors
  • Inference Speed: ~0.5s per sentence (translation), ~2s (parsing with validation)

Citation

If you use this interface in your research, please cite:

@software{linden2025coptic,
  author = {Linden, André},
  title = {Coptic Translation and Parsing Interface: A Neural-Symbolic Approach},
  year = {2025},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.19487216},
  url = {https://doi.org/10.5281/zenodo.19487216},
  version = {1.0.1}
}

Related Publications

  • Enis, M. & Megalaa, A. (2024). Ancient voices, modern technology: Low-resource neural machine translation for coptic texts. [Paper Link]
  • Till, W. C. (1955). Koptische Grammatik (Saïdischer Dialekt). Leipzig: VEB Verlag Enzyklopädie.
  • Crum, W. E. (1939). A Coptic Dictionary. Oxford: Clarendon Press.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Components

  • Translation Models: megalaa models (various licenses)
  • Stanza: Apache License 2.0
  • Prolog Rules: CC BY-NC-SA 4.0 (based on Till's grammar)
  • Lexicon: Public domain (Crum's dictionary)

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Commit your changes (git commit -m 'Add new feature')
  4. Push to the branch (git push origin feature/improvement)
  5. Open a Pull Request

Areas for Contribution

  • Additional dialect support (Akhmimic, Lycopolitan, Fayyumic)
  • Improved Prolog rules for complex constructions
  • Enhanced error detection algorithms
  • Extended test corpus with documentary texts
  • Performance optimizations

Acknowledgments

  • CopticScriptorium for the parallel corpus and UD annotations
  • Amir Zeldes and team for Coptic NLP resources
  • Stanford NLP Group for Stanza
  • megalaa for the translation models
  • Walter Till for the foundational Coptic grammar
  • W. E. Crum for the comprehensive dictionary

Links

Contact

André Linden

Version History

  • v1.0.1 (2025-04-09): Zenodo archive release
    • DOI: 10.5281/zenodo.19487216
    • Updated documentation with contact information
    • All features from v1.0.0
  • v1.0.0 (2025-04-09): Initial release
    • Neural machine translation (Coptic ↔ English)
    • Neural-symbolic dependency parser
    • Prolog grammatical validation
    • Web interface with examples

Built with ❤️ for Coptic language research and digital humanities

About

Coptic Translation & Parsing Interface: A Neural-Symbolic Approach

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors