GitHub - Rohit2sali/MachineTranslationUsingTransformer: Deep learning translation model using transformer

This repository contains the code for a 100 million parameter Transformer model for English-to-French machine translation, trained from scratch using PyTorch.

The model was trained on a large-scale corpus of 3 billion tokens and is publicly available on the Hugging Face Hub.

🚀 Highlights Large-Scale Training: Trained on a 3 billion token corpus aggregated from ParaCrawl and MultiUN datasets.

Strong Performance: Achieves a BLEU score of 20.45 on unseen data from the OPUS-Wikipedia test set.

Open-Sourced: The pre-trained model is available on the Hugging Face Hub for anyone to use or fine-tune.

📊 Model Performance
The model was evaluated on two test sets it was not trained on to verify its generalization capabilities.

Test Set (Unseen Data)	BLEU Score	Full Metric
📝 OPUS-Wikipedia	20.45	`90.6/41.9/13.3/3.4`
🌐 ParaCrawl	18.14	`95.8/52.2/9.1/2.4`

⚙️ Fine-Tuning
This model serves as a strong baseline for English-to-French translation and can be fine-tuned on a more specific domain (e.g., legal or medical texts).

To fine-tune the model:

Prepare your dataset: Format your parallel corpus into source and target text files.

Download the pre-trained model from huggingface using this link "https://huggingface.co/Rohit2sali/en-fr-translation-transformer-100M".

Update the data loader: Point the script to your custom dataset.

The dataset I used for training can be downloaded from here,
"https://object.pouta.csc.fi/OPUS-MultiUN/v1/moses/en-fr.txt.zip"
"https://object.pouta.csc.fi/OPUS-ParaCrawl/v1/moses/en-fr.txt.zip"
"https://object.pouta.csc.fi/OPUS-ParaCrawl/v4/moses/en-fr.txt.zip"
"https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/moses/en-fr.txt.zip"

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Attention.py		Attention.py
BLEUscore.py		BLEUscore.py
LICENSE		LICENSE
README.md		README.md
decoder.py		decoder.py
embedding.py		embedding.py
encoder.py		encoder.py
get_data.py		get_data.py
tokenization.py		tokenization.py
train.py		train.py
transfomer.py		transfomer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

License

Rohit2sali/MachineTranslationUsingTransformer

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages