This project is a documentation of the source code developed for the training and evaluation of MIRACLE, the first German embedding model specifically trained on millions of Q&A pairs and 400,000 real-world German medical documents. This model is designed to understand and generate high-quality embeddings tailored to the complexities of medical language, supporting a variety of use cases such as clinical data analysis, medical information retrieval, and healthcare research. If you are interested in using the model, you can try it out on HuggingFace: https://huggingface.co/SHIPAI-IKIM/miracle-german π
Medical language is intricate and specialized, with jargon, abbreviations, and terminology that differ from everyday language. General-purpose language models often struggle to understand the nuances of medical text, especially in languages like German. MIRACLE addresses this gap by being the first embedding model created specifically for German medical contexts.
Here's why a specialized model like MIRACLE is so important:
- Accuracy: General models may miss crucial medical details, whereas MIRACLE is trained to capture the complexities of diagnoses, treatments, and medical outcomes.
- Efficiency: With the ability to process large datasets of medical records, MIRACLE can improve the speed and accuracy of medical information retrieval systems.
- Trustworthiness: Trained on real-world German medical documents, this model ensures better contextual understanding and reliability when working with sensitive health-related data.
Whether you're working in healthcare, research, or developing applications in the medical field, MIRACLE can significantly improve the quality and relevance of your insights. π§ π‘
MIRACLE/
βββ main.py # Entry point
βββ pyproject.toml # Project dependencies
βββ miracle/
β βββ training/
β β βββ augment_training_data.py # Generate Q&A pairs from medical documents
β β βββ preprocess_training_data.py # Preprocess and split training data
β β βββ training.py # Fine-tune embedding model
β βββ evaluation/
β βββ irr/ # Information Retrieval Ranking evaluation
β β βββ BootstrapRetrievalEvaluator.py # Custom evaluator with bootstrap metrics
β β βββ irr_eval.py # Run IR evaluation pipeline
β β βββ find_duplicates.py # Find duplicate entries in dataset
β βββ rag/ # RAG evaluation pipeline
β βββ ingest.py # Ingest documents into vector database
β βββ retrieve_generate.py # Retrieve docs and generate answers
β βββ evaluate.py # Evaluate RAG outputs (BLEURT, ROUGE, BERTScore)
βββ images/
- Python >= 3.13
- Clone the repository:
git clone https://github.com/your-org/MIRACLE.git
cd MIRACLE- Install dependencies using uv:
uv syncThis will create a virtual environment and install all dependencies. To activate the environment:
source .venv/bin/activateThe project uses the following main dependencies:
sentence-transformers- For training and using embedding modelslangchain/langchain-community- For RAG pipeline and document processingdatasets- Hugging Face datasets librarywandb- Experiment trackingscikit-learn- For data splitting and metricspandas/numpy- Data manipulationaiohttp- Async HTTP requests for data augmentation
Create a .env file in the project root with the following environment variables:
# Data paths
DATA_PICKLE_PATH=/path/to/your/dataset.pkl
TRAIN_DATASET=/path/to/training/data
METADATA_DF=/path/to/patient_metadata.pkl
DATA_OUTPUT_PATH=/path/to/output/
# Training parameters
TRAIN_MODEL=your-base-model-name
BATCH_SIZE=32
MAX_SEQ_LENGTH=512
EPOCHS=3
TRAIN_TEST_SPLIT=0.8
# Data augmentation
CHUNK_SIZE=512
CHUNK_OVERLAP=50
FIRST_PATH=/path/to/documents/folder1
SECOND_PATH=/path/to/documents/folder2
THIRD_PATH=/path/to/documents/folder3
# Weights & Biases
USERNAME=your-wandb-username# Embedding model
EMBEDDINGS_MODEL_NAME=your-fine-tuned-model
MODEL_URL=http://your-model-endpoint
# Vector database
CONNECTION_STRING=postgresql://user:pass@host:port/db
# RAG evaluation
ENDPOINT_URL=http://your-llm-endpoint
MODEL_N_CTX=4096
TARGET_SOURCE_CHUNKS=4
# IRR evaluation
IRR_DATASET=/path/to/irr_dataset.csv
OUTPUT_PATH=/path/to/evaluation/output
FILE_OUTPUT_NAME=evaluation_resultsThe training pipeline consists of three main steps:
Generate question-answer pairs from medical documents using an LLM:
python miracle/training/augment_training_data.pyThis script:
- Loads medical documents from specified directories
- Splits documents into chunks using
RecursiveCharacterTextSplitter - Generates 5 medically relevant questions per chunk using an LLM
- Saves the augmented dataset to CSV
Preprocess and split the augmented data for training:
python miracle/training/preprocess_training_data.pyThis script:
- Filters invalid/dirty rows (length constraints, proper formatting)
- Adds
query:andpassage:prefixes for embedding training - Performs patient-aware train/test split using
GroupShuffleSplit - Saves preprocessed dataset as pickle
Fine-tune a sentence transformer model:
python miracle/training/training.pyThis script:
- Loads a base SentenceTransformer model
- Uses
CachedMultipleNegativesRankingLossfor contrastive learning - Trains with configurable batch size, epochs, and learning rate
- Logs metrics to Weights & Biases
- Saves the fine-tuned model
Evaluate embedding model retrieval performance:
python miracle/evaluation/irr/irr_eval.pyMetrics computed:
- Accuracy@k
- Precision@k / Recall@k
- MRR@k (Mean Reciprocal Rank)
- NDCG@k (Normalized Discounted Cumulative Gain)
- MAP@k (Mean Average Precision)
The custom BootstrapRetrievalEvaluator extends SentenceTransformers' evaluator to provide raw per-query metrics for statistical analysis.
- Ingest documents into the vector database:
python miracle/evaluation/rag/ingest.py- Run retrieval and generation:
python miracle/evaluation/rag/retrieve_generate.py- Evaluate generated answers:
python miracle/evaluation/rag/evaluate.pyRAG evaluation metrics:
- BERTScore - Semantic similarity using BERT embeddings
- ROUGE (1, 2, L) - N-gram overlap metrics
- BLEURT - Learned evaluation metric
- ClinicalBLEURT - Domain-adapted BLEURT for clinical text
Load and use your fine-tuned MIRACLE model:
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("path/to/your/trained/model")
# Encode queries and passages
queries = ["query: Was sind die Symptome von Diabetes?"]
passages = ["passage: Diabetes mellitus zeigt sich durch erhΓΆhten Blutzucker..."]
query_embeddings = model.encode(queries)
passage_embeddings = model.encode(passages)
# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, passage_embeddings)Arzideh K., SchΓ€fer H., Idrissi-Yaghir A. et al. "Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study". Journal of Medical Internet Research, 28, e82997, doi: https://doi.org/10.2196/82997
Feel free to contribute, ask questions, or open issues! Together, let's make medical language understanding in German more powerful and accessible.
See LICENSE for details.
