MIRACLE: Medical Information Retrieval Using AI with Clinical Language Embeddings 🌟🩺

This project is a documentation of the source code developed for the training and evaluation of MIRACLE, the first German embedding model specifically trained on millions of Q&A pairs and 400,000 real-world German medical documents. This model is designed to understand and generate high-quality embeddings tailored to the complexities of medical language, supporting a variety of use cases such as clinical data analysis, medical information retrieval, and healthcare research. If you are interested in using the model, you can try it out on HuggingFace: https://huggingface.co/SHIPAI-IKIM/miracle-german 🚀

Why MIRACLE? 🤔

Medical language is intricate and specialized, with jargon, abbreviations, and terminology that differ from everyday language. General-purpose language models often struggle to understand the nuances of medical text, especially in languages like German. MIRACLE addresses this gap by being the first embedding model created specifically for German medical contexts.

Here's why a specialized model like MIRACLE is so important:

Accuracy: General models may miss crucial medical details, whereas MIRACLE is trained to capture the complexities of diagnoses, treatments, and medical outcomes.
Efficiency: With the ability to process large datasets of medical records, MIRACLE can improve the speed and accuracy of medical information retrieval systems.
Trustworthiness: Trained on real-world German medical documents, this model ensures better contextual understanding and reliability when working with sensitive health-related data.

Whether you're working in healthcare, research, or developing applications in the medical field, MIRACLE can significantly improve the quality and relevance of your insights. 🧠💡

Project Structure 📁

MIRACLE/
├── main.py                          # Entry point
├── pyproject.toml                   # Project dependencies
├── miracle/
│   ├── training/
│   │   ├── augment_training_data.py # Generate Q&A pairs from medical documents
│   │   ├── preprocess_training_data.py # Preprocess and split training data
│   │   └── training.py              # Fine-tune embedding model
│   └── evaluation/
│       ├── irr/                     # Information Retrieval Ranking evaluation
│       │   ├── BootstrapRetrievalEvaluator.py # Custom evaluator with bootstrap metrics
│       │   ├── irr_eval.py          # Run IR evaluation pipeline
│       │   └── find_duplicates.py   # Find duplicate entries in dataset
│       └── rag/                     # RAG evaluation pipeline
│           ├── ingest.py            # Ingest documents into vector database
│           ├── retrieve_generate.py # Retrieve docs and generate answers
│           └── evaluate.py          # Evaluate RAG outputs (BLEURT, ROUGE, BERTScore)
└── images/

Installation & Setup ⚙️

Prerequisites

Python >= 3.13

Installation

Clone the repository:

git clone https://github.com/your-org/MIRACLE.git
cd MIRACLE

Install dependencies using uv:

uv sync

This will create a virtual environment and install all dependencies. To activate the environment:

source .venv/bin/activate

Dependencies

The project uses the following main dependencies:

sentence-transformers - For training and using embedding models
langchain / langchain-community - For RAG pipeline and document processing
datasets - Hugging Face datasets library
wandb - Experiment tracking
scikit-learn - For data splitting and metrics
pandas / numpy - Data manipulation
aiohttp - Async HTTP requests for data augmentation

Configuration 🔧

Create a .env file in the project root with the following environment variables:

Training Configuration

# Data paths
DATA_PICKLE_PATH=/path/to/your/dataset.pkl
TRAIN_DATASET=/path/to/training/data
METADATA_DF=/path/to/patient_metadata.pkl
DATA_OUTPUT_PATH=/path/to/output/

# Training parameters
TRAIN_MODEL=your-base-model-name
BATCH_SIZE=32
MAX_SEQ_LENGTH=512
EPOCHS=3
TRAIN_TEST_SPLIT=0.8

# Data augmentation
CHUNK_SIZE=512
CHUNK_OVERLAP=50
FIRST_PATH=/path/to/documents/folder1
SECOND_PATH=/path/to/documents/folder2
THIRD_PATH=/path/to/documents/folder3

# Weights & Biases
USERNAME=your-wandb-username

Evaluation Configuration

# Embedding model
EMBEDDINGS_MODEL_NAME=your-fine-tuned-model
MODEL_URL=http://your-model-endpoint

# Vector database
CONNECTION_STRING=postgresql://user:pass@host:port/db

# RAG evaluation
ENDPOINT_URL=http://your-llm-endpoint
MODEL_N_CTX=4096
TARGET_SOURCE_CHUNKS=4

# IRR evaluation
IRR_DATASET=/path/to/irr_dataset.csv
OUTPUT_PATH=/path/to/evaluation/output
FILE_OUTPUT_NAME=evaluation_results

Usage 💻

Training Pipeline

The training pipeline consists of three main steps:

1. Data Augmentation

Generate question-answer pairs from medical documents using an LLM:

python miracle/training/augment_training_data.py

This script:

Loads medical documents from specified directories
Splits documents into chunks using RecursiveCharacterTextSplitter
Generates 5 medically relevant questions per chunk using an LLM
Saves the augmented dataset to CSV

2. Data Preprocessing

Preprocess and split the augmented data for training:

python miracle/training/preprocess_training_data.py

This script:

Filters invalid/dirty rows (length constraints, proper formatting)
Adds query: and passage: prefixes for embedding training
Performs patient-aware train/test split using GroupShuffleSplit
Saves preprocessed dataset as pickle

3. Model Training

Fine-tune a sentence transformer model:

python miracle/training/training.py

This script:

Loads a base SentenceTransformer model
Uses CachedMultipleNegativesRankingLoss for contrastive learning
Trains with configurable batch size, epochs, and learning rate
Logs metrics to Weights & Biases
Saves the fine-tuned model

Evaluation Pipeline

Information Retrieval Ranking (IRR) Evaluation

Evaluate embedding model retrieval performance:

python miracle/evaluation/irr/irr_eval.py

Metrics computed:

Accuracy@k
Precision@k / Recall@k
MRR@k (Mean Reciprocal Rank)
NDCG@k (Normalized Discounted Cumulative Gain)
MAP@k (Mean Average Precision)

The custom BootstrapRetrievalEvaluator extends SentenceTransformers' evaluator to provide raw per-query metrics for statistical analysis.

RAG Evaluation

Ingest documents into the vector database:

python miracle/evaluation/rag/ingest.py

Run retrieval and generation:

python miracle/evaluation/rag/retrieve_generate.py

Evaluate generated answers:

python miracle/evaluation/rag/evaluate.py

RAG evaluation metrics:

BERTScore - Semantic similarity using BERT embeddings
ROUGE (1, 2, L) - N-gram overlap metrics
BLEURT - Learned evaluation metric
ClinicalBLEURT - Domain-adapted BLEURT for clinical text

Using the Trained Model

Load and use your fine-tuned MIRACLE model:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("path/to/your/trained/model")

# Encode queries and passages
queries = ["query: Was sind die Symptome von Diabetes?"]
passages = ["passage: Diabetes mellitus zeigt sich durch erhöhten Blutzucker..."]

query_embeddings = model.encode(queries)
passage_embeddings = model.encode(passages)

# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, passage_embeddings)

Paper 📄

Arzideh K., Schäfer H., Idrissi-Yaghir A. et al. "Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study". Journal of Medical Internet Research, 28, e82997, doi: https://doi.org/10.2196/82997

Contributing 🤝

Feel free to contribute, ask questions, or open issues! Together, let's make medical language understanding in German more powerful and accessible.

License 📜

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
miracle		miracle
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIRACLE: Medical Information Retrieval Using AI with Clinical Language Embeddings 🌟🩺

Why MIRACLE? 🤔

Project Structure 📁

Installation & Setup ⚙️

Prerequisites

Installation

Dependencies

Configuration 🔧

Training Configuration

Evaluation Configuration

Usage 💻

Training Pipeline

1. Data Augmentation

2. Data Preprocessing

3. Model Training

Evaluation Pipeline

Information Retrieval Ranking (IRR) Evaluation

RAG Evaluation

Using the Trained Model

Paper 📄

Contributing 🤝

License 📜

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MIRACLE: Medical Information Retrieval Using AI with Clinical Language Embeddings 🌟🩺

Why MIRACLE? 🤔

Project Structure 📁

Installation & Setup ⚙️

Prerequisites

Installation

Dependencies

Configuration 🔧

Training Configuration

Evaluation Configuration

Usage 💻

Training Pipeline

1. Data Augmentation

2. Data Preprocessing

3. Model Training

Evaluation Pipeline

Information Retrieval Ranking (IRR) Evaluation

RAG Evaluation

Using the Trained Model

Paper 📄

Contributing 🤝

License 📜

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages