Skip to content

UMEssen/MIRACLE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MIRACLE: Medical Information Retrieval Using AI with Clinical Language Embeddings 🌟🩺

License: MIT Supported Python version

alt text

This project is a documentation of the source code developed for the training and evaluation of MIRACLE, the first German embedding model specifically trained on millions of Q&A pairs and 400,000 real-world German medical documents. This model is designed to understand and generate high-quality embeddings tailored to the complexities of medical language, supporting a variety of use cases such as clinical data analysis, medical information retrieval, and healthcare research. If you are interested in using the model, you can try it out on HuggingFace: https://huggingface.co/SHIPAI-IKIM/miracle-german πŸš€

Why MIRACLE? πŸ€”

Medical language is intricate and specialized, with jargon, abbreviations, and terminology that differ from everyday language. General-purpose language models often struggle to understand the nuances of medical text, especially in languages like German. MIRACLE addresses this gap by being the first embedding model created specifically for German medical contexts.

Here's why a specialized model like MIRACLE is so important:

  • Accuracy: General models may miss crucial medical details, whereas MIRACLE is trained to capture the complexities of diagnoses, treatments, and medical outcomes.
  • Efficiency: With the ability to process large datasets of medical records, MIRACLE can improve the speed and accuracy of medical information retrieval systems.
  • Trustworthiness: Trained on real-world German medical documents, this model ensures better contextual understanding and reliability when working with sensitive health-related data.

Whether you're working in healthcare, research, or developing applications in the medical field, MIRACLE can significantly improve the quality and relevance of your insights. πŸ§ πŸ’‘


Project Structure πŸ“

MIRACLE/
β”œβ”€β”€ main.py                          # Entry point
β”œβ”€β”€ pyproject.toml                   # Project dependencies
β”œβ”€β”€ miracle/
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ augment_training_data.py # Generate Q&A pairs from medical documents
β”‚   β”‚   β”œβ”€β”€ preprocess_training_data.py # Preprocess and split training data
β”‚   β”‚   └── training.py              # Fine-tune embedding model
β”‚   └── evaluation/
β”‚       β”œβ”€β”€ irr/                     # Information Retrieval Ranking evaluation
β”‚       β”‚   β”œβ”€β”€ BootstrapRetrievalEvaluator.py # Custom evaluator with bootstrap metrics
β”‚       β”‚   β”œβ”€β”€ irr_eval.py          # Run IR evaluation pipeline
β”‚       β”‚   └── find_duplicates.py   # Find duplicate entries in dataset
β”‚       └── rag/                     # RAG evaluation pipeline
β”‚           β”œβ”€β”€ ingest.py            # Ingest documents into vector database
β”‚           β”œβ”€β”€ retrieve_generate.py # Retrieve docs and generate answers
β”‚           └── evaluate.py          # Evaluate RAG outputs (BLEURT, ROUGE, BERTScore)
└── images/

Installation & Setup βš™οΈ

Prerequisites

  • Python >= 3.13

Installation

  1. Clone the repository:
git clone https://github.com/your-org/MIRACLE.git
cd MIRACLE
  1. Install dependencies using uv:
uv sync

This will create a virtual environment and install all dependencies. To activate the environment:

source .venv/bin/activate

Dependencies

The project uses the following main dependencies:

  • sentence-transformers - For training and using embedding models
  • langchain / langchain-community - For RAG pipeline and document processing
  • datasets - Hugging Face datasets library
  • wandb - Experiment tracking
  • scikit-learn - For data splitting and metrics
  • pandas / numpy - Data manipulation
  • aiohttp - Async HTTP requests for data augmentation

Configuration πŸ”§

Create a .env file in the project root with the following environment variables:

Training Configuration

# Data paths
DATA_PICKLE_PATH=/path/to/your/dataset.pkl
TRAIN_DATASET=/path/to/training/data
METADATA_DF=/path/to/patient_metadata.pkl
DATA_OUTPUT_PATH=/path/to/output/

# Training parameters
TRAIN_MODEL=your-base-model-name
BATCH_SIZE=32
MAX_SEQ_LENGTH=512
EPOCHS=3
TRAIN_TEST_SPLIT=0.8

# Data augmentation
CHUNK_SIZE=512
CHUNK_OVERLAP=50
FIRST_PATH=/path/to/documents/folder1
SECOND_PATH=/path/to/documents/folder2
THIRD_PATH=/path/to/documents/folder3

# Weights & Biases
USERNAME=your-wandb-username

Evaluation Configuration

# Embedding model
EMBEDDINGS_MODEL_NAME=your-fine-tuned-model
MODEL_URL=http://your-model-endpoint

# Vector database
CONNECTION_STRING=postgresql://user:pass@host:port/db

# RAG evaluation
ENDPOINT_URL=http://your-llm-endpoint
MODEL_N_CTX=4096
TARGET_SOURCE_CHUNKS=4

# IRR evaluation
IRR_DATASET=/path/to/irr_dataset.csv
OUTPUT_PATH=/path/to/evaluation/output
FILE_OUTPUT_NAME=evaluation_results

Usage πŸ’»

Training Pipeline

The training pipeline consists of three main steps:

1. Data Augmentation

Generate question-answer pairs from medical documents using an LLM:

python miracle/training/augment_training_data.py

This script:

  • Loads medical documents from specified directories
  • Splits documents into chunks using RecursiveCharacterTextSplitter
  • Generates 5 medically relevant questions per chunk using an LLM
  • Saves the augmented dataset to CSV

2. Data Preprocessing

Preprocess and split the augmented data for training:

python miracle/training/preprocess_training_data.py

This script:

  • Filters invalid/dirty rows (length constraints, proper formatting)
  • Adds query: and passage: prefixes for embedding training
  • Performs patient-aware train/test split using GroupShuffleSplit
  • Saves preprocessed dataset as pickle

3. Model Training

Fine-tune a sentence transformer model:

python miracle/training/training.py

This script:

  • Loads a base SentenceTransformer model
  • Uses CachedMultipleNegativesRankingLoss for contrastive learning
  • Trains with configurable batch size, epochs, and learning rate
  • Logs metrics to Weights & Biases
  • Saves the fine-tuned model

Evaluation Pipeline

Information Retrieval Ranking (IRR) Evaluation

Evaluate embedding model retrieval performance:

python miracle/evaluation/irr/irr_eval.py

Metrics computed:

  • Accuracy@k
  • Precision@k / Recall@k
  • MRR@k (Mean Reciprocal Rank)
  • NDCG@k (Normalized Discounted Cumulative Gain)
  • MAP@k (Mean Average Precision)

The custom BootstrapRetrievalEvaluator extends SentenceTransformers' evaluator to provide raw per-query metrics for statistical analysis.

RAG Evaluation

  1. Ingest documents into the vector database:
python miracle/evaluation/rag/ingest.py
  1. Run retrieval and generation:
python miracle/evaluation/rag/retrieve_generate.py
  1. Evaluate generated answers:
python miracle/evaluation/rag/evaluate.py

RAG evaluation metrics:

  • BERTScore - Semantic similarity using BERT embeddings
  • ROUGE (1, 2, L) - N-gram overlap metrics
  • BLEURT - Learned evaluation metric
  • ClinicalBLEURT - Domain-adapted BLEURT for clinical text

Using the Trained Model

Load and use your fine-tuned MIRACLE model:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("path/to/your/trained/model")

# Encode queries and passages
queries = ["query: Was sind die Symptome von Diabetes?"]
passages = ["passage: Diabetes mellitus zeigt sich durch erhΓΆhten Blutzucker..."]

query_embeddings = model.encode(queries)
passage_embeddings = model.encode(passages)

# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, passage_embeddings)

Paper πŸ“„

Arzideh K., SchΓ€fer H., Idrissi-Yaghir A. et al. "Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study". Journal of Medical Internet Research, 28, e82997, doi: https://doi.org/10.2196/82997


Contributing 🀝

Feel free to contribute, ask questions, or open issues! Together, let's make medical language understanding in German more powerful and accessible.

License πŸ“œ

See LICENSE for details.

About

Medical Information Retrieval using AI with Clinical Language Embeddings (MIRACLE) presents clinical embedding models trained on 400.000 german clinical notes which can be used in RAG applications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages