Welcome! This repository demonstrates a production-ready machine learning workflow that automatically trains, tests, and deploys a Random Forest regression model. Think of it as a complete pipeline that takes code changes and turns them into trained, trackable modelsβall without manual intervention.
What does this application do?
This project predicts California housing prices using machine learning. But more importantly, it showcases how to build an automated ML workflow that:
- Trains models on real housing data (median income, age, location, etc.)
- Tracks experiments using MLflow (so you can compare different model versions)
- Tests code automatically before deploying
- Packages everything in Docker containers for consistency
- Integrates with CI/CD so training happens automatically when code changes
The Real-World Problem: In production ML systems, you can't manually retrain models every time you change hyperparameters or fix bugs. This workflow ensures that every code change triggers automated testing, model training, and trackingβmaking ML development reliable and reproducible.
Here's what powers this project and why each piece matters:
| Technology | Purpose | Why It's Used |
|---|---|---|
| Python 3.11 | Core programming language | The de facto standard for ML workflows; excellent library ecosystem |
| scikit-learn | Machine learning library | Provides RandomForestRegressor and preprocessing tools; battle-tested and reliable |
| MLflow | Experiment tracking & model registry | Logs parameters, metrics, and models so you can compare runs and version control your models |
| pandas | Data manipulation | Loads and processes CSV files efficiently |
| Docker | Containerization | Ensures the same environment runs everywhere (your laptop, CI server, production) |
| GitHub Actions | CI/CD automation | Automatically builds, tests, and triggers model training when code is pushed |
| Apache Airflow | Workflow orchestration | Schedules and manages complex ML pipelines (optional; triggered via API) |
| pytest | Testing framework | Validates that data loading, preprocessing, and training functions work correctly |
sample-ml-workflow-with-github-action/
β
βββ app/
β βββ train.py # Main training script - the heart of the ML pipeline
β
βββ scripts/
β βββ trigger_airflow.py # Bridge script that triggers Airflow DAGs via API
β
βββ tests/
β βββ ml_pipeline_test.py # Unit tests ensuring each function works correctly
β
βββ .github/
β βββ workflows/
β βββ cml.yaml # GitHub Actions workflow (build β test β deploy)
β
βββ Dockerfile # Container definition - packages the entire app
βββ MLproject # MLflow project file - defines entry points and parameters
βββ requirements.txt # Python dependencies (pandas, scikit-learn, MLflow, etc.)
βββ README.md # This file!
-
/app: Contains the core training logic. Thetrain.pyscript orchestrates data loading, preprocessing, model training, and MLflow logging. -
/scripts: Utility scripts that integrate with external systems.trigger_airflow.pyhandles authentication and API calls to trigger remote Airflow DAGs. -
/tests: Unit tests that validate individual functions (data loading, preprocessing, pipeline creation). These run automatically in CI/CD to catch bugs before deployment. -
/.github/workflows: GitHub Actions configuration. This YAML file defines what happens when code is pushed: build Docker image β run tests β push to registry β trigger Airflow. -
Dockerfile: Blueprint for creating a consistent environment. It installs Python, dependencies, and copies your code into a containerized workspace. -
MLproject: MLflow configuration file. Defines how to run the training script, what parameters it accepts, and which Docker image to use.
Understanding these three flows will give you a complete picture of how the system works.
Path: app/train.py β Data Loading β Preprocessing β Training β MLflow Logging
This is the core ML workflow. Let's trace what happens when you run the training script:
-
Command Line Arguments (lines 62-66)
- The script accepts hyperparameters like
--n_estimatorsand--criterion - These allow you to experiment with different model configurations
- The script accepts hyperparameters like
-
MLflow Setup (lines 70-71)
- Connects to an MLflow tracking server (via
MLFLOW_TRACKING_URIenvironment variable) - Creates or selects an experiment to organize related runs
- Connects to an MLflow tracking server (via
-
Data Loading (line 88)
load_data()fetches the California housing dataset from an S3 URL- Returns a pandas DataFrame with features (income, age, location) and target (house price)
-
Preprocessing (line 89)
preprocess_data()splits the data into train/test sets (80/20 split)- Separates features (X) from the target variable (y)
-
Pipeline Creation (line 92)
create_pipeline()builds a scikit-learn Pipeline with two steps:- StandardScaler: Normalizes features so they're on the same scale
- RandomForestRegressor: The actual ML model
-
Training (line 93)
train_model()uses GridSearchCV to find the best hyperparameters- Validates using cross-validation to avoid overfitting
-
Logging (lines 96-114)
- Metrics (train score, test score, training time) are logged to MLflow
- The trained model is saved and registered in MLflow's model registry
- This creates a versioned artifact you can deploy later
Visual Flow:
graph TD
A[Start: python app/train.py] --> B[Parse Arguments]
B --> C[Set MLflow Tracking URI]
C --> D[Load Data from URL]
D --> E[Split Train/Test]
E --> F[Create Pipeline: Scaler + RF]
F --> G[Train with GridSearchCV]
G --> H[Evaluate on Test Set]
H --> I[Log Metrics to MLflow]
I --> J[Save Model to MLflow Registry]
J --> K[End: Model Ready]
style A fill:#e1f5ff
style K fill:#c8e6c9
style I fill:#fff9c4
style J fill:#fff9c4
Path: Code Push β Build Docker Image β Run Tests β Push to Registry β Trigger Airflow
When you push code to the main branch, GitHub Actions automatically:
-
Checkout Code (line 20)
- Downloads the latest code from the repository
-
Docker Login (lines 22-26)
- Authenticates with Docker Hub using secrets stored in GitHub
-
Build Image (lines 28-34)
- Creates a Docker image from the
Dockerfile - Tags it with both the commit SHA (for versioning) and
latest(for convenience)
- Creates a Docker image from the
-
Run Tests (lines 36-45)
- Executes
pytestinside the Docker container - Uses a temporary MLflow tracking URI (file system) since we're just testing
- If tests fail, the pipeline stops hereβno bad code gets deployed
- Executes
-
Push to Docker Hub (lines 47-54)
- Only runs on the
mainbranch (not on pull requests) - Uploads both tagged versions of the image to Docker Hub
- Other systems can now pull this exact image version
- Only runs on the
-
Trigger Airflow (lines 59-77)
- Calls
scripts/trigger_airflow.pywith the commit hash - This script authenticates with Airflow's API and triggers a DAG run
- The DAG can then pull the Docker image and run training on a remote server
- Calls
Why This Matters: You never manually run tests or build images. Every code change triggers validation and deployment, ensuring consistency and catching bugs early.
Path: GitHub Actions β API Call β Airflow Authentication β DAG Trigger β Model Training
This flow connects your GitHub repository to a remote Airflow instance:
-
Authentication (lines 26-56 in
trigger_airflow.py)- Makes a POST request to Airflow's
/auth/tokenendpoint - Tries JSON payload first, falls back to Basic Auth for compatibility
- Retrieves a JWT token for subsequent API calls
- Makes a POST request to Airflow's
-
DAG Trigger (lines 58-88)
- Sends the commit hash as configuration to Airflow
- Airflow can use this hash to pull the exact Docker image version
- Creates a new DAG run that executes on Airflow's schedule/resources
Why Separate from GitHub Actions? Airflow runs on dedicated infrastructure (e.g., EC2) with more compute power, scheduled retraining, and dependency management across multiple tasks.
Here are the patterns and practices you should study closely:
def create_pipeline():
return Pipeline(steps=[
("standard_scaler", StandardScaler()),
("Random_Forest", RandomForestRegressor())
])Why It's Powerful:
- Encapsulates preprocessing + model in one object, preventing data leakage
- Easy to swap components (e.g., replace RandomForest with XGBoost)
- Works seamlessly with GridSearchCV for hyperparameter tuning
- Can be pickled and deployed as a single artifact
Study Point: Notice how GridSearchCV references pipeline steps with double underscores ("Random_Forest__n_estimators"). This is scikit-learn's way of accessing nested parameters.
with mlflow.start_run():
# ... training code ...
mlflow.log_param("n_estimators", args.n_estimators)
mlflow.log_metric("test_score", test_score)
mlflow.sklearn.log_model(...)Why It's Powerful:
- Automatic run management: The context manager handles creating and ending runs
- Exception-safe: If training crashes, MLflow still records what happened
- Version control for models: Each run creates a new model version you can compare
- Reproducibility: Logged parameters let you exactly recreate any model
Study Point: MLflow tracks three types of artifacts:
- Parameters: Inputs that don't change (hyperparameters, data version)
- Metrics: Outputs that can improve (accuracy, training time)
- Artifacts: Files (models, plots, data samples)
def test_load_data(mock_df):
with patch('app.train.pd.read_csv') as mock_read_csv:
mock_read_csv.return_value = mock_df
df = load_data("http://fake-url.com/data.csv")
assert not df.emptyWhy It's Powerful:
- Fast tests: Don't download 2MB CSV files or train real models
- Isolated testing: Tests one function without dependencies on external services
- Deterministic: Uses fixed mock data, so results are consistent
- Catches bugs early: Runs automatically before code reaches production
Study Point: The @pytest.fixture decorator creates reusable test data. Notice how mock_df is shared across multiple test functions without duplication.
Follow these steps to run the project locally.
- Python 3.11+ installed (download here)
- Docker installed (download here)
- Git installed (usually pre-installed on macOS/Linux)
git clone <repository-url>
cd sample-ml-workflow-with-github-actionThis isolates your project dependencies from other Python projects:
python -m venv venv
# Activate it (macOS/Linux):
source venv/bin/activate
# Or on Windows:
# venv\Scripts\activatepip install -r requirements.txtThis installs:
mlflow==3.5.0(experiment tracking)pandas==2.2.2(data manipulation)scikit-learn==1.5.0(machine learning)pytest==8.2.0(testing)
MLflow can track experiments locally or on a remote server. For local testing:
# Option A: Use local file system (simplest)
export MLFLOW_TRACKING_URI="file:///tmp/mlruns"
# Option B: Start local MLflow server (better for viewing results)
mlflow ui --backend-store-uri file:///tmp/mlruns
# Then visit http://localhost:5000 in your browserpython app/train.py \
--n_estimators 50 \
--criterion squared_error \
--experiment_name california_housingWhat happens:
- Downloads California housing data from S3
- Trains a Random Forest model with 50 trees
- Logs results to MLflow
- Saves the trained model
Verify everything works:
pytest tests/ -vYou should see all tests pass. β
For a production-like environment:
# Build the Docker image
docker build -t sample-ml-workflow:latest .
# Run training inside the container
docker run --rm \
-e MLFLOW_TRACKING_URI="file:///tmp/mlruns" \
sample-ml-workflow:latest \
python app/train.py --n_estimators 30Why Docker? It ensures your code runs identically on your laptop, CI server, and productionβno "works on my machine" issues.
For local development and testing, you may only need MLFLOW_TRACKING_URI. However, if you want to use the complete CI/CD pipeline with Airflow integration, you'll need to configure several environment variables. Here's a template showing what each variable does and example values:
Create a .env file in the project root (and add it to .gitignore to keep secrets safe):
# MLflow Configuration
MLFLOW_TRACKING_URI="file:///tmp/mlruns" # Local file system
# OR for remote server:
# MLFLOW_TRACKING_URI="http://your-mlflow-server:5000" # Remote MLflow server URL
# Optional: Use specific experiment ID instead of name
MLFLOW_EXPERIMENT_ID="12345678-abcd-1234-efgh-123456789abc" # UUID formatThese are configured in GitHub repository settings under Secrets and variables β Actions. They're used by the GitHub Actions workflow:
# Docker Hub Credentials (for pushing built images)
DOCKER_USERNAME="your-dockerhub-username" # Your Docker Hub account name
DOCKER_PASSWORD="dckr_pat_xxxxxxxxxxxxxxxxxxxx" # Docker Hub Personal Access Token
# Airflow Integration (for triggering remote DAGs)
AIRFLOW_URL="https://your-airflow-instance.ngrok-free.app" # Airflow web server URL
AIRFLOW_USERNAME="airflow_user" # Airflow admin username
AIRFLOW_PASSWORD="your_airflow_password" # Airflow admin password
# AWS Credentials (if your Airflow DAG uses AWS services)
AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" # AWS access key
AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" # AWS secret key
# GitHub Personal Access Token (for repository access from CI/CD)
PERSONAL_ACCESS_TOKEN="ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # GitHub PAT with repo permissions| Variable | Used For | Required When | Example Format |
|---|---|---|---|
MLFLOW_TRACKING_URI |
Connecting to MLflow server | Always (local or remote) | file:///tmp/mlruns or http://server:5000 |
MLFLOW_EXPERIMENT_ID |
Specific experiment (optional) | Only if using ID instead of name | UUID string |
DOCKER_USERNAME |
Docker Hub authentication | When pushing to Docker Hub | Your Docker Hub username |
DOCKER_PASSWORD |
Docker Hub authentication | When pushing to Docker Hub | Docker Hub PAT token |
AIRFLOW_URL |
Airflow API endpoint | When triggering Airflow DAGs | https://airflow.example.com |
AIRFLOW_USERNAME |
Airflow authentication | When triggering Airflow DAGs | Airflow user account |
AIRFLOW_PASSWORD |
Airflow authentication | When triggering Airflow DAGs | Airflow password |
AWS_ACCESS_KEY_ID |
AWS services access | If using S3/EC2/etc. | AWS access key |
AWS_SECRET_ACCESS_KEY |
AWS services access | If using S3/EC2/etc. | AWS secret key |
PERSONAL_ACCESS_TOKEN |
GitHub API access | If CI/CD needs repo access | GitHub PAT |
Option 1: Export in your shell session
export MLFLOW_TRACKING_URI="file:///tmp/mlruns"
export AIRFLOW_URL="https://your-airflow-url.com"
# ... etcOption 2: Create a .env file and load it (recommended)
# Create .env file with your variables
cat > .env << EOF
MLFLOW_TRACKING_URI=file:///tmp/mlruns
AIRFLOW_URL=https://your-airflow-url.com
AIRFLOW_USERNAME=airflow_user
AIRFLOW_PASSWORD=your_password
EOF
# Load variables (add to your ~/.zshrc or ~/.bashrc for persistence)
export $(cat .env | xargs)Option 3: Use python-dotenv (if using a library that supports it)
from dotenv import load_dotenv
load_dotenv() # Loads variables from .env fileGitHub Actions uses Repository Secrets (not environment variables in your code). To add them:
- Go to your GitHub repository
- Click Settings β Secrets and variables β Actions
- Click New repository secret
- Add each variable name and value
.env files or hardcode secrets in your code. Always use environment variables or secret management tools (like GitHub Secrets) for sensitive information.
-
Experiment with Hyperparameters: Try different
--n_estimatorsvalues and see how they affect model performance in MLflow. -
Modify the Pipeline: Add a feature selection step or try a different algorithm (e.g., XGBoost).
-
Explore MLflow UI: Start
mlflow uiand browse experiments, compare runs, and download models. -
Add More Tests: Write tests for edge cases (empty data, missing values, etc.).
-
Study the CI/CD Flow: Push a change to a feature branch and watch GitHub Actions run tests.
Found a bug or want to add a feature? Feel free to open an issue or submit a pull request!
Happy Learning! π
This README was designed to help beginners understand not just what the code does, but why it's structured this way. If you have questions, don't hesitate to explore the code comments or reach out to the maintainers.